Contenido del Curso
Web Scraping with Python
Web Scraping with Python
Attributes & Contents of Element
The methods discussed in the previous sections return specific parts of the HTML
code. BeautifulSoup
enables us to retrieve the attributes and contents of particular elements. To access the attributes of an object, use the .attrs
attribute. For instance, we can retrieve the attributes of the first <div>
element.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').attrs)
It's important to note that the result of using the .attrs
attribute is a dictionary where the keys are attribute names and the values are their respective values. If you wish to obtain the content stored within a tag, employ the .contents attribute. For example, let's examine the contents of the first <div>
element.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').contents)
As observed above, all the newline characters were included in a list of elements, which may not be the most desirable representation of content. If you want to extract only the text within a specific element, utilize the .get_text()
method. Compare the results from the example below with the one obtained earlier.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').get_text())
¡Gracias por tus comentarios!