Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Attributes & Contents of Element | Beautiful Soup: Part II
Web Scraping with Python
course content

Contenido del Curso

Web Scraping with Python

Web Scraping with Python

1. Getting Acquainted with HTML
2. Beautiful Soup: Part I
3. Beautiful Soup: Part II

bookAttributes & Contents of Element

The methods discussed in the previous sections return specific parts of the HTML code. BeautifulSoup enables us to retrieve the attributes and contents of particular elements. To access the attributes of an object, use the .attrs attribute. For instance, we can retrieve the attributes of the first <div> element.

123456789101112
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').attrs)
copy

It's important to note that the result of using the .attrs attribute is a dictionary where the keys are attribute names and the values are their respective values. If you wish to obtain the content stored within a tag, employ the .contents attribute. For example, let's examine the contents of the first <div> element.

123456789101112
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').contents)
copy

As observed above, all the newline characters were included in a list of elements, which may not be the most desirable representation of content. If you want to extract only the text within a specific element, utilize the .get_text() method. Compare the results from the example below with the one obtained earlier.

123456789101112
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.find('div').get_text())
copy

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 1
some-alt