Advanced Search

Certain HTML tags require mandatory attributes, such as the anchor tag necessitating the href attribute or <img> requiring the src attribute. If you are interested in a specific attribute, you can use the .get() method following .attrs. For example, let's retrieve all the src attributes of all <img> elements.


              12345678910111213
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for img in soup.find_all("img"):
  print(img.attrs.get("src"))

You may also encounter the id attribute, which is quite common and is used to distinguish elements under the same tag. If you are interested in specific attribute values, you can pass them as a dictionary (in the format attr_name: attr_value) as the parameter for .find_all() (immediately after specifying the tag you are searching for). For example, we are interested in only <div> elements with the class attribute set to "box", or we are searching for the <p> element with an "id" attribute value of "id2".


              12345678910111213141516
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for div in soup.find_all("div", {"class": "box"}):
  print(div)

# Filtering by id attribute value
print(soup.find("p", {"id": "id2"}))

We utilized the .find() method (instead of .find_all()) to retrieve the element with a specific id since the id is a unique identifier, and there cannot be more than one element with the same value. To ensure that we obtained only specific <div> elements, let's examine the classes that <div> elements have.


              12345678910111213
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for div in soup.find_all("div"):
  print(div.attrs.get("class"))

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Web Scraping with Python

1. Getting Acquainted with HTML

2. Decoding HTML with Beautiful Soup

What is Beautiful Soup?Navigating HTML Document Challenge: The BeautifulSoup Object Challenge: Iterating Over Lists Working with Specific Elements Working with Paragraph Elements

3. Working with Element Attributes in Beautiful Soup

Attributes and Contents of Element Challenge: Attributes Attributes and Contents of Multiple Elements Challenge: Text from HTML Elements Advanced Search Challenge: Find All

Advanced Search


              12345678910111213
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for img in soup.find_all("img"):
  print(img.attrs.get("src"))


              12345678910111213141516
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for div in soup.find_all("div", {"class": "box"}):
  print(div)

# Filtering by id attribute value
print(soup.find("p", {"id": "id2"}))


              12345678910111213
            
# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for div in soup.find_all("div"):
  print(div.attrs.get("class"))

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 5