Course Content
Web Scraping with Python
Web Scraping with Python
What is Beautiful Soup?
BeautifulSoup
is a Python library that offers extensive functionality for parsing HTML
pages. In the previous section, you worked with HTML
as a string, which imposed significant limitations.
To install BeautifulSoup
, execute the following command in your terminal or command prompt:
pip install beautifulsoup4
;- To get started, import BeautifulSoup from bs4:
from bs4 import BeautifulSoup
.
# Importing the library from bs4 import BeautifulSoup print(BeautifulSoup)
This library is designed for working with HTML
files and does not handle links. However, you already know how to deal with that using urlopen from urllib.requests
. To initiate parsing, you need to provide two parameters to the BeautifulSoup
function: the first is the HTML
file, and the second is the parser (we will use the built-in html.parser
parser). This action will create a BeautifulSoup object. For example, let's open and read a web page.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(type(soup)) print(soup)
The first method we will explore is .prettify()
, which presents the HTML
file as a nested data structure.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') print(soup.prettify())
Thanks for your feedback!