Course Content
Web Scraping with Python
Web Scraping with Python
Opening HTML File
Now that you're acquainted with the fundamental aspects of HTML
, let's explore the initial method of working with it in Python
.
One of the modules you can employ to handle HTML
files in Python
is urllib.request
. You'll need to import
the urlopen
method to access web pages. Simply provide the URL of the page you wish to open as a parameter to this method.
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) print(page)
As seen in the example above, you receive an http.client.HTTPResponse
object as a result, which differs from what we intended. To obtain the HTML
structure, you should apply the .read()
and .decode("utf-8")
methods to the object you've acquired.
Note
The
decode("utf-8")
part is used to convert the raw binary data into a human-readable string, assuming that the webpage's content is encoded usingUTF-8
. This conversion enables us to work with the text data contained in the webpage in a meaningful manner, such as parsing or analyzing its content.
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) # Reading and decoding web_page = page.read().decode("utf-8") print(type(web_page)) print(web_page)
As a result of applying the .read()
and .decode()
methods, you obtain a string. This string contains the HTML
structure in a well-formatted manner, making it easily readable and allowing you to apply string methods to it.
If the .decode()
method weren't applied, you would receive a bytes object with the entire HTML
page represented as a single string with specific characters. Feel free to experiment with it!
Thanks for your feedback!