Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Opening HTML File | Getting Acquainted with HTML
Web Scraping with Python

bookOpening HTML File

You are already familiar with the basics of HTML, so now explore the first method of working with it in Python.

One of the modules you can employ to handle HTML files in Python is urllib.request. You'll need to import the urlopen method to access web pages. Simply provide the URL of the page you wish to open as a parameter to this method.

1234567
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) print(page)
copy

As shown in the example above, the result is an http.client.HTTPResponse object, which is not the desired output. To obtain the HTML structure, apply the .read() and .decode('utf-8') methods to the object you received.

Note
Definition

The decode("utf-8") part is used to convert the raw binary data into a human-readable string, assuming that the webpage's content is encoded using UTF-8. This conversion enables us to work with the text data contained in the webpage in a meaningful manner, such as parsing or analyzing its content.

1234567891011
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) # Reading and decoding web_page = page.read().decode("utf-8") print(type(web_page)) print(web_page)
copy

As a result of applying the .read() and .decode() methods, you obtain a string. This string contains the HTML structure in a well-formatted manner, making it easily readable and allowing you to apply string methods to it.

If the .decode() method weren't applied, you would receive a bytes object with the entire HTML page represented as a single string with specific characters. Feel free to experiment with it!

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 8

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

What other methods can I use to work with HTML in Python?

Can you explain why we need to use .decode('utf-8')?

What can I do with the HTML string once I have it?

Awesome!

Completion rate improved to 4.35

bookOpening HTML File

Swipe to show menu

You are already familiar with the basics of HTML, so now explore the first method of working with it in Python.

One of the modules you can employ to handle HTML files in Python is urllib.request. You'll need to import the urlopen method to access web pages. Simply provide the URL of the page you wish to open as a parameter to this method.

1234567
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) print(page)
copy

As shown in the example above, the result is an http.client.HTTPResponse object, which is not the desired output. To obtain the HTML structure, apply the .read() and .decode('utf-8') methods to the object you received.

Note
Definition

The decode("utf-8") part is used to convert the raw binary data into a human-readable string, assuming that the webpage's content is encoded using UTF-8. This conversion enables us to work with the text data contained in the webpage in a meaningful manner, such as parsing or analyzing its content.

1234567891011
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) # Reading and decoding web_page = page.read().decode("utf-8") print(type(web_page)) print(web_page)
copy

As a result of applying the .read() and .decode() methods, you obtain a string. This string contains the HTML structure in a well-formatted manner, making it easily readable and allowing you to apply string methods to it.

If the .decode() method weren't applied, you would receive a bytes object with the entire HTML page represented as a single string with specific characters. Feel free to experiment with it!

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 8
some-alt