Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
First Steps | HTML Files and DevTools
Web Scraping with Python (res)
course content

Зміст курсу

Web Scraping with Python (res)

Web Scraping with Python (res)

1. HTML Files and DevTools
2. Beautiful Soup
3. CSS Selectors/XPaths
4. Tables

bookFirst Steps

The web page is comprised of HTML.

HTML is the markup language for creating web pages.

Let’s get the HTML file of the web page!

We can work with the data of sites by their URLs. To open website URLs in your Python programs, use the function urlopen() from the module urllib.request and define the URL you want to open as a string variable:

1234
from urllib.request import urlopen url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url)
copy

It looks simple, but when we want to print the variable page to see what is going on in the HTML file, we get:

It returns HTTPResponse object. To parse it, use the .read() method, which returns a sequence of bytes, and then the function decode("utf-8") to decode the data from bytes to string

1234
bytes = page.read() html = bytes.decode("utf-8") print(html)
copy

We can also use methods consequentially: page.read().decode("utf-8").

Here we open the URL we will work within this course!

Завдання

Write the missing code to get the HTML structure from the page which interests you.

  1. Import module urlopen to open URLs from your code.
  2. Open the URL. Assign the result to the variable page.
  3. Get a sequence of bytes using the method .read(). Assign the result to the variable bytes.
  4. Decode bytes to string using the method .decode(). Assign the result to the variable html.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2
toggle bottom row

bookFirst Steps

The web page is comprised of HTML.

HTML is the markup language for creating web pages.

Let’s get the HTML file of the web page!

We can work with the data of sites by their URLs. To open website URLs in your Python programs, use the function urlopen() from the module urllib.request and define the URL you want to open as a string variable:

1234
from urllib.request import urlopen url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url)
copy

It looks simple, but when we want to print the variable page to see what is going on in the HTML file, we get:

It returns HTTPResponse object. To parse it, use the .read() method, which returns a sequence of bytes, and then the function decode("utf-8") to decode the data from bytes to string

1234
bytes = page.read() html = bytes.decode("utf-8") print(html)
copy

We can also use methods consequentially: page.read().decode("utf-8").

Here we open the URL we will work within this course!

Завдання

Write the missing code to get the HTML structure from the page which interests you.

  1. Import module urlopen to open URLs from your code.
  2. Open the URL. Assign the result to the variable page.
  3. Get a sequence of bytes using the method .read(). Assign the result to the variable bytes.
  4. Decode bytes to string using the method .decode(). Assign the result to the variable html.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2
toggle bottom row

bookFirst Steps

The web page is comprised of HTML.

HTML is the markup language for creating web pages.

Let’s get the HTML file of the web page!

We can work with the data of sites by their URLs. To open website URLs in your Python programs, use the function urlopen() from the module urllib.request and define the URL you want to open as a string variable:

1234
from urllib.request import urlopen url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url)
copy

It looks simple, but when we want to print the variable page to see what is going on in the HTML file, we get:

It returns HTTPResponse object. To parse it, use the .read() method, which returns a sequence of bytes, and then the function decode("utf-8") to decode the data from bytes to string

1234
bytes = page.read() html = bytes.decode("utf-8") print(html)
copy

We can also use methods consequentially: page.read().decode("utf-8").

Here we open the URL we will work within this course!

Завдання

Write the missing code to get the HTML structure from the page which interests you.

  1. Import module urlopen to open URLs from your code.
  2. Open the URL. Assign the result to the variable page.
  3. Get a sequence of bytes using the method .read(). Assign the result to the variable bytes.
  4. Decode bytes to string using the method .decode(). Assign the result to the variable html.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

The web page is comprised of HTML.

HTML is the markup language for creating web pages.

Let’s get the HTML file of the web page!

We can work with the data of sites by their URLs. To open website URLs in your Python programs, use the function urlopen() from the module urllib.request and define the URL you want to open as a string variable:

1234
from urllib.request import urlopen url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url)
copy

It looks simple, but when we want to print the variable page to see what is going on in the HTML file, we get:

It returns HTTPResponse object. To parse it, use the .read() method, which returns a sequence of bytes, and then the function decode("utf-8") to decode the data from bytes to string

1234
bytes = page.read() html = bytes.decode("utf-8") print(html)
copy

We can also use methods consequentially: page.read().decode("utf-8").

Here we open the URL we will work within this course!

Завдання

Write the missing code to get the HTML structure from the page which interests you.

  1. Import module urlopen to open URLs from your code.
  2. Open the URL. Assign the result to the variable page.
  3. Get a sequence of bytes using the method .read(). Assign the result to the variable bytes.
  4. Decode bytes to string using the method .decode(). Assign the result to the variable html.

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Секція 1. Розділ 2
Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
some-alt