Course Content
Web Scraping with Python (res)
Web Scraping with Python (res)
CSS Selectors in BeautifulSoup
To extract the data with the already written CSS Selectors, you can use Selector
from the scrapy
library. However, we will consider another way to work with CSS Selectors using the library BeautifulSoup
from the previous section. To select the data from the file, use the function .select()
of the already created BeautifulSoup
object:
сss_locator = "html > body > div" print(soup.select(сss_locator))
We know how to navigate through HTML files using attributes. However, we can select all elements with a specified class
or id
without the tag’s name or path. For example:
print(soup.select("#id-1")) print(soup.select(".class-1"))
In the first line, we select all elements with the id
equal to id-1
. In the second line, CSS Selector navigates to all tags that belong to the class-1
.
You can also go through all elements of your class with for
loop:
for link in soup.select(".class-link > a"): page = urlopen(link) html = page.read().decode("utf-8") new_soup = BeautifulSoup(html, "html.parser")
Here we go through all the links of the class class-link
and create BeautifulSoup
object for each new page.
Keep in mind that instead of urllib.request
library you can send every time get request (request for seeing a webpage) to the page using the library requests
and .content()
function to convert the page to the HTML format:
import requests page_response = requests.get(url) page = page_response.content
Task
Go through all the links on the main webpage, get their HTML code, and print the titles of each page. Here we will first go through all tags, saving them into a list, and then go through all href
attributes of extracted tags to get all URLs of the pages.
- Import the library for opening URLs.
- Select all
a
tags using the method.select()
and CSS Selector as the parameter. Assign the result to the variablea_tags
. - Create the empty list
links
. - Go through the list
a_tags
with thefor
loop and the variablea
to extract the attributeshref
and add them to the empty listlinks
. - During running through each webpage by its link, get the
title
tag using the method.title
of theBeautifulSoup
object and print it.
Thanks for your feedback!
CSS Selectors in BeautifulSoup
To extract the data with the already written CSS Selectors, you can use Selector
from the scrapy
library. However, we will consider another way to work with CSS Selectors using the library BeautifulSoup
from the previous section. To select the data from the file, use the function .select()
of the already created BeautifulSoup
object:
сss_locator = "html > body > div" print(soup.select(сss_locator))
We know how to navigate through HTML files using attributes. However, we can select all elements with a specified class
or id
without the tag’s name or path. For example:
print(soup.select("#id-1")) print(soup.select(".class-1"))
In the first line, we select all elements with the id
equal to id-1
. In the second line, CSS Selector navigates to all tags that belong to the class-1
.
You can also go through all elements of your class with for
loop:
for link in soup.select(".class-link > a"): page = urlopen(link) html = page.read().decode("utf-8") new_soup = BeautifulSoup(html, "html.parser")
Here we go through all the links of the class class-link
and create BeautifulSoup
object for each new page.
Keep in mind that instead of urllib.request
library you can send every time get request (request for seeing a webpage) to the page using the library requests
and .content()
function to convert the page to the HTML format:
import requests page_response = requests.get(url) page = page_response.content
Task
Go through all the links on the main webpage, get their HTML code, and print the titles of each page. Here we will first go through all tags, saving them into a list, and then go through all href
attributes of extracted tags to get all URLs of the pages.
- Import the library for opening URLs.
- Select all
a
tags using the method.select()
and CSS Selector as the parameter. Assign the result to the variablea_tags
. - Create the empty list
links
. - Go through the list
a_tags
with thefor
loop and the variablea
to extract the attributeshref
and add them to the empty listlinks
. - During running through each webpage by its link, get the
title
tag using the method.title
of theBeautifulSoup
object and print it.
Thanks for your feedback!
CSS Selectors in BeautifulSoup
To extract the data with the already written CSS Selectors, you can use Selector
from the scrapy
library. However, we will consider another way to work with CSS Selectors using the library BeautifulSoup
from the previous section. To select the data from the file, use the function .select()
of the already created BeautifulSoup
object:
сss_locator = "html > body > div" print(soup.select(сss_locator))
We know how to navigate through HTML files using attributes. However, we can select all elements with a specified class
or id
without the tag’s name or path. For example:
print(soup.select("#id-1")) print(soup.select(".class-1"))
In the first line, we select all elements with the id
equal to id-1
. In the second line, CSS Selector navigates to all tags that belong to the class-1
.
You can also go through all elements of your class with for
loop:
for link in soup.select(".class-link > a"): page = urlopen(link) html = page.read().decode("utf-8") new_soup = BeautifulSoup(html, "html.parser")
Here we go through all the links of the class class-link
and create BeautifulSoup
object for each new page.
Keep in mind that instead of urllib.request
library you can send every time get request (request for seeing a webpage) to the page using the library requests
and .content()
function to convert the page to the HTML format:
import requests page_response = requests.get(url) page = page_response.content
Task
Go through all the links on the main webpage, get their HTML code, and print the titles of each page. Here we will first go through all tags, saving them into a list, and then go through all href
attributes of extracted tags to get all URLs of the pages.
- Import the library for opening URLs.
- Select all
a
tags using the method.select()
and CSS Selector as the parameter. Assign the result to the variablea_tags
. - Create the empty list
links
. - Go through the list
a_tags
with thefor
loop and the variablea
to extract the attributeshref
and add them to the empty listlinks
. - During running through each webpage by its link, get the
title
tag using the method.title
of theBeautifulSoup
object and print it.
Thanks for your feedback!
To extract the data with the already written CSS Selectors, you can use Selector
from the scrapy
library. However, we will consider another way to work with CSS Selectors using the library BeautifulSoup
from the previous section. To select the data from the file, use the function .select()
of the already created BeautifulSoup
object:
сss_locator = "html > body > div" print(soup.select(сss_locator))
We know how to navigate through HTML files using attributes. However, we can select all elements with a specified class
or id
without the tag’s name or path. For example:
print(soup.select("#id-1")) print(soup.select(".class-1"))
In the first line, we select all elements with the id
equal to id-1
. In the second line, CSS Selector navigates to all tags that belong to the class-1
.
You can also go through all elements of your class with for
loop:
for link in soup.select(".class-link > a"): page = urlopen(link) html = page.read().decode("utf-8") new_soup = BeautifulSoup(html, "html.parser")
Here we go through all the links of the class class-link
and create BeautifulSoup
object for each new page.
Keep in mind that instead of urllib.request
library you can send every time get request (request for seeing a webpage) to the page using the library requests
and .content()
function to convert the page to the HTML format:
import requests page_response = requests.get(url) page = page_response.content
Task
Go through all the links on the main webpage, get their HTML code, and print the titles of each page. Here we will first go through all tags, saving them into a list, and then go through all href
attributes of extracted tags to get all URLs of the pages.
- Import the library for opening URLs.
- Select all
a
tags using the method.select()
and CSS Selector as the parameter. Assign the result to the variablea_tags
. - Create the empty list
links
. - Go through the list
a_tags
with thefor
loop and the variablea
to extract the attributeshref
and add them to the empty listlinks
. - During running through each webpage by its link, get the
title
tag using the method.title
of theBeautifulSoup
object and print it.