Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
String Methods | HTML Files and DevTools
Web Scraping with Python (res)
course content

Conteúdo do Curso

Web Scraping with Python (res)

Web Scraping with Python (res)

1. HTML Files and DevTools
2. Beautiful Soup
3. CSS Selectors/XPaths
4. Tables

String Methods

We already know the structure of the HTML document and how to get the HTML of the page by its URL. Let’s explore how to find some information in the code by tags. One of the ways to do it - use string methods. For example, function .find(). This method returns the index of the first appearance of the string you want to find.

Let’s find the index of the first occurrence of the tag <head> on the site from the second chapter:

12
index_first = html.find("<head>") print(index_first)
copy

Hmm, what does it mean? That we have no head in our HTML code? But it can’t be true. We can easily find it in the first lines of the HTML file. The fact is that this page has extra space before the close-angle bracket. It can be any spaces between the end of the tag name and the closing bracket. So here, we can find the index using the following code:

12
index_first = html.find("<head >") print(index_first)
copy

We can also find first index of closing tag (here we have no extra spaces):

Finally, you can extract the title block by slicing the html string:

1234567891011
# Find the first index of the tag index_first = html.find("<title>") print(index_first) # Find the last index of the tag index_last = html.find("</title>") + len("</title>") print(index_last) # Extract the title title = html[index_first:index_last] print(title)
copy

In the code, we added to the index the length of the close tag </title> to see it in the output.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Before using web scraping, you should always check your website’s terms of use to know if accessing the website with tools is a violation of the terms of use or not.

Mude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo

Tudo estava claro?

Seção 1. Capítulo 5
toggle bottom row

String Methods

We already know the structure of the HTML document and how to get the HTML of the page by its URL. Let’s explore how to find some information in the code by tags. One of the ways to do it - use string methods. For example, function .find(). This method returns the index of the first appearance of the string you want to find.

Let’s find the index of the first occurrence of the tag <head> on the site from the second chapter:

12
index_first = html.find("<head>") print(index_first)
copy

Hmm, what does it mean? That we have no head in our HTML code? But it can’t be true. We can easily find it in the first lines of the HTML file. The fact is that this page has extra space before the close-angle bracket. It can be any spaces between the end of the tag name and the closing bracket. So here, we can find the index using the following code:

12
index_first = html.find("<head >") print(index_first)
copy

We can also find first index of closing tag (here we have no extra spaces):

Finally, you can extract the title block by slicing the html string:

1234567891011
# Find the first index of the tag index_first = html.find("<title>") print(index_first) # Find the last index of the tag index_last = html.find("</title>") + len("</title>") print(index_last) # Extract the title title = html[index_first:index_last] print(title)
copy

In the code, we added to the index the length of the close tag </title> to see it in the output.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Before using web scraping, you should always check your website’s terms of use to know if accessing the website with tools is a violation of the terms of use or not.

Mude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo

Tudo estava claro?

Seção 1. Capítulo 5
toggle bottom row

String Methods

We already know the structure of the HTML document and how to get the HTML of the page by its URL. Let’s explore how to find some information in the code by tags. One of the ways to do it - use string methods. For example, function .find(). This method returns the index of the first appearance of the string you want to find.

Let’s find the index of the first occurrence of the tag <head> on the site from the second chapter:

12
index_first = html.find("<head>") print(index_first)
copy

Hmm, what does it mean? That we have no head in our HTML code? But it can’t be true. We can easily find it in the first lines of the HTML file. The fact is that this page has extra space before the close-angle bracket. It can be any spaces between the end of the tag name and the closing bracket. So here, we can find the index using the following code:

12
index_first = html.find("<head >") print(index_first)
copy

We can also find first index of closing tag (here we have no extra spaces):

Finally, you can extract the title block by slicing the html string:

1234567891011
# Find the first index of the tag index_first = html.find("<title>") print(index_first) # Find the last index of the tag index_last = html.find("</title>") + len("</title>") print(index_last) # Extract the title title = html[index_first:index_last] print(title)
copy

In the code, we added to the index the length of the close tag </title> to see it in the output.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Before using web scraping, you should always check your website’s terms of use to know if accessing the website with tools is a violation of the terms of use or not.

Mude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo

Tudo estava claro?

We already know the structure of the HTML document and how to get the HTML of the page by its URL. Let’s explore how to find some information in the code by tags. One of the ways to do it - use string methods. For example, function .find(). This method returns the index of the first appearance of the string you want to find.

Let’s find the index of the first occurrence of the tag <head> on the site from the second chapter:

12
index_first = html.find("<head>") print(index_first)
copy

Hmm, what does it mean? That we have no head in our HTML code? But it can’t be true. We can easily find it in the first lines of the HTML file. The fact is that this page has extra space before the close-angle bracket. It can be any spaces between the end of the tag name and the closing bracket. So here, we can find the index using the following code:

12
index_first = html.find("<head >") print(index_first)
copy

We can also find first index of closing tag (here we have no extra spaces):

Finally, you can extract the title block by slicing the html string:

1234567891011
# Find the first index of the tag index_first = html.find("<title>") print(index_first) # Find the last index of the tag index_last = html.find("</title>") + len("</title>") print(index_last) # Extract the title title = html[index_first:index_last] print(title)
copy

In the code, we added to the index the length of the close tag </title> to see it in the output.

Tarefa

Let's find the title of the page!

  1. Find the index of the first occurrence of the tag <title> and assign it to the variable index_first. Print the variable index_first.
  2. Find the index of the first occurrence of the tag </title>. Add the length of the world "</title>" and assign the result to the variable index_last. Print the variable index_last.
  3. Extract the title block by slicing the variable html and assign the result to the variable title. Print the variable title.

Before using web scraping, you should always check your website’s terms of use to know if accessing the website with tools is a violation of the terms of use or not.

Mude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Seção 1. Capítulo 5
Mude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
We're sorry to hear that something went wrong. What happened?
some-alt