Course Content
Web Scraping with Python
Web Scraping with Python
Applying String Methods
What can you do with the read page? It's a string, so you can utilize any string method. For instance, you can use the .find()
method, which returns the index of the first occurrence of a specific element. For example, you can locate the page title by identifying the indexes of the first opening and closing tags. We'll also take into account the length of the closing tag.
# Importing the module from urllib.request import urlopen # Opening web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/mother.html" page = urlopen(url) # Reading and decoding web_page = page.read().decode("utf-8") # Indexes of opening and closing title tags start = web_page.find('<title') finish = web_page.find('</title>') + len('</title>') print(web_page[start:finish])
As demonstrated in the example above, two variables, start
and finish
, were created. The start
variable contains the index of the first element within the initial occurrence of the <title>
element. Meanwhile, the finish
variable holds the index of the character immediately following the closing </title>
tag. The .find()
method itself provided the initial index of the closing tag, so we added the length of the tag to obtain the index of the last element.
Note
List slicing excludes the last element, which is why we find the next character after the closing tag.
Thanks for your feedback!