How To Scrape A Website For Your ML Project

A while ago, I was reading a thread on the LearnML subreddit. Which the OP needed to scrape webpage data for his ML project.

People in the thread gave good answers. Which was mainly learn how to use beautifulsoup and selenium.

But the OP may not know how to relate to his ML project. If he has no experience with those libraries

I have used BeautifulSoup and Selenium for some of my data science projects. While not the most advanced tasks it got the work done.

https://www.tobiolabode.com/blog/2020/4/21/bookstores-vs-income-data-science-project-part-1

https://www.tobiolabode.com/blog/2020/4/26/bookstore-vs-income-part-2

In this blog post, I’m going to show you how to scrape a webpage with some useful data and convert it into a pandas dataframe.

The reason why we want to convert it into a dataframe. Is that most ML libraries can handle pandas data frames and can be edited for your model with minimal changes.

 

First, we are going to find a table on Wikipedia to convert into a dataframe.

Here I’m going to scrape a table of the most viewed sportspeople on Wikipedia.

image001.png

First, a lot of work will be navigating the HTML tree to get to the table we want.

image003.png

We will use BeautifulSoup with the help of requests and the regex library.

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

Here we are extracting the HTML code from the webpage:

website_url = requests.get('https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages').text
soup = BeautifulSoup(website_url, 'lxml')
print(soup.prettify())
</a>
    </li>
    <li id="footer-places-disclaimer">
     <a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General disclaimer">
      Disclaimers
     </a>
    </li>
    <li id="footer-places-contact">
     <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us">
      Contact Wikipedia
     </a>
    </li>
    <li id="footer-places-mobileview">
     <a class="noprint stopMobileRedirectTog
`

We want to collect all of the tables from the corpus. So we have a smaller surface area to search from.

wiki_tables = soup.find_all('table', class_='wikitable')
wiki_tables

As there are numerous tables we need a way to filter them.

 

We know that Cristiano Ronaldo has an anchor tag that will likely be unique to a few tables.

image005.png

We can filter those tables that have an anchor tag with the text Cristiano Ronaldo. While finding some parent elements that contain the anchor tag.

links = []
for table in wiki_tables:
  _table = table.find('a', string=re.compile('Cristiano Ronaldo'))
  if not _table:
    continue
  print(_table)

  _parent = _table.parent
  print(_parent)
  links.append(_parent)
<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>

<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>

<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>

The parent only shows us the cell. We need to climb higher up the tree.

Here is a cell with the browser web dev tools.

image007.png
parent_lst = []
for anchor in links:
  _ = anchor.find_parents('tbody')
  print(_)
  parent_lst.append(_)

By using the tbody we able to return other tables that would contain the anchor tags from earlier.

 

To filter even more we can search through the various headers of those tables:

for i in parent_lst:
  print(i[0].find('tr'))
tr>
<th>Rank*</th>
<th>Page</th>
<th>Views in millions
</th></tr>
<tr>
<th>Rank</th>
<th>Page</th>
<th>Views in millions
</th></tr>
<tr>
<th>Rank</th>
<th>Page</th>
<th>Sport</th>
<th>Views in millions
</th></tr>

The third one looks like the table that we want.

 

Now we start to create the logic needed to extract and clean the details we want.

sports_table = parent_lst[2]

complete_row = []

for i in sports_table:
  rows = i.find_all('tr')
  print('\n--------row--------\n')
  print(rows)

  for row in rows:
    cells = row.find_all('td')
    print('\n-------cells--------\n')
    print(cells)

    if not cells:
      continue

    rank = cells[0].text.strip('\n')
    page_name = cells[1].find('a').text
    sport = cells[2].find('a').text
    views = cells[3].text.strip('\n')

    print('\n-------CLEAN--------\n')
    print(rank)
    print(page_name)
    print(sport)
    print(views)

    complete_row.append([rank, page_name, sport, views])


for i in complete_row:
  print(i)

To break it down:

sports_table = parent_lst[2]

complete_row = []

Here we select the 3rd element from the list from earlier. As it’s the table we wanted.

Then we create an empty list that would store the details of each row. As we iterate through the table.

 

We create an loop that would iterate each row in the table and save them into the rows variable.

for i in sports_table:
  rows = i.find_all('tr')
  print('\n--------row--------\n')
  print(rows)
image009.png
for row in rows:
    cells = row.find_all('td')
    print('\n-------cells--------\n')
    print(cells)

We create a nested loop. That iterates through each row saved from the last loop. When iterating through the cells we save the individual cells in a new variable.

image011.png
if not cells:
      continue

This short snippet of code allows us to avoid empty cells and prevent errors when extracting the text from the cell.

rank = cells[0].text.strip('\n')
    page_name = cells[1].find('a').text
    sport = cells[2].find('a').text
    views = cells[3].text.strip('\n')

Here we clean out the various cells into plain text form. The cleaned values are saved the variables under the name of their columns.

print('\n-------CLEAN--------\n')
    print(rank)
    print(page_name)
    print(sport)
    print(views)

    complete_row.append([rank, page_name, sport, views])

Here we add the values into the row list. And print the cleaned values.

-------cells--------

[<td>13
</td>, <td style="text-align: left;"><a href="/wiki/Conor_McGregor" title="Conor McGregor">Conor McGregor</a>
</td>, <td><a href="/wiki/Mixed_martial_arts" title="Mixed martial arts">Mixed martial arts</a>
</td>, <td>43
</td>]

-------CLEAN--------

13
Conor McGregor
Mixed martial arts
43

Now we convert it into a dataframe:

headers = ['Rank', 'Name', 'Sport', 'Views Mil']

df = pd.DataFrame(complete_row, columns=headers)
df
image013.png

Now we have a pandas dataframe you can use for your ML project. You can use your favourite library to fit a model onto the data.

--

If you found this article interesting, then check out my mailing list. Where I write more stuff like this

$\setCounter{0}$
Previous
Previous

Social Media Does Not Accurately Reflect Society

Next
Next

Making Sure All Your Data Sources End Up in The Same Shape