How To Scrape A Website For Your ML Project
A while ago, I was reading a thread on the LearnML subreddit. Which the OP needed to scrape webpage data for his ML project.
People in the thread gave good answers. Which was mainly learn how to use beautifulsoup and selenium.
But the OP may not know how to relate to his ML project. If he has no experience with those libraries
I have used BeautifulSoup and Selenium for some of my data science projects. While not the most advanced tasks it got the work done.
https://www.tobiolabode.com/blog/2020/4/21/bookstores-vs-income-data-science-project-part-1
https://www.tobiolabode.com/blog/2020/4/26/bookstore-vs-income-part-2
In this blog post, I’m going to show you how to scrape a webpage with some useful data and convert it into a pandas dataframe.
The reason why we want to convert it into a dataframe. Is that most ML libraries can handle pandas data frames and can be edited for your model with minimal changes.
First, we are going to find a table on Wikipedia to convert into a dataframe.
Here I’m going to scrape a table of the most viewed sportspeople on Wikipedia.
First, a lot of work will be navigating the HTML tree to get to the table we want.
We will use BeautifulSoup with the help of requests and the regex library.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
Here we are extracting the HTML code from the webpage:
website_url = requests.get('https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages').text
soup = BeautifulSoup(website_url, 'lxml')
print(soup.prettify())
</a>
</li>
<li id="footer-places-disclaimer">
<a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General disclaimer">
Disclaimers
</a>
</li>
<li id="footer-places-contact">
<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us">
Contact Wikipedia
</a>
</li>
<li id="footer-places-mobileview">
<a class="noprint stopMobileRedirectTog
`
We want to collect all of the tables from the corpus. So we have a smaller surface area to search from.
wiki_tables = soup.find_all('table', class_='wikitable')
wiki_tables
As there are numerous tables we need a way to filter them.
We know that Cristiano Ronaldo has an anchor tag that will likely be unique to a few tables.
We can filter those tables that have an anchor tag with the text Cristiano Ronaldo. While finding some parent elements that contain the anchor tag.
links = []
for table in wiki_tables:
_table = table.find('a', string=re.compile('Cristiano Ronaldo'))
if not _table:
continue
print(_table)
_parent = _table.parent
print(_parent)
links.append(_parent)
<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>
<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>
<a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
<td style="text-align: left;"><a href="/wiki/Cristiano_Ronaldo" title="Cristiano Ronaldo">Cristiano Ronaldo</a>
</td>
The parent only shows us the cell. We need to climb higher up the tree.
Here is a cell with the browser web dev tools.
parent_lst = []
for anchor in links:
_ = anchor.find_parents('tbody')
print(_)
parent_lst.append(_)
By using the tbody we able to return other tables that would contain the anchor tags from earlier.
To filter even more we can search through the various headers of those tables:
for i in parent_lst:
print(i[0].find('tr'))
tr>
<th>Rank*</th>
<th>Page</th>
<th>Views in millions
</th></tr>
<tr>
<th>Rank</th>
<th>Page</th>
<th>Views in millions
</th></tr>
<tr>
<th>Rank</th>
<th>Page</th>
<th>Sport</th>
<th>Views in millions
</th></tr>
The third one looks like the table that we want.
Now we start to create the logic needed to extract and clean the details we want.
sports_table = parent_lst[2]
complete_row = []
for i in sports_table:
rows = i.find_all('tr')
print('\n--------row--------\n')
print(rows)
for row in rows:
cells = row.find_all('td')
print('\n-------cells--------\n')
print(cells)
if not cells:
continue
rank = cells[0].text.strip('\n')
page_name = cells[1].find('a').text
sport = cells[2].find('a').text
views = cells[3].text.strip('\n')
print('\n-------CLEAN--------\n')
print(rank)
print(page_name)
print(sport)
print(views)
complete_row.append([rank, page_name, sport, views])
for i in complete_row:
print(i)
To break it down:
sports_table = parent_lst[2]
complete_row = []
Here we select the 3rd element from the list from earlier. As it’s the table we wanted.
Then we create an empty list that would store the details of each row. As we iterate through the table.
We create an loop that would iterate each row in the table and save them into the rows variable.
for i in sports_table:
rows = i.find_all('tr')
print('\n--------row--------\n')
print(rows)
for row in rows:
cells = row.find_all('td')
print('\n-------cells--------\n')
print(cells)
We create a nested loop. That iterates through each row saved from the last loop. When iterating through the cells we save the individual cells in a new variable.
if not cells:
continue
This short snippet of code allows us to avoid empty cells and prevent errors when extracting the text from the cell.
rank = cells[0].text.strip('\n')
page_name = cells[1].find('a').text
sport = cells[2].find('a').text
views = cells[3].text.strip('\n')
Here we clean out the various cells into plain text form. The cleaned values are saved the variables under the name of their columns.
print('\n-------CLEAN--------\n')
print(rank)
print(page_name)
print(sport)
print(views)
complete_row.append([rank, page_name, sport, views])
Here we add the values into the row list. And print the cleaned values.
-------cells--------
[<td>13
</td>, <td style="text-align: left;"><a href="/wiki/Conor_McGregor" title="Conor McGregor">Conor McGregor</a>
</td>, <td><a href="/wiki/Mixed_martial_arts" title="Mixed martial arts">Mixed martial arts</a>
</td>, <td>43
</td>]
-------CLEAN--------
13
Conor McGregor
Mixed martial arts
43
Now we convert it into a dataframe:
headers = ['Rank', 'Name', 'Sport', 'Views Mil']
df = pd.DataFrame(complete_row, columns=headers)
df
Now we have a pandas dataframe you can use for your ML project. You can use your favourite library to fit a model onto the data.
--
If you found this article interesting, then check out my mailing list. Where I write more stuff like this
Making Sure All Your Data Sources End Up in The Same Shape
I was reading a thread a while ago and the OP asked:
I have data from different input sources that vary in the number of columns they have. It looks like for Keras I need to make sure every file has the same number of columns. My first thought is to just add 0's where the columns should be but I don't know what this does to weighting so I'm asking here. Thanks!
This looks like another feature engineering problem. My best answer is to make a combined dataset of these various sources.
This can be done with pandas. Using the concatenate function and merge functions.
result = pd.concat([df1, df4], axis=1)
An example from the pandas documentation.
Another example from the documentation:
result = pd.concat([df1, df4], ignore_index=True, sort=False)
To get rid of the NaN rows. You use the DropNA function or while concatenating the data frames use the inner attribute.
Check out the pandas documentation here.
A person in the thread recommended a decent course of action:
it might be the best to manually prepare the data to use only/mostly the columns which are present across all datasets, and make sure they match together.
This is a good idea. But they may be important features that may be dropped in the process. Because of that, I would recommend the first option. Then deciding which columns to drop after training. Depending on your data you may want to add extra features yourself mentioning the input source.
The main task to do is some extra pre-processing of your data. Combining them is the best bet. From there you can apply various feature selection techniques to decide which columns you would like to keep. Check out the pandas documentation if you not sure how to deal with Dataframes.
If you found this article interesting, then check out my mailing list. Where I write more stuff like this
Using ARIMA to Forecast Your Weekly Dataset
I was reading a Reddit thread in which the OP called for help forecasting some of the weekday performance in the dataset. Machine learning allows you a few ways to do this.
This is the area of time series forecasting. There are two main ways to do this. First, you can use neural networks like LSTMs. Which takes a sequence of data and predicts the next time window. The second is to use the methods from the stats world. Mainly stuff like ARIMA.
In this article, we are just going to focus on using ARIMA. A technique used in the stats world for forecasting.
Because ARIMA is easier to set up and understand compared to a neural network. Also, very useful if you have a small dataset.
One of my projects was to forecast rainfall in a certain area. It did not work well as I hoped. But it will likely work better if you have a clear correlation between variables.
A person in the thread gave a good resource for ARIMA. https://www.askpython.com/python/examples/arima-model-demonstration
Cause I’m not an expert in time series forecasting I can give you some resources you check out.
https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
Some general tasks you want to do:
- Make sure your data is stationary
- Install pmaria
- If your data is seasonal use SARIMA instead.
After using the resources above, you then forecast the win-loss ratio for your dataset or any other variable you want to forecast in the future.
If you found this article interesting, then check out my mailing list. Where I write more stuff like this
Some resources on “How do I learn about machine learning?”
“How do I learn machine learning?” a question you’ve seen many times on this subreddit or other places. Maybe you’re the one asking this question. In this blog post, I will provide some links that you can view.
While I’m not an expert. (And nowhere close to one). I can still point to some areas that you should check out.
Your first step is to enrol in the fast.ai course. This class will allow you try out ML first-hand. Without getting bogged down in lectures and theory at first. (Which is still important).
If you feel your python skills are not up to scratch. Then I recommend these resources:
Also, they are lots of YouTube videos explaining Python.
You should have a decent grasp of the basics after using these materials.
The second is some type of theory and maths:
There are various arguments about learning for ML. One argument says you should do it on a case-by-case basis. For example, if you’re learning about CNNs then learning Linear algebra at the same time would be useful. The other case is to start learning the basic maths as you start. Maybe by following a course or reading a textbook.
In my opinion, I will argue for the first case. Because the ML field is so vast. That means you may get stuck learning about theory for a long time. Without getting much hands-on work. But if the second option sounds appealing then go ahead.
Resources for maths include:
Mathematics for Machine Learning
Learning Math for Machine Learning
For ML Theory:
Andrew Ng’s deep learning course
Yuan Lecun & co Deep learning course
Lectures via certain area:
Introduction to Convolutional Neural Networks for Visual Recognition (CS231n)
They are some textbooks that you try out:
Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow
Deep Learning with Python by Francois Chollet
The Hundred-Page Machine Learning Book
(I have not used these book myself, but gotten good reviews from various people.)
“But I don’t like learning via courses!” I wrote a blog post if that’s your case.
Related to that, you can continue to improve your skills doing more custom tasks, like:
Side projects
Kaggle
Implementing papers
Now after going through some of the material above. You know a bit more about Machine learning.
Now you can decide what to do next with that knowledge.
-
If you found this post useful, then check out my mailing list where I write more stuff like this.
Neural Networks you can try to implement from scratch (for beginners)
I was reading a tweet talking about how useful it is to implement neural networks from scratch. How it allowed for a greater understanding of the topic. The author said he found it more useful than other people explaining the concept to him.
While I disagree with the author’s opinion that it stops the need for explanations. It certainly does help the understanding of one’s model.
I recommend giving it a go. In the blog post, I will suggest which models you should try to implement from scratch using NumPy or your favourite library. Also, I will link to some accompanying resources.
Simple Feedforward Network
This is the most famous example because it’s so simple. But allows you to learn so much. I heard about this idea from Andrew Trask. It also helped me think about implementing networks from scratch in general.
In the Feedforward network, you will be using NumPy. As you won't need Pytorch or TensorFlow. To do the heavy-lifting for complex calculations.
You can simply create a Numpy Array for training and testing data. You can also create a nonlinear function using Numpy. Then work out the error rate between the layer’s guess and real data.
Resource for this task: https://iamtrask.github.io/2015/07/12/basic-python-network/
Follow this tutorial. It does a much better job of explaining how to do this in NumPy. With code examples to follow.
Feedforward Network with Gradient Descent
This is an extension of the network above. In this network, we allow the model to optimise its weights. This can also be done in NumPy.
Resource for this task: https://iamtrask.github.io/2015/07/27/python-network-part2/
A follow-on from the previous article.
Pytorch version of Perceptrons and Multi-layered Perceptrons.
Here will go up a level by using a library. Examples I'm using will be done in Pytorch. But you can use whatever library you prefer. When implementing these networks, you learn how much a library does the work for you.
Recourses for the task:
https://medium.com/@tomgrek/building-your-first-neural-net-from-scratch-with-pytorch-56b0e9c84d54
https://becominghuman.ai/pytorch-from-first-principles-part-ii-d37529c57a62
K Means Clustering
Yes, this does not count as a network. But a traditional machine learning algorithm is still very useful. As this is non deep learning algorithm it should be easier to understand. This can be done just using NumPy or Pandas depending on the implementation.
Recourse for this task:
https://www.machinelearningplus.com/predictive-modeling/k-means-clustering/
https://gdcoder.com/implementation-of-k-means-from-scratch-in-python-9-lines/
There are quite a few choices to choose from. So pick whatever implementation helps you understand the concepts better.
These networks or models should be simple enough that you won't get lost trying to implement them. But still, help learn a few stuff along the way.
-
If you found this post useful, then check out my mailing list where I write more stuff like this.
Tips For Learning ML If You Don't Like learning Via Courses
I read a Reddit post about how the OP was struggling to learn ML. Because he found the courses abstract and theoretical. And did not see how it would relate to his ML goals. In the people gave their opinions and useful suggestions.
Some of those suggestions I will be showing below:
Create A Side Project
Start working on a project that would involve ML. Then you can learn about the topic as you’re developing the project. You can write about what learned in a blog post, so you know what to work on next time.
Implement A Paper
Implementing a paper helps learn new concepts and forces you to translate that knowledge into a tangible item.
Take Courses That Focus On Coding Models Straight Away
I recommend FastAI which is a very hands-on course. Which focuses on working on ML examples straight away. This course helps you learn the basics of Deep Learning while of some tangible examples to show.
Tutorials Provided By PyTorch and Tensorflow
You can try the tutorials provided on their websites. You will work through practical examples on how to use the library. And you can read about the concepts that some tutorials talk about along the way.
Create Your Favourite Models From Scratch
This idea is from Andrew Trask. You create neural networks only using NumPy. This will force you to turn any theoretical knowledge you have about ML into real-life examples. It won’t be enough to name a concept and move on. But you will need to make tangible examples of the concepts. This can be done for your favourite libraries as well.
Additional note:
You still need theoretical knowledge if you want to do well with ML. As want to know how your model works behind the scenes. And it helps you grasp any new concept that comes your way. If you want to learn about maths. Check out these resources (MML book and YC Learning Math for Machine Learning). As maths is something that many people struggle with when learning ML.
After this, you should be more confident about learning ML. And have hands-on experience making models and a greater understanding of courses you were watching.
-
If you found this post useful, then check out my mailing list where I write more stuff like this.
How to run python scripts on Google Colab
Did you know that you run python scripts in Google Colab?
No? Neither did I.
Just learned from reading a reddit comment.
I’m going to show you how to do so in this blog post. So you use Google’s compute power to its full use.
If I learned about this earlier then many of my projects would make be easier.
If you saved your script in your google drive folder. Then click the mont google drive button. And provide permission.
I’m just going to have a simple hello world script:
print('Hello World, This is a Google Colab script')
Uploading locally, you can click the upload to session storage. Should work if you file is not too large.
Then you run this statement:
!python '/content/colab_script.py'
With the result:
Hello World, This is a Google Colab script
You can upload to drive using:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
First the file will be saved in session storage then you can move it into your drive folder.
NOTE: I don’t know if this is a bug. But uploading your py file via the normal file upload on google drive (not colab). Turns the py file into a gdoc file. Which google colab can’t run. So you one will need to upload your file though google colab. To access your files.
Hopefully you found this useful. Knowing you can run some of your scripts on Google Colab.
Using assert statements to prevent errors in ML code
In my previous post, I showed what I learnt about unit testing. Testing tends not to be thought about when coding ML models. (The exception being production). So, I thought it will be an interesting topic to learn about.
I found one unit test to try out because it solves an issue. I face a lot when coding up my model.
The unit test checks if I’m passing the right shape of data into the model. Because I make this simple mistake from time to time. This mistake can add hours to your project. If you don’t know the source of the problem.
After I shared the blog post on Reddit. A Redditor commented. “Why not just use assert?”
That was something that did not cross my mind. So, I rejigged my memory, by checking out what assert did.
Then started working out how to use it for testing ML code.
One of the most popular blog posts on the internet about ML testing. Uses assertion statements to test the code.
When writing an assertion statement making a function is needed most of the time. This is how unit tests can be made.
Assertion Statement for the Wrong Shape
I was able to hack up this simple assertion statement.
def test_shape():
assert torch.Size((BATCH_SIZE, 3, 32, 32)) == images.shape
This is even shorter than the unit test I created in the last blog post.
I tried out the new unit test. By dropping the batch size column. The same thing I did in the last post.
images = images[0,:,:,:]
images.shape
Now we get an assertion error:
To make the assertion statement clearer. I added info about the shapes of the tensors.
def test_shape():
assert torch.Size((BATCH_SIZE, 3, 32, 32)) == images.shape, f'Wrong Shape: {images.shape} != {torch.Size((BATCH_SIZE, 3, 32, 32))}'
This is super short. Now, you have something to try out straight away for your current project.
As I spend more time on this. I should be writing about testing ML code.
An area I want to explore with ML testing is production. Because I can imagine testing will be very important to make sure the data is all set up and ready. Before the model goes into production. (I don’t have the experience, so I'm only guessing.)
When I start work on my side projects. I can implement more testing. On the model side. And the production side. Which would be a very useful skill to have.
-
If you liked this blog post. Consider signing up to my mailing list. Where I write more stuff like this
Stop passing the wrong shape into model with a unit test
When coding up a model. It can be easy to make a few trivial mistakes. Leading to serious errors when the training model later on. Leading to more time debugging your model. Only to find that your data was in the wrong shape. Or the layers were not configured properly.
Catching such mistakes earlier can make life so much easier.
I decided to do some googling around. And found out that you could use some testing libraries. To automatically catch those mistakes for you.
Now entering the wrong shape size through your layers. Should be a thing of the past.
Using unittest for your model
I’m going to use the standard unittest library. I used from this article: How to Trust Your Deep Learning Code.
All credit goes to him. Have a look at his blog post. For a great tutorial on unit testing deep learning code.
This test simply checks if your data is the same shape that you intend to fit into your model.
Trust me.
You don’t know how many times. An error pops up that is connected to this. Especially when you're half paying attention.
This test should take minutes to set up. And can save you hours in the future.
dataiter = iter(trainloader)
images, labels = dataiter.next()
class MyFirstTest(unittest.TestCase):
def test_shape(self):
self.assertEqual(torch.Size((4, 3, 32, 32)), images.shape)#
This to run:
unittest.main(argv=[''], verbosity=2, exit=False)
test_shape (__main__.MyFirstTest) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.056s
OK
<unittest.main.TestProgram at 0x7fb137fe3a20>
The batch number is hard-coded in. But this can be changed if we save our batch size into a separate variable.
The test with the wrong shape
Now let’s check out the test. When it has a different shape.
I’m just going to drop the batch dimension. This can be a mistake that could happen if you manipulated some of your tensors.
images = images[0,:,:,:]
images.shape
torch.Size([3, 32, 32])
unittest.main(argv=[''], verbosity=5, exit=False)
As we see, the unit test catches the error. This can save you time. As you won’t hit this issue later on when you start training.
I wanted to keep this one short. This is an area I’m still learning about. So I decided to share what I just learnt. And I wanted to have something you can try out straight away.
Visit these links.
These are far more detailed resources about unit testing for machine learning:
https://krokotsch.eu/cleancode/2020/08/11/Unit-Tests-for-Deep-Learning.html
https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6
https://medium.com/@keeper6928/how-to-unit-test-machine-learning-code-57cf6fd81765
https://towardsdatascience.com/unit-testing-for-data-scientists-dc5e0cd397fb
As I start to use unit testing more for my deep learning projects. I should be creating more blog posts. Of other short tests, you can write. To save you time and effort when debugging your model and data.
I used Pytorch for this. But can be done with most other frameworks. TensorFlow has its own test module. So if that’s your thing then you should check it out.
Other people also used pytest and other testing libraries. I wanted to keep things simple. But if you’re interested you can check out for yourself. And see how it can improve your tests.
…
If you liked this blog post. Consider signing up to my mailing list. Where I write more stuff like this
How to extract currency related info from text
I was scrolling through Reddit and a user asked how to extract currency-related text in news headlines.
This is the question:
Hi, I'm new to this group. I'm trying to extract currency related entities in news headlines. I also want to extend it to a web app to highlight the captured entities. For example the sentence "Company XYZ gained $100 million in revenue in Q2". I want to highlight [$100 million] in the headline. Which library can be used to achieve such outcomes? Also note since this is news headlines $ maybe replaced with USD, in that case I would like to highlight [USD 100 million].
While I did not do this before. I have experience scraping text from websites. And the problem looks simple enough that would likely require basic NLP.
So, did a few google searches and found many popular libraries that do just that.
Using spaCy to extract monetary information from text
In this blog post, I’m going to show you how to extract currency info text from data.
I’m going to take this headline I found from google:
23andMe Goes Public as $3.5 Billion Company With Branson Aid
Now by using a few lines of the NLP library of Spacy. We extract the currency related text.
The code was adapted from this stack overflow answer
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('23andMe Goes Public as $3.5 Billion Company With Branson Aid')
extracted_text = [ent.text for ent in doc.ents if ent.label_ == 'MONEY']
print(extracted_text)
['$3.5 Billion']
With only a few lines of code, we were able to extract the financial information.
You will need to have extra code when dealing with multiple headlines. Like storing them a list. And having a for loop doing the extraction of the text.
Spacy is a great library for getting things done with NLP. I don’t consider myself expert in NLP. But you should check it out.
The code is taking advantage of spaCy’s named entities.
From the docs:
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction.
The named entities have annotations which we’re accessing with the code. By filtering the entities to have money type only. We make sure that we are extracting the financial information of the headline.
How to replace currency symbol with currency abbreviation.
As we can see Spacy did a great job extracting the wanted information. So we did the main task.
In the question, the person needed help with replacing the dollar sign with USD. And included highlighting the financial information.
The replacement of the dollar sign is easy. As this can be done with native python functions.
extracted_text[0].replace('$', 'USD ')
USD 3.5 Billion
Now we have replaced the symbol with the dollar abbreviation. This can be done with other currencies that you want.
Highlighting selected text in data
The highlighting of the text moves away from processing data. And more of the realm of web development.
The highlighting of the text. Would require adjusting the person’s web app. To have some extra HTML and CSS attributes.
While I don’t have the know-how to do that.
I can point you to some directions:
Highlight Searched text on a page with just Javascript
https://stackoverflow.com/questions/8644428/how-to-highlight-text-using-javascript
Hopefully, this blog post has helped your situation. And on your way into completing your project.
If you want more stuff like this. Then checkout my mailing list. Where I solve many of your problems straight from your inbox.