Bookstores vs Income data science project part 1
The goal of this project will be to use data to find out whether if there is a correlation of between the location of a bookstore chain called Waterstones and the income of the borough. Waterstone is one of my favourite places to visit in London. As it has a wide selection of books and their shops are great place to stay if you plan on reading one.
I wanted to do a data science project as I haven’t done one in a while. As the goal of the project is to simply find correlations, the project model will likely use a simple linear regression.
Now the first step for any data science project is to get data. For the location of the bookshops the waterstones website gives information about the locations of the stores. The default view sorts out bookshops by alphabetical directories:
But clicking the view all button the page we can all of bookstores with less separation into different pages.
By having it like this it makes the website more easier to scrape. As I can only navigating 15 pages compared to 24 with the alphabetical view.
To collect the income data of the London boroughs there is two main datasets I’m thinking of using. One by HM Revenue & Customs which gives the average income of taxpayers by borough, based on survey. Another dataset by the Office of National Statistics, which gives the gross earnings by residence, sorted out by borough.
Developing a web scraper
The first step I want to do is to create a web scraper, that will take the addresses of the bookstores and store in a file for later sorting. As using the search function on the website is not too effective.
In the image below I typed up London but does not all the bookstores in the city:
Below the map the website shows a list of the bookshops. This is the information I want to scrape.
Going through the inspect element options on my browser we can see how the html is design for the list:
The div class="shops-directory-list span12 alpha omega section" holds the dictionary of the bookstore elements.
Then div class="shop-item span6 mobile-span12" holds the element of the individual bookstore.
From that div class the information of the bookstore is located in the child tags.
We can see a level below the “inner” tag contain information about the bookstore.
One more level down leads to the textual information about the bookstore like the name and address.
This “info-wrap” div class gives me all the information I want from the bookstore. As from here it has the title of the bookstore and the address of the bookstore.
To create the scraper, I will be using beautiful soup to extract the HTML content and I will be using the requests library for the HTTP request to collect the HTML from the web.
I was able to print out the contents of the HTML from beautiful soup like so:
Now I need to navigate to the section where the information of the bookstores are.
From printing the body tag using print(soup.body.prettify())
I was able find area where the bookstores where but there is lots of other not relevant information in the HTML.
So, need to zoom in more in the HTML hierarchy. To get to the element where the bookstore information is located, I need to travel “sideways” across the HTML tree. As the one of the div element is located on the same level as the div with the useful information.
So, I used the beautiful soup navigation and used the next_sibling tag to get to the “main-page row” div. But when I ran the code, I got this error:
Deleting the next sibling tag got me this:
This is the div element on the first level. But this element is not needed as the useful information is in the second div. Doing the same as earlier but without using the prettify function. print(soup.body.div.div.next_sibling)
gets me nothing:
Later on, I was able to access the div element I wanted using find_all function. Saved the navigation to the parent element of the starting div containing the information of the bookstores in a variable.
main_container_div_element = soup.body.div
From there I searched the div by class then printed the variable out.
find_div = main_container_div_element.find_all('div', class_="main-page row")
print(find_div)
Found another way to find the “main-page row” div. By saving the navigation of the first sibling is located. I was able manually cross sideways by using the next_sibling tag multiple times.
sibling_2 = first_sibling_div.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling
print(sibling_2)
The one advantage about this way that it does not come within a list. But with the previous solution I can simply use a key next to variable where it will give me the chosen number from the list. Like this:
print(find_div[0])
Now using further navigation, we can move deeper into the div element where the bookstore information is located. I used a find_all function again to move one level deep into the div element. Doing like this is easier as the div element I wanted has other siblings as well.
find_div_2 = find_div[0].find_all('div', class_='shops-directory-list span12 alpha omega section')
print(find_div_2)
Now I used the find all function to move a level deeper and collect the siblings of this element. This was done as the div class “'shop-item span6 mobile-span12'” contained information of the individual bookstores.
Bookstore_info = find_div_2[0].find_all('div', class_='shop-item span6 mobile-span12')
print(Bookstore_info)
Same div classes containing information of other bookstores.
Information about the bookstore stored around this layer.
I will work on extracting the information in this layer on the next blog post.