Bookstore vs Income part 2

Click here for part 1

Information about the bookstore stored around this layer.

image001.png

Using the bookstore_info variable I want to extract the name and address from the elements inside the div. To do this I need to create a loop the find_all function makes an iterable so I will be able to use this for the loop I want to develop.

The idea of the loop I got from this page: https://realpython.com/beautiful-soup-web-scraper-python/

I was able to develop this loop:

for each_bookstore in Bookstore_info:
    name_of_bookstore = each_bookstore.find('a', class_='title link-invert')
    address_of_bookstore = each_bookstore.find('a', class_='shop-address')

    print(name_of_bookstore)
    print(address_of_bookstore)

The loop iterates through each 'shop-item span6 mobile-span12’ from the bookstore_info variable. Then using the find function which is different from find_all as it only returns one result. It finds the element which the title is stored in (anchor tag) and the class (‘title link-invert'). The next line does the same thing for the address of the bookstore.

image003.png

 The last lines of the print statement of results of the loop.

image005.png

But I still have of hanger-on HTML tags. By adding .text to the output was able to get remove the tags.

print(name_of_bookstore.text)
print(address_of_bookstore.text)
image007.png

But I have lots of whitespace, so it looks I need to add the python strip function and added a line break between name and addresses to make it more clearer:

print(name_of_bookstore.text.strip())
print('\n')
print(address_of_bookstore.text.strip())

This removed a lot of whitespaces but still some around the addresses:

image009.png

I made simple text file to see if the problem is not just my anaconda prompt messing with the output:

image011.png

As we can see the format is still messed up.

From looking around to fix this issue I found out the main source of the problem is the text from the HTML itself.  Mainly that it already has odd formatting that later shows when extracting it.

As we can see below, I double-clicked the text and shows the amount of whitespace around the address.

image013.png

I was able to find the solution from this stack overflow answer. The question was asked to remove the whitespace in a string but keep the whitespace between the words. A similar issue that I have as I want to keep the whitespace in the addresses so it still makes sense.

seprated_address = ' '.join(address_of_bookstore.text.split())
image015.png

I added the prints statements saying the 'Bookstore name: ' and 'address_of_bookstore: ' to make the output clearer.

I tried different solution earlier where I used str.replace to remove all the whitespace from the address variable, but this made the addresses unreadable. One of the major issues that it had that it deleted the space between the postcodes which is hard to add back as it just a string of numbers and letters.

By adding a few newline statements when writing the text file and the print statements for the output, the format is significantly improved:

image017.png
image019.png


Now I have the format of the address fixed. I want to convert the text file into a CSV for the dataset. Or simply change the python file opener to write CSV files instead of text files.

If I want to make a CSV file using this data. Like I mentioned earlier I need to adjust how the file is written. CSV files are written by rows. So the ‘Bookstore name:  ’ and ‘Bookstore_address: ’ strings will need to go. As the CSV file will count them as data entries. The best move is for the strings to the top row and use them as column names. The ‘Bookstore_address: ’ string may have to be adjusted as the addresses themselves have commas between them making separate columns in a CSV file. I may change spilt the string into street name, town, postcode. As this is the order of the address inside the data.

Here is the file writer adjusted to make csv files:

And the produced dataset:

image021.png

As we can see it worked very well. I just need to adjust the quoting setting for the file, so the quote marks on the file can be removed.

Also, I need to add a way that it can scrape the rest of the pages. As this dataset is based on the first page on the website. When looking at the URL for the other pages a small addition is added to the end of the URL:

image023.png

The page/2 is added. When moving to the other pages the number only changes. For example, page/3 and page/5. Using this knowledge, I need to create a loop with increasing the number at the end of the link.

I used some of the code design from this blog post.

I made a list of a number going up to 15 (As this is the number of pages in the website.) It gets to turn into a string so it can be used to attach to the URL. Afterwards, a loop is made iterating through the list, inside the list a request is made to the website. A ‘/page/’ string is added at end of the URL so afterwards the page number can be added.

Below that, an if statement checking if the request gives a successful status code. If not, then a message will be printed. Afterwards a 10-second delay before moving with the rest of the code. This is done so the scraper does not overwhelm the servers and get my IP address banned.

The CSV file produce only showed the last page. This is likely because the CSV writer is inside the page loop to it resets itself each time there is a new page.

image024.png

Now I moved the CSV writer outside of the loop, all the data is now in the CSV file:

image027.png

While going through the CSV file. I noticed that the column names were written into the file multiple times.

image028.png

I think the problem is where the CSV writer of the column names is placed. There are placed inside the loop of the pages. As they were not moved when I adjust the file writer.

 

After making changes the issue is fixed:

image030.png







Tobi Olabode