Bookstores vs income part 5

Click here part 1, part 2, part 3, part 4

Now I have the dataset of the London bookstores complete. I can move on to getting the dataset of the incomes of the London boroughs. In the first article I said they are two datasets that I may use

“One by HM Revenue & Customs which gives the average income of taxpayers by borough, based on survey. Another dataset by the Office of National Statistics, which gives the gross earnings by residence, sorted out by borough.” – part 1

This is the look of the HM Revenue & Customs dataset:

image001.png

These incomes are based on tax year.

This is the look of Office of National Statistics dataset:

image003.png

These earnings is based on weekly or hourly times.

I am going to use the first dataset provided by HMRC. As this looks more usable and clearer as size of income can be more easily compared to other values.

I used pandas to open excel file:

excel_df = pd.read_excel('income-of-tax-payers.xls')

print(excel_df.head(10))

Got this result:

image005.png

This is the wrong sheet being read, first sheet is the metadata of the file, giving general information about the dataset. The data is in the second sheet. This is something I did not think about as I always deal with csv files which only have one sheet to deal with. So, I need to tell pandas to only read the second sheet.

excel_df = pd.read_excel('income-of-tax-payers.xls', sheet_name=1)
print(excel_df.head(10))
image007.png

I was able to load the correct sheet. But the information that is being viewed right now is not too useful for me. As it shows the incomes in 1999 – 2000. I want to get the latest tax year, so the data can be most up to-date when I use it for analysis. First, I need to get the names of the columns, so I get the name of the column I want to use.

for col in excel_df.columns:
    print(col)
image009.png
image011.png

So when selecting the file in pandas I need to select the unnamed columns with the column that has the tax year on it.

tax_year = excel_df[['2016-17', 'Unnamed: 49', 'Unnamed: 50']]
print(tax_year)
image013.png

The columns were correctly selected but scrolling down a few rows. I noticed they are row that are not needed as they are not related to London.

image015.png

First, I tried using negative slices but it kept the rows instead of removing them:

tax_year = tax_year.drop(tax_year.index[:-17])

image017.png

Then a tried a solution recommended on stackoverflow:

tax_year = tax_year.drop(tax_year.tail(10).index, inplace=True) but I get this error:

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)

None

So I went back to try using negative slicing but I got this:

TypeError: Expected tuple, got str

I updated pandas as I read from this website that having this error means my pandas is out of date. But that only lead to more errors. I had to change the names of some of the columns as pandas started parsing the file differently.

tax_year = excel_df[['Unnamed: 1', '2016-17', 'Unnamed: 51', 'Unnamed: 52']]

I added ‘unnamed: 1’ for the names of the areas. As ‘Area’ does not work.

image019.png

I added a drop NA function to remove any missing values.

Now there is no NA values in data section:

image021.png

But this is still the section I want to remove.

When looking at dataframe I noticed that name of the boroughs stop at index 36. So just need to remove from index 36 onwards.

image023.png

After a few tries I was able to slice the dataframe and cut out the last rows:

tax_year = excel_df[['Unnamed: 1', '2016-17', 'Unnamed: 51', 'Unnamed: 52']] tax_year = tax_year.dropna() tax_year = tax_year.drop(tax_year.index[34:]) print(tax_year)

image025.png

Now we have the data of the tax year I can start doing analysis about incomes and bookstores. I’m going to drop the population column as it may not be to useful for analysis. But I will save the dataframe without the population column in a separate variable so if the column is needed for analysis it can quickly be called upon.

While trying to drop the population column I noticed that pandas does not think of it as a column. Which makes sense as the unnamed columns were used. This is mainly due to the design of the sheet as the tax year is the first row (header). And the other column names are a row below.

image027.png

This is way the dataframe looks like this:

image029.png

So want to move the header one row down so it makes it easier to navigate and clear to read when using pandas. Using this answer from stack overflow I was able to change the columns:

new_header = tax_year.iloc[0]
tax_year = tax_year[1:]
tax_year.columns = new_header
image031.png

And print the columns names to double check

for col in tax_year.columns:
    print(col)
image033.png

Now I can drop the population column:

tax_year_without_pop = tax_year.drop('Number of Individuals', axis=1)
print(tax_year_without_pop)
image035.png

I plotted the data into a bar chart to have an idea of the data I’m looking at.

tax_year_without_pop.plot(kind='bar', x='Area', y='Mean £')
plt.show()
image037.png

Now I need to use the bookstore dataset to match the boroughs they are in. When doing research on how to find a borough name using postcode I found this government website. Where you can type up your postcode then the website returns name of the local authority or borough.

image039.png

Also there is a dataset which give the postcode to local district and local authority lookup, in the UK.

image041.png

The only issue is that the dataset is very large (few thousand). So, filtering or searching my 50ish London postcodes my take a while. I am going use the website to get result of the name of the borough.

As I need to input data into website to get the result I want. I need to use some type of automation software. I will be using selenium web driver. Automate the task of typing up the postcode into the entry box. After the website gets the results back I can start using normal scraping with beautifulsoup.

The code design was based on this website and this stackoverflow answer.

The browser was able to load the website but the program though an error before submitting the input.

File "borough_scraper.py", line 14, in <module>
    inputElement.send_keys(Keys.ENTER)
AttributeError: 'NoneType' object has no attribute 'send_keys'

I was able fix this issue by replacing the enter key by using a click button solution.

find_button = driver.find_element_by_xpath(
    '/html/body/div[6]/div[1]/main/div/form/fieldset/div/button')
find_button.click()
image043.png

But the program scraped the wrong element as I got this:

image045.png

I was able to fix the issue by navigating to the correct div element. The extracting the first paragraph tag which has the wanted text.

soup = BeautifulSoup(driver.page_source, 'lxml')
result_layer = soup.find('div', class_='unitary-result group')
name_borough = result_layer.p.text
image047.png

Now I need to adapt the code so it can use multiple postcodes from my filtered dataset. I started by importing pandas to read the csv file. Then taking the postcode column and turning it into a list so it can used for a loop.

df = pd.read_csv('filtered_dataset.csv')
list_postcodes = list(df['Postcode'])

print(list_postcodes)

A for loop is added to all of the scraping code from earlier and is indented into the loop. for postcode in list_postcodes:

print(df['Bookstore name'])

driver.implicitly_wait(15)
print('Waiting ...')
driver.get(url)

…..

print(name_borough)
names_boroughs.append(name_borough)

driver.quit()

At the bottom of the loop the names of the boroughs are appended into an empty list, called names_boroughs.

After I make a new column in the dataframe. Using the names_boroughs list. Then produces a csv file with the new column.

df['Borough'] = names_boroughs
print(df.head())
df.to_csv('filtered_dataset_boroughs.csv')

After running it for the first time. I had to do few edits. First it kept on printing whole bookstore name column, for each iteration in the loop.

image049.png

Then I got this error:

image051.png

So I added a sleep function after the value is added to the list.

print(name_borough)
names_boroughs.append(name_borough)

time.sleep(10)
print('Waiting ...')

I ran the program again but got this result:

image053.png

Check that the issue is about the postcode TN. I checked that any of the TN postcodes that I have are in London. I found that none are in London. So, I deleted them from the dataset.

Ran the program again got the same result later on:

image055.png

I checked the postcode in the dataset that is mostly likely going to be outside of London and near the area that the program is searching for, which is KT13 8AB.

Googled the postcode:

image057.png

Checked the city that googled gave for the postcode:

image059.png

From the description the bookstore is not in London, So I deleted it from the dataset.

Did this again for another value:

image061.png

Value in the dataset: Walton-on-Thames,"12/13 The Heart Shopping Centre, Walton-on-Thames, KT12 1GH",KT12 1GH

image063.png

Did this again for another values:

image065.png
image067.png
image069.png
image071.png
image073.png
image075.png

After deleting the non-London values. Got a new dataframe and csv file with the borough names.

image077.png
image079.png

Some bookstores that are not in London slipped into the dataset. On the 4th value on the first image it says the borough is Thurrock council which is not a London borough. This is because it does not produce the error that the non-London values did.

image081.png

As can see in this image the website only produces one result. But a postcode outside of London:

image083.png

The website returns two local bodies and the class used to extract the name of the borough is not used. Beautifulsoup can’t navigate the HTML as the class is does not exist on the page. This is why it throws an error when a non-London postcode is passed into the program.

image085.png

This is class used for the first result on the image above. Which is different from the class used for Beautifulsoup.

 

Also, the csv file has the index of the dataframe. But this is an easy fix as can just set the index variable of the function to False and it will not write the csv file with numbers.

df.to_csv('filtered_dataset_boroughs.csv', index=False)





image087.png
$\setCounter{0}$
Previous
Previous

Bookstores vs income part 6

Next
Next

Bookstores vs income part 4