Bookstores vs income part 5

Click here part 1, part 2, part 3, part 4

Now I have the dataset of the London bookstores complete. I can move on to getting the dataset of the incomes of the London boroughs. In the first article I said they are two datasets that I may use

“One by HM Revenue & Customs which gives the average income of taxpayers by borough, based on survey. Another dataset by the Office of National Statistics, which gives the gross earnings by residence, sorted out by borough.” – part 1

This is the look of the HM Revenue & Customs dataset:

These incomes are based on tax year.

This is the look of Office of National Statistics dataset:

These earnings is based on weekly or hourly times.

I am going to use the first dataset provided by HMRC. As this looks more usable and clearer as size of income can be more easily compared to other values.

I used pandas to open excel file:

excel_df = pd.read_excel('income-of-tax-payers.xls')

print(excel_df.head(10))

Got this result:

This is the wrong sheet being read, first sheet is the metadata of the file, giving general information about the dataset. The data is in the second sheet. This is something I did not think about as I always deal with csv files which only have one sheet to deal with. So, I need to tell pandas to only read the second sheet.

excel_df = pd.read_excel('income-of-tax-payers.xls', sheet_name=1)
print(excel_df.head(10))

I was able to load the correct sheet. But the information that is being viewed right now is not too useful for me. As it shows the incomes in 1999 – 2000. I want to get the latest tax year, so the data can be most up to-date when I use it for analysis. First, I need to get the names of the columns, so I get the name of the column I want to use.

for col in excel_df.columns:
    print(col)

So when selecting the file in pandas I need to select the unnamed columns with the column that has the tax year on it.

tax_year = excel_df[['2016-17', 'Unnamed: 49', 'Unnamed: 50']]
print(tax_year)

The columns were correctly selected but scrolling down a few rows. I noticed they are row that are not needed as they are not related to London.

First, I tried using negative slices but it kept the rows instead of removing them:

tax_year = tax_year.drop(tax_year.index[:-17])

Then a tried a solution recommended on stackoverflow:

tax_year = tax_year.drop(tax_year.tail(10).index, inplace=True) but I get this error:

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)

None

So I went back to try using negative slicing but I got this:

TypeError: Expected tuple, got str

I updated pandas as I read from this website that having this error means my pandas is out of date. But that only lead to more errors. I had to change the names of some of the columns as pandas started parsing the file differently.

tax_year = excel_df[['Unnamed: 1', '2016-17', 'Unnamed: 51', 'Unnamed: 52']]

I added ‘unnamed: 1’ for the names of the areas. As ‘Area’ does not work.

I added a drop NA function to remove any missing values.

Now there is no NA values in data section:

But this is still the section I want to remove.

When looking at dataframe I noticed that name of the boroughs stop at index 36. So just need to remove from index 36 onwards.

After a few tries I was able to slice the dataframe and cut out the last rows:

tax_year = excel_df[['Unnamed: 1', '2016-17', 'Unnamed: 51', 'Unnamed: 52']] tax_year = tax_year.dropna() tax_year = tax_year.drop(tax_year.index[34:]) print(tax_year)

Now we have the data of the tax year I can start doing analysis about incomes and bookstores. I’m going to drop the population column as it may not be to useful for analysis. But I will save the dataframe without the population column in a separate variable so if the column is needed for analysis it can quickly be called upon.

While trying to drop the population column I noticed that pandas does not think of it as a column. Which makes sense as the unnamed columns were used. This is mainly due to the design of the sheet as the tax year is the first row (header). And the other column names are a row below.

This is way the dataframe looks like this:

So want to move the header one row down so it makes it easier to navigate and clear to read when using pandas. Using this answer from stack overflow I was able to change the columns:

new_header = tax_year.iloc[0]
tax_year = tax_year[1:]
tax_year.columns = new_header

And print the columns names to double check

for col in tax_year.columns:
    print(col)

Now I can drop the population column:

tax_year_without_pop = tax_year.drop('Number of Individuals', axis=1)
print(tax_year_without_pop)

I plotted the data into a bar chart to have an idea of the data I’m looking at.

tax_year_without_pop.plot(kind='bar', x='Area', y='Mean £')
plt.show()

Now I need to use the bookstore dataset to match the boroughs they are in. When doing research on how to find a borough name using postcode I found this government website. Where you can type up your postcode then the website returns name of the local authority or borough.

Also there is a dataset which give the postcode to local district and local authority lookup, in the UK.

The only issue is that the dataset is very large (few thousand). So, filtering or searching my 50ish London postcodes my take a while. I am going use the website to get result of the name of the borough.

As I need to input data into website to get the result I want. I need to use some type of automation software. I will be using selenium web driver. Automate the task of typing up the postcode into the entry box. After the website gets the results back I can start using normal scraping with beautifulsoup.

The code design was based on this website and this stackoverflow answer.

The browser was able to load the website but the program though an error before submitting the input.

File "borough_scraper.py", line 14, in <module>
    inputElement.send_keys(Keys.ENTER)
AttributeError: 'NoneType' object has no attribute 'send_keys'

I was able fix this issue by replacing the enter key by using a click button solution.

find_button = driver.find_element_by_xpath(
    '/html/body/div[6]/div[1]/main/div/form/fieldset/div/button')
find_button.click()

But the program scraped the wrong element as I got this:

I was able to fix the issue by navigating to the correct div element. The extracting the first paragraph tag which has the wanted text.

soup = BeautifulSoup(driver.page_source, 'lxml')
result_layer = soup.find('div', class_='unitary-result group')
name_borough = result_layer.p.text

Now I need to adapt the code so it can use multiple postcodes from my filtered dataset. I started by importing pandas to read the csv file. Then taking the postcode column and turning it into a list so it can used for a loop.

df = pd.read_csv('filtered_dataset.csv')
list_postcodes = list(df['Postcode'])

print(list_postcodes)

A for loop is added to all of the scraping code from earlier and is indented into the loop. for postcode in list_postcodes:

print(df['Bookstore name'])

driver.implicitly_wait(15)
print('Waiting ...')
driver.get(url)

…..

print(name_borough)
names_boroughs.append(name_borough)

driver.quit()

At the bottom of the loop the names of the boroughs are appended into an empty list, called names_boroughs.

After I make a new column in the dataframe. Using the names_boroughs list. Then produces a csv file with the new column.

df['Borough'] = names_boroughs
print(df.head())
df.to_csv('filtered_dataset_boroughs.csv')

After running it for the first time. I had to do few edits. First it kept on printing whole bookstore name column, for each iteration in the loop.

Then I got this error:

So I added a sleep function after the value is added to the list.

print(name_borough)
names_boroughs.append(name_borough)

time.sleep(10)
print('Waiting ...')

I ran the program again but got this result:

Check that the issue is about the postcode TN. I checked that any of the TN postcodes that I have are in London. I found that none are in London. So, I deleted them from the dataset.

Ran the program again got the same result later on:

I checked the postcode in the dataset that is mostly likely going to be outside of London and near the area that the program is searching for, which is KT13 8AB.

Googled the postcode:

Checked the city that googled gave for the postcode:

From the description the bookstore is not in London, So I deleted it from the dataset.

Did this again for another value:

Value in the dataset: Walton-on-Thames,"12/13 The Heart Shopping Centre, Walton-on-Thames, KT12 1GH",KT12 1GH

Did this again for another values:

After deleting the non-London values. Got a new dataframe and csv file with the borough names.

Some bookstores that are not in London slipped into the dataset. On the 4th value on the first image it says the borough is Thurrock council which is not a London borough. This is because it does not produce the error that the non-London values did.

As can see in this image the website only produces one result. But a postcode outside of London:

The website returns two local bodies and the class used to extract the name of the borough is not used. Beautifulsoup can’t navigate the HTML as the class is does not exist on the page. This is why it throws an error when a non-London postcode is passed into the program.

This is class used for the first result on the image above. Which is different from the class used for Beautifulsoup.

Also, the csv file has the index of the dataframe. But this is an easy fix as can just set the index variable of the function to False and it will not write the csv file with numbers.

df.to_csv('filtered_dataset_boroughs.csv', index=False)

Bookstores vs income part 5

Bookstores vs income part 6

Bookstores vs income part 4