Bookstores vs income part 3
I was able to get quite a few results. But there's more to this than meets the eye. There lots of places in London that does not have ‘London’ as their town address or a postcode.
I will show an example in the dataset to illustrate the point:
From looking at these two examples you would not think that the second address is not apart of London as it does not have ‘London ’ anywhere in its address or have a London postcode (N, NW, SW, SE, W, WC, E and EC.) But if we google where Romford is, Wikipedia gives us this answer:
Romford is a large town in east London and the administrative centre of the London Borough of Havering. -Wikipedia
So Romford is apart of London with little indication. This is mainly explained because of the London postal district does not cover all of London (Mainly greater London).
This picture from Wikipedia highlights the issue:
Now knowing this information, we can’t use a simple find and search with the string ‘London’. We need to use Greater London postcodes with normal London postcodes to capture all the London postcode in the dataset.
First I want to make a list contain all the London postcodes. (Greater and Inner) Which can later be used to filter down the dataset.
This is the list of postcodes, the postcode are in strings.
Postcodes = ['EN', 'IG', 'RM', 'DA', 'BR', 'TN', 'CR', 'SM', 'KT',
'TW', 'HA', 'UB', 'E', 'EC', 'N', 'NW', 'SE', 'SW', 'W', 'WC']
After I import pandas so I can start filtering the data. Using pandas I parsed the data set and printed the first items.
dataset_df = pd.read_csv('dataset.csv')
print(dataset_df.head())
First I thought that I dataset is fine but I noticed that the bookstore name looks like in place of the index column. So I tried to get the key of some of column names so it can print out values of the column or data type.
I got errors like KeyError: 'Town' and KeyError: 'Bookstore name'.
I added a skip_row variable to 1. Then pandas was able to give an index column but was using the first data entry as a column name.
I also tried using a dataset with no quotes. By using code editors find and replace. But the pandas was not able to parse through the dataset as extra commas from the street addresses added extra column to file.
Then I was able to print out the data types inside the dataframe. I got this result:
Looking at this means the program counts all column names as one batch of data.
Looking back at my scraper code I noticed the issue lied where I was setting up the column names.
dataset_writer.writerow(['Bookstore name, Street_name, Town, Postcode'])
As we can see all of column names counts as one string. Which explains why the column names where quoted in the csv file.
Now after changed the column names into induvial strings I ran the scraper again and then fed it into pandas. Using the head function, I printed the first 10 results.
As we can see panda recognises the rest of the column names. Also a index column is also appears on the result. But as we can see the columns of Town and Postcode are empty. This is because the quotes that cover the address make the programs count them as one whole block. Like the column name issue from earlier.
Thinking that if I can find a way of dropping the rest of the street address except the postcode then I can move on to the next step. This is because I will be using the postcodes to filter dataset.
What I can do is use pandas features and regex to extract the postcodes from their address and have them in a separate column.
From this stack overflow answer I found a pattern that I can use to extract the postcodes from my data frame.
I stored the regex in a variable:
Postcode_regex = '[A-Z]{1,2}[0-9][0-9A-Z]? [0-9][A-Z]{2}'
As the regex values looks a bit too long for it to directly used in a pandas function without looking messy.
Below I have breakdown on how the regex pattern works. The highlighted areas correspond to the areas of the postcode the regex is dealing with.
• [A-Z]{1,2}
: The first section of the pattern checks if the first letters of the postcode are letters. Which is why the character set covers the alphabet [A-Z]. The quantifier {1, 2} matches the character set [A-Z] only once or twice. So only 1 or 2
characters allowed to be matched.
• [0-9]
: The second section matches the first number in the postcode.
• [0-9A-Z]
○: Third section matches character that is a letter that is a number or letter. That’s why the character set contains 0 to 9 and A to Z [0-9A-Z]. The question mark makes the character set optional. This is because a few postcodes don’t have a 4th character. For example (EC1 6KH, PA2 6NH). ○ Is just a placeholder character to symbolise a space character, this is need so it can move on to the second half of a postcode.
• [0-9]
: Matches the character to a number, same as the 2nd section.
• [A-Z]{2}
: This section matches letters with a length of 2 characters. Similar to the first section but it must be two letters in a row, or it won’t match.
To extract the postcodes, I used the pandas extract function. I borrowed the code from this page.
dataset_df[' Street_name'] = dataset_df['Postcode'].str.extract(Postcode_regex)
I got this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas.
I later found out I was setting up the columns the wrong way round. As the variable name defines the new column name not the column I’m extracting from. So I changed it as you can see below.
dataset_df['Postcode'] = dataset_df[' Street_name'].str.extract(Postcode_regex)
The code should move the extracted postcodes to a new column. One error popped up run running the code. I got this: ValueError: pattern contains no capture groups. But this was easily fixed by adding parentheses to my regex pattern string, like so:
Postcode_regex = '([A-Z]{1,2}[0-9][0-9A-Z]? [0-9][A-Z]{2})'
Running the code after I fixed the issue, I was able to get successful results:
Now as the town column is not needed, I want to drop this column. Using the pandas drop function should be able to drop it from the data frame.
dataset_df = dataset_df.drop('Town', axis=1)
I was able get the results I wanted: