Bookstores vs income part 4
Now as the town column is not needed, I want to drop this column. Using the pandas drop function should be able to drop it from the data frame.
dataset_df = dataset_df.drop('Town', axis=1)
I was able get the results I wanted:
Now I’m able to start filtering out the dataset based on list of postcodes I created earlier. I first tried using the isin pandas function but all values back false or non-available.
filtered_dataset_df = dataset_df.isin(Postcodes)
filtered_dataset_df = dataset_df[dataset_df.isin(Postcodes)]
Then I noticed that the isin function probably won’t work because it is checking for the value itself in the list and the column. Which it won’t be able to find because the list of the postcodes are simply the stating letters of a full postcode. Due to this I will probably need to use regex again, so it can match the prefixes postcodes from list to column.
More visual example of the problem: This is the strings that pandas are looking for:
Postcodes = ['EN', 'IG', 'RM', 'DA', 'BR', 'TN', 'CR', 'SM', 'KT', 'TW', 'HA', 'UB', 'E', 'EC', 'N', 'NW', 'SE', 'SW', 'W', 'WC']
But pandas is going through the data that looks like this:
PR1 3NN
TW9 1TN -> Pandas is not going to get this as ‘TW’ is not a separate string.
SY1 1SJ
EH15 3RD
CV37 6AU
S1 2FB
DD1 3DG
So, I need to develop a new regex pattern that can be used to match the letters from the list. While working on the regex pattern to match the first to letters of the postcode developed this draft.
[EN]{2}[0-9]|[IG]{2}[0-9]|[RM]{2}[0-9]|[DA]{2}[0-9]|[BR]{2}[0-9]|[TN]{2}[0-9]|[CR]{2}[0-9]|[SM]{2}[0-9]|[KT]{2}[0-9]|[TW]{2}[0-9]|[HA]{2}[0-9]|[UB]{2}[0-9]
This is simply first two letters of the greater London postcodes. With a quantifier only asking for two letters. And also, a number character set to match a number after the first two letters of the postcode. This stops it from matching the second half of postcodes which may have the same letters.
This is what it matched while testing:
It got the matches I want but some incorrect ones. Like ‘DD1’ I did further tests with EE1 and DD2. The problem stems from the character sets allows duplicates of characters that are in the character set. For example [DA]{2}[0-9] allows DD2 because the ‘D’ character is in the string and character set and meets the quantifier of {2}. Also, the number ‘2’ is the third character and meets the [0-9] character set. Later on, I was able to fix the issue by changing the regex pattern.
[E]N[0-9]|[I]G[0-9]|[R]M[0-9]|[D]A[0-9]|[B]R[0-9]|[T]N[0-9]|[C]R[0-9]|[S]M[0-9]|[K]T[0-9]|[T]W[0-9]|[H]A[0-9]|[U]B[0-9]
I reduced the size of the character set to the first letter of the postcode. This made it so the program will only match the first character before moving to the second. This stops duplicate characters from earlier as only the first letter can be matched once. I was able to develop regex pattern for the second half of the original regex pattern focusing on inner London.
[^A-Z]E[0-9]|EC[0-9]|[^A-Z]N[0-9]|NW[0-9]|SE[0-9]|SW[0-9]|[^A-Z]W[0-9]|WC[0-9]
Like the first pattern but I added negated character sets next to the postcodes that only have one letter. (E, N and W) So it does not match with other letters containing the same letters. For example, a postcode that pattern without a negated set was TW9. This is not what I wanted as TW9 is already a separate postcode with the TW match. Before negation:
After negation:
I combined both of the regex patterns into one:
[^A-Z]E[0-9]|EC[0-9]|[^A-Z]N[0-9]|NW[0-9]|SE[0-9]|SW[0-9]|[^A-Z]W[0-9]|WC[0-9]|[E]N[0-9]|[I]G[0-9]|[R]M[0-9]|[D]A[0-9]|[B]R[0-9]|[T]N[0-9]|[C]R[0-9]|[S]M[0-9]|[K]T[0-9]|[T]W[0-9]|[H]A[0-9]|[U]B[0-9]
I created a variable that uses regex pattern to see if there are any matches in the column.
filtered_column = dataset_df.Postcode.str.contains(London_postcode_regex)
As we see the functions works as most of the values return True. But I want to return non-Boolean values.
So, do that created another variable which changes postcode column using the contains function, while return all of the dataframe. I used this page to help me create the line of code.
dataset_df = dataset_df[dataset_df['Postcode'].str.contains(London_postcode_regex) == True]
print(dataset_df.head(10))
As we can see, the filtered dataset has mostly London postcodes. (TN31 1PJ is not a London postcode but the regex passes it through because of TN31. I tried to adjust the regex to stop it from matching but to no success.) To see the whole dataset, I think it’s a good idea. Export the filtered dataset. To do this I just need to use the df.to_csv function.
dataset_df.to_csv('filtered_dataset.csv', index=False)
Now we can see the filtered dataset. I think it did a good job. But when looking at both datasets I noticed that some bookstores have slipped through and did not captured in the filter. Did my own manual regex search on the dataset and found these:
Searching using E[0-9]
:
Searching using N[0-9]
:
Searching using W[0-9]
:
As pattern forming from these postcodes that did not get captured are single letter postcodes. I guess that the negation of character sets affected the search of the postcodes. And also the combination of the inner and outer postcodes regex may have affected the search. To test this hypothesis when to extract these values and put them in a separate file, so could manual use regex to use if they match.
Searching without negation:
Searching with negation:
Adding negation does not seem to affect the matching of the postcode.
I changed the pandas’ function of contains to match to see if that changes anything.
dataset_df = dataset_df[dataset_df['Postcode'].str.match(London_postcode_regex) == True]
Looking at the dataset I can see that it did not make a difference. Mainly looking at the amount of values in the dataset.
Adding negation does not seem to affect the matching of the postcode.
I changed pandas function of contains to match to see if that changes anything.
dataset_df = dataset_df[dataset_df['Postcode'].str.match(London_postcode_regex) == True]
Looking at the dataset I can see that it did not make a difference. Mainly looking at the amount of values in the dataset.
Using Str.contains:
Using str.match:
As the non-filtered values all have contain London in their bookstore name. I think I should filter down based on the bookstore name containing “London -” a small regex pattern may be needed. I can also use the greater London postcodes separately as those bookstore names do not contain “London -”. Both of these separate searches could be combined into a filtered dataset, having the values of the bookstore name search and the greater London postcode search.
filtered_greater_london = dataset_df.Postcode.str.contains(Greater_london_postcode_regex)
filtered_inner_london = dataset_df['Bookstore name'].str.contains(London_bookstore_name_regex)
dataset_df['Greater_london'] = filtered_greater_london
dataset_df['Inner_london'] = filtered_inner_london
dataset_df = dataset_df.drop(' Street_name', axis=1)
print(dataset_df.head(10))
I dropped the street name column so printing results can be more neat. Looking at the preview of the first 10 rows of the dataset, I can say it went well as pandas correctly spotted which ones are in Inner London and greater London. Now I just need to take these columns and put them into a separate file. Tried this code to convert the Boolean values into strings (postcode, bookname)
filtered_dataset = dataset_df[dataset_df[['Inner_london', 'Greater_london']] == True]
print(filtered_dataset.head(10))
But I got this instead:
I was able to covert the Boolean values of the greater_london and inner london by applying Boolean indexing into the columns directly.
non_bool_filtered_greater_london = dataset_df[
dataset_df.Postcode.str.contains(Greater_london_postcode_regex) == True]
non_bool_filtered_inner_london = dataset_df[dataset_df['Bookstore name'].str.contains(
London_bookstore_name_regex) == True]
print(non_bool_filtered_greater_london.head(10))
print(non_bool_filtered_inner_london.head(10))
Next I combined the columns using the pandas concat() function.
frames = [non_bool_filtered_greater_london, non_bool_filtered_inner_london]
result = pd.concat(frames)
result = result.drop(' Street_name', axis=1)
print(result)
This picture is smaller preview of the print statement.
Now I used the df.to_csv to make a new dataset with this dataframe. Also commented out drop of the street name column. So it can be in the csv file.
result.to_csv('filtered_dataset.csv', index=False)
As we can see the new values added into the dataset, which was not there before: