Bookstores vs income part 6

See here for part1, part2, part3 part4, part5.

Now we need to prepare the data for the analysis. For the linear regression, the question will ask if there is a correlation between the number of bookstores in a borough and its income. So, to do this I need to make one more dataset containing the name of boroughs and the number of bookstores it has. I typed a few lines of code that should count the borough values in the borough column.

bookstore_df = pd.read_csv('filtered_dataset_boroughs.csv')
count = bookstore_df['Borough'].value_counts()
print(count)

As we can we pandas gave the count of the bookstores in the boroughs. Some boroughs are not in the count because of the bookstore is not in a specific borough. For example, Southwark and Lambeth do not have bookstores so they are not included in the count. I will export the count into a CSV file then make edits by adding the boroughs that do not appear in the count. Got this result:

This means I did something wrong as the name of boroughs does not appear. I fixed the issue by setting the index of the to_csv to true.

I added the boroughs that did not have any of the bookstores. And deleted the ‘London Borough of’ and ‘Royal London Borough of’ phrases. Which is easier than typing of the prefix into the recently added boroughs. And should make the data easier to read.

Now we just need to attach the income to the dataset. So I can add the correct income to the boroughs, I will sort both of the datasets by alphabetical order.

count_df = pd.read_csv('count_borough.csv')
count_df = count_df.sort_values('Borough')
tax_year_without_pop = tax_year_without_pop.sort_values('Area')
print(count_df.head())
print(tax_year_without_pop.head())

I will be using the mean income so I will drop the median column.

Now can take the mean column and add to the count dataframe.

mean_income = tax_year_without_pop['Median £']
count_df['Income'] = mean_income

print(count_df.head())

As we can see there is something wrong with the income column. There are missing values. And the income is in the wrong rows. I later fixed the issue by resetting the indexes of mean income series and the count dataframe. The reason why resetting fixes the issue that when the boroughs were sorted by alphabetical order the index was moved around. When I tried to add the mean income column into the dataframe.

Pandas matched the values to the index. But this was not needed as the income was already sorted by alphabetical order. So resetting the index of the mean income column allowed the values to stay in place while changing the index to normal ascending order. And changing the count dataframe did the same thing by keeping the values in place. So when I added the column into the dataframe the values able to match correctly without pandas moving the values around because of the index.

mean_income = tax_year_without_pop['Mean £']
mean_income = mean_income.reset_index(drop=True)
print(mean_income)
count_df = count_df.reset_index(drop=True)
count_df['Income'] = mean_income

Now we can start making a scatter plot to see how the data looks. By using pandas built in matplotlib functions I’m able to plot the data.

ax1 = count_df.plot.scatter(x='Count', y='Income')
plt.show()

As we can see most of the data point are clustered around the 0-1 bookstores area. There is some meaning we can take from the graph but I will swap the axes around to see if the graph look slightly better.

The axes swapped around:

Swapping the axes around does change the graph that much.

I went back to the normal axes then added a zero to the y-axis.

I then plotted a line though the data to see the correlation in the data.

ax1 = count_df.plot.scatter(x='Count', y='Income')
ax1 = ax1.set_ylim(bottom=0)


x = count_df['Count'].astype('float')
y = count_df['Income'].astype('float')

b, m = polyfit(x, y, 1)
plt.plot(x, y, '.')
plt.plot(x, b + m * x, '-')

plt.show()

https://stackoverflow.com/a/19069001

As we can see there is a positive a correlation between the number of bookstores and income. I will next delete the outliners to see if it changes the appearance of the graph.

count_df['Income'] = count_df['Income'][count_df['Income'].between(
    count_df['Income'].quantile(.15), count_df['Income'].quantile(.85))]

count_df = count_df.dropna()
print(count_df)

https://nextjournal.com/schmudde/how-to-remove-outliers-in-data

When removing the outliers we can see the income range is reduced. And the bookstore count. As the bookstore count reduced from 6 to 3. And the income range of the y-axis from 160000 to 60000.

I am going to bring the outliers back to annotate and add colours to the data points. To make it clearer. I’m going to colour the data points based on the location of the bookstore. North, south, east, west, and central London.

To do that I exported the count dataframe. Then created a new column where the category will be. I added the compass location of the bookstore.

The I grouped the data frame by the labels.

compass_df = pd.read_csv('Borough_income_count.csv')
print(compass_df)
groups = compass_df.groupby('Compass')

for name, group in groups:
    plt.plot(group["Count"], group["Income"], marker="o", linestyle="", label=name)

plt.legend()

https://www.kite.com/python/answers/how-to-color-a-scatter-plot-by-category-using-matplotlib-in-python

Now we can see the results. One pattern I see is the East London boroughs do not have more than one bookstore.

I will next annotate the outliers. While working on getting the annotations able was able to annotate all of the data points because I was still struggling to annotate a few data points.

for i, txt in enumerate(compass_df['Borough']):
     ax.annotate(txt, (compass_df['Count'].iloc[i], compass_df['Income'].iloc[i]))

As we can see the annotations make the graph look messy. As much of the text bunched together. I was able to later work out how label one data point. Using the code from above.

ax.annotate(compass_df['Borough'].iloc[1],
            (compass_df['Count'].iloc[1], compass_df['Income'].iloc[1]))

The compass_df['Borough'].iloc[1] is the name of the borough. Taken from the borough series. The iloc selects the second value(index starts at 0) in the series. The borough name is returned as text as this piece of code is located where it returns text.

compass_df['Count'].iloc[1], compass_df['Income'].iloc[1]

Similar to the first, compass_df['Count'].iloc[1] and compass_df['Income'].iloc[1] selects the integer 1 in their series. The count series is first because it is on the x-axis. And the annotate function has the order x-axis first and y-axis second.

Now I know how to annotate the data points, I can now annotate the outliers which I originally wanted to.

Bookstores vs income part 6

Bookstore vs Income Part 7

Bookstores vs income part 5