Getting Rainfall Data and Cleaning

For this project, I will make a model that will show long term flooding risk in an area. Related to climate change and machine learning, which I have been writing a lot about recently. The idea was to predict if an area has a higher risk of flooding in 10 years. The general idea to work this out was to get rainfall data. Then work out if the rainfall exceeded land elevation. After that, the area can be counted as flooded or not.

To get started I had to find rainfall data. Luckily, it was not too hard. But the question was, what rainfall data I wanted to use. First, I found the national rainfall data (UK). Which looked very helpful. But as the analysis will be done by a geographic basis. I decided that I will use London rainfall data. When I got the rainfall data it looked like this:

Some of the columns gave information about soil moisture, which was not relevant to the project. So I had to get rid of them. Also, as it was a geographic analysis. I decided to pick the column that would be closest to the general location I wanted to map. So, I picked Lower Lee rainfall. As I will analyse East London.

To complete the data wrangling I used pandas. No surprise there. To start, I had to get rid of the first row in the dataframe. As they work as the second header in the dataframe. This makes sense as the data was meant for an excel spreadsheet.

I used this to get rid of the first row:

df = df[1:]

After that, I had to get rid of the locations I was not going to use. So, I used pandas iloc function to slice through a significant number of columns in the dataframe.

df = df.drop(df.iloc[:, 1:6], axis=1)

After that, I used the dataframe drop function to get rid of the columns by name.

df = df.drop(['Roding', 'Lower Lee.1', 'North Downs South London.1', 'Roding.1'], axis=1)

Now, before I show you the other stuff I did. I fell into some errors when trying to analyse or manipulate the contents of the dataframe. To fix these issues that I fell into. I changed the date column into Pandas DateTime, with the option of parsing the date first. Due to pandas using the American date system. Then changed the Lower Lee column into a float type. This had to be done as the first row which I sliced earlier. Changes the data type of the columns into non-numeric data types. After I did all of this I can go back into further analysis.

To make the analysis more manageable, I decided to sum up the rainfall to a monthly basis. Rather than a daily basis. As I will have to deal with a lot of extra rows. And having monthly rainfall makes it easier to see changes in rainfall from a glance. To do this I had to group the dataframe into monthly data. This is something that I was stuck for a while, but I was able to find the solution.

Initially, I had to create a new dataframe, that grouped the DateTime column by month. This is why I had to change the datatype from earlier. Then I used the dataframe aggregate function. To sum the values. Then after that, I used the unstack function which pivots the index labels. Thirdly I used reset_index(level=[0,1]) to revert the multi-index into a single index dataframe. Then dropped the level_0 column. Then renamed the rest of columns date and rain.

Analysing the Data

One of the major issues that popped up was the data type of the date column. After tonnes of digging around in stack overflow, I found the solution was to convert it to a timestamp then converted back into a DateTime format. I think this has to do with the changed dataframe into a monthly dataframe so it must have messed up the data type which is why I had to change it again.

A minor thing I had to adjust was the index because when I first plotted the graphs the forecast did not provide the date only providing an increasing numerical number. So, I went to the tutorial’s notebook and her dataframe had the date as the index. So, I changed my dataset, so the index contains the dates so when the forecast is plotted the dates are shown on the x-axis.

Now for the analysis. This is a time-series analysis as we are doing forecasting. I found this article here which I followed. I used the statsmodels package. Which helps provide models for statistical analysis. First, we did a decomposition which separated the dataframe into a trend, seasonal and residual components.

Next, the tutorial asks us to check if the time series is stationary. In the article, it's defined as “A time series is stationary when its statistical properties such as mean, variance, and autocorrelation are constant over time. In other words, the time series is stationary when it is not dependent on time and not have a trend or seasonal effects.”

To check if the data is stationary, we used autocorrelation function and partial autocorrelation function plots.

There is a quick cut off the data is stationary. The Autocorrelation and Partial autocorrelation functions give information about the reliance of time series values.

Now we used another python package called pmdarima. Which will help me decide my model.

import pmdarima as pm

model = pm.auto_arima(new_index_df_new_index['Rain'], d=1, D=1,

                      m=12, trend='c', seasonal=True,

                      start_p=0, start_q=0, max_order=6, test='adf',

                      stepwise=True, trace=True)

All of the settings were taken from the tutorial. I will let the tutorial explain the numbers:

“Inside auto_arima function, we will specify d=1 and D=1 as we differentiate once for the trend and once for seasonality, m=12 because we have monthly data, and trend='C' to include constant and seasonal=True to fit a seasonal-ARIMA. Besides, we specify trace=True to print status on the fits. This helps us to determine the best parameters by comparing the AIC scores.”

After than I spilt the data into train and test batches.

train_x = new_index_df_new_index[:int(0.85*(len(new_index_df_new_index)))]

test_x = new_index_df_new_index[int(0.85*(len(new_index_df_new_index))):]

When Splitting the data for the first time I used SciKit Learn’s train_test_split function to split the data. But this led to some major errors later on when plotting the data so I'm using the tutorial method.

Then we trained a SARIMAX based on the parameters produced from earlier.

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(train_x['Rain'],
                order=(2,1,0),seasonal_order=(2,1,0,12))
results = model.fit()
results.summary()

Plotting the forecast

Now we can start work on forecasting as we now have a trained model.

forecast_object = results.get_forecast(steps=len(test_x))

mean = forecast_object.predicted_mean

conf_int = forecast_object.conf_int()

dates = mean.index

These variables used to help us plot the forecast. The forecast is as long as the test dataset. The mean is the average prediction. The confidence interval gives us a range where the numbers lie. And dates provide an index so we can plot the date.

plt.figure(figsize=(16,8))

df = new_index_df_new_index plt.plot(df.index, df, label='real')

plt.plot(dates, mean, label='predicted')

plt.fill_between(dates, conf_int.iloc[:,0], conf_int.iloc[:,1],alpha=0.2)

plt.legend() plt.show()

This is example of an in-sample forecast. Now lets see how we make a out-sample forecast.

pred_f = results.get_forecast(steps=60)
pred_ci = pred_f.conf_int()
ax = df.plot(label='Rain', figsize=(14, 7))
pred_f.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Monthly Rain in lower lee')
plt.legend()
plt.show()

This is forecasting 60 months into the future.

Now we have forecasting data. I needed to work on which area can get flooded.

Getting Elevation Data

To work out areas that are at risk of flooding I had to find elevation data. After googling around. I found that the UK government provide elevation data of the country. Using LIDAR. While I was able to download the data. I worked out that I did not have a way to view the data in python. And I may have to pay and learn a new program called ArcGIS. Which is something I did not want to do.

So I found a simpler alternative using Google Maps API elevation data. Where you can get elevation data of an area. Using coordinates. I was able to access the elevation data using the Python package requests.

import requests
r = requests.get('https://maps.googleapis.com/maps/api/elevation/json?locations=39.7391536,-104.9847034&key={}'.format(key))
r.json()

{'results': [{'elevation': 1608.637939453125,
   'location': {'lat': 39.7391536, 'lng': -104.9847034},
   'resolution': 4.771975994110107}],
 'status': 'OK'}

Now we need to work out when the point will get flooded. So using the rainfall data we compare the difference between elevation and rainfall. And if the rain passes elevation then the place is underwater.

import json
r = requests.get('https://maps.googleapis.com/maps/api/elevation/json?locations=51.528771,0.155324&key={}'.format(key))
r.json()
json_data = r.json()
print(json_data['results'])
elevation = json_data['results'][0]['elevation']
print('elevation: ', elevation )

rainfall_dates = []
for index, values in mean.iteritems():
    print(index)
    rainfall_dates.append(index)

print(rainfall_dates)
for i in mean:
  # print('Date: ', dates_rain)
  print('Predicted Rainfall:', i)
  print('Rainfall vs elevation:', elevation - i)
  print('\n')

Predicted Rainfall: 8.427437412467206
Rainfall vs elevation: -5.012201654639448


Predicted Rainfall: 40.91480530998025
Rainfall vs elevation: -37.499569552152494


Predicted Rainfall: 26.277342698245548
Rainfall vs elevation: -22.86210694041779


Predicted Rainfall: 16.720892909866357
Rainfall vs elevation: -13.305657152038599

As we can see if the monthly rainfall drops all in one day. Then the area will get flooded.

diff_rain_ls = []
for f, b in zip(rainfall_dates, mean):
    print('Date:', f)
    print('Predicted Rainfall:', b)
    diff_rain = elevation - b
    diff_rain_ls.append(diff_rain)
    print('Rainfall vs elevation:', elevation - b)
    print('\n')
    # print(f, b)

This allows me to compare the dates with rainfall vs elevation difference.

df = pd.DataFrame(list(zip(rainfall_dates, diff_rain_ls)), 
               columns =['Date', 'diff']) 
df.plot(kind='line',x='Date',y='diff')
plt.show()

I did the same thing with the 60-month forecast

rainfall_dates_60 = []
for index, values in mean_60.iteritems():
    print(index)
    rainfall_dates_60.append(index)

diff_rain_ls_60 = []
for f, b in zip(rainfall_dates_60, mean_60):
    print('Date:', f)
    print('Predicted Rainfall:', b)
    diff_rain_60 = elevation - b
    diff_rain_ls_60.append(diff_rain_60)
    print('Rainfall vs elevation:', elevation - b)
    print('\n')

In the long term, the forecast says they will be less flooding. This is likely due to how the data is collected is not perfect and short timespan.

How the Project Fell Short

While I was able to work out the amount of rainfall to flood an area. I did not meet the goal of showing it on to a map. I could not work out the LIDAR data from earlier. And other google map packages for Jupiter notebooks did not work. So I only the coordinates and the rainfall amount.

Wanted to make something like this:

For the reasons I mentioned earlier, I could not do it. The idea was to have the map zoomed in to the local area. While showing underwater properties and land.

I think that’s the main bottleneck. Getting a map of elevation data which can be manipulated in python. As from either, I could create a script that could colour areas with a low elevation.

Why you should NOT use this model

While I learnt some stuff with the project. I do think they some major issues on how I decided which areas are at risk. Just calculating monthly rainfall and finding the difference from the elevation is arbitrary. What correlation does monthly rainfall effect if rainfall pores 10X more in a real flood? This is something I started to notice once I started going through the project. Floods happen (in the UK) from flash flooding. So a month’s worth of rain pours in one day. They will be some correlation with normal rainfall. The other data points that that real flood mappers use, like simulating the physics of the water. To see how the water will flow and affect the area. (Hydrology). Other data points can include temperature and snow. Even the data I did have could have been better. The longest national rainfall data when back to the 70s. I think I did a good job by picking the local rain gauge from the dataset. (Lower Lee). I wonder if it would have been better to take the average or sum of all the gauges to have a general idea of rainfall of the city.

So other than I did not map the flooding. This risk assessment is woefully inaccurate.

If you liked reading this article, please check out my other blog posts:

Failing to implement my first paper

How I created an API that can work out your shipping emissions

Predicting Flooding with Python