What has been the longest raining period? (Rolling mean with Pandas and Agglomerative clustering with Scikit-Learn)

Rubén Chuliá Mena
14 min readAug 8, 2022

--

Note: all the code can be accessed in the next Kaggle Notebook.

Lately it has been raining a lot in my city (a lot for being in Spain I mean). Like I am really missing my vitamin D. This is the third consecutive week, which is too much for everybody here.

My friend said that she didn’t remember another time when there were so many consecutive days raining. I disagreed because, just the day before, it did not rain, so it broke the chain of consecutive days raining.
However, it rained the days before that day and the days afterwards.

My friend told me not to be so mathematically rigorous. That it does not matter if one day it does not rain. What she meant is that *most of the days* have rained for the longest period of times she can remember.

And, as I love data and quantifying things, I decided to find if these last weeks have been indeed the longest period of time (over the last years) where it has been raining. But first, what does it mean

most of the days?

If it has been raining for 20 days, but on the 10th day it didn’t, can we consider it has rained?

What about if there have been 2 consecutive sunny days?

What if there has been more than one sunny day but they were not consecutive?

Before answering the question, I need to find a definition of *most of days*. We will go through multiple intuitive and valid interpretations, and we will find an answer for each one.

Get the data

We are going to work with weather data from the Spanish official weather registry, AEMET. They provide an API for accessing the historical records of rain. As this is not an article about API usage, I am just going to provide the Python code to consume the API. If you want to modify the behaviour, it should be easy to understand.

from enum import Enum
from IPython.display import Image
import requests
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib as mpl
import numpy as np
import plotly.graph_objects as go
from sklearn.cluster import AgglomerativeClusteringAPI_KEY = 'eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJydWJjaHVtZUB0ZWxlY28udXB2LmVzIiwianRpIjoiNTRlN2Q1MTUtMDVhMy00MzM5LWI0YzktMjJlYWZkODAwNzU0IiwiaXNzIjoiQUVNRVQiLCJpYXQiOjE1OTA0MDc0MDMsInVzZXJJZCI6IjU0ZTdkNTE1LTA1YTMtNDMzOS1iNGM5LTIyZWFmZDgwMDc1NCIsInJvbGUiOiIifQ.5nTvS38ehJYXQJPO9t6WaFJBtbdz9ts7nRzlfBHP0TE'def get_stations():
stations_data_url = requests.request(
"GET",
"https://opendata.aemet.es/opendata/api/valores/climatologicos/inventarioestaciones/todasestaciones/",
headers={
'cache-control': "no-cache"
},
params={"api_key": API_KEY}
).json()['datos']
stations = requests.request("GET", stations_data_url).json()
return pd.DataFrame(stations)
def get_station_id(station_name):
stations = get_stations()
station = stations[stations.nombre == station_name].iloc[0]
return station.indicativo
def decimal_with_comma_to_float(s):
if not isinstance(s, str):
return s
if s == 'Ip':
return 0.1
return float(s.replace(',', '.'))def clean_weather_data(raw_weather_data):
weather_data = raw_weather_data.copy()
weather_data.fecha = pd.to_datetime(raw_weather_data.fecha)
weather_data = weather_data.sort_values(by="fecha", ascending=True)
weather_data.prec = weather_data.prec.fillna(0.1)
weather_data.prec = weather_data.prec.map(decimal_with_comma_to_float)
return weather_data
def get_weather_data(station_id, start_date, end_date):
weather_url = (
"https://opendata.aemet.es/opendata/api/valores/climatologicos/diarios/datos"
"/fechaini/{start_date}T00%3A00%3A00UTC"
"/fechafin/{end_date}T23%3A59%3A59UTC"
"/estacion/{station_id}"
).format(
start_date=start_date,
end_date=end_date,
station_id=station_id
)

weather_data_url = requests.request(
"GET",
weather_url,
params={"api_key": API_KEY}
).json()['datos']

data = requests.request("GET", weather_data_url).json()

raw_weather_data = pd.DataFrame(data)
return clean_weather_data(raw_weather_data)
def format_date(date):
return date.strftime("%Y-%m-%d")
download_new_data = Falseif download_new_data:
oliva_station_id = get_station_id("OLIVA")
end_date = pd.Timestamp.today()
start_date = pd.to_datetime(end_date) - pd.DateOffset(years=5) + pd.DateOffset(days=1)
weather_data = get_weather_data(oliva_station_id, format_date(start_date), format_date(end_date))
else:
weather_data = clean_weather_data(pd.read_csv("../input/longestrainingperiodarticledata/weather_data_oliva.csv"))
rain = weather_data.set_index("fecha").prec

Now that we have our raining data in the variable rain, let's get a grasp of what it is about

fig = go.Figure(
go.Scatter(x=rain.index, y=rain, showlegend=False),
layout=go.Layout(
yaxis_title="Precipitation (mm)"
)
)
fig.show()

Maximum amount of rain in predefined calendar intervals

A way of finding when it rained the most can be just finding the day it rained most.

class PlotType(Enum):
SCATTER = "Scatter"
BAR = "Bar"
def plot_max_rain(rain, plot_type=PlotType.SCATTER):
max_date = rain.idxmax()
max_rain = rain[max_date]
if plot_type == PlotType.SCATTER:
trace_function = go.Scatter
else:
trace_function = go.Bar

return go.Figure(
data=[
trace_function(x=rain.index, y=rain, showlegend=False),
go.Scatter(x=[max_date], y=[max_rain], marker_size=20, showlegend=False)
]
)
plot_max_rain(rain)

This can be a curious fact, but it is not really much meaningful. Much more meaningful is to compute the amount of rain over a week:

beginning_of_week = rain.index.to_period('W').start_time
rain_by_week = rain.groupby(by=beginning_of_week).sum()
plot_max_rain(rain_by_week)

Indeed, that was the week with most rain over the last 5 years.

What about the rainiest month?

beginning_of_month = rain.index.to_period('M').start_time
rain_by_month = rain.groupby(by=beginning_of_month).sum()
plot_max_rain(rain_by_month, PlotType.BAR)

And again, we see that was the rainiest month.

And just to satisfy our curiosity, let’s aggregate the data by year:

rain_by_year = rain.groupby(by=rain.index.year).sum()plot_max_rain(rain_by_year[(rain_by_year.index > 2017) & (rain_by_year.index < 2022)], PlotType.BAR)

It is very shocking to see that the rain in the city of Oliva has been very irregular over the last 5 years. The current year (2022) and the first one (2017) should not count since we have no complete data about them.

This approach is simple and effective. However, it has two problems. The first one is that weeks, months, years, and other calendar intervals (quarters, etc.) are just conventions. What if it rainned so much from 15th February to 15th March. It is an interval of 30 days, i.e., one month, but in the previous calculation, half of that rain would have been computed within February, and the other half within March. So, even if it has been the 30-day interval when it has rained the most, it is not registered as that.

Therefore, a better question than which month was the rainiest, is which interval of 30 days (or 7 days, or 10 days, or 666 days, whatever seems meaningful to you) has been the rainiest one.

For that, we can use a rolling window function. However, this is tricky if we don’t pay attention. The series has multitude of missing days. We must include them and assume on those days it didn’t rain at all. If we do not do this, we would be computing the wrong information. Instead of measuring the rain over the last 30 days, we would be measuring the rain over the last 30 days when we have data, which is not what we want.

full_date_range = pd.date_range(start=rain.index[0], end=rain.index[-1], freq="D")
rain = rain.reindex(full_date_range).fillna(0)
rain_last_30_days = rain.rolling(30).sum()plot_max_rain(rain_last_30_days)

Nothing new here. April 2019 was, not only the rainiest month, but the rainiest 30-day interval also ended on April 25th.

It seems there is no doubt by now that April 2019 was especially rainy.

However, why did we choose a window of 30 days? Would have the result been similar with other sizes? We can visualize the rainiest periods for different period lengths.

More formally, we want to compute a sum of a sequence with a given window length:

where 𝑃(𝑛)[𝑜] is the total precipitation amount for the 𝑛n-length interval starting at the 𝑜-th day:

max_window_size = 90intervals = []
for window_size in range(1, max_window_size + 1):
rain_interval_sum = rain.rolling(window_size).sum()
end_date_max_interval = rain_interval_sum.idxmax()
start_date_max_interval = end_date_max_interval - pd.Timedelta(window_size - 1, "D")
intervals.append({
"start": start_date_max_interval,
"end": end_date_max_interval,
"precipitation": rain_interval_sum.max(),
"length": window_size
})

intervals = pd.DataFrame(intervals)
go.Figure(
data=[
go.Scatter(
x=[interval.start, interval.end],
y=[interval.length, interval.length],
mode="lines",
line_color="blue",
showlegend=False
)
for _, interval in intervals.iterrows()
],
layout=go.Layout(
yaxis_title="Interval length (days)",
xaxis_title="Date",
xaxis_range=[rain.index[0], rain.index[-1]]
)
)

As we see, with an interval of length less than 34 days, the rainiest rainy period is around April 2019. However, this changes when we increase the interval length.

But, what if we want to have one single answer? We will need to have a way of comparing two different intervals. And to compare them, we can come up with a number for each interval. The number will be higher the rainier the day. How can we come up from such a mathematical expression? What should be the rainyness of an interval of days?

A naive answer would be: “the rainiest period is the one where more rain fell”. But if you think about this, it is not a good answer. Longer periods will be more rainy. Does that mean that 50mm of precipitation that fell in the course of one year is a rainier period that 49mm fallen in one single day?

Another naive answer could be: “the rainiest period is the one where more rain fell per day, i.e. the one with highest velocity of precipitation”. This also has a problem. Let’s take the previously mentioned day when 49mm fell. Is that period of one single day rainier than three consecutive days when it rained 48mm each day? I think most people would agree that the answer is no.

To find the answer, let’s go through the properties that a reasonable rainyness score would need to have. I think it should have two basic properties:

  • If in another interval the same quantity of rain but the period is 𝑁 times shorter, the rainyness score must be 𝑁 times higher.
  • If there is another interval with the same average precipitation rate (same average amount of rain per day) but it is 𝑁 times longer, the rainyness period should also be 𝑁 times higher.

If we call 𝑃 the amount of precipitation it rained in the period, 𝐿 its length, and 𝐼 the intensity (average precipitation per day) then a score of rainyness 𝑅 that would satisfy previous two conditions is:

But what if we want to give more weight to the amount of rain? Or to its intensity? We can add a correction factor 𝛼:

alphas = np.arange(0, 1, 0.01)
max_rainyness_interval_indices = pd.Series(0, index=alphas)
for alpha in alphas:
intensity = intervals.precipitation / intervals.length
rainyness = intensity ** alpha * intervals.precipitation
max_rainyness_interval_indices[alpha] = rainyness.idxmax()

max_rainyness_intervals = intervals.reindex(max_rainyness_interval_indices).set_index(max_rainyness_interval_indices.index)
go.Figure(
data=[
go.Scatter(
x=[period.start, period.end + pd.DateOffset(1)],
y=[alpha, alpha],
mode="lines",
line_color="blue",
showlegend=False
)
for alpha, period in max_rainyness_intervals.iterrows()
],
layout=go.Layout(
yaxis_title=r"$\alpha$",
xaxis_title="Date",
xaxis_range=[rain.index[0], rain.index[-1]]
)
)

As we can see, when we give low importance to the intensity of rain, it is Autumn 2018 the rainiest period (the period it rained the most). However, when we start increasing the factor 𝛼α, we rapidly change to 2 day periods or 1 day period (the 1 day period cannot even be appreciated in the graph, but it is of course the rainiest day in the last 5 years).

Most consecutive days raining

Previous approach to computing the rainiest period takes in account the amount of rain. There is a different interpretation however that I found even more intuitive.

When we talk about the period when it most rained, we might be referring to the longest consecutive sequence of days when it rained. In this interpretation we don’t care about the amount of rain, but about the fact that it rained or not.

Let’s go further. Let’s imagine a period when it rained for 10 days straight, then it stopped for 1 day, and then it rained again for 10 days straight. Should we consider them as two different raining periods? What if we allow for at least one day not raining? What if we allow for two?

This interpretation can be computed in another approach it works in a very straightforward way using a type of hierarchical clustering called Agglomerative clustering. The idea is this. We are going to represent each day that it rained as an integer. The integer is the amount of days from the beginning of the period until that day. The algorithm will take the two closest days (any pair of rainy days that are consecutive) and will put them together in a group or cluster. That will be considered the two-day period with rainier days.

In the next iteration, it will take the next two closest days (or groups) and will put them together into one group (cluster).

At the beginning, as there are many days, the new clusters will be made from consecutive days. However, at some point, there will not be any more groups that are separated by just one day. From that iteration onwards we will move to the next stage where the algorithm will merge clusters that are separated by two days, and then by three, etc.

In the last iteration of the first stage, we will have all the clusters of consecutive rainy days. From that group we can get the longest period.

In the last iteration of the second stage, we will have all the clusters of consecutive rainy days but allowing for periods of 1 day where it does not rain. From this set of intervals, we can get the longest period of rain allowing one-sunny-day intervals.

In the last iteration of the next stage, we can get the longest period of rain allowing two-sunny-day intervals, and so on.

Take a look at the visualization below. It can help you grasp the idea of agglomerative clustering.

Without further ado, let’s get our hands dirty with the code. First, for previous example we can generate a simple Numpy array that represents the days it rained in the period under study:

days_from_beginning = np.array([1, 2, 3, 6, 7, 9, 10])

What we are going to do now it use the agglomerative clustering algorithm implementation from Scikit-Learn package to find groups rainy days:

clustering = AgglomerativeClustering(
n_clusters=1,
affinity="l1",
compute_full_tree=True,
linkage="single",
compute_distances=True
).fit(days_from_beginning.reshape((-1, 1)))

The clustering variable has a field called children_, which is really a cluster merge history, i.e. it contains the information about the order in which the merge of clusters is performed.

Each cluster that has ever existed is represented by an integer index. At the initial stage, before any clustering operation, each point (each day in our use case) belongs to its own cluster. If there are N points, there will be N clusters at the beginning. They go from 0 to N-1.

children is a (N-1)-sized list of 2-element integer tuples. Each tuple represents the merging of two clusters. The 2 integers of the tuple are the indices of the clusters that are merged. Hence, the i-th element of children represents the two clusters that have been merged in the i-th merge. The new cluster that results of the merge of the other two will have index (N + i).

Check out the Scikit-Learn Agglomerative Clustering documentation for more information algorithm.

For our simple hypothetical case of only 10 days, this is how the algorithm would work:

clustering.children_array([[ 0,  1],
[ 7, 2],
[ 3, 4],
[ 5, 6],
[ 9, 10],
[ 8, 11]])

Which visually is represented as:

From the children_ variable we can get, at each stage, the clusters that have been created:

def get_groups_by_merge_from_agglomerative_clustering_merges(merge_history):
"""
For each merge of pairs of clusters that an agglomerative clustering algorithm has performed on a number of points, get the clusters that exist at each stage.

Each cluster that has ever existed is represented by an integer index.
At the initial stage, before any clustering operation, each point belongs to its own cluster.
If there are N points, there will be N clusters at the beginning. They go from 0 to N-1.

merge_history is a (N-1)-sized list of 2-element integer tuples. Each tuple represents the merging of two clusters.
The 2 integers of the tuple are the indices of the clusters that are merged.
Hence, the i-th element of merge_history represents the two clusters that have been merged in the i-th merge.
The new cluster that results of the merge of the other two will have index (N + i).

The function returns two lists: active_clusters and clusters.
The i-th element of the 'active_clusters' list contains the set of indices of the clusters that exist after the i-th merge.
The element j of 'clusters' is the set of elements that belong to the cluster j.

Check out https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html for more information on the agglomerative clustering algorithm.
"""
N = len(merge_history) + 1
current_active_clusters = set(range(N))
active_clusters = []
clusters = [{i} for i in range(N)]
for merge_index, (cluster_A_index, cluster_B_index) in enumerate(merge_history):
cluster_A_elements = clusters[cluster_A_index]
cluster_B_elements = clusters[cluster_B_index]
clusters.append(cluster_A_elements.union(cluster_B_elements))

current_active_clusters.difference_update({cluster_A_index, cluster_B_index})
current_active_clusters.add(merge_index + N)
active_clusters.append(current_active_clusters.copy())

return active_clusters, clusters

active_clusters, clusters =
get_groups_by_merge_from_agglomerative_clustering_merges(clustering.children_)
active_clusters[{2, 3, 4, 5, 6, 7}, {3, 4, 5, 6, 8}, {5, 6, 8, 9}, {8, 9, 10}, {8, 11}, {12}]clusters[{0},
{1},
{2},
{3},
{4},
{5},
{6},
{0, 1},
{0, 1, 2},
{3, 4},
{5, 6},
{3, 4, 5, 6},
{0, 1, 2, 3, 4, 5, 6}]

The way to interpret these variables is the next. Let’s take, for example, the third element of active_clusters:

{5, 6, 8, 9}

This means that, after the third merge of clusters (fourth line of the figure above), the clusters that are left are four, with indices 5, 6, 8 and 9. The elements that are inside each of those clusters are clusters[5] (5), cluster[6] (6), cluster[8] (0, 1 and 2) and cluster[9] (3 and 4). Take a look at previous visualization to understand it better.

We now just need to take stages at which the number of clusters is the minimum for a given number of allowed consecutive sunny days inside a rainy period. In our example, if we do not allow any rainy day, the stage we want is the fourth one (after four merges). If we allow for 1 sunny day, we get the fifth stage. If we allow two, we will be at the sixth stage.

We can perform this computation thanks to the distances_ field of the clustering variable:

clustering.distances_array([1., 1., 1., 1., 2., 3.])

This variable indicates us that, for example, at the fifth stage, the clusters that have been merged are separated by a distance of clustering.distances[4] (2).

def get_most_complete_cluster_stages_for_each_cluster_separation_distance(merged_clusters_separation):
num_merges = len(merged_clusters_separation)
most_complete_stages = np.append(
np.flatnonzero(np.diff(merged_clusters_separation)),
num_merges - 1
)
return pd.Series(
most_complete_stages,
index=merged_clusters_separation[most_complete_stages],
name="Most complete clustering stages"
).rename_axis(index="Distance")
most_complete_stages = get_most_complete_cluster_stages_for_each_cluster_separation_distance(clustering.distances_)
print(most_complete_stages)
Distance
1.0 3
2.0 4
3.0 5
Name: Most complete clustering stages, dtype: int64
active_clusters_in_most_complete_stages = {distance - 1: active_clusters[stage] for distance, stage in most_complete_stages.items()}
print(active_clusters_in_most_complete_stages)
{0.0: {8, 9, 10}, 1.0: {8, 11}, 2.0: {12}}

As we can see, for a distance of 1 (intra-cluster sunny period length of 0, i.e. only consecutive days in clusters), the most complete cluster configuration is with clusters 8, 9 and 10. For a distance of 2 (allowing 1 day periods of sunny days inside clusters), the most complete cluster configuration happens with clusters 8 and 11. For a distance of 3 (allowing for 2 day periods without raining inside clusters) all days have been grouped into just 1 cluster.

We can represent this in the next way

rainy_days_timeline = (
pd.date_range("2022-01-01", "2022-01-10")[(days_from_beginning - 1).flatten()]
)
from itertools import chain
import plotly.graph_objects as go
def get_clustered_days_rectangle(cluster, sunny_period_length, rainy_days_timeline, highlight=False):
date_interval = pd.date_range(rainy_days_timeline[min(cluster)], rainy_days_timeline[max(cluster)] + pd.Timedelta(days=1), freq="1D")

return go.Scatter(
x=date_interval,
y=np.ones(len(date_interval)) * sunny_period_length,
showlegend=False,
marker_color="red" if highlight else "blue",
line_width=20,
mode="lines",
text=len(cluster) if highlight else None,
)
def get_rainy_days_cluster_timeline(clusters, sunny_period_length, rainy_days_timeline):
cluster_sizes = [len(cluster) for cluster in clusters]
max_size = max(cluster_sizes)

return [
get_clustered_days_rectangle(cluster, sunny_period_length, rainy_days_timeline, highlight=True) if len(cluster) == max_size
else get_clustered_days_rectangle(cluster, sunny_period_length, rainy_days_timeline)
for cluster in clusters
]
def draw_date_clusters_by_distance(rainy_days_timeline, clusters, active_clusters_in_most_complete_stages):
data = list(chain.from_iterable([
get_rainy_days_cluster_timeline(
[clusters[cluster_index] for cluster_index in active_clusters],
sunny_period_length,
rainy_days_timeline
) for sunny_period_length, active_clusters in active_clusters_in_most_complete_stages.items()]))
return go.Figure(
data=data,
layout=go.Layout(
xaxis=dict(
title="Time"
),
yaxis=dict(
title="Intra-cluster sunny period max length (days)",
tickmode="array",
tickvals=list(active_clusters_in_most_complete_stages.keys())
)
)
)
draw_date_clusters_by_distance(rainy_days_timeline, clusters, active_clusters_in_most_complete_stages)

The longest cluster in each stage is painted in red.

Now let’s do it with the real data. First, get the subset of days when it rained:

rained = weather_data.prec > 0
rainy_days = weather_data.fecha[rained]
days_from_beginning = (rainy_days - weather_data.fecha.iloc[0]).dt.days

In days_from_beginning we have an array of integers in increasing order. Each element represents a day it rained, and the value of each element is the number of days since the beginning of the period of 5 years we are measuring.

Then apply the agglomerative clustering algorithm

clustering = AgglomerativeClustering(
n_clusters=1,
affinity="l1",
compute_full_tree=True,
linkage="single",
compute_distances=True
).fit(days_from_beginning.values.reshape(-1,1))

Now it is time to detect the clusters at each stage and select the most complete ones for each distance

active_clusters, clusters = get_groups_by_merge_from_agglomerative_clustering_merges(clustering.children_)
most_complete_stages = get_most_complete_cluster_stages_for_each_cluster_separation_distance(clustering.distances_)
active_clusters_in_most_complete_stages = {distance - 1: active_clusters[stage] for distance, stage in most_complete_stages.items()}

Finally, draw it:

active_clusters_in_most_complete_stages = {
sunny_period_max_length: active_clusters
for sunny_period_max_length, active_clusters in active_clusters_in_most_complete_stages.items()
if sunny_period_max_length < 10
}
draw_date_clusters_by_distance(rainy_days.values, clusters, active_clusters_in_most_complete_stages)

As we can see, the longest period of consecutive days raining happened around March 2020 (red period on inferior row). If we allow for 4 or more days without raining, then it is the winter of 2018 the rainiest period.

Now you can use this method to any one-dimensional binary data (like rain/no-rain). I hope you found this useful.

Sign up to discover human stories that deepen your understanding of the world.

--

--

Rubén Chuliá Mena
Rubén Chuliá Mena

Written by Rubén Chuliá Mena

Artificial Intelligence Engineer with experience in ML for bank and insurance sector. I strive with passion to make a meaningful impact in the industry.

No responses yet

Write a response