Geographic visualization with Matplotlib and Pandas

Rubén Chuliá Mena
15 min readJun 9, 2021

--

Geographic and demographic visualizations are one of the most fun parts about data visualization. They allow you to connect technical skills with facts about the reality in which we live in in a very straightforward manner. This is a short exercise for showing how you can use GeoPandas and Matplotlib libraries to visualize simple but insightful facts about a country’s population.

You can access all of this code in my Kaggle notebook.

Downloading the data

There are two types of data we need. On the one hand there is the demographic information about the geographical regions we want to study. In my case, I am interested in the population of Spain, my coutry. So I need the information about the people in each city, and its distribution across age groups, sexes and nationalities. On the other hand, we need the geographical information: geographical coordinates, shapes of the cities, etc.

The demographic information can come in a great variety of formats, and it depends on the organism that makes that data available in each country. So, don’t get caught up in the particular information you have available at your country. The intention of this post is not that you replicate exactly what I do. It is much better if you create your own list of visualizations that you are interested in.

In any case, I am going to use the Spanish census population from the year 2017, available in this site.

It turns out you can navigate to the downloads section and choose the file in .geojson format. You can also download that file from this Kaggle dataset. This is the ideal format because the file contains, in addition to the census population of every town, the geometric shape of each one. Hence, we have all information we need in just one file.

Reading geographical data

We can load the data from a .geojson file into a GeoDataFrame.

A GeoWhat?

Working with Pandas DataFrames for any type of tabular data set is convenient, easy and popular. And in this post it is expected that you already know how to use Pandas. But, when dealing with geospatial data (like the shapes of municipalities), one might need to use some other libraries like Shapely to perform some operations: calculating areas, transforming to other coordinate system…

Fortunately, there is a library called GeoPandas, that makes it really easy to work with tabular data with geospatial data associated to it. This library extends the datatypes used by Pandas to allow spatial operations using Shapely. It also makes plotting maps very easily since it also uses Matplotlib utilities.

A GeoDataFrame is very similar to a Pandas DataFrame, but one of the columns has geometric data that Shapely can operate on. It can be loaded as:

import geopandas
spain_localities_info = geopandas.read_file(file_geojson)

The geometry column of spain_localities_info, which has been generated by Geopandas when reading the file in .geojson format, contains the shapes of the localities in Spain.

A shape is nothing more than the sequence of points that, when joined together in order by straight lines, form a 2D polygon. So, for each city, we have a sequence of points that form the polygon with the shape of that city.

But, wait. The Earth is more or less like a Sphere, a 3-dimensional object. But the polygons in the geometry column are in a 2-dimensional space. In other words, each point of the polygon has an X and a Y coorditates. How are they calculated? What is the X, Y coordinate for London for example?

The answer? They are calculated by projecting geographical locations to a Coordinate Reference System or CRS. A CRS is just a way transforming (projecting) each 3D real geographical location to a lower dimensional (2D) space. There is always a loss of information, a distortion. According to Wikipedia:

All projections of a sphere on a plane necessarily distort the surface in some way and to some extent

And, in which way are locations distorted? It depends on the CRS. If we take the (X, Y) coordinates for London, Madrid, and Rome for example, the distance between them in the 2D space might be different from the real one, or maybe the angles of the triangle are different. It could be that the difference in distance is larger in the X direction than the Y direction or viceversa.

In general, all types of distortions occur to some extent. Hence, the question is not how to avoid distortion, but what type of distortion we are willing to allow. What type of distortion is not very critical for our application.

The way to know the CRS of a Geopandas DataFrame is to use the crs property. Sometimes the crs property is not going to be set because the data from the source does not have a CRS specified, but usually it does. Geographical coordinates without the CRS they use are pretty useless.

Anyway, in our example the geographical data came with the CRS specified.

spain_localities_info.crs<Geographic 2D CRS: EPSG:4326> 
Name: WGS 84
Axis Info [ellipsoidal]:
— Lat[north]: Geodetic latitude (degree)
— Lon[east]: Geodetic longitude (degree)
Area of Use:
— name: World
— bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
— Ellipsoid: WGS 84
— Prime Meridian: Greenwich

It seems that the geographic information from the source (and the one that has been loaded into the geometry column of spain_localities_info) is in a CRS (Coordinate Reference System) called WGS84 or EPSG:4326, which is what we popularly know as latitude and longitude coordinate system.

Renaming variables

Although it is a boring task, it is necessary if we want to navigate the data comfortably. The names of the columns don’t make much sense. We must rename them to something more expressive. Just execute the following code or copy and paste the code in the following Jupyter notebook cell:

field_code_to_name_manual = {
"PAD_2C02": "Total de personas",
"PAD_2C03": "Pob. 0-4",
"PAD_2C04": "Pob. 5-9",
"PAD_2C05": "Pob. 10-14",
"PAD_2C06": "Pob. 15-19",
"PAD_2C07": "Pob. 20-24",
"PAD_2C08": "Pob. 25-29",
"PAD_2C09": "Pob. 30-34",
"PAD_2C10": "Pob. 35-39",
"PAD_2C11": "Pob. 40-44",
"PAD_2C12": "Pob. 45-49",
"PAD_2C13": "Pob. 50-54",
"PAD_2C14": "Pob. 55-59",
"PAD_2C15": "Pob. 60-64",
"PAD_2C16": "Pob. 65-69",
"PAD_2C17": "Pob. 70-74",
"PAD_2C18": "Pob. 75-79",
"PAD_2C19": "Pob. 80-84",
"PAD_2C20": "Pob. 85 o más",

"PAD_2C21": "Total de varones",
"PAD_2C22": "Varones 0-4",
"PAD_2C23": "Varones 5-9",
"PAD_2C24": "Varones 10-14",
"PAD_2C25": "Varones 15-19",
"PAD_2C26": "Varones 20-24",
"PAD_2C27": "Varones 25-29",
"PAD_2C28": "Varones 30-34",
"PAD_2C29": "Varones 35-39",
"PAD_2C30": "Varones 40-44",
"PAD_2C31": "Varones 45-49",
"PAD_2C32": "Varones 50-54",
"PAD_2C33": "Varones 55-59",
"PAD_2C34": "Varones 60-64",
"PAD_2C35": "Varones 65-69",
"PAD_2C36": "Varones 70-74",
"PAD_2C37": "Varones 75-79",
"PAD_2C38": "Varones 80-84",
"PAD_2C39": "Varones 85 o más",

"PAD_2C40": "Total de mujeres",
"PAD_2C41": "Mujeres 0-4",
"PAD_2C42": "Mujeres 5-9",
"PAD_2C43": "Mujeres 10-14",
"PAD_2C44": "Mujeres 15-19",
"PAD_2C45": "Mujeres 20-24",
"PAD_2C46": "Mujeres 25-29",
"PAD_2C47": "Mujeres 30-34",
"PAD_2C48": "Mujeres 35-39",
"PAD_2C49": "Mujeres 40-44",
"PAD_2C50": "Mujeres 45-49",
"PAD_2C51": "Mujeres 50-54",
"PAD_2C52": "Mujeres 55-59",
"PAD_2C53": "Mujeres 60-64",
"PAD_2C54": "Mujeres 65-69",
"PAD_2C55": "Mujeres 70-74",
"PAD_2C56": "Mujeres 75-79",
"PAD_2C57": "Mujeres 80-84",
"PAD_2C58": "Mujeres 85 o más",

"PAD_3C03": "Pob. España",
"PAD_3C04": "Pob. extranjera",

"PAD_3C05": "Pob. Unión Europea",
"PAD_3C06": "Pob. Alemania",
"PAD_3C07": "Pob. Bulgaria",
"PAD_3C08": "Pob. Francia",
"PAD_3C09": "Pob. Italia",
"PAD_3C10": "Pob. Polonia",
"PAD_3C11": "Pob. Portugal",
"PAD_3C12": "Pob. Reino Unido",
"PAD_3C13": "Pob. Rumanía",

"PAD_3C14": "Pob. europea no comunitaria",
"PAD_3C15": "Pob. Rusia",
"PAD_3C16": "Pob. Ucrania",

"PAD_3C17": "Pob. África",
"PAD_3C18": "Pob. Argelia",
"PAD_3C19": "Pob. Marruecos",
"PAD_3C20": "Pob. Nigeria",
"PAD_3C21": "Pob. Senegal",

"PAD_3C22": "Pob. América",
"PAD_3C23": "Pob. Argentina",
"PAD_3C24": "Pob. Bolivia",
"PAD_3C25": "Pob. Brasil",
"PAD_3C26": "Pob. Colombia",
"PAD_3C27": "Pob. Cuba",
"PAD_3C28": "Pob. Chile",
"PAD_3C29": "Pob. Ecuador",
"PAD_3C30": "Pob. Paraguay",
"PAD_3C31": "Pob. Perú",
"PAD_3C32": "Pob. Rep_Dominicana",
"PAD_3C33": "Pob. Uruguay",
"PAD_3C34": "Pob. Venezuela",

"PAD_3C35": "Pob. Asia",
"PAD_3C36": "Pob. China",
"PAD_3C37": "Pob. Pakistán",

"PAD_3C38": "Pob. Oceanía y Apátridas",

"PAD_3C39": "Total Varones (Todas Nacionalidades)",
"PAD_3C40": "Varones España",
"PAD_3C41": "Varones Extranjeros",

"PAD_3C42": "Varones Unión Europea",
"PAD_3C43": "Varones Alemania",
"PAD_3C44": "Varones Bulgaria",
"PAD_3C45": "Varones Francia",
"PAD_3C46": "Varones Italia",
"PAD_3C47": "Varones Polonia",
"PAD_3C48": "Varones Portugal",
"PAD_3C49": "Varones Reino Unido",
"PAD_3C50": "Varones Rumanía",

"PAD_3C51": "Varones europea no comunitaria",
"PAD_3C52": "Varones Rusia",
"PAD_3C53": "Varones Ucrania",

"PAD_3C54": "Varones África",
"PAD_3C55": "Varones Argelia",
"PAD_3C56": "Varones Marruecos",
"PAD_3C57": "Varones Nigeria",
"PAD_3C58": "Varones Senegal",

"PAD_3C59": "Varones América",
"PAD_3C60": "Varones Argentina",
"PAD_3C61": "Varones Bolivia",
"PAD_3C62": "Varones Brasil",
"PAD_3C63": "Varones Colombia",
"PAD_3C64": "Varones Cuba",
"PAD_3C65": "Varones Chile",
"PAD_3C66": "Varones Ecuador",
"PAD_3C67": "Varones Paraguay",
"PAD_3C68": "Varones Perú",
"PAD_3C69": "Varones Rep_Dominicana",
"PAD_3C70": "Varones Uruguay",
"PAD_3C71": "Varones Venezuela",

"PAD_3C72": "Varones Asia",
"PAD_3C73": "Varones China",
"PAD_3C74": "Varones Pakistán",

"PAD_3C75": "Varones Oceanía y Apátridas",

"PAD_3C76": "Total Mujeres (Todas Nacionalidades)",
"PAD_3C77": "Mujeres España",
"PAD_3C78": "Mujeres Extranjeros",

"PAD_3C79": "Mujeres Unión Europea",
"PAD_3C80": "Mujeres Alemania",
"PAD_3C81": "Mujeres Bulgaria",
"PAD_3C82": "Mujeres Francia",
"PAD_3C83": "Mujeres Italia",
"PAD_3C84": "Mujeres Polonia",
"PAD_3C85": "Mujeres Portugal",
"PAD_3C86": "Mujeres Reino Unido",
"PAD_3C87": "Mujeres Rumanía",

"PAD_3C88": "Mujeres europea no comunitaria",
"PAD_3C89": "Mujeres Rusia",
"PAD_3C90": "Mujeres Ucrania",

"PAD_3C91": "Mujeres África",
"PAD_3C92": "Mujeres Argelia",
"PAD_3C93": "Mujeres Marruecos",
"PAD_3C94": "Mujeres Nigeria",
"PAD_3C95": "Mujeres Senegal",

"PAD_3C96": "Mujeres América",
"PAD_3C97": "Mujeres Argentina",
"PAD_3C98": "Mujeres Bolivia",
"PAD_3C99": "Mujeres Brasil",

"PAD_3C100": "Mujeres Colombia",
"PAD_3C101": "Mujeres Cuba",
"PAD_3C102": "Mujeres Chile",
"PAD_3C103": "Mujeres Ecuador",
"PAD_3C104": "Mujeres Paraguay",
"PAD_3C105": "Mujeres Perú",
"PAD_3C106": "Mujeres Rep_Domincana",
"PAD_3C107": "Mujeres Uruguay",
"PAD_3C108": "Mujeres Venezuela",
"PAD_3C109": "Mujeres Asia",
"PAD_3C110": "Mujeres China",
"PAD_3C111": "Mujeres Pakistán",
"PAD_3C112": "Mujeres Oceanía y Apátridas",
"PAD_3C113": "Mujeres China",

"PAD_3_COD_PROV": "Código Provincia",
"PAD_3_COD_CCAA": "Código Comunidad Autónoma",
}
def add_localities_natcode(census_localities):

def locality_census_to_natcode(locality_census):
autonomy_code = locality_census["Cod_CCAA"]
province_code = locality_census["Cod_Prov"]
ine = locality_census["Codigo"]
return f"34{autonomy_code}{province_code}{ine}"

localities_natcodes = census_localities.apply(locality_census_to_natcode, axis="columns")
return census_localities.set_index(localities_natcodes).rename_axis(index="NATCODE")
spain_localities_info = add_localities_natcode(
spain_localities_info.rename(columns=field_code_to_name_manual)
)

Calculating the area of shapes

As we are going to study the population density of towns in Spain, we need to calculate the area. GeoPandas has the builtin property area for calculating the area of the polygons in the geometry column.

However, as we have just seen, some types of distortions (every CRS is a distortion) are more fit to a given application than others. We need to choose the right type of distortion. The units of the WGS84 system are not in meters but in degrees and moreover one unit distance of latitude does not represent the same distance as one unit distance in longitude. And, actually, the ratio between the two changes with the latitude.

In summary, it is not the approppriate system for calculating areas because distances are deeply distorted. The area of the polygons in this system does not have meaningful units and due to the distortion that increases with latitude, we cannot even compare areas of different parts of the world.

We need to project the data onto a CRS where distances are in meters and the distortion is not too big.

For Spain, a good CRS is EPSG:2062. We will project the coordinates onto this CRS using the Geopandas builtin function to_crs and then calculate the area.

spain_localities_info["Area"] = spain_localities_info.to_crs(epsg=2062).geometry.area / 1e6 spain_localities_info["Density"] = (
spain_localities_info["Total de personas"]
/ spain_localities_info["Area"]
)

Notice we have divided the area between 1000000 to transform square meters to square kilometers.

Plotting

Now, when plotting shapes on a map, we need the map as the background and the shapes to be drawed on top. The map and the shapes must, of course, use the same CRS.

The library for adding background maps we are going to use is Contextily. By default, it plots maps using the CRS widely popular Web Mercator, also called EPSG:3857, which is different from the one we are using.

Although Contextily has the option to change it, it looks to me more convenient to convert the geometry information using the to_crs function of Geopandas. So, let's just convert it because from now on it will be the only CRS we are going to need to use.

spain_localities_info = spain_localities_info.to_crs(epsg=3857)

Now we are ready to plot!

import contextily as cxfig, ax = plt.subplots(1, 1, figsize=(25, 15))spain_localities_info.plot(
ax=ax,
column='Density',
legend=True,
cmap='brg',
alpha=0.75,
vmin=0,
vmax=10000,
figsize=(20,16)
)
cx.add_basemap(ax, zoom=8)

And it’s done! However, some improvements could be made. For example, we could focus the zoom to the Iberian Peninsula (sorry Canary Islands). We can do that by setting the x axis and y axis limits according to the CRS we are using:

ax.set_xlim(-1250000, 600000) ax.set_ylim(4250000, 5500000)

Now we don’t need the tick marks anymore since they do not contain useful information unless one is very interested in the coordinates in the given CRS. For now, we are going to remove them since it just adds noise to the image:

ax.tick_params(axis="both", bottom=False, top=False, left=False, right=False, labelbottom=False, labelleft=False)

Another issue is that a few cities with very high density force most of the cities to be of approximately the same color of the colorbar. We can solve this by using the logarithm of the density instead of the density itself. This can be done by passing a LogNorm object to the norm parameter of the plot function:

spain_localities_info.plot( 

norm=matplotlib.colors.LogNorm(
vmin=1, vmax=spain_localities_info.Density.max(), clip=True
)
)

The clip argument makes the normalization not fail if some value is 0.

Finally, I am going to set a title for the colorbar and change the labels so they don’t use fancy math notation. Nothing wrong with that notation, it is just that normal decimal notation is easier to comprehend for everyone. The modification of number formatting can be done by using the argument legend_kwds of the geopandas.GeoDataFrame.plot function, which internally will be passed as**kwargs to thematplotlib.pyplot.colorbar function.

spain_localities_info.plot( 

legend_kwds=dict( format="%d" )
)

Setting a label for the colorbar can also be done using the legend_kwds argument. However, I also want to change the font size, and for that I need to access directly the colorbar matplotlib axes object and use lower level functions:

colorbar = fig.axes[1] colorbar.tick_params(labelsize=20) colorbar.set_ylabel("Density (Log-Scale)", fontsize=20)

Putting it all together:

fig, ax = plt.subplots(1, 1, figsize=(20, 12))  ax.set_xlim(-1250000, 600000)
ax.set_ylim(4250000, 5500000)
spain_localities_info.plot(
ax=ax,
column='Density',
legend=True,
cmap='brg',
alpha=0.75,
norm=matplotlib.colors.LogNorm(
vmin=1, vmax=spain_localities_info.Density.max(),
clip=True
),
legend_kwds=dict(format="%d")
)

ax.tick_params(axis="both", bottom=False, top=False, left=False, right=False, labelbottom=False, labelleft=False)
colorbar = fig.axes[1]
colorbar.tick_params(labelsize=20)
colorbar.set_ylabel("Density (Log-Scale)", fontsize=20)
cx.add_basemap(ax, zoom=7)

And finally, let’s create a function so we can use it later:

def plot_map(color, name=None, normalizer=None, tick_format=None, colormap="brg"):
fig, ax = plt.subplots(1, 1, figsize=(20, 12))

ax.set_xlim(-1250000, 600000)
ax.set_ylim(4250000, 5500000)

spain_localities_info.plot(
ax=ax,
column=color,
legend=True,
cmap=colormap,
alpha=0.75,
norm=normalizer,
legend_kwds={"format": tick_format}
)

ax.tick_params(axis="both", bottom=False, top=False, left=False, right=False, labelbottom=False, labelleft=False)

colorbar = fig.axes[1]
colorbar.tick_params(labelsize=20)
colorbar.set_ylabel(name, fontsize=20)

cx.add_basemap(ax, zoom=7)

return fig


plot_map(
color="Density",
name="Density (people / km$^2$)",
normalizer=matplotlib.colors.LogNorm(vmin=1, vmax=spain_localities_info.Density.max(), clip=True),
tick_format="%d"
);

This is much better! Now we can really see the difference between low density regions, and VERY low density regions.

And that’s what I wanted to show your about map plotting. Now what it is left is just playing with the data to build your own insightful visualizations. If you want to see more examples of how I explore the dataset, stay with me through the rest of the article.

Further exploration: distribution of foreigners in Spain

We might ask: what parts of spain have more foreigners? Another way of asking it is: is foreign population distributed accross spain the same way nationals are? Before answering the question with data, I would say big cities like Madrid, Barcelona and Valencia have a bigger proportion of foreigners than the rest of Spain, but it might not be true and be just a sensation.

I have thought about two ways to answer this question visually. I think the second way is better, but I will start with the first one since it is more intuitive.

In order to answer what cities of Spain have more foreigners, we should compare to the amount of nationals that are living there. For example, almost 7% of all Spaniards (nationals) in Spain are living in Madrid. If the distribution of foreigners was the same as nationals, we would expect the same proportion of them to live in Madrid.

However, if there are more (as I suspect is the case), then that would mean that in that part of Spain is more cosmopolitan/diverse, and we want the color of that city in the map to correspond to its level of diversity.

How can we come up with a number that is significative? To me, the most intuitive is just to divide the number of actual foreigners in a city by the expected number of them:

where

  • S_i is the Diversity Score of the i-th municipality/territory/region.
  • N_i is the number of people from the target population (the foreign population in this case) in the i-th municipality
  • N_i^{reference} is the number of people from the reference population (the national population in this case) in the $i$-th municipality
  • \mu_i is the expected number of people from the target population considering the distribution of the reference population as the expected one (Spaniard population in this case)
def calculate_diversity_score(target_population, reference_population):
total_target_population = target_population.sum()
total_reference_population = reference_population.sum()

expected_target_population = reference_population * total_target_population / total_reference_population

return target_population / expected_target_population
diversity_score = calculate_diversity_score(
target_population=spain_localities_info["Pob. extranjera"],
reference_population=spain_localities_info["Pob. España"],
)
fig = plot_map(
diversity_score,
normalizer=matplotlib.colors.LogNorm(vmin=0.1, vmax=10, clip=True),
name="Diversity Score",
tick_format="%.1f",
colormap="bwr"
);

Red tonalities mean that the nuber of foreigners is bigger than the expected. Blue tonalities mean the opposite. As we can see, the east of Spain has, in general a bigger diversity score than the west.

However, I find a problem with this visualization. Some town might be very big in extension, but not in population. There is, for example, a little town of less than 400 people, where only 70 of them are nationals. That part of the map is very red, but in terms of total population, it is not that relevant.

Hence, this visualization gives too much weight to municipalities with low density. This is why the diversity score is not very meaningful.

I propose another score that will take in account the population of municipalities. Instead of taking the quocient, I will just take the difference between the actual and expected amount of foreigners, and then I will divide by the area of the municipality to transform it to a density. Therefore, this quantity is the “excess” of foreigners (the difference with respect to the expected quantity) per square kilometer that live in a municipality.

In this way, low populated towns will not be relevant at all (will appear with a white color), while very populated towns will have a weight proportional to its density.

Finally, in order to appreciate the magnitude of the unbalance in the distribution of foreigners with respect the total size of the group, I will divide the current score by the mean density of foreign population per square kilometer. This way, the final quantity, which I will call Relative Deviation Density, can be thought of as the “excess” of proportion of foreigners per square kilometer.

where

  • D_i is the Relative Deviation Density of the i-th municipality/territory/region.
  • N_i is the number of people from the target population (the foreign population in this case) in the i-th municipality
  • N_i^{reference} is the number of people from the reference population (the national population in this case) in the i-th municipality
  • \mu_i is the expected number of people from the target population considering the distribution of the reference population as the expected one (Spaniard population in this case)
  • A_i is the area of the i-th municipality in square kilometers
def calculate_relative_deviation_density(target_population, reference_population, area):
total_area = area.sum()
total_target_population = target_population.sum()
total_reference_population = reference_population.sum()

expected_target_population = reference_population * total_target_population / total_reference_population
mean_target_density = total_target_population / total_area

return (
(target_population - expected_target_population)
/ (area * mean_target_density)
)
diversity_deviation_density = calculate_relative_deviation_density(
target_population=spain_localities_info["Pob. extranjera"],
reference_population=spain_localities_info["Pob. España"],
area=spain_localities_info["Area"]
)
fig = plot_map(
diversity_deviation_density,
normalizer=matplotlib.colors.SymLogNorm(
linthresh=10,
linscale=1,
vmin=-diversity_deviation_density.abs().quantile(0.99),
vmax=diversity_deviation_density.abs().quantile(0.99),
clip=True
),
name="Diversity Deviation Density",
colormap="bwr"
)

Now we can see that, in most of Spain, as it is mostly uninhabited, the deviation is close to 0 (white color). However, in highly populated cities, we see if they present a bigger or smaller diversity than expected.

In general, foreign population is unusually high in cities of the Mediterranean coast and Madrid. The counterpart are the big cities of the rest of Spain: Sevilla, Salamanca, Valladolid, Pontevedra, Santander, A Coruña, etc.

Distribution of foreigners of specific nationalities

Now we can do the same but selecting specific nationalities.

nationality = "China"
foreign_nationality_deviation_density = calculate_relative_deviation_density(
target_population=spain_localities_info[f"Pob. {nationality}"],
reference_population=spain_localities_info["Pob. España"],
area=spain_localities_info["Area"]
)

fig = plot_map(
foreign_nationality_deviation_density,
normalizer=matplotlib.colors.SymLogNorm(
linthresh=10,
linscale=1,
vmin=-foreign_nationality_deviation_density.abs().quantile(0.99),
vmax=foreign_nationality_deviation_density.abs().quantile(0.99),
clip=True
),
name=f"{nationality} Deviation Density",
colormap="bwr"
)

We can even do it using the total foreign population as a reference for the expected distribution, instead of the national distribution.

In this case, this would help detect the parts of Spain with unusually high density of, say, Germans for example, but not compared to the population of Spaniards, but to the population of foreigners.

In other words, we already know that Barcelona has lots of people from other countries, so it is not a surprise that there are lots of people from basically any country. But, given its cosmopolitanism, does it have more or less Chinese population, for example?

nationality = "China"
foreign_nationality_deviation_density = calculate_relative_deviation_density(
target_population=spain_localities_info[f"Pob. {nationality}"],
reference_population=spain_localities_info["Pob. extranjera"],
area=spain_localities_info["Area"]
)

fig = plot_map(
foreign_nationality_deviation_density,
normalizer=matplotlib.colors.SymLogNorm(
linthresh=10,
linscale=1,
vmin=-foreign_nationality_deviation_density.abs().quantile(0.99),
vmax=foreign_nationality_deviation_density.abs().quantile(0.99),
clip=True
),
name=f"{nationality} Deviation Density",
colormap="bwr"
)

The maps look very similar, but they are different. We can notice that in the subtleties. For exaple, the city of Granada has an unusually low number of foreign people and, hence, an unusually low number of Chinese people living ther. However, given that Granada is like that, i.e. given that it has the number of foreigners it has, the size of the Chinese population is actually higher than expected.

If you are curious about how different nationalities are distributed according to the Relative Deviation Density, play a little with more visualiations in the original post.

Originally published at https://www.listeningtothedata.com.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rubén Chuliá Mena
Rubén Chuliá Mena

Written by Rubén Chuliá Mena

Artificial Intelligence Engineer with experience in ML for bank and insurance sector. I strive with passion to make a meaningful impact in the industry.

No responses yet

Write a response