Meeting the limit of Plotly maps

data analysis
prison map
In trying to make an animated prison map, I’ve reached one of Plotly’s limits
Published

2024-03-23

Previously, I have used Plotly to make a map of UK prisons, and built a data model for how they change over time. I had hoped to be able to combine these, but I think I’ll have to change tack to make the animated map I had planned.

Data extraction

First, I need the data. I’m using Plotly’s scatter_mapbox() to create the map. The plan is to have the colours represent the operator, the size represent the operational capacity, then animate over time. I had a nice little tooltip on the static version of the plot, so it would be nice to include other information like the type of prison on that. This means each row of the table needs to contain:

  1. Prison name
  2. Latitude
  3. Longitude
  4. Prison type
  5. Operational capacity
  6. Operator
  7. Date

SQL

For my friend to update the data as he finds it, I have a spreadsheet. For my own purposes, however, I have an sqlite database. Partly I’m hoping having it in this format will help protect me from the perils of spreadsheets, partly I want an excuse to use SQL. I’ve explored SQL through coding problems, but haven’t really used it for anything practical.

Show the code
import sqlite3
import pandas as pd
import itertools
import plotly.express as px

# Connect to the database
conn = sqlite3.connect('data/prisons.db')

# Perform SQL query
query = """
SELECT 
    s.latitude, 
    s.longitude, 
    p.prison_name, 
    pt.prison_type, 
    pc.operational_capacity, 
    od.operator, 
    od.start_date AS date 
FROM 
    sites s
JOIN 
    prisons p ON s.site_id = p.site_id
JOIN 
    prison_types pt ON p.prison_id = pt.prison_id
JOIN 
    prison_capacities pc ON p.prison_id = pc.prison_id
JOIN 
    operator_dates od ON p.prison_id = od.prison_id
WHERE 
    (pt.prison_type, pc.operational_capacity, od.operator, od.start_date) IN (
        SELECT DISTINCT 
            prison_type, 
            operational_capacity, 
            operator, 
            operator_dates.start_date 
        FROM 
            prison_types 
        JOIN 
            prison_capacities ON prison_types.prison_id = prison_capacities.prison_id
        JOIN 
            operator_dates ON prison_types.prison_id = operator_dates.prison_id
    )
"""

# Execute the query
df = pd.read_sql_query(query, conn)
conn.close()

I’m about to go through the SQL query in some detail. If that sounds dreadful, you should probably skip ahead to the next section.

The basic stuff is at the top. The SELECT statement describes the kind of table that plotly needs, with the correct fields. The FROM and JOIN statements mean we can get all of the data from the tables in the database, using the relationships set up in the data model.

The interesting part is in the WHERE clause. You might notice that the code inside the brackets looks a bit like the overall query. This is a subquery that makes a makes a table with a row for every time there is a new operator for a prison. This works by combining every DISTINCT value taken from these tables. This won’t do exactly what I want in the end: I’m going to want a new row for every time any of these attributes changes, but it’s good enough to start creating this map.

Data manipulation

Cleaning

For some of the values in this table, the dates might be parsed as timestamps. For any dates in the original data, the values will be a four digit integer, as a string. We will tidy these up and initially simply have them as an integer for the year. We also have prisons dating back a long way; for example, the Marshalsea was open from 1373-1842. However, there was not much change back then, so the map is fairly boring early on, and the interesting stuff only comes along in the 20th century, so we can filter out events before then to show a more informative range.

Show the code
df.dropna(inplace=True)
df['date'] = df['date'].apply(lambda x: x.year if isinstance(x, pd.Timestamp) else int(x))


df_after_1900 = df.loc[df['date'] >=1900]
df_after_1900.sort_values('date', inplace=True)
df_after_1900.head()
latitude longitude prison_name prison_type operational_capacity operator date
33 51.440556 -0.435278 HMP/YOI Feltham Closed male Young Offender Institution 48 His Majesty's Prison Service 1910
71 53.635833 -1.611667 HMP/ YOI New Hall Local adult female prison 34 His Majesty's Prison Service 1933
73 52.941276 0.062979 HMP North Sea Camp Open adult male prison 42 His Majesty's Prison Service 1935
48 52.051389 1.451944 HMP/ YOI Hollesley Bay Open adult male prison 485 His Majesty's Prison Service 1938
82 51.686111 -2.943333 HMP/ YOI Prescoed Open adult male prison 252 His Majesty's Prison Service 1939

Filling in dates

If we plot this data, you will see a problem.

Show the code
df_after_1900['datestamp'] = pd.to_datetime(df_after_1900['date'], format='%Y')
fig = px.scatter_mapbox(
    df_after_1900,
    lat = 'latitude',
    lon = 'longitude',
    mapbox_style = 'carto-positron',
    custom_data = ['prison_name', 'operational_capacity', 'operator', 'date'],
    size = 'operational_capacity',
    color = 'operator',
    animation_frame = 'datestamp',
    animation_group = 'prison_name',
    width = 800,
    height = 800,
    zoom = 4.6,
)


# Update the map center
fig.update_mapboxes(
    center_lat = 54,
    center_lon= -2.5
)

# Show the figure
fig.show()

Plotly’s API requires that you have a row for every entity at every year you want to see it. The current form of the DataFrame is an entry every time an aspect of a prison changes. This is why we only get spots popping up every so often - Plotly thinks that the prisons only exist when things change. This means that for the kind of map I’m after, I’ll need to make a row for every year for every prison and fill in the data.

First I need all the prison names:

Show the code
prison_names = pd.Series(df_after_1900['prison_name'].unique(), name='prison_name')
prison_names
0            HMP/YOI Feltham
1          HMP/ YOI New Hall
2         HMP North Sea Camp
3     HMP/ YOI Hollesley Bay
4          HMP/ YOI Prescoed
               ...          
77              HMP/YOI Isis
78               HMP Oakwood
79             HMP Thameside
80                HMP Berwyn
81            HMP Five Wells
Name: prison_name, Length: 82, dtype: object

Then make a range of the years covered in the dataset and make a row for each year for each prison. Luckily, pandas includes one of the weirder join operations, the cross-join, which achieves exactly this.

Show the code
df_to_plot = pd.DataFrame({"date": range(min(df_after_1900['date']),
                                         max(df_after_1900['date']))})\
                        .merge(prison_names, how='cross')
df_to_plot.head()
date prison_name
0 1910 HMP/YOI Feltham
1 1910 HMP/ YOI New Hall
2 1910 HMP North Sea Camp
3 1910 HMP/ YOI Hollesley Bay
4 1910 HMP/ YOI Prescoed

Then we can fill in the data from the table taken from the database, so each year there’s a change, the right information is added.

Show the code
df_to_plot = df_to_plot.merge(df_after_1900.drop(columns='datestamp'), how='left').sort_values(['prison_name', 'date'])
df_to_plot.loc[df_to_plot['prison_name'] == 'HMP Altcourse']
date prison_name latitude longitude prison_type operational_capacity operator
66 1910 HMP Altcourse NaN NaN NaN NaN NaN
148 1911 HMP Altcourse NaN NaN NaN NaN NaN
230 1912 HMP Altcourse NaN NaN NaN NaN NaN
312 1913 HMP Altcourse NaN NaN NaN NaN NaN
394 1914 HMP Altcourse NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
8840 2017 HMP Altcourse NaN NaN NaN NaN NaN
8922 2018 HMP Altcourse NaN NaN NaN NaN NaN
9004 2019 HMP Altcourse NaN NaN NaN NaN NaN
9086 2020 HMP Altcourse NaN NaN NaN NaN NaN
9168 2021 HMP Altcourse NaN NaN NaN NaN NaN

112 rows × 7 columns

Most of these rows are empty, because for most years nothing happens to a particular prison. What we can do now is, for each column, fill the data in from the last change, crucially filling only forward. Then we can drop the rows from before a prison opens, because we’ve only filled in forward in time.

Show the code
df_to_plot['prison_type'] = df_to_plot.groupby(['prison_name'])['prison_type'].ffill()
df_to_plot['operator'] = df_to_plot.groupby(['prison_name'])['operator'].ffill()
df_to_plot['operational_capacity'] = df_to_plot.groupby(['prison_name'])['operational_capacity'].ffill()
df_to_plot['latitude'] = df_to_plot.groupby(['prison_name'])['latitude'].ffill()
df_to_plot['longitude'] = df_to_plot.groupby(['prison_name'])['longitude'].ffill()
df_to_plot['date'] = pd.to_datetime(df_to_plot['date'], format='%Y')

df_to_plot.dropna(inplace=True)
df_to_plot.sort_values('date', inplace=True)
df_to_plot.head()
date prison_name latitude longitude prison_type operational_capacity operator
0 1910-01-01 HMP/YOI Feltham 51.440556 -0.435278 Closed male Young Offender Institution 48.0 His Majesty's Prison Service
82 1911-01-01 HMP/YOI Feltham 51.440556 -0.435278 Closed male Young Offender Institution 48.0 His Majesty's Prison Service
164 1912-01-01 HMP/YOI Feltham 51.440556 -0.435278 Closed male Young Offender Institution 48.0 His Majesty's Prison Service
246 1913-01-01 HMP/YOI Feltham 51.440556 -0.435278 Closed male Young Offender Institution 48.0 His Majesty's Prison Service
328 1914-01-01 HMP/YOI Feltham 51.440556 -0.435278 Closed male Young Offender Institution 48.0 His Majesty's Prison Service

Now we have the table in the format required we can plot it… right?

A disappointing map

Show the code
fig = px.scatter_mapbox(
    df_to_plot,
    lat = 'latitude',
    lon = 'longitude',
    mapbox_style = 'carto-positron',
    custom_data = ['prison_name', 'operational_capacity', 'operator', 'date'],
    size = 'operational_capacity',
    color = 'operator',
    animation_frame = 'date',
    animation_group = 'prison_name',
    width = 800,
    height = 800,
    zoom = 4.6,
)


# Update the map center
fig.update_mapboxes(
    center_lat = 54,
    center_lon= -2.5
)

# Show the figure
fig.show()

I was very sure I had messed up here, somehow, as only some of the prisons show up, and those are only publicly operated. What I eventually figured out is that plotly’s scatter_mapbox will only read the first few thousand rows of a dataframe. This means that only some of the prisons are plotted. It’s tantalisingly close but I can’t see anything in the documentation that means I can change this. I think this means that I’m going to have to do some hard work to plot all of the prisons.

Next steps

I did worry, when I was planning this, that having a row for each year when most years nothing happens was not the ideal solution. I have been messing about with javascript versions of the map, but nothing is ready yet!