First, I need the data. I’m using Plotly’s scatter_mapbox() to create the map. The plan is to have the colours represent the operator, the size represent the operational capacity, then animate over time. I had a nice little tooltip on the static version of the plot, so it would be nice to include other information like the type of prison on that. This means each row of the table needs to contain:
Prison name
Latitude
Longitude
Prison type
Operational capacity
Operator
Date
SQL
For my friend to update the data as he finds it, I have a spreadsheet. For my own purposes, however, I have an sqlite database. Partly I’m hoping having it in this format will help protect me from the perils of spreadsheets, partly I want an excuse to use SQL. I’ve explored SQL through coding problems, but haven’t really used it for anything practical.
Show the code
import sqlite3import pandas as pdimport itertoolsimport plotly.express as px# Connect to the databaseconn = sqlite3.connect('data/prisons.db')# Perform SQL queryquery ="""SELECT s.latitude, s.longitude, p.prison_name, pt.prison_type, pc.operational_capacity, od.operator, od.start_date AS date FROM sites sJOIN prisons p ON s.site_id = p.site_idJOIN prison_types pt ON p.prison_id = pt.prison_idJOIN prison_capacities pc ON p.prison_id = pc.prison_idJOIN operator_dates od ON p.prison_id = od.prison_idWHERE (pt.prison_type, pc.operational_capacity, od.operator, od.start_date) IN ( SELECT DISTINCT prison_type, operational_capacity, operator, operator_dates.start_date FROM prison_types JOIN prison_capacities ON prison_types.prison_id = prison_capacities.prison_id JOIN operator_dates ON prison_types.prison_id = operator_dates.prison_id )"""# Execute the querydf = pd.read_sql_query(query, conn)conn.close()
I’m about to go through the SQL query in some detail. If that sounds dreadful, you should probably skip ahead to the next section.
The basic stuff is at the top. The SELECT statement describes the kind of table that plotly needs, with the correct fields. The FROM and JOIN statements mean we can get all of the data from the tables in the database, using the relationships set up in the data model.
The interesting part is in the WHERE clause. You might notice that the code inside the brackets looks a bit like the overall query. This is a subquery that makes a makes a table with a row for every time there is a new operator for a prison. This works by combining every DISTINCT value taken from these tables. This won’t do exactly what I want in the end: I’m going to want a new row for every time any of these attributes changes, but it’s good enough to start creating this map.
Data manipulation
Cleaning
For some of the values in this table, the dates might be parsed as timestamps. For any dates in the original data, the values will be a four digit integer, as a string. We will tidy these up and initially simply have them as an integer for the year. We also have prisons dating back a long way; for example, the Marshalsea was open from 1373-1842. However, there was not much change back then, so the map is fairly boring early on, and the interesting stuff only comes along in the 20th century, so we can filter out events before then to show a more informative range.
df_after_1900['datestamp'] = pd.to_datetime(df_after_1900['date'], format='%Y')fig = px.scatter_mapbox( df_after_1900, lat ='latitude', lon ='longitude', mapbox_style ='carto-positron', custom_data = ['prison_name', 'operational_capacity', 'operator', 'date'], size ='operational_capacity', color ='operator', animation_frame ='datestamp', animation_group ='prison_name', width =800, height =800, zoom =4.6,)# Update the map centerfig.update_mapboxes( center_lat =54, center_lon=-2.5)# Show the figurefig.show()
Plotly’s API requires that you have a row for every entity at every year you want to see it. The current form of the DataFrame is an entry every time an aspect of a prison changes. This is why we only get spots popping up every so often - Plotly thinks that the prisons only exist when things change. This means that for the kind of map I’m after, I’ll need to make a row for every year for every prison and fill in the data.
0 HMP/YOI Feltham
1 HMP/ YOI New Hall
2 HMP North Sea Camp
3 HMP/ YOI Hollesley Bay
4 HMP/ YOI Prescoed
...
77 HMP/YOI Isis
78 HMP Oakwood
79 HMP Thameside
80 HMP Berwyn
81 HMP Five Wells
Name: prison_name, Length: 82, dtype: object
Then make a range of the years covered in the dataset and make a row for each year for each prison. Luckily, pandas includes one of the weirder join operations, the cross-join, which achieves exactly this.
Most of these rows are empty, because for most years nothing happens to a particular prison. What we can do now is, for each column, fill the data in from the last change, crucially filling only forward. Then we can drop the rows from before a prison opens, because we’ve only filled in forward in time.
Now we have the table in the format required we can plot it… right?
A disappointing map
Show the code
fig = px.scatter_mapbox( df_to_plot, lat ='latitude', lon ='longitude', mapbox_style ='carto-positron', custom_data = ['prison_name', 'operational_capacity', 'operator', 'date'], size ='operational_capacity', color ='operator', animation_frame ='date', animation_group ='prison_name', width =800, height =800, zoom =4.6,)# Update the map centerfig.update_mapboxes( center_lat =54, center_lon=-2.5)# Show the figurefig.show()
I was very sure I had messed up here, somehow, as only some of the prisons show up, and those are only publicly operated. What I eventually figured out is that plotly’s scatter_mapbox will only read the first few thousand rows of a dataframe. This means that only some of the prisons are plotted. It’s tantalisingly close but I can’t see anything in the documentation that means I can change this. I think this means that I’m going to have to do some hard work to plot all of the prisons.
Next steps
I did worry, when I was planning this, that having a row for each year when most years nothing happens was not the ideal solution. I have been messing about with javascript versions of the map, but nothing is ready yet!