A friend of mine wants an interactive map of prisons for a project. I like to have an excuse to play with libraries I don’t know yet help, so I’m going to try and make one.
My friend is gathering a lot of data manually, but as a starting point, let’s see what we can get from Wikipedia.
Web Scraping
I’ve used Beautiful Soup to parse the html from the Wikipedia list of prisons in the UK. To be a good neighbour, I’m not actually running the code from this each time.
This page has seven tables in. I’ve extracted each of these as a pandas DataFrame.
Show the code
tables = soup.find_all('table')data_from_tables = []for table in tables: rows = table.find_all('tr') header = rows[0] row_cells = header.find_all('th') row_titles = [cell.text.strip() for cell in row_cells] table_data =dict([(title, []) for title in row_titles]) table_data['url'] = []for row in rows[1:]: cells = row.find_all('td')# Get the URL and Name field name_cell = cells[0] name_link = name_cell.find('a')if name_link: prison_url = name_link['href']else: prison_url ='No URL' table_data['url'].append(prison_url)for i inrange(0, len(row_titles)): if i >len(cells): # If there's no matching cell in this row table_data[row_titles[i]].append('') # Append empty fieldelse: table_data[row_titles[i-1]].append(cells[i-1].text.strip())# Store the extracted datatry: data_from_tables.append(pd.DataFrame(table_data))except: data_from_tables.append(table_data)
As well as taking the text from each cell, I’ve taken the URL for each prison’s page, and used that to extract the co-ordinates for each. There’s a little helper in there to parse the degrees, minutes, and seconds into numeric values.
Show the code
def parse_coordinates(coord_string):if'.'in coord_string: l_number =int(coord_string.split('°')[0])else: nums = re.findall(r"\d+", coord_string) l_number =0for i, num inenumerate(nums): l_number +=int(num)/60**iif'W'in coord_string or'S'in coord_string:return-l_numberelse:return l_numberdef extract_coordinates(prison_url):"""Extracts latitude and longitude from a prison's Wikipedia page."""try: prison_response = requests.get("https://en.wikipedia.org"+ prison_url) prison_response.raise_for_status() prison_soup = BeautifulSoup(prison_response.content, 'html.parser') latitude = prison_soup.find('span', class_ ='latitude').get_text() longitude = prison_soup.find('span', class_ ='longitude').get_text()return parse_coordinates(latitude), parse_coordinates(longitude)exceptExceptionas e:print(f"Error getting coordinates for {prison_url}: {e}") return'NaN', 'NaN'# Return NaN if coordinates are not foundfor df in data_from_tables: coords_list = []for url in df['url']: latitude, longitude = extract_coordinates(url) coords_list.append((latitude, longitude)) df['Coordinates'] = coords_list
I then named these tables as on the Wikipedia page.
Show the code
table_names = ["Current England and Wales Prisons","Former England and Wales Prisons","Current Northern Ireland Prisons","Former Northern Ireland Prisons","Current Scottish Prisons","Former Scottish Prisons","Future Prisons"]tables =dict([(table_names[i], data_from_tables[i]) for i inrange(7)])
Data cleaning
I then did some clean-up:
Prisons run by HMP are blank rows in the table of current prisons in England and Wales, so I filled that
I had to rename the prisons in Northern Ireland. I didn’t bother doing it programmatically because there are only four
Northern Ireland prisons are run by the Department of Justice NI, which I filled into
I had to manually find the capacities of the Scottish prisons, and use the information in the notes to fill in the operator
combo = pd.concat([ tables['Current England and Wales Prisons'][['Capacity', 'Operator', 'Coordinates']], tables['Current Northern Ireland Prisons'][['Capacity', 'Operator', 'Coordinates']], tables['Current Scottish Prisons'][['Capacity', 'Operator', 'Coordinates']]])
The last things to do are to 1. Remove the citations from the Capacity and Operator columns. This turns them from e.g. 563[17] to 563, which lets them be parsed as numbers. 2. Convert the coordinates from one column of both to two columns: latitude and longitude
There was a bit more manual entry I had to do just to correct some of the co-ordinates that weren’t picked up by the script correctly.
Plotting
I then used plotly’s scatter_mapbox to load a map from mapbox and plot the prison locations, sized by capacity and coloured by operator. Pretty neat, huh?