The GLA’s data science journey
Over the past couple of years the Greater London Authority has been developing its data science capacity. This blog post aims to give a brief snapshot of where we are and where we are headed, in terms of skills, organisation and infrastructure.
The Intelligence Unit at the GLA comprises a number of teams with many years of experience between them in areas including economics, opinion research, GIS, crime, census and demography.
If you have been following the development of the field of data science, you are likely to have come across Drew Conway’s data science Venn diagram, which describes data science as being at the intersection between hacking skills, maths and statistics knowledge, and substantive expertise (or domain knowledge). Traditionally, the Intelligence Unit at the GLA has been extremely strong on substantive expertise, with some good maths and statistics knowledge. We have recently been working to increase our skills in the hacking area, with areas of work being transferred from Excel into R and python, making use of PostgreSQL databases, Amazon Web Services and APIs, employing data scientists and software engineers, and offering training and support for existing staff. As such we are starting to add the data science intersection to the field of traditional research where we are already strong.
The two examples below show how we have been making use of these new skills to answer questions relevant to the environment and policing teams within the GLA.
Question: How many green roofs are there in London?
The environment team at the GLA is working to encourage the planting of green roofs in London as a way to increase the green space, biodiversity, and carbon capture in London. Green roofs also absorb rainwater which prevents flooding, and help to insulate buildings. Measuring the uptake of green roofs in London has previously been done manually by looking at aerial photographs of areas of London and counting the number of green roofs, then extrapolating to the rest of London. The challenge is to create an automated method to recognise green roofs. The approach that we are currently working on (but is still very much an experimental project) uses 3 datasets – aerial photographs, near infra-red imaging, and ordinance survey buildings outline maps. Healthy plants reflect near infra-red radiation, giving a strong signal in near-IR images which can be matched with ordinance survey building outlines maps to identify where there is plan growth on a building. The manually identified green roof areas are used as training data for a machine learning algorithm which then predicts where green roofs appear in other areas.
Question: How can we build a picture of unreported crime?
It is estimated that only 30% of violent crime is reported to the police, but there are other government bodies that collect data related to violence. As part of the Home Office-funded ‘Information Sharing to Tackle Violence’ programme, the GLA’s Safestats multi-agency crime and community safety data system has been collating data from hospital A&E departments in an effort to measure and locate unreported crimes. When victims turn up at A&E, details of where the crime occurred are recorded. The quality of this data can range from an exact address, to a vague area, and is always recorded as a free text format. We are developing an automated clean-split-match procedure which takes the address data, cleans it, splits it into relevant fields, then matches those fields to location databases and crucially includes a confidence measure on the location.
The age and size of government institutions can mean that there are barriers to change. Traditionally government has been slow to take up new technologies and working methods, and has worked in a very siloed manner. This has led to a system where data collection and storage are not standardised, and protocols for data sharing are patchy. In order to address this, the Borough Data Partnership has been founded to encourage data sharing and standardisation, as well as skill and problem sharing. As part of this, a London Office of Data Analytics (LODA) is being piloted, in partnership with Nesta and ASI, as a test case for borough data sharing. Boroughs put forward 20 different ideas, with the problem of identifying unlicensed Houses of Multiple Occupancy (HMOs) currently being taken forward. This is a test case which is expected to deliver quick, actionable results as the boroughs involved hold good data already, housing quality is a pressing issue for Londoners, and there is potential for councils to recover lost revenue from licensing fees. Keep an eye out for updates on the blog as this project progresses!
In order to support the expansion of our data capabilities, we have been investing in technology solutions to store and access the different types of data that is held by London government. The City Data Store, a project which is being implemented by our partners Mastodon C, will allow for secure storage of sensitive data, a federated access system where other parties such as London’s local authorities can store and control access to their own data, and the ability to interact with the data via APIs. This will help to break down some of the barriers to data sharing between different organisations while still maintaining the security of sensitive data, and the APIs will allow for real time access to datasets, including data that can be shared with the public to encourage crowd sourced solutions to London’s challenges. The city data store will be ready to accept data from IoT (internet of things) sources as they start to proliferate through the city as part of the Horizon 2020 smart cities initiative.