For over 40 years the GLA, and its predecessors, have produced population projections for London and the boroughs. These projections form the foundation of much of the GLA’s strategic planning and help to shape a wide range of policy decisions. The projection models themselves are continually evolving as new data and methods are incorporated into existing systems. This blog outlines how concepts and practices from the emerging discipline of data science have been integrated into the work of GLA’s demography team over the last year.
The project to redevelop the GLA’s modelling framework began with a review of the team’s suite of population and household models in late 2015. This review identified a number of areas where the models could be improved both methodologically but also in terms of workflow and data structures. New datasets and developments in the field of data science provide a context in which more complex data-driven models are possible providing greater output detail and modelling flexibility.
The open-source environment R was chosen as the platform in which to develop this new suite of robust and methodologically transparent models. The overarching goal is to develop a system in which all of the population and household models are connected through a single user interface.
The project has also provided an opportunity to undertake an audit of how we store, consume and output data. Much of the input data to our models comes directly from third parties (particular the national statistical agencies ONS, NISRA and NRS) and in various formats. The data is pre-processed so that it conforms to the principles of tidy data where data are stored in the same format with each row representing a single observation. This greatly increases the efficiency with which the data can be processed but also aids in data validation. Data transfer between models is undertaken using the R-native file format ‘Rdata’ which allows increased time efficiencies and protection against data corruption.
Beyond the work of the demography team, the wider Intelligence Unit has seen a rise in interest and capacity for data science projects. A core of individuals across a range of teams have begun to develop the skills and knowledge necessary to take on Big Data projects.
A vital element of this kind of coding project work is the development of procedures for peer review and quality assurance. The flexibility of R means that there are often many ways to approach the same problem so having a common set of coding practices can facilitate the code validation. Having an agreed-upon core of libraries ensures that those working on disparate projects are using the same dialect of the R language.
Another valuable tool in the development of our data science capacity has been knowledge-sharing. Through meetings, presentations and collaborative working we ensure that new knowledge and ways of working are disseminated across teams. Perhaps the most valuable aspect of working within a data science community is the ability to crowd-source answers to problems when you hit a road block. Often just explaining an issue to a colleague is enough to trigger an answer in your own mind.
We have also begun to use more tools for collaborative working such as GitHub which allows for peer review of code but also, importantly, version control. This is an area in which we are looking to improve and we are planning to develop procedures over the coming months which will consolidate this aspect of our workflow.
It is vital that the work we do at the GLA is transparent so that those using our data can have confidence in its validity. To that end the demography team recently commissioned the Centre for Population Change at the University of Southampton to review the methodology and implementation of our Cohort Component Population Model. This is an example of the team reaching out and working with other experts in the field of demography to ensure that our products are the best they can be.
Another aspect of transparency is the availability of data. The new modelling approach allows us to extract and publish a much greater range and variety of data from our models (around 0.6 gigabytes per model run). This allows users to fully understand and investigate the nuances of our population projections in a way that has not previously been possible. We are always working towards ensuring that our outputs are in the most useful formats. This includes producing data in human-readable Excel files but also in machine-readable flat csv files. We also produce bespoke csv outputs designed to be used as input files for specific software applications.
The model review and development process is ongoing. We have already transferred our cohort-component, school rolls and ethnic group projection models into R. Over the coming year we will redesign and implement housing-linked and small area models into the suite also. We will be looking at developing a platform in which we can place all of our models and which allows them to pass and consume data directly from one another. This latter development will, we hope, provide even greater efficiencies and cut back on the need to write-out and read-in data files.
The model development work that we are undertaking now will set the scene for the next decade of demographic modelling at the GLA but will also provide a cutting-edge template on which others working in the field can build.