London Building Stock Model 2
Blog 2: Celebrating London’s data story
Introduction
Nearly two-thirds of London’s CO2 emissions come from buildings, with domestic buildings being a major contributor at 32%. Delivering retrofit to domestic buildings at scale and targeted to the most-effective types of building is therefore a key route to net zero. A joint study with London Councils identified ~500,000 homes that require retrofit in the social housing stock alone, however, only a few thousand were retrofitted in 2022.
Data Scientists at the GLA have used machine learning and data from a range of sources to provide energy efficiency data for all London homes in a transparent, shareable way that can be regularly updated and improved over time. This data is now available in the London Building Stock Model 2 (LBSM2).

Problem to be solved:
Information about London’s properties is needed to prioritise retrofit programmes, identify fuel poverty and feed into other programmes, such as those addressing damp and mould.
Energy Performance Certificates (EPCs) are generated when a property is sold or re-let. They contain most of the information required to inform energy policy and delivery, but are only available for about 65% of London’s residential properties. There are also gaps in the EPC data and known issues with consistency and data quality.
The problem to be solved is therefore to generate modelled values for those properties without an EPC and to improve the quality of the data using alternative sources. Key variables that were modelled for LBSM2 include:
- Property type & built form
- Floor area & number of habitable rooms
- Property tenure
- Primary fuel type & heating system
- Wall type & insulation
- Roof type & insulation
- Construction age
- Glazing type
- Energy consumption
- Current & potential EPC rating
Who are the users and what do they need to do?

Working with experts at the GLA, London Councils and the London Office of Technology & Innovation, we identified the following groups of users to inform the initial release:
- Policy staff wanting to design energy efficiency programmes or identify the most effective route to decarbonise a local area’s energy system through Local Area Energy Plans
- Retrofit programme delivery staff wanting to carry out an initial sifting exercise to identify properties that might qualify for support through a retrofit programme, such as the Warm Homes: Local Grant.
- Specialist suppliers (for instance, of replacement glazing or heat networks) wanting to identify scale and location of the market.
What’s new about this work?
Although LBSM2 builds on previous work (see below) it is novel in a number of ways:
- Accessibility – Particular care has been taken to carry out the modelling based either on open data or Ordnance Survey data (available to public organisations and their partners through the PSGA). This means that the outputs can be shared much more widely and put in the hands of the organisations best placed to deliver much larger retrofit programmes in the future.
- Beyond EPCs – We’ve enhanced the information that is captured in an EPC visit using other sources including the 2021 Census and Land Registry’s Price Paid data. We’ve also included planning data (such as conservation areas) and information that might help targeting areas (such as Fuel Poverty or Deprivation levels).
- Updates – Around 200,000 EPC visits are carried out in London each year and up to 30,000 new properties are built. The data pipelines and modelling have therefore been built so that they can be run regularly (ideally monthly) to keep the LBSM2 up to date.
- Data Explorer – The raw data is too large to be manipulated in standard spreadsheets and is such a rich, multi-dimensional set of information that patterns aren’t obviously visible. Therefore, alongside a data download and API (available on request), we’ve also developed a map-based data explorer, that allows users to filter and view building-level or summarised data
- Validation – the outputs of the model have been independently reviewed by Buro Happold at an area level with a focus on attributes most important for the likely uses, with the conclusion ‘that the model performed well’.
What’s next?
In many ways, our journey with building stock data has just begun.
As well as being a key tool for the delivery of Warmer Homes London, we intend to use the LBSM2 as a core linking dataset for other programmes including work to tackle damp and mould.
We will also look into expanding LBSM2 to include additional data inputs, such as:
- Through partnerships with Borough housing officers, energy companies, social landlords and others, we’ll actively seek additional inputs from other housing surveys, improving the input data. If you have any data you would like to share, please contact us at gis@london.gov.uk
- Linking to sensor data – EPC data indicates a property’s theoretical energy consumption, but bringing in actual energy consumption data will help us to build up a set of realistic building archetypes.
- Novel data sources including thermal imagery and machine generated building classifications will provide independent inputs to cross check or validate EPC data
To some extent, we’ve pushed our EPC-led modelling as far as we can go, due to inconsistencies and errors in the original EPC data. However, we will continue to see if improvements can be made in the modelling approach (particularly, now that a clean set of input data is available to other interested modelling teams).
Technical details and further information
Acknowledgements
LBSM2 builds on two particular pieces of work:
1. The London Building Stock Model was built for the GLA by UCL in 2020 based on data from 2017. The work was carried out by the UCL Energy Institute and the Centre for Advanced Spatial Analysis, both in the Bartlett Faculty of the Built Environment. The model contains data on every domestic and non-domestic building in London. Whilst it has proved useful, the GLA wanted to update this model (v2) and bring the modelling work in house as work on building retrofit accelerates.
2. The ONS Data Science Campus published a particularly useful blog evaluating different approaches and data sets. This work focussed on buildings in Wales, so the outputs didn’t help answer questions about London. Nonetheless, the detail on the methodology provided inspiration and reassurance that reasonable results could be achieved using datasets that we could access for London and easily available machine learning tools.
Methodology
The project can be thought of in three broad areas:
1. Data Processing
Energy Performance Certificates – The EU’s Energy Performance of Buildings Directive requires domestic properties to have an Energy Performance Certificate (EPC) when constructed, sold or let. Based on information gathered during an in-person survey, an Energy Efficiency Rating is generated on a scale from 1 to 100 and then mapped to a band of A to G.

MHCLG has released EPC data, matched to Ordnance Survey’s Unique Property Reference Number (UPRN). As well as the headline band, this contains the detailed observations for each property covering heating, insulation, size and other key factors needed for planning retrofit programmes. The main challenges encountered included:
- Multiple records exist for certain properties (for instance, where it had been regularly re-let or resold on multiple occasions). These were sometimes found to contain conflicting values (even for information that shouldn’t change, like the construction age). Therefore, a set of rules had to be designed to either take the most recent value, the most common value or an average value, depending on the field.
- Gaps in the data – some fields were missing from the data download (for example, energy consumption wasn’t included in the data download and had to be estimated based on fuel bill data).
- Very high dimensionality for some fields (for instance, over 100 possible answers), which had to be grouped down to a manageable number of categories such as 4 or 5, without losing their meaning.
2. Modelling
At a high level:
Step 1 – Data cleaning and merging to create a single, pan-London data set containing known information for each property and the five nearest neighbours, ready for the gaps to be filled by the model
Step 2 – The tool currently uses the LightGBM model – a type of decision tree boosting algorithm. This is run iteratively in a particular order (the known data is used to predict variable 1, then the known data + variable 1 is used to predict variable 2 and so on).
Step 3 – Create performance reports and map how predictive each input variable is for the subsequent variables.
Step 4 – Fine-tune the model to improve the accuracy for individual fields and the overall distribution of the values. As an example, values for the five nearest neighbours were often important predictors. Splitting this, so that when predicting for houses or flats, we used the five nearest properties of the same type, improved the accuracy scores.
3. Sharing
The data is available here.
LBSM2 has been added to the Department for Science, Innovation and Technology registry on algorithmic transparency, and the record is available here.
The Team:
- Anupam Bose – Data Engineer/Analyst
- Conor Dempsey – Data Scientist
- James Scott-Brown – Data Visualisation Developer
- Karan Luthra – Snr Data Engineer
- Paul Hodgson – Senior Manager – City Data
- Rachel Humphries – Data Scientist
- Ruxanda Profir – Principal Policy and Programme Officer (Energy)
- Yiran Wei – GIS Manager