Introduction

This page provides the R code used to produce the GLA Pan-London School Roll Projections. An overview of the methodology and analysis of the model’s outputs can be found in the GLA Intelligence Unit report 2015-02: Pan-London demand for school places.

A flow diagram describing how the model operates can be viewed here and in Appendix D of the report).

The model code and csv inputs can be downloaded from the pan-London demand dataset page.

Users will not be able to run the model themselves using this code as the GLA are unable to publish the underlying data. The data used is an extract of the Department for Education’s National Pupil Database (NPD). This is a pupil-level dataset provided to the GLA under special licence. For further information or to request the NPD from DfE go to the GOV.UK website.

A list of the data inputs to the model, including fields, can be found in Appendix C of the report.

If you have any queries please contact demography@london.gov.uk.

Versions

This code was written and run in R version 3.2.0 – 64 bit – “Full of Ingedients” (download) using R Studio version Version 0.98.1102 (download).

3 non-standard packages are necessary to run the code, all of which are avalable in CRAN: dplyr, tidyr and data.table. See package help for full details.

The Pan-London Model

Initial settings

library(data.table)
library(dplyr)
library(tidyr)
WD <- "~/Models/Pan London SRP"

subDir <- substr(as.character(Sys.time()),1,16)
subDir <- gsub(":", "", subDir)
dir.create(file.path(WD, "Outputs", subDir), showWarnings = F)
WD.OUT <- file.path(WD, "Outputs", subDir)

Data sources and User-defined variables

The Model uses two main sources of data. The first is the GLA ward population projections (2014 round).

The second is the NPD which is processed prior to being imported into R to filter the number of records and the number of fields down to just those needed. The NPD is also split into two dataset, one for primary pupils (4 to 10) and one for secondary (11 to 15).

The Roll.Year variable is the date of the NPD extract. In this case the data used is from the January 2015 NPD.

SHLAA <- "ward_SHLAA_2014.csv"

secondary.npd.data <- "Secondary_wardflows_2015.csv"

primary.npd.data <- "Primary_wardflows_2015.csv"

Roll.Year <- 2015 

Derived variables

Output.Start.Year <- Roll.Year -1
Output.End.Year <- Output.Start.Year + 10

Bespoke Functions

The main code module in R calls a number of functions written specifically for this project. These functions are saved in the RScripts/Functions folder and are sourced here.

func.loc <- file.path(WD,"Model/R Scripts/Functions/")
func.list <- list.files(path=func.loc)

for(i in 1:length(func.list)){
        func <- paste(func.loc,func.list[i],sep="/")
        source(func)
}

Read-in population projection

Read-in the 2014 round ward-level SHLAA capped household size projections.

setwd(file.path(WD,"Model/Inputs"))

Input.Projection <- read.csv(SHLAA, header=T, nrow=1)
Input.Projection <- read.csv(SHLAA, header=T,
                            colClasses=c(rep("character",4),
                            rep("numeric",(ncol(Input.Projection)-4))))

Data.Col <- which( colnames(Input.Projection)=="X0")
Proj.First.Year <- min(Input.Projection$Year)
Proj.Last.Year <- max(Input.Projection$Year)
Proj.Rng <- Proj.Last.Year - Proj.First.Year + 1

Read in lookup

A ward to district lookup is needed for filtering and aggregating within the code and functions.

setwd(file.path(WD,"Model/Inputs"))
ward.district.lookup <- read.csv("ward to district.csv", header=T, colClasses="character")

List of wards in City of London

A list of wards in the City of London is also needed. This is because the GLA only projects at borough-level for the City meaning inputs at ward level need to be identified and aggregated.

city.wards <- c(paste("E0500000",c(1:9),sep=""),paste("E050000",c(10:25),sep=""))

Pan London Model

The main code of the model runs twice, once for primary projections and once for secondary. This is achieved by containing the model code within a function whose sole argument is the cohort being projected.

Pan.London.Projections <- function(Cohort){

At the begining of the function three variables are set:
1. LBound.Age The pupil minimum age,
2. UBound.Age The maximum pupil age
3. Roll.File The name of the relevant npd extract (primary or secondary)

if(Cohort == "Primary"){
        LBound.Age <- 4
        UBound.Age <- 10
        Roll.File <- primary.npd.data
}

if(Cohort == "Secondary"){
        LBound.Age <- 11
        UBound.Age <- 15
        Roll.File <- secondary.npd.data
}        

Read in NPD data

Read in to school data and remove any flows from Wales.

setwd(file.path(WD,"Model/Inputs"))

NPD.Extract <- read.csv(Roll.File, header=T,
                        colClasses=c("character","character",
                                     "numeric","numeric",
                                     "character","character")) %>%

        mutate(Country = substr(HomeWard,1,1))%>%
        filter(Country!="W")%>%
        select(-Country) %>%
        setnames(c("HomeWard","SchoolWard","PupilFlow",
                   "PupilsOnRoll","SchoolLA","HomeLSOA"))

Ward lists

Two ward lists are needed for filtering and aggregating. The first is a list of all wards in London and the second is a list of wards in London excluding those in City of London.

London.Wards <- filter(ward.district.lookup,
                       substr(ward.district.lookup$District,1,3)=="E09")
London.Wards.Excl.City <- filter(London.Wards,District!="E09000001")

London.Wards <- as.character(London.Wards$Ward)
London.Wards.Excl.City <- as.character(London.Wards.Excl.City$Ward)

State school ward residents

This is a dataframe based on the NPD extract containing the total children on roll in each ward of residence.

ward.roll.residents <- select(NPD.Extract,HomeWard,PupilsOnRoll) %>%
        unique()

Base Population

This is a dataframe based on the input population. It sums the single year of age data into a single cohort, either primary or secondary.

Base.Popn <- select(Input.Projection,GSS.Code,Year) %>%

        mutate(Data = rowSums(select(Input.Projection,
                                    c((Data.Col + LBound.Age):(Data.Col+UBound.Age))))) %>%

        spread(Year, Data) %>%
        setnames(c("Ward",Proj.First.Year:Proj.Last.Year))

Wards Outside London

Wards that aren’t in London aren’t included in the input SHLAA projection, therefore projection data must be approximated and appended. 2011 Census ward populations are projected forward using borough-level growth rates from the 2012 SNPP.

  1. Extract a dataframe of wards not in London from the roll data.
outside.london <- filter(NPD.Extract,!HomeWard %in% London.Wards) %>%
        select(HomeWard) %>%
        unique()
  1. The necessary dataframes and variables are passed to a function to calculate ward level projections for wards outside London. This requires both census population data and SNPP data to be read in.
    View the function here: outside.london. The approximated projection for wards outside London is then appended to the GLA ward projection.
setwd(file.path(WD,"Model/Inputs"))

census.ward.pop <- read.csv("WARD POP CENSUS 2011.CSV",header=T,
                     colClasses=c("character","character","numeric")) %>%
        setnames(c("Ward","Age","Popn"))

snpp <- read.csv("2012 SNPP Population persons.csv",header=T,
                 colClasses=c(rep("character",5),
                              rep("numeric",27)))

outside.london <- Outside.London(ward.district.lookup, outside.london,
                                 LBound.Age, UBound.Age, Proj.Rng,
                                 census.ward.pop, snpp) %>%
                setnames(names(Base.Popn))

Base.Popn <- rbind(Base.Popn, outside.london)

Rebase the projection from June to August

The base projection is an estimate of the population at mid-year (June) while the NPD data is collected in August. To address this difference the base projection is rolled forward by two months using the Rebase function (see below).

Base.Popn <- Rebase(Base.Popn,2)

Not all years in the base projection are needed. Data for the base year (the year for which the NPD data is available) is saved to a dataframe called Indep.School.Estimates. The redundant years are then removed from the Base.Popn dataframe.

start.col <- Roll.Year - Proj.First.Year + 1
col.rng <- Output.End.Year - Roll.Year

Indep.School.Estimates <- select(Base.Popn,1,start.col)
Base.Popn <- select(Base.Popn,Ward,start.col:(start.col+col.rng+1))      

The City of London

Due to the small size of wards within the City of London the GLA projections only project borough-level data for the City. In order to make the NPD data comparable the ward-level flows it contains must be aggregated to borough.

  1. Change City of London ward codes to the borough code in the NPD data.
change.ward.to.borough <- filter(NPD.Extract, HomeWard %in% city.wards) %>%
        select(-HomeWard) %>%
        mutate(HomeWard = "E09000001") %>%
        select(HomeWard,SchoolWard:HomeLSOA)

NPD.Extract <- filter(NPD.Extract, !HomeWard %in%  city.wards) %>%
        rbind(change.ward.to.borough)    
  1. Exract from the total residents on roll dataframe the resident populations for wards in City. Sum these to obtain a total resident population for the borough.
city.total <- filter(ward.roll.residents, HomeWard %in% city.wards) %>%
                select(PupilsOnRoll) %>%
                sum()
  1. Remove city wards from the master list of ward flows and add in the newly created borough level data.
city <- data.frame(HomeWard="E09000001",PupilsOnRoll=city.total) %>%
        setnames(names(ward.roll.residents))


ward.roll.residents <- filter(ward.roll.residents,!HomeWard %in% city.wards) %>%
        rbind(city) 
  1. Add the summed ward population to the NPD dataset
NPD.Extract$PupilsOnRoll <- ifelse(NPD.Extract$HomeWard=="E09000001", city.total, NPD.Extract$PupilsOnRoll)

Calculate Number and Proportion at Independent School

The NPD constitutes a count of residents in a borough attending state school. The base projection is an estimate of all residents in the ward. Therefore the difference between the two can be used as an estimate for the independent school population.

Indep.School.Estimates <- left_join(Indep.School.Estimates, ward.roll.residents,
                                    by=c("Ward"="HomeWard")) %>%

        setnames(c("Ward","Total.Pop","State.Pop")) %>%
        mutate(Indy.Pop = Total.Pop - State.Pop,
               Indy.pc = ((Total.Pop - State.Pop)/Total.Pop)*100,
               State.pc = 100-(((Total.Pop - State.Pop)/Total.Pop)*100))

Where Roll is larger than Projection

In some cases the resident population recorded in the NPD is greater than the estimated population in the GLA projection. (See the accompanying report for an explanation Projected demand for school places).

In the model such instances are dealt with by adding the negative difference to the base population and setting the independent population estimate to zero.

  1. Create a dataframe of the population to be added to the base population. Wards where the calculated independent population is positive are set to zero. Wards where it is negative are set to the negative value multiplied by minus one.
Adjustment <- Indep.School.Estimates %>%
        mutate(adjustment = ifelse(Indep.School.Estimates$Indy.Pop<0,
                                   Indep.School.Estimates$Indy.Pop*-1,0)) %>%
        select(Ward, adjustment)
  1. Join the projection and add.back.in dataframes together.
Base.names <- names(Base.Popn)
Base.Popn <- left_join(Base.Popn,Adjustment,by="Ward")
  1. Add the adjustment to each year of the projection.
LastColumn <- ncol(Base.Popn)

Base.Popn <- cbind(select(Base.Popn,Ward),
                   lapply(Base.Popn[2:(LastColumn-1)],
                          function(x) x+Base.Popn[LastColumn])) %>%
                setnames(Base.names)
  1. Save the base population for output.
res.pop.for.output <- Base.Popn

Recalculate Indepedent school estimates

Recreate the Indep.School.Estimates data based on the new projection numbers. A shortcut is to set the independent population to zero rather than running the entire calculation again.

Indep.School.Estimates <- mutate(Indep.School.Estimates,
                        temp = ifelse(Indep.School.Estimates$Indy.Pop<0, 0, Indep.School.Estimates$Indy.Pop),

                        temp1 = ifelse(Indep.School.Estimates$Indy.Pop<0, 0, Indep.School.Estimates$Indy.pc),

                        temp2 = ifelse(Indep.School.Estimates$Indy.Pop<0, 100, Indep.School.Estimates$State.pc)) %>%

                        select(Ward,Total.Pop,State.Pop,temp,temp1,temp2) %>%

                        setnames(c("Ward","Total.Pop","State.Pop","Indy","Indy.pc","State.pc"))

Join the independent school estimates to the base population projection dataframe

Base.Popn <- left_join(Base.Popn,Indep.School.Estimates,by="Ward")

Run Projections

Column headers

col.labels <- c("Ward",Output.Start.Year:Output.End.Year)

Demand by ward of residence

Determine school place demand if all future demand were met in ward of residence. Uses the Demand.Residence function (see below).

Residence.1 <- Demand.Residence(Base.Popn, type="abs") %>%
        setnames(col.labels)

Residence.2 <- Demand.Residence(Base.Popn, type="rate") %>%
        setnames(col.labels)

Demand by current patterns of mobility

Determine school place demand if all future demand were a distributed according to of current patterns of mobility.

  1. Determine home ward to school ward propensity rates
Mobility.Rates <- NPD.Extract %>%
        mutate(Prop = PupilFlow/PupilsOnRoll) %>%
        select(HomeWard,SchoolWard,Prop)
  1. Pass propensity rate and residence projection data to the Mobility.Demand function (see below).
Mobility.1 <- Mobility.Demand(Mobility.Rates, Residence.1) %>%
        setnames(col.labels)

Mobility.2 <- Mobility.Demand(Mobility.Rates, Residence.2) %>%
        setnames(col.labels)

Hybrid model of demand

Determine school place demand if a constant number of children adhered to curent patterns of mobility but any population growth was met locally with the ward of residence. Uses the Hybrid function (see below)

Hybrid.1 <- Hybrid(Residence.1, NPD.Extract) %>%
        setnames(col.labels)

Hybrid.2 <- Hybrid(Residence.2, NPD.Extract) %>%
        setnames(col.labels)

Output Data

Set up row and column labels.

row.labels <- data.frame(Wd = c("E09000001", London.Wards.Excl.City)) %>%
        mutate(Ward = as.character(Wd))%>%
        select(Ward)%>%
        left_join(ward.district.lookup,by="Ward")

col.labels <- c("Ward.Code","Ward.Name","Borough.Code","Borough.Name")

for(i in Output.Start.Year:Output.End.Year){
        col.labels <- c(col.labels,paste(i,"/",i-1999,sep=""))
}

Outputs are saved using a function which takes as input one of the eight projections and produces ward and borough files for state demand projections and ward-level independent school projections. The final outputs are the population bases on which the projections are based. Again these are published at ward and borough level.

output(Residence.1, "Residence", "N", col.labels, row.labels, Cohort, WD.OUT)
output(Residence.2, "Residence", "P", col.labels, row.labels, Cohort, WD.OUT)
output(Independent.1, "Independent", "N", col.labels, row.labels, Cohort, WD.OUT)
output(Independent.2, "Independent", "P", col.labels, row.labels, Cohort, WD.OUT)
output(Mobility.1, "Mobility", "N", col.labels, row.labels, Cohort, WD.OUT)
output(Mobility.2, "Mobility", "P", col.labels, row.labels, Cohort, WD.OUT)
output(Hybrid.1, "Hybrid", "N", col.labels, row.labels, Cohort, WD.OUT)
output(Hybrid.2, "Hybrid", "P", col.labels, row.labels, Cohort, WD.OUT)

output.residents.base(row.labels, res.pop.for.output, WD.OUT)

End function

The Pan.London.Projections function is closed

}

Run the Pan-London Model

The main function is called twice, once for primary and once for secondary.

Pan.London.Projections("Primary")
Pan.London.Projections("Secondary")

Additional Functions

Outside London

(return to main code)
This function creates ward level projections for wards outside London using the ONS SNPP.

The inputs are:
1. lookup – the ward GSS code to local authority district GSS code lookup
2. wards.to.project – a dataframe of wards for which projections are needed
3. LBound.Age – the youngest age in the cohort (4 for Primary, 11 for Secondary)
4. UBound.Age – the oldest age in the cohort (10 for Primary, 15 for Secondary)
5. Years.to.Project – the projection range
6. census.ward.pop – 2011 Census single year of age (sya) data for all wards in England & Wales
7. snpp – the 2012 ONS SNPP, sya, persons

Outside.London <- function(lookup, wards.to.project, LBound.Age, UBound.Age, Years.to.Project, census.ward.pop, snpp){


        #Add borough GSS code to wards to project
        wards.to.project$Borough <- lookup[match(wards.to.project$HomeWard,lookup$Ward),3]
                setnames(wards.to.project,c("Ward","Borough"))


        #Group census sya ward data into age cohort
        census.ward.pop <- filter(census.ward.pop, Age %in% c(LBound.Age:UBound.Age)) %>%
                group_by(Ward)%>%
                summarise(Popn = sum(Popn))


        #Group snpp sya ward data into age cohort
        snpp <- filter(snpp, snpp$Age %in% c(LBound.Age:UBound.Age)) %>%
                select(-areaname,-areatype,-AgeGroup,-sex) %>%
                group_by(areacode) %>%
                summarise_each(funs(sum)) %>%
                setnames("areacode","District")


        #calculate year on year growth rates for boroughs
        borough.growth <- cbind(select(snpp,District),
                   lapply(3:ncol(snpp), function(x) (snpp[x]-snpp[x-1])/snpp[x-1]))


        #snpp projects to 2037, GLA to 2041
        #extra years are added by rolling forward 2037 rates

        last.Column <- ncol(borough.growth)
        extra.Columns <- Years.to.Project - last.Column

        borough.growth <- cbind(borough.growth,
                                lapply((last.Column + 1):(last.Column + extra.Columns),
                                 function(x) borough.growth[last.Column]))  %>%
                as.data.frame()


        #apply borough growth rates to wards to approximate ward population changes

        ward.growth <- left_join(wards.to.project, borough.growth, by=c("Borough"="District"))

        EW.ward.projection <- left_join(wards.to.project, census.ward.pop, by="Ward") %>%
                select(Ward,Popn)


        for(x in 2:Years.to.Project){

                EW.ward.projection[x+1] <- EW.ward.projection[x] * (ward.growth[x+1] + 1)

        }

return(EW.ward.projection)

}

Rebase Projections

(return to main code)
This function rolls the base projection of residents forward by a specified number of months. The inputs are:
1. Projection – The base projection 2. months – The number of months to roll forward

Rebase <-function(Projection, months){

        #Each column of the Projection dataframe, except the first, is a year

        #A year is rebased by taking x% of that year and adding (100-x)% of next year.

        #The Final year of the projection is dropped as the necessary data is not available

        all.columns <- ncol(Projection)
        output.columns <- all.columns-1

        follow.year.proportion <- (1/12) * months
        current.year.proportion <- 1 - follow.year.proportion

        nm <- names(Projection[1:output.columns])

        ReBased <- select(Projection, 1)

        for(i in 2:output.columns){
                ReBased[i]<-(Projection[i]*current.year.proportion)+
                                (Projection[i+1]*follow.year.proportion)
        }

        setnames(ReBased, nm)

return(ReBased)

}

Demand Residence

(return to main code)
This function creates a roll projection for both state and independent school by ward of residence.

The function inputs are:
1. Projection – The ward level projection of all children
2. type – the method for determining how many children are expected to attend independent school. Will either be “˜abs’ (static number) or “˜rate’ (static preportion).

Demand.Residence <- function(Projection, type){

        #The number of columns in projection input file
        Columns <- ncol(Projection)

        #The last column with projection data
        last.col <- Columns-5

        #The column containing the number of children at independent school
        indep.pop.col <- which(colnames(Projection)=="Indy.Pop")
        indep.pc.col <- which(colnames(Projection)=="Indy.pc")

        #Loop through each year of the projection and
        #subtract pupils expected to be at independent school

        #If type is set to absolute then this a constant value each year
        #If type set to rate then value is a proportion of the total population

        state <- select(Projection,1)
        independent <- select(Projection,1)

        for(i in 2:last.col){

                if(type=="abs"){state[i] <- Projection[i]-Projection[indep.pop.col]}

                if(type=="rate"){state[i] <- Projection[i]-((Projection[indep.pc.col]/100)*Projection[i])}
        }

        for(i in 2:last.col){

                independent[i] <- Projection[i] - state[i]
        }

return(list(state,independent))

}

Mobility Demand

(return to main code)
This function applies the 2014 pattern of ward to ward flows to the projected on-roll residential population.

The inputs are:
1. Propensity.Rates – Ward of residence to ward of schooling propensity rates
2. Projection – The on-roll projection by ward of residence

Mobility.Demand <- function(Propensity.Rates, Projection){

        rates.and.projection <- merge(Propensity.Rates, Projection, by.x="HomeWard", by.y="Ward", all=T)

#Multiply each year's projected population by each of the propensity rates this distributes the projection which is on a residence base to the school ward those children would attend if 2014 patterns held true throughout the projection period

        last.col <-ncol(rates.and.projection)-2

        mobility <- select(rates.and.projection,SchoolWard)

        for(i in 2:last.col){

                mobility[i] <- rates.and.projection[3] * rates.and.projection[i+2]

        }


        mobility[is.na(mobility)]<-0

        #Sum ward flows for each school ward

        mobility <- as.data.table(mobility)
        mobility <- mobility[ , lapply(.SD, sum), by = "SchoolWard"] %>%
                as.data.frame()

        # Sort

        mobility <- mobility[order(mobility[,1]),]

return(mobility)

}

Hybrid

(return to main code)
The inputs are:
1. Projection – The on-roll projection by ward of residence
2. NPD.Data – The ward to ward flows for 2015

Hybrid <-function(Projection, NPD.data){


#For each year calculate growth of population defined as difference between that year's population and the 2014/15 baseline population

        growth <- select(Projection,1)

        for(i in 2:ncol(Projection)){

                growth[i]<-Projection[i]-Projection[2]
        }


#Import the 2015 distribution of pupils by ward from the roll data. Column 2 is school ward, column 3 is flow. Add together flows for each unique school ward.

        NPD.data <- select(NPD.data,2,3) %>%
                as.data.table()

        NPD.data <- NPD.data[ , lapply(.SD, sum), by = "SchoolWard"] %>%
                as.data.frame()


#Join the 2015 roll data to the 2014-23 growth data. Where there are missing values these will be noted as NA by R. Convert all NA values to zeros

        growth.and.NPD <- merge(growth,NPD.data,by.x="Ward",by.y="SchoolWard",all=T)

        growth.and.NPD[is.na(growth.and.NPD)] <- 0

#For each column (year) in the projection add the 2015 school attendees for the ward to the additional resident population in the ward

        last.col <- ncol(growth.and.NPD)
        hybrid <- select(growth.and.NPD,1)

        for(i in 2:(last.col-1)){

               hybrid[i] <- growth.and.NPD[i] + growth.and.NPD[last.col]
        }

# Sort
        hybrid <- hybrid[order(hybrid[,1]),]


return(hybrid)

}

Output

(return to main code)
The inputs are:
1. projection – The projection of school places demand to be output
2. file.name – The desired name of the output file
3. indy – The independent take-up assumption used in the projection (N for static number or P for static proportion)
4. col.labels – a character vector containing the column heading for the output file
5. row.labels – a one-column dataframe containing the list wards to be output
6. Cohort – The cohort being projected (Primary or Secondary)
7. WD.OUT – The output directory

output <- function(projection, Base.Popn, file.name, indy, col.labels, row.labels, Cohort, WD.OUT){

#Create an output directory if one doesn't already exist
        dir.create(file.path(WD.OUT, Cohort), showWarnings = FALSE)

#Create sub-directories
        dir.create(file.path(WD.OUT, Cohort, "Ward"), showWarnings = FALSE)
        dir.create(file.path(WD.OUT, Cohort, "Borough"), showWarnings = FALSE)



#Output ward table

#Join the projection file to the row labels and set the column headers

        ward.out <- left_join(row.labels, projection, by="Ward") %>%
                setnames(col.labels)

        setwd(file.path(WD.OUT, Cohort, "Ward"))
        nm <- paste(indy, file.name, Cohort,"(Ward).csv")

        write.csv(ward.out, nm, row.names=F)



#Aggregate by boroughs

        setwd(file.path(WD.OUT, Cohort, "Borough"))

        borough.out <- select(ward.out,-c(1:2))

        borough.out <- aggregate(borough.out[c(3:ncol(borough.out))],
                                 by=list(Borough.Code=borough.out$Borough.Code,
                                         Borough.Name=borough.out$Borough.Name), FUN=sum)

        borough.out <- borough.out[order(borough.out$Borough.Code),]

        nm <- paste(indy,file.name,Cohort,"(District).csv")
        write.csv(borough.out, nm, row.names=F)

}

Download code Zip file

The code and associated functions are availble to download here: Projected Demand for School Places code