SlideShare a Scribd company logo
Administrative data
census research
Chair: Becky Tinsley, Admin Data Census,
ONS
Becky.tinsley@ons.gov.uk
Population and Household
Estimates – what we have done
and what we plan to do
Chris Hill, Ali Dent and Claire Pereira
Administrative Data Census Project
Census Transformation Programme
Overview
Administrative Data Research Outputs:
• Population estimates – background, our
progress so far and our future plans (Chris
and Ali)
• Producing household estimates from
administrative data (Claire)
Note: These research outputs are NOT official statistics on the
population
Background
Beyond 2011 programme (April 2011)
• review the future provision of population statistics in England and Wales
and inform government and Parliament about options for the next census
• culminated in the National Statistician’s recommendation on the future of the
census and population statistics (March 2014)
The Census Transformation Programme (January 2015) to take forward
the National Statistician’s recommendation:
• deliver a predominantly online census in 2021
• increased use of administrative data and surveys to enhance the statistics
from the 2021 census and improve annual statistics between censuses
Administrative Data Census Project - aim is to produce the type of
information that is collected by a ten-yearly census (on housing, households
and people) from use of administrative data and surveys
Administrative Data Census Project –
Research Outputs
Key aim of the Research Outputs is:
• to replicate as many census outputs as possible using
admin data (and surveys) to compare with the 2021
Census
• Size of population
• Number and structure of households
• Characteristics of housing and the population
Continued development of the methodology based on
acquisition of new data sources and user feedback
Publish an annual assessment each spring to show
progress of our ability to move to an Administrative
Data Census in the next decade
A long way to go…but we have begun
• Published first set of
Research Outputs
Oct 2015
• Published first Annual
Assessment
May 2016
• Next set of Research
Outputs published –
expanding the range
Autumn 2016
What was included in the 2015 release?
• research outputs for each LA in England and Wales as a series
of admin data population estimates for 2011, 2013 and 2014 by
5 year age-sex groups
• analytical report comparing these to the 2011 Census and
subsequently the ONS mid-year population estimates
• case studies to highlight quality issues with the admin data
• interactive maps and population pyramids
• administrative data update paper – plans and aspirations for
future years
• feedback from users - aim of improving our methods
Producing population estimates from
administrative data (SPD)
NHS Patient
Register (PR)
DWP/HMRC Customer
Information
System (CIS)
Higher Education Statistics Agency (HESA)
data (students)
population
estimates
Included in Statistical
Population
Dataset (SPD)
If in different location
on PR & CIS, split half
and half across two
addresses
Statistical Population Dataset – SPD V1.0
Performance of SPD v1.0 compared with the 2011
Census estimates by LA
94% of LA total population
estimates within 3.8% of Census
estimate in 2011
Admin data
method lower
than 2011 Census
Admin data
method higher
than 2011 Census
Percentage difference from 2011 Census estimates
Performance of SPDv1.0 compared with the 2014
mid-year estimates by LA
90% of LA total population
estimates within 3.8% of mid-
year estimate in 2014
Admin data
method lower
than 2011 Census
Admin data
method higher
than 2011 Census
Percentage difference from 2014 MYEs
Feedback summary
• the need for a Population Coverage Survey to help with
estimating the size of the population (considering options)
• using ‘activity data’ (1) to help reduce levels of over-coverage
that are seen for particular age groups (some progress)
• refining the Statistical Population Dataset inclusion and
exclusion rules (changes made)
• reviewing the quality standards that are used to assess the
quality of the SPDs (considering options)
• producing population estimates for small areas, within a local
authority (potentially autumn 2016)
1 Information from administrative data sources about when individuals have interacted with
systems or services, such as the National Insurance, tax or benefits systems, or a hospital visit
through the NHS system.
SPD Developments for 2016
SPD (Statistical Population Dataset)
used to estimates of the size of the population by anonymously
linking multiple administrative datasets
• Continue with SPD v1.0 for 2015 estimates (stable)
• SPD v2.0 (improved model) will be used to produce
pop estimates for 2011 and 2015
SPD v2.0 changes
• Improve overall coverage of the usual resident
population
• Redistribute people in the correct location
Plans for 2016 Research Outputs
• Population estimates – expanding the breadth and
detail
• Improvements to the methods used to produce
administrative data population estimates
• Outputs on the number of households
• Research on income from combined PAYE and
benefits data
• Stagger the outputs over the autumn
What are we planning to publish this year?
Package 1
Population estimates
(National and LA)
by LA, sex and 5 year age-group 2015
(As last year, but extends to include
a new time series for 2011 and 2015,
and the old time series extends to
SYOA)
Autumn 2016
Package 2 NEW
Population estimates
(Small Area)
by LSOA, sex and 5 year age-group Autumn 2016
Package 3 NEW
Household estimates
(Number of households)
Combined PAYE and
benefits research
by LA (2011 and 2015)
by LA (2013/14 Tax Year)
Autumn 2016
All content and timings are provisional
Focus of methodological research for 2016
Males and females (where comparison data is higher or lower than
official estimates) percentage difference 2011 (England and Wales)
Add of other
admin data
Activity
Data
Improve Matching Methodology,
increasing number of matches
Tackling undercoverage for school age
children
User feedback had suggested additional
administrative sources to use including the
School Census
• School Census = record level source that includes all
pupils at state Schools, produced annually
• SPD V1 includes matches between any two of PR,
DWP-CIS and HESA
With School Census we can find additional
matches to include in an SPD:
• PR-SC matches
• CIS-SC matches
Also - improve use of matches for
students
• In SPD V1 – it is possible for the record identifiers to
conflict, for example
• In this case SPD v1 does not choose between the two HESA
IDs, so PR and CIS locations are used to place people in an
area
• The conflict implies that there is one SPD row that might
represent two people, and that we are not using the HESA
location for them in SPD v1
HESA-pr
match
PR-CIS CIS-HESA
match
HESA ID (via
PR)
NHS number
PR
DWP CIS ID HESA ID
(via CIS)
365421 7404747201 889877261543 739542
HESA ID (via
PR)
NHS number
PR
DWP CIS ID HESA ID
(via CIS)
365421 7404747201 889877261543 739542
Improved use of matches for students
• Want to ensure there are no rows with more than 1 identifier
from each dataset (e.g. prevent 2 HESA ids in 1 row)
• Resolving conflicts may change number of records in SPD
Could convert this to 2 matches, so each HESA ID only appears
once!
• Achieved by changing how we use the information for the
matches
• All IDs from the datasets are included in a “spine” of records
(even if they are non-matches) – convenient for research
HESA ID NHS number
PR
365421 7404747201
DWP CIS ID HESA ID
889877261543 739542
School Census matches and HESA conflicts:
estimated effect of extra matches compared to SPD v1
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
% of SPD 1.0 estimate
Singleyearofage
Females
Males
Effects from
adding SC
matches
Effects from
HESA conflict
resolution
Estimated proportion of SPD records by source
(after SPD exclusion/inclusion rules)
Frequency Percent
PR and CIS 45,202,800 81.2
PR, CIS and SC 7,733,800 13.9
PR,CIS and HESA 2,070,800 3.7
PR 225,100 0.4
PR and HESA 170,500 0.3
PR and SC 143,500 0.3
CIS and HESA 89,700 0.2
CIS and SC 52,200 0.1
New
matches
Increased number
of matches
Coverage improvements for working
ages
• SPD v1 used “exact” or deterministic matches
(e.g. based on combination of name, address, DoB etc).
• Using score based matches (probabilistic) we can find
more PR to CIS matches
• Over 150,000 additional matches of expected good
quality can be identified
• 60% male, majority are distributed evenly across ages
18-50
• 40% Female, many in 18-24 range but also some older
• The extra matches to be added to the
deterministic/matchkey matches to test impact (work in
progress)
“Activity” data
New activity data acquired from DWP and
HMRC (abbrev = BIDS):
• National Benefits Database (NBD)
• PAYE (Pay as you earn – income tax)
• Single Housing Benefit (SHBE)
• Tax Credits
• excludes: Child Benefit, self-employed and people on
Universal Credit
Research aim:
to derive broad activity to verify residency in E&W and
potentially reduce overcoverage in an SPD
Created combined DWP/HMRC activity
dataset for 2011
• PAYE and TC:
• Anyone present in 2010/2011 and 2011/2012 tax years
• NBD:
• All people with an active claim on 15/03/2011
• Any other people with JSA or ESA claim since 15/03/2009
• Any partners of the above given their own record
• SHBE:
• Active claim on 15/03/2011 or started after
• Partner status was active on 15/03/2011 or after
• Variables:
• Dates
• Additional variables derived from each source – eg account
status, claim status, boolean tax year variables for PAYE/TC
• Links to CIS, Census, PR, SPDv1
Age-sex distribution of active DWP/HMRC
records in SPDv1 2011
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100+
Male not in BIDS
Male in BIDS
Female not in BIDS
Female in BIDS
SPD records
not active
Effect of removing inactive records
• Self-employed likely to be excluded from activity dataset
• Child benefit not included
• Would like to acquire MORE activity data!
-50%
-40%
-30%
-20%
-10%
0%
10%
15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90+
Proportion of males in BIDS
compared to census estimates
Proportion of females in BIDS
compared to census estimates
Proportion of males inSPDv1
compared to census estimates
Proportion of females in SPD v1
compared to census estimates
Improving local distributions – with
“PDS” activity data
• PDS is our first set of health activity data
• Is based on interactions with NHS, not same as Patient Register -
contains history (multiple rows per person)
• Extract for ONS contains “movers” – a history of locations for each
person
• Aim to remove uncertainty in SPD about location of people, so a
single record is not allocated as a half-person in 2 locations
• SPD v1.0 contains ~3.1 million half-weighted people (5.5%)
• PDS information likely to be more recent than CIS/PR and may
help to resolve half-weighted people
• Many half-weighted records persist in SPD for multiple years, so
linking to PDS from current and previous years may resolve more
• Aimed to link half-weighted records to PDS in current year or
earlier, and categorise
Resolving half weighted people with PDS:
linkage results (2013)
• 48% of 2013 half-weighted records are not linked to any PDS
record from 2013 or earlier, 52% are linked
• Those who are found by most recent PDS extract they are
found on:
• Significant benefit from including earlier years
(maximise information available for half-weighted records)
PDS extract Frequency Percentage
June 2013 806125 49.1
June 2012 469416 28.6
June 2011 323584 19.7
March 2011 44480 2.7
Example of resolving location :
perfect moves
PR:CIS:
LA Mod date
E06000047 16/04/2013
LA Addr start date
E09000027 09/03/2011
Most recent PDS move:
Origin LA Destination LA Effective date
E09000027 E06000047 04/04/2013
• Dates on PDS and PR are both later than CIS
• Category 1b is the same except PR to CIS
• Can assign very confidently to destination LA
Category 1a – a perfect CIS-PR move:
Producing population estimates from
administrative data (SPD)
NHS Patient
Register
DWP/HMRC Customer
Information
System
HESA data
(students)
SPD population
estimates
Included in Statistical
Population
Dataset (SPD)
School Census
Statistical Population Dataset – SPD V2.0
resolve half-weights
Add extra
PR-CIS
matches
To be published this autumn !
Producing household
estimates from
administrative data
Methodology and analysis towards
ONS Research Outputs 2016
What is a household?
A
household
is defined as:
one person living alone,
or
a group of people (not
necessarily related) living at the
same address who share
cooking facilities and
share a living room or
sitting room or dining area.
Beyond 2011
Early research showed potential for admin
data to provide number and sizes of
‘occupied addresses’.
But key challenges….
• Limited data sources available.
• Coverage & measurement error –undercount &
people not in the right place.
• Definitions – occupied address v census definition
Aims
Short term
• Producing numbers of households in England
and Wales by Local Authority for 2011 and 2015
• Deal with key challenges from previous work
Longer term
• Keep developing – build breadth and time series
of households statistics
• Develop alongside SPD production
What data can we use?
Address
Base
Tax and
Benefits
data
Population
Coverage
Survey
Comparing with other ONS outputs
OA
Output Area
DAU
Demographics
Analysis Unit
LFS
Labour Force
Survey
SPD
Statistical
Population
Database
No mid year estimates as with population
Can evaluate quality in 2011 by comparing with Census
estimates, down to OA level.
DAU produce national estimates for 1996 onwards:
• Families and people in families
• Households and people in households
Produced from LFS – sample size - 41,000 households
containing around 100,000 individuals. Internally
estimates can be produced at Local Authority level.
Can AddressBase help?
C Commercial
L Land
M Military
O Other (Ordnance
Survey Only)
P Parent Shell
R Residential
U Unclassified
X Dual Use
Z Object of Interest
RB Ancillary Building
RC Car Park Space
RD Dwelling
RG Garage
RH House In Multiple Occupation
RI Residential Institution
There are 1128 classifications of address on Address Base, an
Ordinance Survey product, including care home, house boat and
caravan. Classifications have four levels of detail (many/most do
not) and have dates attached, that allows further validation.
Address Matching
OSAPR
Ordnance Survey
Address-Point
Reference Number
UPRN
Unique Property
Reference Number
Address matching methodology is developing at ONS -
estimate a 5% increase in match rate.
Need a reliable unique identifier for addresses - transition
from OSAPR to UPRN
Changes in address identification
OSAPR
Ordnance Survey
Address-Point
Reference Number
UPRN
Unique Property
Reference Number
- 1.00 2.00 3.00 4.00
East Midlands
East of England
London
North East
North West
South East
South West
Wales
West Midlands
Yorkshire and…
OSAPRs/UPRNs in millionsUPRNs 2014 OSAPRs 2013
Currently ONS has attached OSAPRs onto records up to 2013,
with a switch to UPRNs in 2014.
We would expect an increase due to housing stock growth of
around 1%. 77% of LAs show an increase of more than 1 %.
Challenges
Our biggest challenges for producing household numbers
Definition – household/address is not a one to one relationship.
Putting people in the right place
•Half weights on SPD – when sources disagree
•Correct address allocation
• data lags
• high churn
• people not deregistering
• poor AddressBase matching/allocation
Dealing with half sizes
Our objective is to count each person in a household – need to resolve
unmatched records
Two methods
1. Source preference
HESA PR CIS
2. Redistribute according to
household size distributions
Dealing with half sizes
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
hh size 1 hh size 2 hh size 3 hh size 4 hh size 5
plus
total hhs
Comparing with Census Outputs
Census - QS406EW Redistributed half sizes
-25
-20
-15
-10
-5
0
5
10
15
20
25
hh size 1 hh size 2 hh size 3 hh size 4 hh size 5
plus
total hhs
% differences
Redistributed half sizes
Over counting large
household sizes, whilst
undercounting 1 and 2
person households.
It is anticipated that better
address matching and the
use of UPRNs rather than
OSAPRs will resolve some of
these differences.
Dual System Estimation
ONS often uses DSE to weight up for non response. To trial the use of
DSE, to weight up for undercount, I used a 4% sample by postcode taken
from the Census as a proxy for a survey.
To allow for differences in samples, 400 samples were taken.
In the future, an annual
survey similar to a population
coverage survey could
contribute.
-14 -12 -10 -8 -6 -4 -2 0
East Midlands
East of England
London
North East
North West
South East
South West
Wales
West Midlands
Yorkshire and The Humber
England and Wales
SPD % diff
Dual System Estimation
Entire
population
Sample population
match
(Census addresses * SPD addresses)
Matched addresses
Then to scale up to
England and Wales
aggregate
Dual System Estimation
Impact of DSE on household counts
85% of Local Authorities are within 0.5% of Census estimate
90% of Local Authorities are within 1% of Census estimate
95% of Local Authorities are within 1.5% of Census estimate
-14 -12 -10 -8 -6 -4 -2 0
East Midlands
East of England
London
North East
North West
South East
South West
Wales
West Midlands
Yorkshire and The Humber
England and Wales
DSE % diff SPD % diff
Allocating address at SPD record level
Using many data sources to find our
‘best’ address.
Benefits
Enables aggregation at different
levels and cross tabulation with other
variables.
Can weight certain data sources for
different demographic groups . e.g.
students
Allocating address at record level
PR
Joe Bloggs
17/4/1974
UPRN: 12345
CIS
Joe Bloggs
17/4/1974
UPRN: 12346
Patient register moves - PDS
Joe Bloggs 17/4/1974 move 1 - 1/1/2011: UPRN: 12345
Joe Bloggs 17/4/1974 move 2 - 2/2/2011: UPRN: 22345
Joe Bloggs 17/4/1974 move 3 - 3/3/2013: UPRN: 12346
Can use
activity
data to
locate the
newest
address.
True
match on
SPD
Plans for the future
This year
• Numbers of households by LA, England and Wales, 2011
and 2015 for Research Outputs, Autumn 2016 (including
case studies of Local Authorities of interest)
• Focus on issue of definitional differences – what is the real
need vs what can be produced.
• We have initiated a ONS household working group to join
different sectors of work to share ideas build knowledge.
Future years
• Develop time series of numbers of households
• Explore additional data sources to fill gaps in household
statistics
•Household sizes
•Household composition
•Investigate production of an enhanced address register.
Integrated sources for estimating
population characteristics
Alison Whitworth and Meghan Elkin
Administrative Data Census
Census provides information on:
1. Size of the population
by area, age and sex
2. Household and families
number, size and type of family
3. Population characteristics
information on ethnicity, educational attainment,
religion (etc)
Producing population statistics from
admin data
• Beyond 2011 admin data option :
o Admin data – population size by age and sex
o 4% annual survey – population characteristics
• National Statistician’s recommendation: use
all available sources
o Approach going forward to explore in more depth
the potential of admin data
51
Outline
• Framework for characteristics post 2021
• Methods for combining sources
• Academic research
• Examples
o Income
o Population by ethnic group
Framework for population characteristics
Survey
• Admin data available for target variable
• Admin data for characteristic associated with
target variable
• Admin data is a proxy of target variable
Admin
• No administrative data available for target
variable
Admin
Integrated
sources
Survey
Framework for population characteristics
Survey
• Admin data available for target variable
• Direct counts or estimates from administrative
data
•Admin data for characteristic associated with target variable
• Regression model for estimates
•Admin data is a proxy of target variable
• Structural model for estimates
Admin
• No administrative data available for target variable
• Direct survey estimates
Admin
Integrated
sources
Survey
Local
benchmarks
National /
regional
Microlevel /
multivariate
Framework – top down approach
Survey
Integrated
sources
Admin
Two key methods
Regression model
• Auxiliary data are correlated
with the target variable
• Model to define relationship
Structural model
• Auxiliary data same structure as
target variable
• Model to define the structure
Local Authority claimant count by
unemployment
Methods – ONS applications
Regression model
 Mean Household income
(MSOA)
 Median Household
income (MSOA) for 2011
 Unemployment (LA)
 Emigration (LA)
SPREE
 Broad ethnic group (LA)
Regression model
× Unemployment (MSOA)
× Informal caring (wards)
× Crime and fear of crime
(wards)
× Mental health of children
and adolescents (wards)
× Adult Neurotic Disorder
(wards)
Role of survey data
Survey
• Direct counts or estimates from administrative
data - Survey to adjust for under or over
coverage, measurement error
• Regression model - Survey estimates
strengthened by admin sources
• Structural model - Survey provides accurate
marginal totals
Admin
• No administrative - Direct survey estimates
Admin
Integrated
sources
Survey
Academic input
• Collaborative projects Structure Preserving
Estimation (SPREE),
• Expert advisory group Small area estimation
• Funded research/ Bids National Centre Research
Methods
• Conferences NCRM Bath (July)
SAE Maastricht (Aug)
Summary
1. Framework: top down
 gaps at small area, micro level & multivariate
2. Methods: two approaches for combining sources
 understand wider application
3. Academic momentum
Case Study: Income outputs
Meghan Elkin
Current published income outputs
Admin
outputs
Integrated
sources
Survey
outputs Majority of current income outputs
(lowest geography parliamentary constituency)
Small Area Income Estimates
(modelled household weekly income at MSOA)
Working towards multivariate, small area
income outputs
Income definition (Canberra Handbook)
Ideally achieve gross income:
• Income from employment
e.g. employee income and income from self-employment
• Property income
e.g. income from financial and non-financial assets
• Current transfer received
e.g. social security schemes, pensions
Source: UNECE Canberra Group Handbook on Household Income Statistics
Components of income & admin data
Total
Income
Current
Transfers
received
Property
income
Self
employment
Employment
via
PAYE
Financial
assets
Royalties
Rent
Dividends
Interest
Personal
pension
Benefits
Non-financial
assets
Tax
Credits
Un-
declared
Size of the bubbles
not relative to
proportion of income
Social
security &
assistance
State
pension
Pensions
Current
transfers
from non-profit
institutions
Current transfers
from other
households
This diagram does not provide a full disaggregation of components. For more detail see the Canberra Group Handbook.
Occupational
pension
Income
from
employment
Green = access to
admin data
Amber = admin data
available
Red = unknown/admin
data not available –
will need to estimate
using surveys
Un-
declared
Admin Data Census plans for income
Increased use of Admin Data
Direct estimates from
admin data individual
income estimates
Combining admin
data with surveys
(work with SAIE)
More developed
publication e.g.
components of income
Admin
outputs
Produce household
income estimates
More detailed PAYE &
self-assessment data
2016 2017 2018
Integrated
sources
Output
quality
National OA
Limited
coverage
LA LSOA OA
Income research outputs 2016 definition
Output Distributions
Population Those resident at 30 June 2013 on the Statistical Population Dataset (SPD) aged 16 and over
Geography England and Wales
Geographic level Local Authorities
Unit level Individuals
Reference period Annual
Time period Tax year 2013/14
Source of income PAYE earnings (employment and occupational pensions), child and working tax credits,
housing benefit, many DWP benefits
Accrual or receipt Receipt – administrative data on income and benefits is recorded on receipt of the income or
benefits
Location Income by individual’s home address
Proportion of SPD population that has
some income information by age
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88
Age
Males
Females
Proportion of the
population
Proportion of SPD population that has
some income information
Proposed income bands for 2016 outputs
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
LA1
LA2
Zero
0-5K
5-10K
10-15K
15-20K
20-30K
30-40K
40-60K
60K+
Missing
Proportion of the population
Case Study: Local Authority
Population Estimates by Ethnic
Group using Generalised Structure
Preserving Estimation (GSPREE)
June 2016
Census Table
Population by Local Authority and Ethnic Group
England (March 2011)
Local Authority White Mixed Asian Chinese Black Other Total
Fareham 107959 1359 1200 467 357 239 111581
Southampton 203528 5678 16443 3449 5067 2717 236882
Portsmouth 181182 5467 9863 2611 3777 2156 205056
Winchester 111577 1626 1894 745 457 296 116595
……. … … … … … … …
Bath & NE
Somerset 166473 2898 2665 1912 1326 742 176016
Data for Ethnic Group
• 2011 Census estimates (Mar 2011)
Proxy: Detailed cross tabulation but outdated
• School Census (Jan 2014)
Proxy: Detailed cross tabulation but age 5-15 only
• Annual Population Survey (2014)
Total population by ethnic group
• Mid Year Population Estimates (June 2014)
Total population by local authority
Data for Ethnic Group
Census 2011 MYE
2014
White Mixed Asian Chinese Black Other Total
Fareham 107959 1359 1200 467 357 239 ……..
Southampton 203528 5678 16443 3449 5067 2717 ……..
Portsmouth 181182 5467 9863 2611 3777 2156 ….....
…. … … … … … … …
Tower Hamlets 114819 10360 96392 8109 18629 5787 ……..
Slough 64053 4758 54900 797 12115 3582 ………
….. … … … … … … …
APS July 2012 - June 2014 (weighted estimates)
National total ………. ………. ………. ………. ……….. ……….
School Census Dec 2014
White Mixed Asian Chinese Black Other
Fareham … … … … … …
Southampton … … … … … …
Portsmouth … … … … … …
…. … … … … … …
Tower Hamlets … … … … … …
Slough … … … … … …
….. … … … … … …
Solution…
• Combine administrative and census data with
survey data to borrow strength and produce
reliable estimate for each cell (domain) using
GSPREE (Zhang and Chambers, 2004 and
Luna-Hernandez, A. 2014).
Applying GSPREE
• Step 1: Estimate the association structure by relating
survey counts (Yaj) to census counts (Xaj):
logYaj =g a + lj + baaj
X
lj = 0
jå
1, , , 1, ,a A j J K K
aaj
Y
= baaj
X

2011 Census (Xaj)
White Mixed Asian
Chines
e Black Other
Fareham 107959 1359 1200 467 357 239
Southampton 203528 5678 16443 3449 5067 2717
Portsmouth 181182 5467 9863 2611 3777 2156
…. … … … … … …
Tower Hamlets 114819 10360 96392 8109 18629 5787
Slough 64053 4758 54900 797 12115 3582
….. … … … … … …
2013 School Census (Xaj)
White Mixed Asian Chinese Black Other
Fareham … … … … … …
Southampton … … … … … …
Portsmouth … … … … … …
….
Tower Hamlets … … … … … …
Slough
….. … … … … … …
APS (Yaj)
Jan 2012-Dec 2014
White Mixed Asian Chinese Black Other
Fareham … … … … … …
Southampton … … … … … …
Portsmouth … … … … … …
….
Tower Hamlets … … … … … …
Slough
….. … … … … … …
- obtained via
MLE
- Poisson or
Multinomial
distribution
assumed
- Predict cell counts
but no
benchmarking
ˆb
Applying GSPREE
• Step 2: Benchmark updated cell counts to margins totals
Iterative Proportional Fitting (IPF) to impose the known row
and column totals to the cell counts obtained in step 1
GSPREE
Estimates
Dec
2014
MYE
2014
White Mixed Asian ChineseBlack Other Total
Fareham … … … … … … ……..
Southampton … … … … … … ……..
Portsmouth … … … … … … ……..
…. … … … … … … …
Tower Hamlets … … … … … … ……..
Slough … … … … … … ……..
….. … … … … … … …
APS July 2012 - June 2014 (weighted estimates)
National total ……….. …….. ………. …….. ……… …………
• Step 3: Obtain precision estimates via bootstrap
Distribution of LA estimates by ethnic
group, 2014
(England)
RMSE. LA by ethnic group, 2014
• Overall, GSPREE is successful in providing reliable
estimates for most LAs.
• However, non-negligible RMSEs (and CVs) are
observed in some areas
Fixed Effects GSPREE estimator (England)
Conclusions
• GSPREE shows good performance
Small RMSE in most LAs
• Work in progress
Validation study (1991/2001 Census)
GSPREE: 2001 Census x 2011 data (APS, MYE, ESC)
Validation: 2011 Census
• Further work …
Modelling strategy for more detailed categories
Consider SPD as row totals
Consider only School Census as proxy data
Consider different attributes
References
Purcell, N. J. and Kish, L. (1980). Postcensal Estimates for Local Areas
(or Domains). International Statistical Review, 48, 3-18.
Zhang, L.C. and Chambers, R. (2004). Small area estimates for cross-
classifications. Journal of the Royal Statistical Society, B, 66, 479–
496.
Luna-Hernandez, A. (2014). On Small Area Estimation for
Compositions Using Structure Preserving Models. Unpublished PhD
upgrade document, Department of Social Statistics and
Demography, University of Southampton.
Contacts
• Further feedback on today’s session please
contact us at:
Beyond.2021.Research.and.Design@ons.gov
.uk

More Related Content

PDF
Pemanfaatan Data Statistik Sektoral Pada E-Walidata SIPD
PPTX
3 mekanisme dan tatacara pendataan (provinsi)
PDF
30 sept, sambutan pencanangan tni manunggal kb kesehatan, pelantikan bpc a ku...
PPTX
Results-Based Management in UNDP
PPTX
Monitoring and evaluation presentatios
PPTX
Build Your NGO: Monitoring & Evaluation
PDF
The Results Chain: 
 A logical model to achieve results
PPTX
RBM Presentation
Pemanfaatan Data Statistik Sektoral Pada E-Walidata SIPD
3 mekanisme dan tatacara pendataan (provinsi)
30 sept, sambutan pencanangan tni manunggal kb kesehatan, pelantikan bpc a ku...
Results-Based Management in UNDP
Monitoring and evaluation presentatios
Build Your NGO: Monitoring & Evaluation
The Results Chain: 
 A logical model to achieve results
RBM Presentation

What's hot (10)

PPT
500-5 The Reign of God and the Poor 560-4
PPTX
Data Demand and Use Training Materials
PPTX
Addresses vs Households : Who needs Household statistics?
PPTX
Modul Kas Dalam SPAN
DOCX
Kak pembinaan desa siaga.docx
PPTX
Mpim 4. pelaporan pemantauan pengendalian utama
PPT
Manajemen operasional lapangan
PPT
182927836 1-konsep-kesehatan-reproduksi-ppt
PPTX
kader pembangunan manusia
500-5 The Reign of God and the Poor 560-4
Data Demand and Use Training Materials
Addresses vs Households : Who needs Household statistics?
Modul Kas Dalam SPAN
Kak pembinaan desa siaga.docx
Mpim 4. pelaporan pemantauan pengendalian utama
Manajemen operasional lapangan
182927836 1-konsep-kesehatan-reproduksi-ppt
kader pembangunan manusia
Ad

Similar to Administrative data census research (20)

PDF
Delivering early benefits and trial outputs using administrative data
PPTX
PPTX
Towards an administrative data census the story so far
PPTX
ONS presentation at RSS South Wales poverty & inequality stats event
PPTX
What can we do with administrative data?
PPTX
Growing Up in England workshop day 1 slides
PPTX
Research Outputs for small areas 2017: analysis and findings
PPTX
Ons households july 17 research cp ml
PPTX
ONS household income statistics user event
PPTX
The integration of statistical and administrative data sources to increase po...
PPTX
2014 SAHIE: Overview with Census Experts
PPTX
Alex zscheile project
PPTX
Transforming the ONS’s household financial statistics
PPTX
Breaking New Ground: Multi-sectoral data sharing to improve outcomes for chil...
PPTX
Transforming population and migration statistics: Research into developing an...
PPTX
Medicaid Reporting Errors in Four National Surveys: ACS, CPS, MEPS, and NHIS
PDF
Webinar for LSC grantees, Estimating LSC Funding Changes Based on Shifts in t...
PPT
A new opening for transparency and transformation - the benefits of the commu...
PDF
UK Labour Market - April 2015
Delivering early benefits and trial outputs using administrative data
Towards an administrative data census the story so far
ONS presentation at RSS South Wales poverty & inequality stats event
What can we do with administrative data?
Growing Up in England workshop day 1 slides
Research Outputs for small areas 2017: analysis and findings
Ons households july 17 research cp ml
ONS household income statistics user event
The integration of statistical and administrative data sources to increase po...
2014 SAHIE: Overview with Census Experts
Alex zscheile project
Transforming the ONS’s household financial statistics
Breaking New Ground: Multi-sectoral data sharing to improve outcomes for chil...
Transforming population and migration statistics: Research into developing an...
Medicaid Reporting Errors in Four National Surveys: ACS, CPS, MEPS, and NHIS
Webinar for LSC grantees, Estimating LSC Funding Changes Based on Shifts in t...
A new opening for transparency and transformation - the benefits of the commu...
UK Labour Market - April 2015
Ad

More from Office for National Statistics (20)

PPTX
The truth behind the numbers: spotting statistical misuse
PPTX
Global journeys: estimating international migration
PDF
ONS Economic Forum Slidepack – 21 July 2025
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
PPTX
Bringing data to life | Bricks, mortar and data: understanding house and rent...
PPTX
Earnings Symposium Slidepack - 29 April 2025
PPTX
Bringing data to life - Crime webinar Accessible.pptx
PPTX
ONS Economic Forum Slidepack – 19 May 2025.pptx
PPTX
Measuring what matters most: understanding national well-being
PPTX
ONS Economic Forum Slidepack - 24 March 2025 (slideshare).pptx
PPTX
Bringing data to life: Artificial Intelligence and innovation - keeping human...
PPTX
SlideShare ONS Economic Forum Slidepack - 27 January 2025
PPTX
A Quick Introduction to the Reference Data Management Framework
PPTX
Reference Data Management Framework Overview Digital Booklet
PPTX
Bringing data to life: How are your vitals? Exploring health by numbers
PPTX
SlideShare Annual crime and justice statistics forum - 7 November 2024
PPTX
SlideShare ONS Economic Forum Slidepack - 25 November 2024
PPTX
Air fryers and vinyl records: how we measure the cost of living
PPTX
Bringing data to life - environment static.pptx
PPTX
Bringing data to life an introduction to statistics
The truth behind the numbers: spotting statistical misuse
Global journeys: estimating international migration
ONS Economic Forum Slidepack – 21 July 2025
Numbers of a nation: how we estimate population statistics | Accessible slides
Bringing data to life | Bricks, mortar and data: understanding house and rent...
Earnings Symposium Slidepack - 29 April 2025
Bringing data to life - Crime webinar Accessible.pptx
ONS Economic Forum Slidepack – 19 May 2025.pptx
Measuring what matters most: understanding national well-being
ONS Economic Forum Slidepack - 24 March 2025 (slideshare).pptx
Bringing data to life: Artificial Intelligence and innovation - keeping human...
SlideShare ONS Economic Forum Slidepack - 27 January 2025
A Quick Introduction to the Reference Data Management Framework
Reference Data Management Framework Overview Digital Booklet
Bringing data to life: How are your vitals? Exploring health by numbers
SlideShare Annual crime and justice statistics forum - 7 November 2024
SlideShare ONS Economic Forum Slidepack - 25 November 2024
Air fryers and vinyl records: how we measure the cost of living
Bringing data to life - environment static.pptx
Bringing data to life an introduction to statistics

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PPT
Quality review (1)_presentation of this 21
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PDF
Transcultural that can help you someday.
PPTX
modul_python (1).pptx for professional and student
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
Quality review (1)_presentation of this 21
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Optimise Shopper Experiences with a Strong Data Estate.pdf
Mega Projects Data Mega Projects Data
SAP 2 completion done . PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
Transcultural that can help you someday.
modul_python (1).pptx for professional and student
Data_Analytics_and_PowerBI_Presentation.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...

Administrative data census research

  • 1. Administrative data census research Chair: Becky Tinsley, Admin Data Census, ONS Becky.tinsley@ons.gov.uk
  • 2. Population and Household Estimates – what we have done and what we plan to do Chris Hill, Ali Dent and Claire Pereira Administrative Data Census Project Census Transformation Programme
  • 3. Overview Administrative Data Research Outputs: • Population estimates – background, our progress so far and our future plans (Chris and Ali) • Producing household estimates from administrative data (Claire) Note: These research outputs are NOT official statistics on the population
  • 4. Background Beyond 2011 programme (April 2011) • review the future provision of population statistics in England and Wales and inform government and Parliament about options for the next census • culminated in the National Statistician’s recommendation on the future of the census and population statistics (March 2014) The Census Transformation Programme (January 2015) to take forward the National Statistician’s recommendation: • deliver a predominantly online census in 2021 • increased use of administrative data and surveys to enhance the statistics from the 2021 census and improve annual statistics between censuses Administrative Data Census Project - aim is to produce the type of information that is collected by a ten-yearly census (on housing, households and people) from use of administrative data and surveys
  • 5. Administrative Data Census Project – Research Outputs Key aim of the Research Outputs is: • to replicate as many census outputs as possible using admin data (and surveys) to compare with the 2021 Census • Size of population • Number and structure of households • Characteristics of housing and the population Continued development of the methodology based on acquisition of new data sources and user feedback Publish an annual assessment each spring to show progress of our ability to move to an Administrative Data Census in the next decade
  • 6. A long way to go…but we have begun • Published first set of Research Outputs Oct 2015 • Published first Annual Assessment May 2016 • Next set of Research Outputs published – expanding the range Autumn 2016
  • 7. What was included in the 2015 release? • research outputs for each LA in England and Wales as a series of admin data population estimates for 2011, 2013 and 2014 by 5 year age-sex groups • analytical report comparing these to the 2011 Census and subsequently the ONS mid-year population estimates • case studies to highlight quality issues with the admin data • interactive maps and population pyramids • administrative data update paper – plans and aspirations for future years • feedback from users - aim of improving our methods
  • 8. Producing population estimates from administrative data (SPD) NHS Patient Register (PR) DWP/HMRC Customer Information System (CIS) Higher Education Statistics Agency (HESA) data (students) population estimates Included in Statistical Population Dataset (SPD) If in different location on PR & CIS, split half and half across two addresses Statistical Population Dataset – SPD V1.0
  • 9. Performance of SPD v1.0 compared with the 2011 Census estimates by LA 94% of LA total population estimates within 3.8% of Census estimate in 2011 Admin data method lower than 2011 Census Admin data method higher than 2011 Census Percentage difference from 2011 Census estimates
  • 10. Performance of SPDv1.0 compared with the 2014 mid-year estimates by LA 90% of LA total population estimates within 3.8% of mid- year estimate in 2014 Admin data method lower than 2011 Census Admin data method higher than 2011 Census Percentage difference from 2014 MYEs
  • 11. Feedback summary • the need for a Population Coverage Survey to help with estimating the size of the population (considering options) • using ‘activity data’ (1) to help reduce levels of over-coverage that are seen for particular age groups (some progress) • refining the Statistical Population Dataset inclusion and exclusion rules (changes made) • reviewing the quality standards that are used to assess the quality of the SPDs (considering options) • producing population estimates for small areas, within a local authority (potentially autumn 2016) 1 Information from administrative data sources about when individuals have interacted with systems or services, such as the National Insurance, tax or benefits systems, or a hospital visit through the NHS system.
  • 12. SPD Developments for 2016 SPD (Statistical Population Dataset) used to estimates of the size of the population by anonymously linking multiple administrative datasets • Continue with SPD v1.0 for 2015 estimates (stable) • SPD v2.0 (improved model) will be used to produce pop estimates for 2011 and 2015 SPD v2.0 changes • Improve overall coverage of the usual resident population • Redistribute people in the correct location
  • 13. Plans for 2016 Research Outputs • Population estimates – expanding the breadth and detail • Improvements to the methods used to produce administrative data population estimates • Outputs on the number of households • Research on income from combined PAYE and benefits data • Stagger the outputs over the autumn
  • 14. What are we planning to publish this year? Package 1 Population estimates (National and LA) by LA, sex and 5 year age-group 2015 (As last year, but extends to include a new time series for 2011 and 2015, and the old time series extends to SYOA) Autumn 2016 Package 2 NEW Population estimates (Small Area) by LSOA, sex and 5 year age-group Autumn 2016 Package 3 NEW Household estimates (Number of households) Combined PAYE and benefits research by LA (2011 and 2015) by LA (2013/14 Tax Year) Autumn 2016 All content and timings are provisional
  • 15. Focus of methodological research for 2016 Males and females (where comparison data is higher or lower than official estimates) percentage difference 2011 (England and Wales) Add of other admin data Activity Data Improve Matching Methodology, increasing number of matches
  • 16. Tackling undercoverage for school age children User feedback had suggested additional administrative sources to use including the School Census • School Census = record level source that includes all pupils at state Schools, produced annually • SPD V1 includes matches between any two of PR, DWP-CIS and HESA With School Census we can find additional matches to include in an SPD: • PR-SC matches • CIS-SC matches
  • 17. Also - improve use of matches for students • In SPD V1 – it is possible for the record identifiers to conflict, for example • In this case SPD v1 does not choose between the two HESA IDs, so PR and CIS locations are used to place people in an area • The conflict implies that there is one SPD row that might represent two people, and that we are not using the HESA location for them in SPD v1 HESA-pr match PR-CIS CIS-HESA match HESA ID (via PR) NHS number PR DWP CIS ID HESA ID (via CIS) 365421 7404747201 889877261543 739542
  • 18. HESA ID (via PR) NHS number PR DWP CIS ID HESA ID (via CIS) 365421 7404747201 889877261543 739542 Improved use of matches for students • Want to ensure there are no rows with more than 1 identifier from each dataset (e.g. prevent 2 HESA ids in 1 row) • Resolving conflicts may change number of records in SPD Could convert this to 2 matches, so each HESA ID only appears once! • Achieved by changing how we use the information for the matches • All IDs from the datasets are included in a “spine” of records (even if they are non-matches) – convenient for research HESA ID NHS number PR 365421 7404747201 DWP CIS ID HESA ID 889877261543 739542
  • 19. School Census matches and HESA conflicts: estimated effect of extra matches compared to SPD v1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 % of SPD 1.0 estimate Singleyearofage Females Males Effects from adding SC matches Effects from HESA conflict resolution
  • 20. Estimated proportion of SPD records by source (after SPD exclusion/inclusion rules) Frequency Percent PR and CIS 45,202,800 81.2 PR, CIS and SC 7,733,800 13.9 PR,CIS and HESA 2,070,800 3.7 PR 225,100 0.4 PR and HESA 170,500 0.3 PR and SC 143,500 0.3 CIS and HESA 89,700 0.2 CIS and SC 52,200 0.1 New matches Increased number of matches
  • 21. Coverage improvements for working ages • SPD v1 used “exact” or deterministic matches (e.g. based on combination of name, address, DoB etc). • Using score based matches (probabilistic) we can find more PR to CIS matches • Over 150,000 additional matches of expected good quality can be identified • 60% male, majority are distributed evenly across ages 18-50 • 40% Female, many in 18-24 range but also some older • The extra matches to be added to the deterministic/matchkey matches to test impact (work in progress)
  • 22. “Activity” data New activity data acquired from DWP and HMRC (abbrev = BIDS): • National Benefits Database (NBD) • PAYE (Pay as you earn – income tax) • Single Housing Benefit (SHBE) • Tax Credits • excludes: Child Benefit, self-employed and people on Universal Credit Research aim: to derive broad activity to verify residency in E&W and potentially reduce overcoverage in an SPD
  • 23. Created combined DWP/HMRC activity dataset for 2011 • PAYE and TC: • Anyone present in 2010/2011 and 2011/2012 tax years • NBD: • All people with an active claim on 15/03/2011 • Any other people with JSA or ESA claim since 15/03/2009 • Any partners of the above given their own record • SHBE: • Active claim on 15/03/2011 or started after • Partner status was active on 15/03/2011 or after • Variables: • Dates • Additional variables derived from each source – eg account status, claim status, boolean tax year variables for PAYE/TC • Links to CIS, Census, PR, SPDv1
  • 24. Age-sex distribution of active DWP/HMRC records in SPDv1 2011 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100+ Male not in BIDS Male in BIDS Female not in BIDS Female in BIDS SPD records not active
  • 25. Effect of removing inactive records • Self-employed likely to be excluded from activity dataset • Child benefit not included • Would like to acquire MORE activity data! -50% -40% -30% -20% -10% 0% 10% 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90+ Proportion of males in BIDS compared to census estimates Proportion of females in BIDS compared to census estimates Proportion of males inSPDv1 compared to census estimates Proportion of females in SPD v1 compared to census estimates
  • 26. Improving local distributions – with “PDS” activity data • PDS is our first set of health activity data • Is based on interactions with NHS, not same as Patient Register - contains history (multiple rows per person) • Extract for ONS contains “movers” – a history of locations for each person • Aim to remove uncertainty in SPD about location of people, so a single record is not allocated as a half-person in 2 locations • SPD v1.0 contains ~3.1 million half-weighted people (5.5%) • PDS information likely to be more recent than CIS/PR and may help to resolve half-weighted people • Many half-weighted records persist in SPD for multiple years, so linking to PDS from current and previous years may resolve more • Aimed to link half-weighted records to PDS in current year or earlier, and categorise
  • 27. Resolving half weighted people with PDS: linkage results (2013) • 48% of 2013 half-weighted records are not linked to any PDS record from 2013 or earlier, 52% are linked • Those who are found by most recent PDS extract they are found on: • Significant benefit from including earlier years (maximise information available for half-weighted records) PDS extract Frequency Percentage June 2013 806125 49.1 June 2012 469416 28.6 June 2011 323584 19.7 March 2011 44480 2.7
  • 28. Example of resolving location : perfect moves PR:CIS: LA Mod date E06000047 16/04/2013 LA Addr start date E09000027 09/03/2011 Most recent PDS move: Origin LA Destination LA Effective date E09000027 E06000047 04/04/2013 • Dates on PDS and PR are both later than CIS • Category 1b is the same except PR to CIS • Can assign very confidently to destination LA Category 1a – a perfect CIS-PR move:
  • 29. Producing population estimates from administrative data (SPD) NHS Patient Register DWP/HMRC Customer Information System HESA data (students) SPD population estimates Included in Statistical Population Dataset (SPD) School Census Statistical Population Dataset – SPD V2.0 resolve half-weights Add extra PR-CIS matches To be published this autumn !
  • 30. Producing household estimates from administrative data Methodology and analysis towards ONS Research Outputs 2016
  • 31. What is a household? A household is defined as: one person living alone, or a group of people (not necessarily related) living at the same address who share cooking facilities and share a living room or sitting room or dining area.
  • 32. Beyond 2011 Early research showed potential for admin data to provide number and sizes of ‘occupied addresses’. But key challenges…. • Limited data sources available. • Coverage & measurement error –undercount & people not in the right place. • Definitions – occupied address v census definition
  • 33. Aims Short term • Producing numbers of households in England and Wales by Local Authority for 2011 and 2015 • Deal with key challenges from previous work Longer term • Keep developing – build breadth and time series of households statistics • Develop alongside SPD production
  • 34. What data can we use? Address Base Tax and Benefits data Population Coverage Survey
  • 35. Comparing with other ONS outputs OA Output Area DAU Demographics Analysis Unit LFS Labour Force Survey SPD Statistical Population Database No mid year estimates as with population Can evaluate quality in 2011 by comparing with Census estimates, down to OA level. DAU produce national estimates for 1996 onwards: • Families and people in families • Households and people in households Produced from LFS – sample size - 41,000 households containing around 100,000 individuals. Internally estimates can be produced at Local Authority level.
  • 36. Can AddressBase help? C Commercial L Land M Military O Other (Ordnance Survey Only) P Parent Shell R Residential U Unclassified X Dual Use Z Object of Interest RB Ancillary Building RC Car Park Space RD Dwelling RG Garage RH House In Multiple Occupation RI Residential Institution There are 1128 classifications of address on Address Base, an Ordinance Survey product, including care home, house boat and caravan. Classifications have four levels of detail (many/most do not) and have dates attached, that allows further validation.
  • 37. Address Matching OSAPR Ordnance Survey Address-Point Reference Number UPRN Unique Property Reference Number Address matching methodology is developing at ONS - estimate a 5% increase in match rate. Need a reliable unique identifier for addresses - transition from OSAPR to UPRN
  • 38. Changes in address identification OSAPR Ordnance Survey Address-Point Reference Number UPRN Unique Property Reference Number - 1.00 2.00 3.00 4.00 East Midlands East of England London North East North West South East South West Wales West Midlands Yorkshire and… OSAPRs/UPRNs in millionsUPRNs 2014 OSAPRs 2013 Currently ONS has attached OSAPRs onto records up to 2013, with a switch to UPRNs in 2014. We would expect an increase due to housing stock growth of around 1%. 77% of LAs show an increase of more than 1 %.
  • 39. Challenges Our biggest challenges for producing household numbers Definition – household/address is not a one to one relationship. Putting people in the right place •Half weights on SPD – when sources disagree •Correct address allocation • data lags • high churn • people not deregistering • poor AddressBase matching/allocation
  • 40. Dealing with half sizes Our objective is to count each person in a household – need to resolve unmatched records Two methods 1. Source preference HESA PR CIS 2. Redistribute according to household size distributions
  • 41. Dealing with half sizes - 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 hh size 1 hh size 2 hh size 3 hh size 4 hh size 5 plus total hhs Comparing with Census Outputs Census - QS406EW Redistributed half sizes -25 -20 -15 -10 -5 0 5 10 15 20 25 hh size 1 hh size 2 hh size 3 hh size 4 hh size 5 plus total hhs % differences Redistributed half sizes Over counting large household sizes, whilst undercounting 1 and 2 person households. It is anticipated that better address matching and the use of UPRNs rather than OSAPRs will resolve some of these differences.
  • 42. Dual System Estimation ONS often uses DSE to weight up for non response. To trial the use of DSE, to weight up for undercount, I used a 4% sample by postcode taken from the Census as a proxy for a survey. To allow for differences in samples, 400 samples were taken. In the future, an annual survey similar to a population coverage survey could contribute. -14 -12 -10 -8 -6 -4 -2 0 East Midlands East of England London North East North West South East South West Wales West Midlands Yorkshire and The Humber England and Wales SPD % diff
  • 43. Dual System Estimation Entire population Sample population match (Census addresses * SPD addresses) Matched addresses Then to scale up to England and Wales aggregate
  • 44. Dual System Estimation Impact of DSE on household counts 85% of Local Authorities are within 0.5% of Census estimate 90% of Local Authorities are within 1% of Census estimate 95% of Local Authorities are within 1.5% of Census estimate -14 -12 -10 -8 -6 -4 -2 0 East Midlands East of England London North East North West South East South West Wales West Midlands Yorkshire and The Humber England and Wales DSE % diff SPD % diff
  • 45. Allocating address at SPD record level Using many data sources to find our ‘best’ address. Benefits Enables aggregation at different levels and cross tabulation with other variables. Can weight certain data sources for different demographic groups . e.g. students
  • 46. Allocating address at record level PR Joe Bloggs 17/4/1974 UPRN: 12345 CIS Joe Bloggs 17/4/1974 UPRN: 12346 Patient register moves - PDS Joe Bloggs 17/4/1974 move 1 - 1/1/2011: UPRN: 12345 Joe Bloggs 17/4/1974 move 2 - 2/2/2011: UPRN: 22345 Joe Bloggs 17/4/1974 move 3 - 3/3/2013: UPRN: 12346 Can use activity data to locate the newest address. True match on SPD
  • 47. Plans for the future This year • Numbers of households by LA, England and Wales, 2011 and 2015 for Research Outputs, Autumn 2016 (including case studies of Local Authorities of interest) • Focus on issue of definitional differences – what is the real need vs what can be produced. • We have initiated a ONS household working group to join different sectors of work to share ideas build knowledge. Future years • Develop time series of numbers of households • Explore additional data sources to fill gaps in household statistics •Household sizes •Household composition •Investigate production of an enhanced address register.
  • 48. Integrated sources for estimating population characteristics Alison Whitworth and Meghan Elkin
  • 49. Administrative Data Census Census provides information on: 1. Size of the population by area, age and sex 2. Household and families number, size and type of family 3. Population characteristics information on ethnicity, educational attainment, religion (etc)
  • 50. Producing population statistics from admin data • Beyond 2011 admin data option : o Admin data – population size by age and sex o 4% annual survey – population characteristics • National Statistician’s recommendation: use all available sources o Approach going forward to explore in more depth the potential of admin data 51
  • 51. Outline • Framework for characteristics post 2021 • Methods for combining sources • Academic research • Examples o Income o Population by ethnic group
  • 52. Framework for population characteristics Survey • Admin data available for target variable • Admin data for characteristic associated with target variable • Admin data is a proxy of target variable Admin • No administrative data available for target variable Admin Integrated sources Survey
  • 53. Framework for population characteristics Survey • Admin data available for target variable • Direct counts or estimates from administrative data •Admin data for characteristic associated with target variable • Regression model for estimates •Admin data is a proxy of target variable • Structural model for estimates Admin • No administrative data available for target variable • Direct survey estimates Admin Integrated sources Survey
  • 54. Local benchmarks National / regional Microlevel / multivariate Framework – top down approach Survey Integrated sources Admin
  • 55. Two key methods Regression model • Auxiliary data are correlated with the target variable • Model to define relationship Structural model • Auxiliary data same structure as target variable • Model to define the structure Local Authority claimant count by unemployment
  • 56. Methods – ONS applications Regression model  Mean Household income (MSOA)  Median Household income (MSOA) for 2011  Unemployment (LA)  Emigration (LA) SPREE  Broad ethnic group (LA) Regression model × Unemployment (MSOA) × Informal caring (wards) × Crime and fear of crime (wards) × Mental health of children and adolescents (wards) × Adult Neurotic Disorder (wards)
  • 57. Role of survey data Survey • Direct counts or estimates from administrative data - Survey to adjust for under or over coverage, measurement error • Regression model - Survey estimates strengthened by admin sources • Structural model - Survey provides accurate marginal totals Admin • No administrative - Direct survey estimates Admin Integrated sources Survey
  • 58. Academic input • Collaborative projects Structure Preserving Estimation (SPREE), • Expert advisory group Small area estimation • Funded research/ Bids National Centre Research Methods • Conferences NCRM Bath (July) SAE Maastricht (Aug)
  • 59. Summary 1. Framework: top down  gaps at small area, micro level & multivariate 2. Methods: two approaches for combining sources  understand wider application 3. Academic momentum
  • 60. Case Study: Income outputs Meghan Elkin
  • 61. Current published income outputs Admin outputs Integrated sources Survey outputs Majority of current income outputs (lowest geography parliamentary constituency) Small Area Income Estimates (modelled household weekly income at MSOA) Working towards multivariate, small area income outputs
  • 62. Income definition (Canberra Handbook) Ideally achieve gross income: • Income from employment e.g. employee income and income from self-employment • Property income e.g. income from financial and non-financial assets • Current transfer received e.g. social security schemes, pensions Source: UNECE Canberra Group Handbook on Household Income Statistics
  • 63. Components of income & admin data Total Income Current Transfers received Property income Self employment Employment via PAYE Financial assets Royalties Rent Dividends Interest Personal pension Benefits Non-financial assets Tax Credits Un- declared Size of the bubbles not relative to proportion of income Social security & assistance State pension Pensions Current transfers from non-profit institutions Current transfers from other households This diagram does not provide a full disaggregation of components. For more detail see the Canberra Group Handbook. Occupational pension Income from employment Green = access to admin data Amber = admin data available Red = unknown/admin data not available – will need to estimate using surveys Un- declared
  • 64. Admin Data Census plans for income Increased use of Admin Data Direct estimates from admin data individual income estimates Combining admin data with surveys (work with SAIE) More developed publication e.g. components of income Admin outputs Produce household income estimates More detailed PAYE & self-assessment data 2016 2017 2018 Integrated sources Output quality National OA Limited coverage LA LSOA OA
  • 65. Income research outputs 2016 definition Output Distributions Population Those resident at 30 June 2013 on the Statistical Population Dataset (SPD) aged 16 and over Geography England and Wales Geographic level Local Authorities Unit level Individuals Reference period Annual Time period Tax year 2013/14 Source of income PAYE earnings (employment and occupational pensions), child and working tax credits, housing benefit, many DWP benefits Accrual or receipt Receipt – administrative data on income and benefits is recorded on receipt of the income or benefits Location Income by individual’s home address
  • 66. Proportion of SPD population that has some income information by age 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 Age Males Females Proportion of the population
  • 67. Proportion of SPD population that has some income information
  • 68. Proposed income bands for 2016 outputs 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% LA1 LA2 Zero 0-5K 5-10K 10-15K 15-20K 20-30K 30-40K 40-60K 60K+ Missing Proportion of the population
  • 69. Case Study: Local Authority Population Estimates by Ethnic Group using Generalised Structure Preserving Estimation (GSPREE) June 2016
  • 70. Census Table Population by Local Authority and Ethnic Group England (March 2011) Local Authority White Mixed Asian Chinese Black Other Total Fareham 107959 1359 1200 467 357 239 111581 Southampton 203528 5678 16443 3449 5067 2717 236882 Portsmouth 181182 5467 9863 2611 3777 2156 205056 Winchester 111577 1626 1894 745 457 296 116595 ……. … … … … … … … Bath & NE Somerset 166473 2898 2665 1912 1326 742 176016
  • 71. Data for Ethnic Group • 2011 Census estimates (Mar 2011) Proxy: Detailed cross tabulation but outdated • School Census (Jan 2014) Proxy: Detailed cross tabulation but age 5-15 only • Annual Population Survey (2014) Total population by ethnic group • Mid Year Population Estimates (June 2014) Total population by local authority
  • 72. Data for Ethnic Group Census 2011 MYE 2014 White Mixed Asian Chinese Black Other Total Fareham 107959 1359 1200 467 357 239 …….. Southampton 203528 5678 16443 3449 5067 2717 …….. Portsmouth 181182 5467 9863 2611 3777 2156 …..... …. … … … … … … … Tower Hamlets 114819 10360 96392 8109 18629 5787 …….. Slough 64053 4758 54900 797 12115 3582 ……… ….. … … … … … … … APS July 2012 - June 2014 (weighted estimates) National total ………. ………. ………. ………. ……….. ………. School Census Dec 2014 White Mixed Asian Chinese Black Other Fareham … … … … … … Southampton … … … … … … Portsmouth … … … … … … …. … … … … … … Tower Hamlets … … … … … … Slough … … … … … … ….. … … … … … …
  • 73. Solution… • Combine administrative and census data with survey data to borrow strength and produce reliable estimate for each cell (domain) using GSPREE (Zhang and Chambers, 2004 and Luna-Hernandez, A. 2014).
  • 74. Applying GSPREE • Step 1: Estimate the association structure by relating survey counts (Yaj) to census counts (Xaj): logYaj =g a + lj + baaj X lj = 0 jå 1, , , 1, ,a A j J K K aaj Y = baaj X  2011 Census (Xaj) White Mixed Asian Chines e Black Other Fareham 107959 1359 1200 467 357 239 Southampton 203528 5678 16443 3449 5067 2717 Portsmouth 181182 5467 9863 2611 3777 2156 …. … … … … … … Tower Hamlets 114819 10360 96392 8109 18629 5787 Slough 64053 4758 54900 797 12115 3582 ….. … … … … … … 2013 School Census (Xaj) White Mixed Asian Chinese Black Other Fareham … … … … … … Southampton … … … … … … Portsmouth … … … … … … …. Tower Hamlets … … … … … … Slough ….. … … … … … … APS (Yaj) Jan 2012-Dec 2014 White Mixed Asian Chinese Black Other Fareham … … … … … … Southampton … … … … … … Portsmouth … … … … … … …. Tower Hamlets … … … … … … Slough ….. … … … … … … - obtained via MLE - Poisson or Multinomial distribution assumed - Predict cell counts but no benchmarking ˆb
  • 75. Applying GSPREE • Step 2: Benchmark updated cell counts to margins totals Iterative Proportional Fitting (IPF) to impose the known row and column totals to the cell counts obtained in step 1 GSPREE Estimates Dec 2014 MYE 2014 White Mixed Asian ChineseBlack Other Total Fareham … … … … … … …….. Southampton … … … … … … …….. Portsmouth … … … … … … …….. …. … … … … … … … Tower Hamlets … … … … … … …….. Slough … … … … … … …….. ….. … … … … … … … APS July 2012 - June 2014 (weighted estimates) National total ……….. …….. ………. …….. ……… ………… • Step 3: Obtain precision estimates via bootstrap
  • 76. Distribution of LA estimates by ethnic group, 2014 (England)
  • 77. RMSE. LA by ethnic group, 2014 • Overall, GSPREE is successful in providing reliable estimates for most LAs. • However, non-negligible RMSEs (and CVs) are observed in some areas Fixed Effects GSPREE estimator (England)
  • 78. Conclusions • GSPREE shows good performance Small RMSE in most LAs • Work in progress Validation study (1991/2001 Census) GSPREE: 2001 Census x 2011 data (APS, MYE, ESC) Validation: 2011 Census • Further work … Modelling strategy for more detailed categories Consider SPD as row totals Consider only School Census as proxy data Consider different attributes
  • 79. References Purcell, N. J. and Kish, L. (1980). Postcensal Estimates for Local Areas (or Domains). International Statistical Review, 48, 3-18. Zhang, L.C. and Chambers, R. (2004). Small area estimates for cross- classifications. Journal of the Royal Statistical Society, B, 66, 479– 496. Luna-Hernandez, A. (2014). On Small Area Estimation for Compositions Using Structure Preserving Models. Unpublished PhD upgrade document, Department of Social Statistics and Demography, University of Southampton.
  • 80. Contacts • Further feedback on today’s session please contact us at: Beyond.2021.Research.and.Design@ons.gov .uk