ADR UK workshop: Messy and complex data part 1

MESSY AND COMPLEX
DATA 1
This slide deck was produced for an in-person workshop at the ADR UK 2023
conference, Tuesday 14 November

This session
• Intro
• Getting started
• Definitions
• This session
• Principles of Data Engineering
• Structure and complexity

Introductions
Presenters
• Jonathan Swan
• Head of ADRUK Data Engineering, ONS
• Jen Hampton
• Head of ADRUK Linkage, ONS

Definitions: Complex Data
• “Data that are hard to process and translate into a digestible format.”
• Complexity may be because of
• Size (wide or tall)
• Number of Sources
• Structure
• Multiple entities (e.g. people, households, jobs, employers all in same data)
• Relationships within the data, between entities, or between sources
• The type of data (pictures, sound, video, free text, etc.)

Definitions: Messy Data
• Data where there is a barrier to using the data for analysis.
• Examples of Messy data
• missing data.
• unstructured data.
• multiple variables in one column.
• variables stored in the wrong places.
• observations split incorrectly or left together against normalization rules.
• switched columns and rows.
• extra spaces
• Also
• uncertainty
• imprecision

PRINCIPLES OF DATA
ENGINEERING
Tuesday 14 November

Why do data engineering principles
matter?
• “I’m a researcher – why should I care about data engineering?”
• Data engineering includes the process of preparing data to enable
users, researchers, to use the data.
• Using the right principles enables:
• Ease of use
• Consistency of use
• Accuracy of the data
• Accuracy in interpretation
• Avoidance of error

Standard set of Data Engineering
Principles?
7 Data Engineering
Principles You
Should Be Aware Of
6 data integration
principles for data
engineers to live by
The Three P's of
Data Engineering
Data engineering
principles - AWS
Prescriptive
Guidance
Essential Data
Engineering
Concepts and
Principles
Data Engineering
Design Principles
SOLID Principles in
Data Engineering
10 Major DataOps
Principles to Overcome
Data Engineer Burnout
Simplified

Questions for this session
Can we form a list
of practical Data
Engineering
principles?
What do they
mean to us in
practice?
Does usage for
research data
change anything?
The art of a data engineer is dealing with
messy and complex data, whilst maintaining
clarity.
Embrace complexity,
Avoid complicated.

Two main principal frameworks (sort
of)
Software Engineering Principles
• Flexibility
• Reproducibility
• Reusability
• Scalability
• Auditability
SOLID Framework
• Single Responsibility
• Open/Closed
• Liskov Substitution
• Interface Segregation
• Dependency Inversion
… And the Acronyms
DRY … Don’t Repeat Yourself
YAGNI … You Ain’t Gonna Need It
KISS … Keep It Simple Stupid

For the Record - SOLID
Single responsibility A class should have only one responsibility
Open/closed Classes should be open for extension, but closed for modification
Liskov substitution If A is a subtype of class B, we should be able to replace B with A,
without disrupting program behaviour.
(If B is class of Car, then it should work for Electric Car.)
Interface segregation larger interfaces should be split into smaller ones. By doing so, we can
ensure that implementing classes only need to be concerned about the
methods that are of interest to them.
Dependency inversion High-level modules, which provide complex logic, should be easily
reusable and unaffected by changes in low-level modules, which provide
utility features.

Towards a Principles Framework for
Data Engineering to Support Research
Delivery
Principles
Principles that
help data
engineers
deliver
User Principles
Principles that
aid research
Implementation
Principles
Principles for
technical
implementation
(like SOLID)

What principles are key?
Classific
ation
Principle Practice Implication for researchers
User
(principl
e)
Change
minimum
Keep data values as unchanged as
possible.
Even if adding value – keep originals.
Researchers can be confident
that data are as recorded.
User Derived
Variables in
source data
Where several users need DVs, agree and
add DVs to researcher data.
Users can work off consistent
information and don’t need to re-
invent the wheel.
User Standardise
data
Provide data that meets standards; where
necessary include original and
standardised data
Users can more easily interpret
and compare data and results
User Use meaningful
terms
Where possible use meaning file and
variable names, use meaningful category
names (Male, Female vs 1, 2)
Makes analysis easier for the
user, and reduces the risk of
errors.
User Ease of use Prioritise ease of use over storage
efficiency.
Easier for researchers.

Classific
ation
Principle Practice Implication for researchers
Delivery
(principle
)
Reproducibilit
y
Use a reproducible pipeline, that meets
good coding standards.
Reliable delivery, able to
consistently re-supply. With
adaption enable re-supply with
changes.
Delivery Reusability Use functions and modular code.
Generalise where possible.
Quicker, more reliable
development. Helps consistency.
Delivery Scalability Write code and use tools that facilitate
scaling to increased (or decreased) data
volumes.
Delivery Auditability Use procedures, tools, and code) so that
changes to pipelines and data can be
traced, sourced, reasoned, and justified.
Helps ensure the reliability of data.
Delivery Document
and Share
Document everything! Code,
derivations, etc. etc.
Share documentation and code
Researchers can understand how
their data are derived.

Over to you
• Do you have any comments on these principles?
• What other principles are important, especially to researchers?
• A reminder
Classification Principl
e
Practic
e
Implication for researchers
User, Delivery,
(Implementati
on)
User Principles Delivery Principles
Change minimum Reproducibility
Derived Variables in source data Reusability
Standardise data Scalability
Use meaningful terms Auditability
Document and Share

STRUCTURING DATA
FOR RESEARCHERS
Tuesday 14 November

Structuring data
• Underlying principle – make the data as easy to use as possible
• But modern data get complex fast!
• And linking across sources ratchets up the complexity!

Starting simple – a flat table
• Simplest to analyse
• Ideal for simple surveys
• e.g. Opinions Survey
ID Type of House
Number of
Bedrooms
HID1 Semi detached 2
HID2 Terrace 4
HID3 Flat 1

Sparse tables
• Advantages of flat table
• Not efficient for memory
• Good for linked data
• e.g. ASHE linked to Census
ID House_type No. Bedrooms
HID1
Semi
detached 2
HID2 Terrace 4
HID3 Flat 1
ID
No.
Cars
HID1 3
HID3 0
HID4 1
ID House_type No. Bedrooms No. Cars
HID1
Semi
detached 2 3
HID2 Terrace 4 <NULL>
HID3 Flat 1 0
HID4 <NULL> <NULL> 1
Consider:
Null in original
vs
Null = no
match

Wide, Stacked, Summarised
• Three ways to structure the same data
ID
No.
Cars
HID1 3
HID3 0
HID4 1
ID Model
HID1 Ford Focus
HID1 Toyota Aygo
HID1 Range Rover Velar
HID3 <NULL>
HID4 Bentley Continental
ID Model 1 Model 2 Model 3
HID1 Ford Focus
Toyota
Aygo Range Rover Velar
HID3 <NULL> <NULL> <NULL>
HID4
Bentley
Continental <NULL> <NULL>

Over to you:
• How would you structure the house data linked to the car models data?
And Why?
• HID, House Type, No. Bedrooms
• HID, Make 1, Make 2, etc.
• How would you structure data on pupils, that gave all their qualifications?
And Why?
• ID
• Qualification type
• Subject
• Grade/Result

More on Stacking
• Useful when the row is about an entity of interest
• e.g. ASHE data relate to employments
• Each person can have several employments
• Some research interest is around employments
• Useful for data covering several time periods
• e.g. (again) ASHE data
• Wide data structure sometimes not practical
• If lots of entries for an entity (e.g. cars to a household)
• Or if large variation in the number of entries
• Stacking doesn’t work well if lots of variation in data contents
• e.g. New Earnings Survey (NES) questions varied dramatically from year to year

Multiple flat files
• What it says on the tin
• Linked by common keys
• E.g. Labour Force Survey
• Individual File
• Household File
• Longitudinal File
ID HHID Age Employment Status
Person 1 HH 1 45Full Time
Person 2 HH 2 28Unemployed
Person 3 HH 2 1<NULL>
Person 4 HH 3 36Full Time
Person 5 HH 3 34Economically inactive
Person 6 HH 3 10<NULL>
HHID Property type No. Employed
HH1 Flat 1
HH2 Terrace 0
HH3 Semi-detached 1

Table + Spine
• Used for stacked data
• Spine contains ‘fixed’ or persistent data
• Traditionally one spine entry per secondary entity
• But can be more than one if details change
PID Income Job Start
P1 28,000
01/04/202
2
P1 32,000
01/08/202
3
P2 96,000
01/01/201
8
PID
Age at
xx Sex
P1 32Male
P2 52Female

Multiple Tables + Spine
• Used for data covering multiple sources
• Where the data structure is not amenable to stacking etc.
• Data Spine indicates source of data and may contain fixed data
• One entry per data subject
• Demographic spine gives ‘fixed’ or persistent data
• Can use multiple spines
• Typically one data spine + one demographic spine

Multiple Tables + Spine - example
ID SubjectGrade
ID 1 English 8
ID 1 Maths 9
ID 2 Greek 7
ID 2 Latin 4
ID 3 Physics 9
ID 3 French 2
ID 4 Biology 3
ID Sex
Age at
xx
School
ID
ID 1 F 17ABC
ID 2 M 18ABC
ID 4 F 18DEF
ID 5 M 11GHI
ID 6 F 9GHI
ID Subject Result
ID 3 Physics First
ID 7
Media
studies 2:2
Exam PLASC Degree
ID 1 TRUE TRUE FALSE
ID 3 TRUE FALSE TRUE
ID 5 FALSE TRUE FALSE
ID 6 FALSE TRUE FALSE
ID 7 FALSE FALSE TRUE
Age at xx Sex Postcode
ID 1 17M SW1A 1AA
ID 1 18M SW1A 1AA
ID 2 18F SW1A 0AA
ID 2 18F SW1A 0PW
ID 3 19<NULL> PO15 5RR
ID 4 18F NP10 8XG
ID 5 11M DL1 5AD
ID 6 9F M1 6EU
ID 7 22F SW1P 4DF
ID 7 22F EC3N 4AB

Other data structures
• JSON
• JavaScript Object Notation
• Plain text
• Language independent
• Used to talk computer to computer
• Great for complex data
• But a pig to analyse without conversion!

JSON Example
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

• JSON
• Relational Databases
• A way of storing data across multiple tables
• Can be made available as multiple related tables
• Easier to analyse if translated to stacked or wide tables

Other data structures:
Relational Database
Student_I
D
Name
1 Alice
2 Bob
3 Cate
Student_ID Course_ID
1 1
1 2
2 1
2 3
3 1
3 2
3 3
Course_ID Cours
e
1 Englis
h
2 Maths
3 Scienc
e

• JSON
• Graph databases
• Stores nodes and relationships
• Not tables
• Useful for complex linkage and matching
• Useful for visulisation
• Useful for uncertainty in data/relationships

Graph Database Example
Image: By Ole Mussmann - Own work, CC0, https://commons.wikimedia.org/w/index.php?
curid=87002327

• JSON
• Graph databases

Towards the future
• Data queries
• Pre Canned data query
• Updates when the data does
• Useful for consistency
• Useful for sharing common queries
• Creating updating derived variables
• Data Views
• Build a view for analysis
• Re-presents underlying data
• Can be built across several sources
• When the data is updated / changed – reflected in the view
• Cloud based tools are making building custom views much easier
• Dashboards
• Present summary information
• Can be graphical
• Updates when the data does

Structuring Data
Over to you!
• Are there any other data structures you have used?
• When have you had difficulty with structure?
• When has a structure worked well for you (for complex data)?
• What techniques have you found useful?

ADR UK workshop: Messy and complex data part 1

More Related Content

Similar to ADR UK workshop: Messy and complex data part 1 (20)

More from EleanorCollard (11)

Recently uploaded (20)

ADR UK workshop: Messy and complex data part 1

Editor's Notes