SlideShare a Scribd company logo
MESSY AND COMPLEX
DATA 1
This slide deck was produced for an in-person workshop at the ADR UK 2023
conference, Tuesday 14 November
This session
• Intro
• Getting started
• Definitions
• This session
• Principles of Data Engineering
• Structure and complexity
Introductions
Presenters
• Jonathan Swan
• Head of ADRUK Data Engineering, ONS
• Jen Hampton
• Head of ADRUK Linkage, ONS
Definitions: Complex Data
• “Data that are hard to process and translate into a digestible format.”
• Complexity may be because of
• Size (wide or tall)
• Number of Sources
• Structure
• Multiple entities (e.g. people, households, jobs, employers all in same data)
• Relationships within the data, between entities, or between sources
• The type of data (pictures, sound, video, free text, etc.)
Definitions: Messy Data
• Data where there is a barrier to using the data for analysis.
• Examples of Messy data
• missing data.
• unstructured data.
• multiple variables in one column.
• variables stored in the wrong places.
• observations split incorrectly or left together against normalization rules.
• switched columns and rows.
• extra spaces
• Also
• uncertainty
• imprecision
PRINCIPLES OF DATA
ENGINEERING
Tuesday 14 November
Why do data engineering principles
matter?
• “I’m a researcher – why should I care about data engineering?”
• Data engineering includes the process of preparing data to enable
users, researchers, to use the data.
• Using the right principles enables:
• Ease of use
• Consistency of use
• Accuracy of the data
• Accuracy in interpretation
• Avoidance of error
Standard set of Data Engineering
Principles?
7 Data Engineering
Principles You
Should Be Aware Of
6 data integration
principles for data
engineers to live by
The Three P's of
Data Engineering
Data engineering
principles - AWS
Prescriptive
Guidance
Essential Data
Engineering
Concepts and
Principles
Data Engineering
Design Principles
SOLID Principles in
Data Engineering
10 Major DataOps
Principles to Overcome
Data Engineer Burnout
Simplified
Questions for this session
Can we form a list
of practical Data
Engineering
principles?
What do they
mean to us in
practice?
Does usage for
research data
change anything?
The art of a data engineer is dealing with
messy and complex data, whilst maintaining
clarity.
Embrace complexity,
Avoid complicated.
Two main principal frameworks (sort
of)
Software Engineering Principles
• Flexibility
• Reproducibility
• Reusability
• Scalability
• Auditability
SOLID Framework
• Single Responsibility
• Open/Closed
• Liskov Substitution
• Interface Segregation
• Dependency Inversion
… And the Acronyms
DRY … Don’t Repeat Yourself
YAGNI … You Ain’t Gonna Need It
KISS … Keep It Simple Stupid
For the Record - SOLID
Single responsibility A class should have only one responsibility
Open/closed Classes should be open for extension, but closed for modification
Liskov substitution If A is a subtype of class B, we should be able to replace B with A,
without disrupting program behaviour.
(If B is class of Car, then it should work for Electric Car.)
Interface segregation larger interfaces should be split into smaller ones. By doing so, we can
ensure that implementing classes only need to be concerned about the
methods that are of interest to them.
Dependency inversion High-level modules, which provide complex logic, should be easily
reusable and unaffected by changes in low-level modules, which provide
utility features.
Towards a Principles Framework for
Data Engineering to Support Research
Delivery
Principles
Principles that
help data
engineers
deliver
User Principles
Principles that
aid research
Implementation
Principles
Principles for
technical
implementation
(like SOLID)
What principles are key?
Classific
ation
Principle Practice Implication for researchers
User
(principl
e)
Change
minimum
Keep data values as unchanged as
possible.
Even if adding value – keep originals.
Researchers can be confident
that data are as recorded.
User Derived
Variables in
source data
Where several users need DVs, agree and
add DVs to researcher data.
Users can work off consistent
information and don’t need to re-
invent the wheel.
User Standardise
data
Provide data that meets standards; where
necessary include original and
standardised data
Users can more easily interpret
and compare data and results
User Use meaningful
terms
Where possible use meaning file and
variable names, use meaningful category
names (Male, Female vs 1, 2)
Makes analysis easier for the
user, and reduces the risk of
errors.
User Ease of use Prioritise ease of use over storage
efficiency.
Easier for researchers.
Classific
ation
Principle Practice Implication for researchers
Delivery
(principle
)
Reproducibilit
y
Use a reproducible pipeline, that meets
good coding standards.
Reliable delivery, able to
consistently re-supply. With
adaption enable re-supply with
changes.
Delivery Reusability Use functions and modular code.
Generalise where possible.
Quicker, more reliable
development. Helps consistency.
Delivery Scalability Write code and use tools that facilitate
scaling to increased (or decreased) data
volumes.
Delivery Auditability Use procedures, tools, and code) so that
changes to pipelines and data can be
traced, sourced, reasoned, and justified.
Helps ensure the reliability of data.
Delivery Document
and Share
Document everything! Code,
derivations, etc. etc.
Share documentation and code
Researchers can understand how
their data are derived.
Over to you
• Do you have any comments on these principles?
• What other principles are important, especially to researchers?
• A reminder
Classification Principl
e
Practic
e
Implication for researchers
User, Delivery,
(Implementati
on)
User Principles Delivery Principles
Change minimum Reproducibility
Derived Variables in source data Reusability
Standardise data Scalability
Use meaningful terms Auditability
Document and Share
STRUCTURING DATA
FOR RESEARCHERS
Tuesday 14 November
Structuring data
• Underlying principle – make the data as easy to use as possible
• But modern data get complex fast!
• And linking across sources ratchets up the complexity!
Starting simple – a flat table
• Simplest to analyse
• Ideal for simple surveys
• e.g. Opinions Survey
ID Type of House
Number of
Bedrooms
HID1 Semi detached 2
HID2 Terrace 4
HID3 Flat 1
Sparse tables
• Advantages of flat table
• Not efficient for memory
• Good for linked data
• e.g. ASHE linked to Census
ID House_type No. Bedrooms
HID1
Semi
detached 2
HID2 Terrace 4
HID3 Flat 1
ID
No.
Cars
HID1 3
HID3 0
HID4 1
ID House_type No. Bedrooms No. Cars
HID1
Semi
detached 2 3
HID2 Terrace 4 <NULL>
HID3 Flat 1 0
HID4 <NULL> <NULL> 1
Consider:
Null in original
vs
Null = no
match
Wide, Stacked, Summarised
• Three ways to structure the same data
ID
No.
Cars
HID1 3
HID3 0
HID4 1
ID Model
HID1 Ford Focus
HID1 Toyota Aygo
HID1 Range Rover Velar
HID3 <NULL>
HID4 Bentley Continental
ID Model 1 Model 2 Model 3
HID1 Ford Focus
Toyota
Aygo Range Rover Velar
HID3 <NULL> <NULL> <NULL>
HID4
Bentley
Continental <NULL> <NULL>
Over to you:
• How would you structure the house data linked to the car models data?
And Why?
• HID, House Type, No. Bedrooms
• HID, Make 1, Make 2, etc.
• How would you structure data on pupils, that gave all their qualifications?
And Why?
• ID
• Qualification type
• Subject
• Grade/Result
More on Stacking
• Useful when the row is about an entity of interest
• e.g. ASHE data relate to employments
• Each person can have several employments
• Some research interest is around employments
• Useful for data covering several time periods
• e.g. (again) ASHE data
• Wide data structure sometimes not practical
• If lots of entries for an entity (e.g. cars to a household)
• Or if large variation in the number of entries
• Stacking doesn’t work well if lots of variation in data contents
• e.g. New Earnings Survey (NES) questions varied dramatically from year to year
Multiple flat files
• What it says on the tin
• Linked by common keys
• E.g. Labour Force Survey
• Individual File
• Household File
• Longitudinal File
ID HHID Age Employment Status
Person 1 HH 1 45Full Time
Person 2 HH 2 28Unemployed
Person 3 HH 2 1<NULL>
Person 4 HH 3 36Full Time
Person 5 HH 3 34Economically inactive
Person 6 HH 3 10<NULL>
HHID Property type No. Employed
HH1 Flat 1
HH2 Terrace 0
HH3 Semi-detached 1
Table + Spine
• Used for stacked data
• Spine contains ‘fixed’ or persistent data
• Traditionally one spine entry per secondary entity
• But can be more than one if details change
PID Income Job Start
P1 28,000
01/04/202
2
P1 32,000
01/08/202
3
P2 96,000
01/01/201
8
PID
Age at
xx Sex
P1 32Male
P2 52Female
Table + Spine
• Used for stacked data
• Spine contains ‘fixed’ or persistent data
• Traditionally one spine entry per secondary entity
• But can be more than one if details change
PID Income Job Start
P1 28,000
01/04/202
2
P1 32,000
01/08/202
3
P2 96,000
01/01/201
8
PID
Age at
xx Sex
P1 32Male
P2 52Female
Multiple Tables + Spine
• Used for data covering multiple sources
• Where the data structure is not amenable to stacking etc.
• Data Spine indicates source of data and may contain fixed data
• One entry per data subject
• Demographic spine gives ‘fixed’ or persistent data
• Can use multiple spines
• Typically one data spine + one demographic spine
Multiple Tables + Spine - example
ID SubjectGrade
ID 1 English 8
ID 1 Maths 9
ID 2 Greek 7
ID 2 Latin 4
ID 3 Physics 9
ID 3 French 2
ID 4 Biology 3
ID Sex
Age at
xx
School
ID
ID 1 F 17ABC
ID 2 M 18ABC
ID 4 F 18DEF
ID 5 M 11GHI
ID 6 F 9GHI
ID Subject Result
ID 3 Physics First
ID 7
Media
studies 2:2
Exam PLASC Degree
ID 1 TRUE TRUE FALSE
ID 2 TRUE TRUE FALSE
ID 3 TRUE FALSE TRUE
ID 4 TRUE TRUE FALSE
ID 5 FALSE TRUE FALSE
ID 6 FALSE TRUE FALSE
ID 7 FALSE FALSE TRUE
Age at xx Sex Postcode
ID 1 17M SW1A 1AA
ID 1 18M SW1A 1AA
ID 2 18F SW1A 0AA
ID 2 18F SW1A 0PW
ID 3 19<NULL> PO15 5RR
ID 4 18F NP10 8XG
ID 5 11M DL1 5AD
ID 6 9F M1 6EU
ID 7 22F SW1P 4DF
ID 7 22F EC3N 4AB
Other data structures
• JSON
• JavaScript Object Notation
• Plain text
• Language independent
• Used to talk computer to computer
• Great for complex data
• But a pig to analyse without conversion!
JSON Example
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
Other data structures
• JSON
• Relational Databases
• A way of storing data across multiple tables
• Can be made available as multiple related tables
• Easier to analyse if translated to stacked or wide tables
Other data structures:
Relational Database
Student_I
D
Name
1 Alice
2 Bob
3 Cate
Student_ID Course_ID
1 1
1 2
2 1
2 3
3 1
3 2
3 3
Course_ID Cours
e
1 Englis
h
2 Maths
3 Scienc
e
Other data structures
• JSON
• Relational Databases
• Graph databases
• Stores nodes and relationships
• Not tables
• Useful for complex linkage and matching
• Useful for visulisation
• Useful for uncertainty in data/relationships
Graph Database Example
Image: By Ole Mussmann - Own work, CC0, https://commons.wikimedia.org/w/index.php?
curid=87002327
Other data structures
• JSON
• Relational Databases
• Graph databases
Towards the future
• Data queries
• Pre Canned data query
• Updates when the data does
• Useful for consistency
• Useful for sharing common queries
• Creating updating derived variables
• Data Views
• Build a view for analysis
• Re-presents underlying data
• Can be built across several sources
• When the data is updated / changed – reflected in the view
• Cloud based tools are making building custom views much easier
• Dashboards
• Present summary information
• Can be graphical
• Updates when the data does
Structuring Data
Over to you!
• Are there any other data structures you have used?
• When have you had difficulty with structure?
• When has a structure worked well for you (for complex data)?
• What techniques have you found useful?

More Related Content

PPT
Data presentation and transfer
PPTX
Data Science presentation for explanation of numpy and pandas
PDF
Incentivising the uptake of reusable metadata in the survey production process
PPTX
Managing your data paget
PPTX
Connected development data
PPTX
Intro to Data Management
PPTX
Data Management Best Practices
DOCX
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
Data presentation and transfer
Data Science presentation for explanation of numpy and pandas
Incentivising the uptake of reusable metadata in the survey production process
Managing your data paget
Connected development data
Intro to Data Management
Data Management Best Practices
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx

Similar to ADR UK workshop: Messy and complex data part 1 (20)

PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Database_Systems_(CSC_206)_-_2024-2025_STU_PT_1.pptx
PPTX
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
PDF
Enterprise ready: a look at Neo4j in production
PPTX
Data Exploration and Transformation.pptx
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
PPTX
Introduction to Data Science.pptx
PDF
OrientDB: Unlock the Value of Document Data Relationships
PPTX
FAIRDOM data management support for ERACoBioTech Proposals
PPT
Data Management for Graduate Students
PDF
Denodo’s Data Catalog: Bridging the Gap between Data and Business
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
PPTX
CSU-ACADIS_dataManagement101-20120217
PDF
Database Systems - Lecture Week 1
PPTX
System Analysis And Design
PDF
Big Data Analytics M1.pdf big data analytics
PDF
Dive deep into your Data Pools
PPTX
(Big) Data (Science) Skills
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Architect’s Open-Source Guide for a Data Mesh Architecture
Database_Systems_(CSC_206)_-_2024-2025_STU_PT_1.pptx
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Enterprise ready: a look at Neo4j in production
Data Exploration and Transformation.pptx
INFO8116 - Week 10 - Slides.pptx data analutics
INFO8116 - Week 10 - Slides.pptx big data architecture
Introduction to Data Science.pptx
OrientDB: Unlock the Value of Document Data Relationships
FAIRDOM data management support for ERACoBioTech Proposals
Data Management for Graduate Students
Denodo’s Data Catalog: Bridging the Gap between Data and Business
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
CSU-ACADIS_dataManagement101-20120217
Database Systems - Lecture Week 1
System Analysis And Design
Big Data Analytics M1.pdf big data analytics
Dive deep into your Data Pools
(Big) Data (Science) Skills
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Ad

More from EleanorCollard (11)

PPTX
Growing Up in England workshop day 1 slides
PPTX
Growing Up in England workshop day 2 slides
PPTX
Post 16 pathways to employment for lower attaining pupils
PPTX
The Forgotten Third - Using administrative data for policy development (intro)
PPTX
Impact of Wolf Reforms - Evidence and recommendations
PPTX
The Forgotten Third - Using administrative data for policy development (headl...
PPTX
ADR UK workshop: Messy and complex data part 2
PPTX
How to Write a Strong Fellowship Application Webinar slides.pptx
PPTX
PII.pptx
PPTX
Personal identifiable information vs attribute data
PPTX
UKSA how to access data under the DEA.pptx
Growing Up in England workshop day 1 slides
Growing Up in England workshop day 2 slides
Post 16 pathways to employment for lower attaining pupils
The Forgotten Third - Using administrative data for policy development (intro)
Impact of Wolf Reforms - Evidence and recommendations
The Forgotten Third - Using administrative data for policy development (headl...
ADR UK workshop: Messy and complex data part 2
How to Write a Strong Fellowship Application Webinar slides.pptx
PII.pptx
Personal identifiable information vs attribute data
UKSA how to access data under the DEA.pptx
Ad

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Pharma ospi slides which help in ospi learning
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Lesson notes of climatology university.
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
A systematic review of self-coping strategies used by university students to ...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RMMM.pdf make it easy to upload and study
O7-L3 Supply Chain Operations - ICLT Program
GDM (1) (1).pptx small presentation for students
Pharma ospi slides which help in ospi learning
Chinmaya Tiranga quiz Grand Finale.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Computing-Curriculum for Schools in Ghana
human mycosis Human fungal infections are called human mycosis..pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Lesson notes of climatology university.
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
102 student loan defaulters named and shamed – Is someone you know on the list?
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
A systematic review of self-coping strategies used by university students to ...

ADR UK workshop: Messy and complex data part 1

  • 1. MESSY AND COMPLEX DATA 1 This slide deck was produced for an in-person workshop at the ADR UK 2023 conference, Tuesday 14 November
  • 2. This session • Intro • Getting started • Definitions • This session • Principles of Data Engineering • Structure and complexity
  • 3. Introductions Presenters • Jonathan Swan • Head of ADRUK Data Engineering, ONS • Jen Hampton • Head of ADRUK Linkage, ONS
  • 4. Definitions: Complex Data • “Data that are hard to process and translate into a digestible format.” • Complexity may be because of • Size (wide or tall) • Number of Sources • Structure • Multiple entities (e.g. people, households, jobs, employers all in same data) • Relationships within the data, between entities, or between sources • The type of data (pictures, sound, video, free text, etc.)
  • 5. Definitions: Messy Data • Data where there is a barrier to using the data for analysis. • Examples of Messy data • missing data. • unstructured data. • multiple variables in one column. • variables stored in the wrong places. • observations split incorrectly or left together against normalization rules. • switched columns and rows. • extra spaces • Also • uncertainty • imprecision
  • 7. Why do data engineering principles matter? • “I’m a researcher – why should I care about data engineering?” • Data engineering includes the process of preparing data to enable users, researchers, to use the data. • Using the right principles enables: • Ease of use • Consistency of use • Accuracy of the data • Accuracy in interpretation • Avoidance of error
  • 8. Standard set of Data Engineering Principles? 7 Data Engineering Principles You Should Be Aware Of 6 data integration principles for data engineers to live by The Three P's of Data Engineering Data engineering principles - AWS Prescriptive Guidance Essential Data Engineering Concepts and Principles Data Engineering Design Principles SOLID Principles in Data Engineering 10 Major DataOps Principles to Overcome Data Engineer Burnout Simplified
  • 9. Questions for this session Can we form a list of practical Data Engineering principles? What do they mean to us in practice? Does usage for research data change anything? The art of a data engineer is dealing with messy and complex data, whilst maintaining clarity. Embrace complexity, Avoid complicated.
  • 10. Two main principal frameworks (sort of) Software Engineering Principles • Flexibility • Reproducibility • Reusability • Scalability • Auditability SOLID Framework • Single Responsibility • Open/Closed • Liskov Substitution • Interface Segregation • Dependency Inversion … And the Acronyms DRY … Don’t Repeat Yourself YAGNI … You Ain’t Gonna Need It KISS … Keep It Simple Stupid
  • 11. For the Record - SOLID Single responsibility A class should have only one responsibility Open/closed Classes should be open for extension, but closed for modification Liskov substitution If A is a subtype of class B, we should be able to replace B with A, without disrupting program behaviour. (If B is class of Car, then it should work for Electric Car.) Interface segregation larger interfaces should be split into smaller ones. By doing so, we can ensure that implementing classes only need to be concerned about the methods that are of interest to them. Dependency inversion High-level modules, which provide complex logic, should be easily reusable and unaffected by changes in low-level modules, which provide utility features.
  • 12. Towards a Principles Framework for Data Engineering to Support Research Delivery Principles Principles that help data engineers deliver User Principles Principles that aid research Implementation Principles Principles for technical implementation (like SOLID)
  • 13. What principles are key? Classific ation Principle Practice Implication for researchers User (principl e) Change minimum Keep data values as unchanged as possible. Even if adding value – keep originals. Researchers can be confident that data are as recorded. User Derived Variables in source data Where several users need DVs, agree and add DVs to researcher data. Users can work off consistent information and don’t need to re- invent the wheel. User Standardise data Provide data that meets standards; where necessary include original and standardised data Users can more easily interpret and compare data and results User Use meaningful terms Where possible use meaning file and variable names, use meaningful category names (Male, Female vs 1, 2) Makes analysis easier for the user, and reduces the risk of errors. User Ease of use Prioritise ease of use over storage efficiency. Easier for researchers.
  • 14. Classific ation Principle Practice Implication for researchers Delivery (principle ) Reproducibilit y Use a reproducible pipeline, that meets good coding standards. Reliable delivery, able to consistently re-supply. With adaption enable re-supply with changes. Delivery Reusability Use functions and modular code. Generalise where possible. Quicker, more reliable development. Helps consistency. Delivery Scalability Write code and use tools that facilitate scaling to increased (or decreased) data volumes. Delivery Auditability Use procedures, tools, and code) so that changes to pipelines and data can be traced, sourced, reasoned, and justified. Helps ensure the reliability of data. Delivery Document and Share Document everything! Code, derivations, etc. etc. Share documentation and code Researchers can understand how their data are derived.
  • 15. Over to you • Do you have any comments on these principles? • What other principles are important, especially to researchers? • A reminder Classification Principl e Practic e Implication for researchers User, Delivery, (Implementati on) User Principles Delivery Principles Change minimum Reproducibility Derived Variables in source data Reusability Standardise data Scalability Use meaningful terms Auditability Document and Share
  • 17. Structuring data • Underlying principle – make the data as easy to use as possible • But modern data get complex fast! • And linking across sources ratchets up the complexity!
  • 18. Starting simple – a flat table • Simplest to analyse • Ideal for simple surveys • e.g. Opinions Survey ID Type of House Number of Bedrooms HID1 Semi detached 2 HID2 Terrace 4 HID3 Flat 1
  • 19. Sparse tables • Advantages of flat table • Not efficient for memory • Good for linked data • e.g. ASHE linked to Census ID House_type No. Bedrooms HID1 Semi detached 2 HID2 Terrace 4 HID3 Flat 1 ID No. Cars HID1 3 HID3 0 HID4 1 ID House_type No. Bedrooms No. Cars HID1 Semi detached 2 3 HID2 Terrace 4 <NULL> HID3 Flat 1 0 HID4 <NULL> <NULL> 1 Consider: Null in original vs Null = no match
  • 20. Wide, Stacked, Summarised • Three ways to structure the same data ID No. Cars HID1 3 HID3 0 HID4 1 ID Model HID1 Ford Focus HID1 Toyota Aygo HID1 Range Rover Velar HID3 <NULL> HID4 Bentley Continental ID Model 1 Model 2 Model 3 HID1 Ford Focus Toyota Aygo Range Rover Velar HID3 <NULL> <NULL> <NULL> HID4 Bentley Continental <NULL> <NULL>
  • 21. Over to you: • How would you structure the house data linked to the car models data? And Why? • HID, House Type, No. Bedrooms • HID, Make 1, Make 2, etc. • How would you structure data on pupils, that gave all their qualifications? And Why? • ID • Qualification type • Subject • Grade/Result
  • 22. More on Stacking • Useful when the row is about an entity of interest • e.g. ASHE data relate to employments • Each person can have several employments • Some research interest is around employments • Useful for data covering several time periods • e.g. (again) ASHE data • Wide data structure sometimes not practical • If lots of entries for an entity (e.g. cars to a household) • Or if large variation in the number of entries • Stacking doesn’t work well if lots of variation in data contents • e.g. New Earnings Survey (NES) questions varied dramatically from year to year
  • 23. Multiple flat files • What it says on the tin • Linked by common keys • E.g. Labour Force Survey • Individual File • Household File • Longitudinal File ID HHID Age Employment Status Person 1 HH 1 45Full Time Person 2 HH 2 28Unemployed Person 3 HH 2 1<NULL> Person 4 HH 3 36Full Time Person 5 HH 3 34Economically inactive Person 6 HH 3 10<NULL> HHID Property type No. Employed HH1 Flat 1 HH2 Terrace 0 HH3 Semi-detached 1
  • 24. Table + Spine • Used for stacked data • Spine contains ‘fixed’ or persistent data • Traditionally one spine entry per secondary entity • But can be more than one if details change PID Income Job Start P1 28,000 01/04/202 2 P1 32,000 01/08/202 3 P2 96,000 01/01/201 8 PID Age at xx Sex P1 32Male P2 52Female
  • 25. Table + Spine • Used for stacked data • Spine contains ‘fixed’ or persistent data • Traditionally one spine entry per secondary entity • But can be more than one if details change PID Income Job Start P1 28,000 01/04/202 2 P1 32,000 01/08/202 3 P2 96,000 01/01/201 8 PID Age at xx Sex P1 32Male P2 52Female
  • 26. Multiple Tables + Spine • Used for data covering multiple sources • Where the data structure is not amenable to stacking etc. • Data Spine indicates source of data and may contain fixed data • One entry per data subject • Demographic spine gives ‘fixed’ or persistent data • Can use multiple spines • Typically one data spine + one demographic spine
  • 27. Multiple Tables + Spine - example ID SubjectGrade ID 1 English 8 ID 1 Maths 9 ID 2 Greek 7 ID 2 Latin 4 ID 3 Physics 9 ID 3 French 2 ID 4 Biology 3 ID Sex Age at xx School ID ID 1 F 17ABC ID 2 M 18ABC ID 4 F 18DEF ID 5 M 11GHI ID 6 F 9GHI ID Subject Result ID 3 Physics First ID 7 Media studies 2:2 Exam PLASC Degree ID 1 TRUE TRUE FALSE ID 2 TRUE TRUE FALSE ID 3 TRUE FALSE TRUE ID 4 TRUE TRUE FALSE ID 5 FALSE TRUE FALSE ID 6 FALSE TRUE FALSE ID 7 FALSE FALSE TRUE Age at xx Sex Postcode ID 1 17M SW1A 1AA ID 1 18M SW1A 1AA ID 2 18F SW1A 0AA ID 2 18F SW1A 0PW ID 3 19<NULL> PO15 5RR ID 4 18F NP10 8XG ID 5 11M DL1 5AD ID 6 9F M1 6EU ID 7 22F SW1P 4DF ID 7 22F EC3N 4AB
  • 28. Other data structures • JSON • JavaScript Object Notation • Plain text • Language independent • Used to talk computer to computer • Great for complex data • But a pig to analyse without conversion!
  • 29. JSON Example { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }
  • 30. Other data structures • JSON • Relational Databases • A way of storing data across multiple tables • Can be made available as multiple related tables • Easier to analyse if translated to stacked or wide tables
  • 31. Other data structures: Relational Database Student_I D Name 1 Alice 2 Bob 3 Cate Student_ID Course_ID 1 1 1 2 2 1 2 3 3 1 3 2 3 3 Course_ID Cours e 1 Englis h 2 Maths 3 Scienc e
  • 32. Other data structures • JSON • Relational Databases • Graph databases • Stores nodes and relationships • Not tables • Useful for complex linkage and matching • Useful for visulisation • Useful for uncertainty in data/relationships
  • 33. Graph Database Example Image: By Ole Mussmann - Own work, CC0, https://commons.wikimedia.org/w/index.php? curid=87002327
  • 34. Other data structures • JSON • Relational Databases • Graph databases
  • 35. Towards the future • Data queries • Pre Canned data query • Updates when the data does • Useful for consistency • Useful for sharing common queries • Creating updating derived variables • Data Views • Build a view for analysis • Re-presents underlying data • Can be built across several sources • When the data is updated / changed – reflected in the view • Cloud based tools are making building custom views much easier • Dashboards • Present summary information • Can be graphical • Updates when the data does
  • 36. Structuring Data Over to you! • Are there any other data structures you have used? • When have you had difficulty with structure? • When has a structure worked well for you (for complex data)? • What techniques have you found useful?

Editor's Notes

  • #2: Throughout there will be polls via Slido & opportunities for discussion in groups (as seated in room; breakout rooms online) Suggest you log into Slido now.
  • #5: We’ll return to messy data in the second session
  • #8: If you Google “Data Engineering Principles” this is what you get. There is no standard framework for principles.
  • #9: For this session I would like us to have three questions in the back of our minds …
  • #10: Of the proposals for data engineering principles there are various sources that propose principles based around software engineering principles – but no two sources have the same idea of which software engineering principals to start from – or which are most relevant to data engineering. Then we have the SOLID framework – which is all fine and dandy – but is very much aimed at the technical end of engineering. And there are also a few acronyms floating around – which are kind of fun.
  • #11: I’m not going to dwell on the technical end of things, but for the record, and if you have not come across the SOLID Framework here’s a little detail you can look at in your own time.
  • #12: I’d like to focus on user and delivery principles and not worry about the technical implementation principles.
  • #23: Useful when you have linked entity types, where different entities of research interest. e.g. analysis at individual and separately at HH level Joining means you can pull new derived variables between tables – e.g. say, income of highest earner onto HH file.
  • #24: e.g. Primary entity Job (employment), second entity person Useful for quick analysis Used for linkage Data-space reduction
  • #25: e.g. Primary entity Job (employment), second entity person Useful for quick analysis Used for linkage Data-space reduction
  • #26: A demographic spine is useful for linking and or quick analysis A data spine enables a researcher to pin down where data are held on the data subject, i.e. which files.
  • #27: Example (based on GUIE) of education based data.
  • #29: Not exactly user friendly!
  • #35: I say towards the future – but this uses tech available now. And plenty already use these approaches. But cloud based delivery makes these techniques easier to implement.