SlideShare a Scribd company logo
“Provenance and Social Science Data”
15 March 2017
Documenting DataTransformations
George Alter, University of Michigan
• Data are useless without Metadata – “data
about data”
• Metadata should:
– Include all information about data creation
– Describe transformations to variables
– Be easy to create
• Our goal: Automated capture of metadata
Why Metadata?
A few words about ICPSR
• World’s largest
archive of social
science data
• Consortium
established 1962
• 760+ member
institutions around
the world
• Founding member
and home office for
the DDI Alliance
Powered by DDI Metadata
ICPSR is building search
tools based upon Data
Documentation Initiative
(DDI) XML
Codebooks (pdf and
online) are rendered from
the DDI.
Searchable database of
4.5M variables
Click here for
online
codebook
Online codebook shows
variable in context of
dataset
Link to online
crosstab tool
What question
was asked?
How was the
question coded?Link to online
graph tool
Searchable database of
4.5M variables
Click here for
variable
comparison
Variable comparison
display
Click here for
online
codebook
Search for datasets with
3 desired variables
Check boxes
for variable
comparison
Crosswalk for American National Election
Study (ANES) and General Social Survey
(GSS)
Columns link to
70 datasets
134 tags in
8 lists
Variable
comparison
display
Variables linked to
online codebooks
Metadata for the American National Election Study
What question
was asked?
Who answered
this question?
How was the
question coded?
Who answered
this question?
Metadata for the American National Election Study
Who answered
this question?
Who answered
this question?
How do we know who
answered the question?
It’s in the pdf.
When data arrive at the
archive…
• No question text
• No interview flow (question order, skip pattern)
• No variable provenance
• Data transformations are not documented.
How is research data created?
• Most surveys are conducted with computer
assisted interview software (CAI)
– CATI – Computer-assisted Telephone Interview
– CAPI – Computer-assisted Personal Interview
– CAWI – Computer Aided Web Interview
• There is no paper questionnaire
• The CAI program is the questionnaire
– i.e. the program is the metadata
Original
data
DDI XML
Original
metadata
CAI
CAI
to
DDI
Convert to
DDI:
Collectica
MQDS
others
Computer
Assisted
Interviewing
We already have tools to
convert CAI to machine-
readable metadata.
SPSS
SAS
Stata
R
Command
scripts:
Original
data
DDI XML
Original
metadata
Revised
data
SPSS
SAS
Stata
R
CAI
CAI
to
DDI
Statistical
Packages
Convert to
DDI:
Collectica
MQDS
others
Computer
Assisted
Interviewing
What happens when a
project modifies the data.
The modified
data no longer
match the
metadata.
SPSS
SAS
Stata
R
Command
scripts:
Original
data
DDI XML
Original
metadata
Revised
data
SPSS
SAS
Stata
R
SPSSSAS
Stata
R
CAI
CAI
to
DDI
Statistical
Packages
Convert to
DDI:
Collectica
MQDS
others
Computer
Assisted
Interviewing
Stat
Package
to
DDI
DDI
XML
Extracted
metadata
Extract
metadata
from
SPSS/SAS/S
tata/R
Data file
Metadata are re-
created after the
data are
transformed.
Transformations
are documented
by hand
Statistics packages have limited
metadata
• Variable names
• Variable labels
• Value labels
• No provenance
SDTL XML
Updater
DDI XML
SPSS
SAS
Stata
R
Script
Parser
Command
scripts:
Original
data
Revised
metadata
DDI XML
Original
metadata
Revised
data
SPSS
SAS
Stata
R
CAI
CAI
to
DDI
Statistical
Packages
Standard
Data
Transformation
Language
Convert to
DDI:
Collectica
MQDS
others
Computer
Assisted
Interviewing
Automating the
capture of
transformation
metadata.
Missing links that we
will build.
What statistics packages should be
covered?
ICPSR Downloads by Format
All downloads
Studies with all
formats
Delimited text 43% 29%
SPSS 22% 24%
SAS 10% 12%
Stata 19% 23%
R 5% 12%
Excel 0% 1%
Other 0% 0%
100% 100%
Number 378,007 154,663
Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X
2
3
4
-1
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X
2
3
4
-1
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X
2
3
4
-1
Why do we need an SDTL?
Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X X Y Z
2 2 8
3 3
4 4 9
-1 -1
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X X Y Z
2 2 8
3 3
4 4 9
-1 9
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X X Y Z
2 2 . 8
3 3 . .
4 4 9 .
-1 . . 8
Why do we need an SDTL?
What happens when a missing value is
in a logical comparison?
• SPSS
– Logical expressions including a missing value are
considered “Missing.” Usually, “Missing” is equivalent to
“False.”
• Stata
– Missing values are treated as numbers equal to infinity.
So, any number is less than a missing value.
• SAS
– Missing values are treated as numbers equal to minus
infinity. So, any number is greater than a missing value.
Input Data Output Data
SPSS
MISSING VALUES X(-1).
IF (X > 3) Y=9.
IF (X < 3) Z=8.
X X Y Z
2 2 8
3 3
4 4 9
-1 NULL
Stata
replace X=. if X==-1
generate Y=9 if X>3
generate Z=8 if X<3
X X Y Z
2 2 8
3 3
4 4 9
-1 ∞ 9
SAS
if X=-1 then X=.;
if X>3 then Y=9;
if X<3 then Z=8;
X X Y Z
2 2 . 8
3 3 . .
4 4 9 .
-1 -∞ . 8
Missing Values in Comparisons
Benefits of automated metadata
capture
• Metadata will be better
– All the information in the CAI can be included.
– Variable transformations can be described
• Automation will lower costs
– Metadata will not be discarded and re-created
• All metadata will be standardized and machine
readable
– Codebooks with rich information can be rendered at will
• If we make it easy and beneficial, researchers will
use it.
Continuous Capture of Metadata for
Statistical Data
(NSF ACI-1640575)
Project Partners
•Inter-university Consortium for Political and Social
Research (ICPSR), University of Michigan
•Colectica
•Metadata Technology North America
•Norwegian Centre for Research Data
•General Social Survey, NORC, University of Chicago
•American National Election Study, University of
Michigan
Questions?
George Alter
altergc@umich.edu

More Related Content

PDF
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
PDF
Big Data processing with Apache Spark
PDF
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
PDF
SQLBits Module 2 RStats Introduction to R and Statistics
PDF
Democratizing Data within your organization - Data Discovery
PDF
Graph Databases - Where Do We Do the Modeling Part?
PDF
Iterative data discovery and transformation with open refine
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Big Data processing with Apache Spark
What Is SAS | SAS Tutorial For Beginners | SAS Training | SAS Programming | E...
SQLBits Module 2 RStats Introduction to R and Statistics
Democratizing Data within your organization - Data Discovery
Graph Databases - Where Do We Do the Modeling Part?
Iterative data discovery and transformation with open refine

What's hot (19)

PPTX
Connected data meetup group - introduction & scope
PDF
Talis Insight Europe 2017 - Using Talis data with other datasets - Tim Hodson
PDF
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
PPTX
Evolution of big data
PPTX
Introduction to data cleaning with spreadsheets
PDF
The Power of Machine Learning and Graphs
PPTX
Relational data model in Cassandra: Will it fit?
PDF
Amundsen: From discovering to security data
PDF
Building Knowledge Graphs in 10 steps
PPT
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
PDF
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
PPTX
Towards Visualization Recommendation Systems
PPTX
NO SQL Databases, Big Data and the cloud
PDF
Job Data Analysis Reveals Key Skills Required for Data Scientists
PDF
Data Discovery & Trust through Metadata
PDF
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
PDF
Applied Data Science Course Part 1: Concepts & your first ML model
PDF
RDF Data Quality Assessment - connecting the pieces
PDF
Data Discoverability at SpotHero
Connected data meetup group - introduction & scope
Talis Insight Europe 2017 - Using Talis data with other datasets - Tim Hodson
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
Evolution of big data
Introduction to data cleaning with spreadsheets
The Power of Machine Learning and Graphs
Relational data model in Cassandra: Will it fit?
Amundsen: From discovering to security data
Building Knowledge Graphs in 10 steps
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
Edgard Marx, Amrapali Zaveri, Diego Moussallem and Sandro Rautenberg | DBtren...
Towards Visualization Recommendation Systems
NO SQL Databases, Big Data and the cloud
Job Data Analysis Reveals Key Skills Required for Data Scientists
Data Discovery & Trust through Metadata
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Applied Data Science Course Part 1: Concepts & your first ML model
RDF Data Quality Assessment - connecting the pieces
Data Discoverability at SpotHero
Ad

Viewers also liked (15)

PPTX
Managing provenance in the Social Sciences: the Data Documentation Initiative...
PPTX
Provenance and social science data Nicholas Car - Intro to PROV
PDF
Kate Hudson, Harper's Bazaar UK Cover
PPTX
Simetria respecto a un eje
DOCX
CARACTERISTICAS DE BLOG Y WIKI
PDF
Sydney fc official merchandise
DOCX
PDF
Viral video marketing
PDF
Brussels Capital of Data Science
PPTX
H20 - Thirst for Machine Learning
PDF
Real-life Application of Analytics: Fighting the Underworld of Bike Theft wit...
PDF
Ux and data
PDF
2017 ifma presentation pdf
PPTX
3Com 3C96010C-AC
PDF
Healthchain. TFG Grado Ingeniería Informática.
Managing provenance in the Social Sciences: the Data Documentation Initiative...
Provenance and social science data Nicholas Car - Intro to PROV
Kate Hudson, Harper's Bazaar UK Cover
Simetria respecto a un eje
CARACTERISTICAS DE BLOG Y WIKI
Sydney fc official merchandise
Viral video marketing
Brussels Capital of Data Science
H20 - Thirst for Machine Learning
Real-life Application of Analytics: Fighting the Underworld of Bike Theft wit...
Ux and data
2017 ifma presentation pdf
3Com 3C96010C-AC
Healthchain. TFG Grado Ingeniería Informática.
Ad

Similar to Documenting Data Transformations (20)

PPT
Dublin Core In Practice
PPT
Educ 190_Data Analysis and Collection Tools
PPTX
Dataset Metadata, Tools and Approaches for Access and Preservation
PPT
Educational satistics
DOCX
Dimensions iom training
PPTX
Combining process metadata and cdisc metadata to achieve automation
PPT
Using Technology in Data Analysis
PDF
QPSMR presentation
PPT
香港六合彩
PPTX
Data Exchange, Data Citation: An overview of some community work
PPTX
Data Exchange, Data Citation: An overview of some community work
PDF
Preparing Data for Sharing: The FAIR Principles
PPTX
MTNA DataForge
PPTX
ICPSR Find & Analyze Data
PPTX
Data 2014
PPTX
RDAP13 Elizabeth Moss: The impact of data reuse
PPTX
Understanding Data: An Information Literacy Perspective
PDF
Incentivising the uptake of reusable metadata in the survey production process
PPTX
DataVsStatistics
PPTX
PROF OKOYE_PGC 601_ ICT AND DATA ANALYSIS_LECTURE 2025.pptx
Dublin Core In Practice
Educ 190_Data Analysis and Collection Tools
Dataset Metadata, Tools and Approaches for Access and Preservation
Educational satistics
Dimensions iom training
Combining process metadata and cdisc metadata to achieve automation
Using Technology in Data Analysis
QPSMR presentation
香港六合彩
Data Exchange, Data Citation: An overview of some community work
Data Exchange, Data Citation: An overview of some community work
Preparing Data for Sharing: The FAIR Principles
MTNA DataForge
ICPSR Find & Analyze Data
Data 2014
RDAP13 Elizabeth Moss: The impact of data reuse
Understanding Data: An Information Literacy Perspective
Incentivising the uptake of reusable metadata in the survey production process
DataVsStatistics
PROF OKOYE_PGC 601_ ICT AND DATA ANALYSIS_LECTURE 2025.pptx

More from ARDC (20)

PPTX
Introduction to ADA
PPTX
Architecture and Standards
PPTX
Data Sharing and Release Legislation
PPT
Australian Dementia Network (ADNet)
PPTX
Investigator-initiated clinical trials: a community perspective
PPTX
NCRIS and the health domain
PPTX
International perspective for sharing publicly funded medical research data
PPTX
Clinical trials data sharing
PPTX
Clinical trials and cohort studies
PPTX
Introduction to vision and scope
PPTX
FAIR for the future: embracing all things data
PDF
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
PDF
Skilling-up-in-research-data-management-20181128
PDF
Research data management and sharing of medical data
PPTX
Findable, Accessible, Interoperable and Reusable (FAIR) data
PPTX
Applying FAIR principles to linked datasets: Opportunities and Challenges
PDF
How to make your data count webinar, 26 Nov 2018
PDF
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
PDF
How FAIR is your data? Copyright, licensing and reuse of data
PDF
Peter neish DMPs BoF eResearch 2018
Introduction to ADA
Architecture and Standards
Data Sharing and Release Legislation
Australian Dementia Network (ADNet)
Investigator-initiated clinical trials: a community perspective
NCRIS and the health domain
International perspective for sharing publicly funded medical research data
Clinical trials data sharing
Clinical trials and cohort studies
Introduction to vision and scope
FAIR for the future: embracing all things data
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
Skilling-up-in-research-data-management-20181128
Research data management and sharing of medical data
Findable, Accessible, Interoperable and Reusable (FAIR) data
Applying FAIR principles to linked datasets: Opportunities and Challenges
How to make your data count webinar, 26 Nov 2018
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
How FAIR is your data? Copyright, licensing and reuse of data
Peter neish DMPs BoF eResearch 2018

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PDF
Lecture1 pattern recognition............
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
Lecture1 pattern recognition............
oil_refinery_comprehensive_20250804084928 (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
ISS -ESG Data flows What is ESG and HowHow
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IBA_Chapter_11_Slides_Final_Accessible.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
Clinical guidelines as a resource for EBP(1).pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Documenting Data Transformations