SlideShare a Scribd company logo
DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica
DISPOSABLE DATA MANAGEMENT
• Researchers know they need clean
reliable data
• The analysis really interests them
• When data arrive do quick manual
clean-up of any problems they see.
• Often cut-and-paste in spreadsheets
• Look for and fix anomalies

• If no errors crop up in the analysis,
they make a clean archive copy
and forget about the data.

The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
DISPOSABLE DATA MANAGEMENT
• PROBLEM #1: More data arrive and
they have to do the same cut-andpaste / sorting / combining operations
over again.
• PROBLEM #2: An anomaly appears in a
later data set. She has to check all the
earlier data to find out if it’s there too.
It was a cut-and-paste error.
• PROBLEM #3: The results look peculiar, or are opposite to
the prediction. Was it the data handling or is it real?
The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
GOOD DATA PRACTICES
• ―It’s common to spend many
tedious and frustrating hours
cleaning and wrangling your
data into a usable format,
followed by careful exploration to provide context and
reveal potential problems with the analyses you
want to run.‖
• ―Data cleaning and data transformation are two
major bottlenecks in data analysis.‖

Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/

DATA CLEANING
It should be no surprise that it takes longer
to clean messier data. Unfortunately, there
are many ways that data can be messy.
Powerful tools and practices can help you
turn messy data into clean data.
Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/

DATA TRANSFORMATION
―This is more subtle. It’s often important to
visualize and model the data in various ways
when conducting an analysis. I’m not talking
about going on fishing expeditions, but rather
about familiarizing yourself with the data…
The point is that frequent data transformations
are required to mediate changes between
these representations, introducing an underappreciated amount of friction in analysis.‖
TIDY DATA
• Each variable forms a column
• Each observation forms a row
• Each data set contains information on
only one observational unit of analysis
(e.g., families, participants, participan
t visits)

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
MESSY DATA
• Column names represent data values instead
of variable names
• A single column contains data on multiple
variables instead of a single variable
• Variables are contained in both rows and
columns instead of just columns
• A single table contains more than one
observational unit
• Data about an observational unit is spread
across multiple data sets
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
TIDY TOOLS
• Tidy tools are those that
accept, manipulate, and return tidy data.
• Tidy tools are like Lego blocks—individually
simple but flexible & powerful in combination.
• What tools are tidy?
• Most functions in R
• Most transformations in SPSS or SAS
• Relational databases (an entire skill of its own)

• Spreadsheets are not tidy tools

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
SCI 2777
• We will learn about cleaning data first with
untidy tools: spreadsheets and the like.
• They are more familiar and easy to use right away
• We will learn how to track the provenance even
with our untidy tools.

• Soon, we will use R for some tasks, and get some
basic skills for using a tidy tool for cleaning data.

Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
A CAUTIONARY EXAMPLE
• THOMAS HERNDON
• Third-year economics grad
student at UMass-Amherst
(age 28)
• Class assignment:
replicate the findings
of a published study.
• Growth in a Time of Debt by
Reinhart & Rogoff in American
Economic Review
• Finding: Growth drops off
sharply if debt is high
• Basis for austerity economics

• Could not replicate
Photo : The 28-Year-Old Who Caught the Excel Error Heard
Round the World. In These Times http://bit.ly/Lz2eDm

• Found 3-4 errors.

Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth?
A Critique of Reinhart and Rogoff. PERI Working Papers Number 322.
http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
“There were
actually four
errors all together.
Any one error by
itself would not
have been
enough to cause
the negative
average. It was
the combined
effect of all four of
them: They
interacted with
each other and
amplified each
other—almost like
a perfect storm of
errors.”
Quote from: The 28-Year-Old
Who Caught the Excel Error
Heard Round the World. In These
Times http://bit.ly/Lz2eDm

Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems
from Next New Deal at http://bit.ly/1f1XUHG
DATA PROVENANCE
• Main goals
• Keep a record
• Be able to replicate your steps
• Facilitate collaboration (most data work uses a team)

• Versioning
• Some software automatically keeps old versions of files
• Google docs (online files) does this
• Dropbox also syncs files across all your devices,
keeps a local copy on computers (ie one you can use
when there is no internet)
TODAY
• Look at the World Bank Data visually: what do we
notice?
• World Bank Data – computing variables in spreadsheet
using the School of Data instructions.
• Getting your first look at Graphs using the School of
Data instructions.

• Seeing versions of files in Google Drive
GOALS BY JANUARY 29
• Clean data from the World Bank
• First graphs of variables
• Practice in dreaming up analyses
• Beginning to find our own data
• Basic Descriptive Statistics in ALEKS
• Basic Graphics in ALEKS
• FUN with Design

• First thoughts about your projects
DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica

More Related Content

PPT
Introduction to Data Management
PPTX
Karen Lopez 10 Physical Data Modeling Blunders
PDF
Briefing room: An alternative for streaming data collection
PDF
Data quality - The True Big Data Challenge
PDF
DGIQ 2015 The Fundamentals of Data Quality
PDF
Big data issues and challenges
PDF
Data-Ed Online Webinar: Metadata Strategies
PDF
Adventures in Data Profiling
Introduction to Data Management
Karen Lopez 10 Physical Data Modeling Blunders
Briefing room: An alternative for streaming data collection
Data quality - The True Big Data Challenge
DGIQ 2015 The Fundamentals of Data Quality
Big data issues and challenges
Data-Ed Online Webinar: Metadata Strategies
Adventures in Data Profiling

What's hot (20)

PPTX
What does a dba do all day long?
PDF
Big dataplatform operationalstrategy
PDF
Everything has changed except us
PPTX
Security issues in big data
PDF
Big Data Fundamentals
PPTX
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
PPT
Best Practices: Data Admin & Data Management
PPT
Data mining techniques and dss
PDF
( Big ) Data Management - Data Quality - Global concepts in 5 slides
PPTX
Data science.chapter-1,2,3
PDF
Data Governance Assessment - Jan Rutger Merkus MSc
PPT
Introduction to data warehousing
PDF
A Survey on Big Data Analytics
PPT
Dw & etl concepts
PDF
Data Warehouse Design and Best Practices
DOCX
Big data and Hadoop overview
PDF
Paradigm4 Research Report: Leaving Data on the table
PDF
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
PPTX
Dealing with Dark Data
PPT
Lecture 04 - Granularity in the Data Warehouse
What does a dba do all day long?
Big dataplatform operationalstrategy
Everything has changed except us
Security issues in big data
Big Data Fundamentals
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
Best Practices: Data Admin & Data Management
Data mining techniques and dss
( Big ) Data Management - Data Quality - Global concepts in 5 slides
Data science.chapter-1,2,3
Data Governance Assessment - Jan Rutger Merkus MSc
Introduction to data warehousing
A Survey on Big Data Analytics
Dw & etl concepts
Data Warehouse Design and Best Practices
Big data and Hadoop overview
Paradigm4 Research Report: Leaving Data on the table
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dealing with Dark Data
Lecture 04 - Granularity in the Data Warehouse
Ad

Viewers also liked (20)

PPTX
Data Management for Dummies
PPTX
Introduction to Data Management
PPT
Master Data Management
PDF
The what, why, and how of master data management
PDF
How to identify the correct Master Data subject areas & tooling for your MDM...
PPTX
Introduction to data management
PDF
5 Steps To Master Data Management
PDF
Introduction to research data management
PPT
MDM Strategy & Roadmap
PPT
Gartner: Master Data Management Functionality
PPT
Gartner: Seven Building Blocks of Master Data Management
PDF
Ebook - The Guide to Master Data Management
PPTX
Data management issues
PPT
Survey Research Data Archive: Current Status and Challenges
PPTX
LIS 653, Session 11: Data Management & Curation
PPTX
Data Management: Tips & Tools
PPTX
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
PPT
Data Archiving and Processing
PPTX
CDO Vision: Data Governance Priorities
PPT
Managing data throughout the research lifecycle
Data Management for Dummies
Introduction to Data Management
Master Data Management
The what, why, and how of master data management
How to identify the correct Master Data subject areas & tooling for your MDM...
Introduction to data management
5 Steps To Master Data Management
Introduction to research data management
MDM Strategy & Roadmap
Gartner: Master Data Management Functionality
Gartner: Seven Building Blocks of Master Data Management
Ebook - The Guide to Master Data Management
Data management issues
Survey Research Data Archive: Current Status and Challenges
LIS 653, Session 11: Data Management & Curation
Data Management: Tips & Tools
NCompass Live: Conducting Surveys III: Analyzing Data and Reporting Methods
Data Archiving and Processing
CDO Vision: Data Governance Priorities
Managing data throughout the research lifecycle
Ad

Similar to Data Management - Basic Concepts (20)

PPTX
The Growing Importance of Data Cleaning
PPTX
thegrowingimportanceofdatacleaning-211202141902.pptx
PPTX
Data analytics
PPTX
Big Data for Pearson Btec Higher level 3.ppt
PPTX
Data Analysis for students learning.pptx
PPTX
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
PPTX
Data Preparation.pptx
PPTX
Data Visualization & Analytics.pptx
PPTX
INTRODUCTION TO DATA ANALYTICS -MODULE 1.pptx
PPTX
Data preparation and processing chapter 2
PPTX
BDA TAE 2 (BMEB 83).pptx
PDF
Data Cleaning Best Practices.pdf
PPTX
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
PPTX
Presentation 1.pptx
PDF
Overcoming Common Data Analysis Challenges.pdf
PDF
Module 1.2 data preparation
PDF
Data Management Lab: Session 3 Slides
PPTX
End-to-End Business Data Services from Statswork
PPTX
Presentation final.pptx
PPTX
Data Processing & Explain each term in details.pptx
The Growing Importance of Data Cleaning
thegrowingimportanceofdatacleaning-211202141902.pptx
Data analytics
Big Data for Pearson Btec Higher level 3.ppt
Data Analysis for students learning.pptx
Module-1.pptxcjxifkgzkzigoyxyxoxoyztiai. Tisi
Data Preparation.pptx
Data Visualization & Analytics.pptx
INTRODUCTION TO DATA ANALYTICS -MODULE 1.pptx
Data preparation and processing chapter 2
BDA TAE 2 (BMEB 83).pptx
Data Cleaning Best Practices.pdf
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
Presentation 1.pptx
Overcoming Common Data Analysis Challenges.pdf
Module 1.2 data preparation
Data Management Lab: Session 3 Slides
End-to-End Business Data Services from Statswork
Presentation final.pptx
Data Processing & Explain each term in details.pptx

More from Sr Edith Bogue (20)

PPTX
How to Report Test Results
PPTX
Introduction to the t test
PPTX
Location Scores
PPTX
Variability
PPTX
Principles of Design (Williams)
PPTX
Levels of Measurement
PPTX
Chi-Square Example
PPTX
Repeated Measures ANOVA
PPTX
Two-Way ANOVA Overview & SPSS interpretation
PPTX
Repeated Measures ANOVA - Overview
PPTX
Oneway ANOVA - Overview
PPTX
Effect size
PPTX
One-Sample Hypothesis Tests
PPTX
Review & Hypothesis Testing
PPTX
Sustaining the Ministry of Sponsorship
PPTX
Demographic Processes
PPTX
Location scores
PPTX
Central Tendency - Overview
PPTX
Introduction to z-Scores
PPTX
Graphing
How to Report Test Results
Introduction to the t test
Location Scores
Variability
Principles of Design (Williams)
Levels of Measurement
Chi-Square Example
Repeated Measures ANOVA
Two-Way ANOVA Overview & SPSS interpretation
Repeated Measures ANOVA - Overview
Oneway ANOVA - Overview
Effect size
One-Sample Hypothesis Tests
Review & Hypothesis Testing
Sustaining the Ministry of Sponsorship
Demographic Processes
Location scores
Central Tendency - Overview
Introduction to z-Scores
Graphing

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Cell Structure & Organelles in detailed.
PDF
01-Introduction-to-Information-Management.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Classroom Observation Tools for Teachers
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Institutional Correction lecture only . . .
PPTX
master seminar digital applications in india
PDF
Complications of Minimal Access Surgery at WLH
PDF
RMMM.pdf make it easy to upload and study
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPH.pptx obstetrics and gynecology in nursing
102 student loan defaulters named and shamed – Is someone you know on the list?
Week 4 Term 3 Study Techniques revisited.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Cell Structure & Organelles in detailed.
01-Introduction-to-Information-Management.pdf
Microbial disease of the cardiovascular and lymphatic systems
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Classroom Observation Tools for Teachers
O7-L3 Supply Chain Operations - ICLT Program
Institutional Correction lecture only . . .
master seminar digital applications in india
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...

Data Management - Basic Concepts

  • 1. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica
  • 2. DISPOSABLE DATA MANAGEMENT • Researchers know they need clean reliable data • The analysis really interests them • When data arrive do quick manual clean-up of any problems they see. • Often cut-and-paste in spreadsheets • Look for and fix anomalies • If no errors crop up in the analysis, they make a clean archive copy and forget about the data. The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  • 3. DISPOSABLE DATA MANAGEMENT • PROBLEM #1: More data arrive and they have to do the same cut-andpaste / sorting / combining operations over again. • PROBLEM #2: An anomaly appears in a later data set. She has to check all the earlier data to find out if it’s there too. It was a cut-and-paste error. • PROBLEM #3: The results look peculiar, or are opposite to the prediction. Was it the data handling or is it real? The Perils of Disposable Data Management from Prometheus Research blog at https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
  • 4. GOOD DATA PRACTICES • ―It’s common to spend many tedious and frustrating hours cleaning and wrangling your data into a usable format, followed by careful exploration to provide context and reveal potential problems with the analyses you want to run.‖ • ―Data cleaning and data transformation are two major bottlenecks in data analysis.‖ Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
  • 5. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA CLEANING It should be no surprise that it takes longer to clean messier data. Unfortunately, there are many ways that data can be messy. Powerful tools and practices can help you turn messy data into clean data.
  • 6. Good Data Management Practices for Data Analysis from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/ DATA TRANSFORMATION ―This is more subtle. It’s often important to visualize and model the data in various ways when conducting an analysis. I’m not talking about going on fishing expeditions, but rather about familiarizing yourself with the data… The point is that frequent data transformations are required to mediate changes between these representations, introducing an underappreciated amount of friction in analysis.‖
  • 7. TIDY DATA • Each variable forms a column • Each observation forms a row • Each data set contains information on only one observational unit of analysis (e.g., families, participants, participan t visits) Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 8. MESSY DATA • Column names represent data values instead of variable names • A single column contains data on multiple variables instead of a single variable • Variables are contained in both rows and columns instead of just columns • A single table contains more than one observational unit • Data about an observational unit is spread across multiple data sets Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 9. TIDY TOOLS • Tidy tools are those that accept, manipulate, and return tidy data. • Tidy tools are like Lego blocks—individually simple but flexible & powerful in combination. • What tools are tidy? • Most functions in R • Most transformations in SPSS or SAS • Relational databases (an entire skill of its own) • Spreadsheets are not tidy tools Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 10. SCI 2777 • We will learn about cleaning data first with untidy tools: spreadsheets and the like. • They are more familiar and easy to use right away • We will learn how to track the provenance even with our untidy tools. • Soon, we will use R for some tasks, and get some basic skills for using a tidy tool for cleaning data. Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
  • 12. • THOMAS HERNDON • Third-year economics grad student at UMass-Amherst (age 28) • Class assignment: replicate the findings of a published study. • Growth in a Time of Debt by Reinhart & Rogoff in American Economic Review • Finding: Growth drops off sharply if debt is high • Basis for austerity economics • Could not replicate Photo : The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm • Found 3-4 errors. Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. PERI Working Papers Number 322. http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
  • 13. “There were actually four errors all together. Any one error by itself would not have been enough to cause the negative average. It was the combined effect of all four of them: They interacted with each other and amplified each other—almost like a perfect storm of errors.” Quote from: The 28-Year-Old Who Caught the Excel Error Heard Round the World. In These Times http://bit.ly/Lz2eDm Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems from Next New Deal at http://bit.ly/1f1XUHG
  • 14. DATA PROVENANCE • Main goals • Keep a record • Be able to replicate your steps • Facilitate collaboration (most data work uses a team) • Versioning • Some software automatically keeps old versions of files • Google docs (online files) does this • Dropbox also syncs files across all your devices, keeps a local copy on computers (ie one you can use when there is no internet)
  • 15. TODAY • Look at the World Bank Data visually: what do we notice? • World Bank Data – computing variables in spreadsheet using the School of Data instructions. • Getting your first look at Graphs using the School of Data instructions. • Seeing versions of files in Google Drive
  • 16. GOALS BY JANUARY 29 • Clean data from the World Bank • First graphs of variables • Practice in dreaming up analyses • Beginning to find our own data • Basic Descriptive Statistics in ALEKS • Basic Graphics in ALEKS • FUN with Design • First thoughts about your projects
  • 17. DATA MANAGEMENT SCI 2777 • Storytelling with Data • Spring 2014 Sister Edith Bogue • The College of St Scholastica