SlideShare a Scribd company logo
Check points for
data quality
Soonmok Kwon
Naver
Last update 2019.2.22
Data pipeline and data quality
Sometimes, it is as serious as code quality.
app
data
Introduction
• Let’s check following topics for data quality:
• File type
• Text content type
• Table data
• Tidiness
• Wholeness
File types
• text, parquet, hadoop sequence-file, …
• + various compression codecs
• Characteristics of those types?
• space efficiency
• processing speed (including compression/decompression times)
• human visibility (simplicity)
• language dependency
• Use general one.
• Hadoop sequence file can be read by Python using some libraries. But it would be
much happier if the original file were TEXT.
• The content should decide the file type not the application logic.
• Because, some day, applications with different logics can access the data.
Text content types
• TSV (CSV), JSON, XML, …
• Characteristics of those types?
• processing speed
• space efficiency
• human visibility(simplicity)
• self-explanation power
• expandability
• JSON is all-round. Good for most cases.
• Don’t mix (JSON in XML, JSON in TSV, …) !
• You will have hard time generalizing your code for processing those data.
• You will meet annoying character escaping issues.
• You will lost support of developer community such as validation code.
• In general, when you mix, the sub-content with high-level format loses its functionality.
• Put them in TEXT files.
• Because they are text.
Table data
• A table data is dataset that can be converted to 2-dimention table with mostly
PRIMITIVE-type values in each cell.
• This is what data scientists call DATASET.
• Some data are TSV (CSV), JSON, XML, … but not table.
• E.g. arbitrary number of items in TSV
• Not-flat JSON
• For data science, terms like TSV, CSV, JSON is used in following meaning:
• TSV: table data in TSV
• CSV: table data in CSV
• JSON: table data in JSON
• When Spark can’t read a CSV, it is not a table data in CSV.
Table data – Tidy data
• Definition
• A table data is TIDY when
• Each column is VARIABLE, and
• Each row is OBSERVATION.
• In the definition, Variable and observation are statistics
terminology.
• Observation: Aspects values measured in a single
experiments
• Variable: Aspects to be measured in experiments. Features.
• Some properties
• No variable needs to be separated
• Each variable has ONE type
• The process to build tidy data has been systemized in large
part, but many parts are remaining as art.
Table data – Tidy data
Example data engineering process to create tidy data
1
2 3
Table data – Tidy data
1
2 3
Example data engineering process to create tidy data
Table data – Whole data
• Data with the same topic scattered all around?
• So bad. Lets gather them up.
• But this move can result in un-tidy data.
• When you pursue wholeness, consider tidiness.
• Or, offer both parts and whole.
Conclusion
• Mind following topics for data quality:
• File type
• Text content type
• Table data
• Tidiness
• Wholeness

More Related Content

PPTX
Steady
PPT
ASP.NET Session 7
PPTX
An Introduction To Python - Files, Part 1
PPTX
CSV File Manipulation
PDF
LaTeX로 문서 작성하자
PPT
Intro to XML in libraries
PDF
Corpus studio Erwin Komen
PPTX
Exploratory querying of the Dutch GeoRegisters
Steady
ASP.NET Session 7
An Introduction To Python - Files, Part 1
CSV File Manipulation
LaTeX로 문서 작성하자
Intro to XML in libraries
Corpus studio Erwin Komen
Exploratory querying of the Dutch GeoRegisters

What's hot (19)

PPTX
Normalizing Data for Migrations
PPTX
Introduction to mongo db
PPTX
Portability
KEY
2011 mongo sf-schemadesign
PDF
A step away from RDBMS
PDF
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
PDF
Gems in the python standard library
PDF
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
PPTX
LaTeX for B.Sc. Mathematics,an introduction
PPTX
Cogapp Open Studios 2012 - Adventures with Linked Data
PPTX
Chapter4
PPTX
Session 03 acquiring data
PDF
Authoring Workflow
PPTX
Artist Archive Group Presentation Art Documentation Pratt Institute School of...
PDF
Dirk Goldhahn: Introduction to the German Wortschatz Project
PPTX
ARK de Triumph: Linking Finding Aids & Digital Libraries Using a Low-Tech App...
PPTX
File handling in vb.net
Normalizing Data for Migrations
Introduction to mongo db
Portability
2011 mongo sf-schemadesign
A step away from RDBMS
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
Gems in the python standard library
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
LaTeX for B.Sc. Mathematics,an introduction
Cogapp Open Studios 2012 - Adventures with Linked Data
Chapter4
Session 03 acquiring data
Authoring Workflow
Artist Archive Group Presentation Art Documentation Pratt Institute School of...
Dirk Goldhahn: Introduction to the German Wortschatz Project
ARK de Triumph: Linking Finding Aids & Digital Libraries Using a Low-Tech App...
File handling in vb.net
Ad

Similar to Checkpoints for data_quality (20)

PPTX
Emerging Technology Chapter 2 Data Science
PDF
DLBDSIDS01_E_Session 2 dATA sCIENCES pRÄSO
PPTX
ADR UK workshop: Messy and complex data part 1
PPTX
The Right Data for the Right Job
PPTX
Introducition to Data scinece compiled by hu
PPT
Data Munging in concepts of data mining in DS
PDF
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
PPTX
Chapter -2- Data science Emerging Tech.pptx
PPTX
Data Science presentation for explanation of numpy and pandas
PDF
The role of data engineering in data science and analytics practice
PDF
Big data rmoug
PPTX
Data science.chapter-1,2,3
PPTX
data science chapter-4,5,6
PPTX
Sailing on the ocean of 1s and 0s
PPTX
Chapter 2- Data Science and big data.pptx
PPTX
ch2 DS.pptx
PPTX
Data science unit2
PDF
Data Modelling For Software Engineers (Full).key.pdf
PDF
Data Modelling For Software Engineers V2.pdf
PDF
BIM Data Mining Unit2 by Tekendra Nath Yogi
Emerging Technology Chapter 2 Data Science
DLBDSIDS01_E_Session 2 dATA sCIENCES pRÄSO
ADR UK workshop: Messy and complex data part 1
The Right Data for the Right Job
Introducition to Data scinece compiled by hu
Data Munging in concepts of data mining in DS
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Chapter -2- Data science Emerging Tech.pptx
Data Science presentation for explanation of numpy and pandas
The role of data engineering in data science and analytics practice
Big data rmoug
Data science.chapter-1,2,3
data science chapter-4,5,6
Sailing on the ocean of 1s and 0s
Chapter 2- Data Science and big data.pptx
ch2 DS.pptx
Data science unit2
Data Modelling For Software Engineers (Full).key.pdf
Data Modelling For Software Engineers V2.pdf
BIM Data Mining Unit2 by Tekendra Nath Yogi
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Lecture1 pattern recognition............
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Business Analytics and business intelligence.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Lecture1 pattern recognition............
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Galatica Smart Energy Infrastructure Startup Pitch Deck
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Analytics and business intelligence.pdf
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Checkpoints for data_quality

  • 1. Check points for data quality Soonmok Kwon Naver Last update 2019.2.22
  • 2. Data pipeline and data quality Sometimes, it is as serious as code quality. app data
  • 3. Introduction • Let’s check following topics for data quality: • File type • Text content type • Table data • Tidiness • Wholeness
  • 4. File types • text, parquet, hadoop sequence-file, … • + various compression codecs • Characteristics of those types? • space efficiency • processing speed (including compression/decompression times) • human visibility (simplicity) • language dependency • Use general one. • Hadoop sequence file can be read by Python using some libraries. But it would be much happier if the original file were TEXT. • The content should decide the file type not the application logic. • Because, some day, applications with different logics can access the data.
  • 5. Text content types • TSV (CSV), JSON, XML, … • Characteristics of those types? • processing speed • space efficiency • human visibility(simplicity) • self-explanation power • expandability • JSON is all-round. Good for most cases. • Don’t mix (JSON in XML, JSON in TSV, …) ! • You will have hard time generalizing your code for processing those data. • You will meet annoying character escaping issues. • You will lost support of developer community such as validation code. • In general, when you mix, the sub-content with high-level format loses its functionality. • Put them in TEXT files. • Because they are text.
  • 6. Table data • A table data is dataset that can be converted to 2-dimention table with mostly PRIMITIVE-type values in each cell. • This is what data scientists call DATASET. • Some data are TSV (CSV), JSON, XML, … but not table. • E.g. arbitrary number of items in TSV • Not-flat JSON • For data science, terms like TSV, CSV, JSON is used in following meaning: • TSV: table data in TSV • CSV: table data in CSV • JSON: table data in JSON • When Spark can’t read a CSV, it is not a table data in CSV.
  • 7. Table data – Tidy data • Definition • A table data is TIDY when • Each column is VARIABLE, and • Each row is OBSERVATION. • In the definition, Variable and observation are statistics terminology. • Observation: Aspects values measured in a single experiments • Variable: Aspects to be measured in experiments. Features. • Some properties • No variable needs to be separated • Each variable has ONE type • The process to build tidy data has been systemized in large part, but many parts are remaining as art.
  • 8. Table data – Tidy data Example data engineering process to create tidy data 1 2 3
  • 9. Table data – Tidy data 1 2 3 Example data engineering process to create tidy data
  • 10. Table data – Whole data • Data with the same topic scattered all around? • So bad. Lets gather them up. • But this move can result in un-tidy data. • When you pursue wholeness, consider tidiness. • Or, offer both parts and whole.
  • 11. Conclusion • Mind following topics for data quality: • File type • Text content type • Table data • Tidiness • Wholeness