SlideShare a Scribd company logo
Data Preparation for Data Science
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Table of Contents
Preliminaries
Demo
Questions
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Data Cleansing =⇒ Data Understanding
There are two ways to understand your data
• Syntactic Understanding
• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A specific column can have many different types
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
DEMO
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Data Preparation for Data Science
Data Preparation for Data Science
Data Preparation for Data Science
Data Preparation for Data Science
Implications for Team Structure
To be successful,
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
1
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016

More Related Content

PPTX
Data Wrangling
PPTX
Data Science presentation for elementary school students
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
PDF
Data Science Folk Knowledge
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
PDF
Pandas, Data Wrangling & Data Science
PPTX
Data Science: Past, Present, and Future
Data Wrangling
Data Science presentation for elementary school students
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Data Science Folk Knowledge
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Big data Intro - Presentation to OCHackerz Meetup Group
Pandas, Data Wrangling & Data Science
Data Science: Past, Present, and Future

What's hot (20)

PPTX
Introduction to data science
PDF
Introduction to Data Science
PPTX
Knowledge graph construction for research & medicine
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PPTX
Introduction to Data Science
PDF
Data Lifecycle Risks Considerations and Controls
PPTX
Session 01 designing and scoping a data science project
PDF
Training in Analytics and Data Science
PDF
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
PPT
DM Lecture 3
PPTX
Session 10 handling bigger data
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PDF
2015 data-science-salary-survey
PPTX
Elsevier’s Healthcare Knowledge Graph
PDF
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
PPTX
Introduction to data science
PDF
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
PDF
Different Career Paths in Data Science
PDF
Data science and_analytics_for_ordinary_people_ebook
PDF
Demystifying Data Science with an introduction to Machine Learning
Introduction to data science
Introduction to Data Science
Knowledge graph construction for research & medicine
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Introduction to Data Science
Data Lifecycle Risks Considerations and Controls
Session 01 designing and scoping a data science project
Training in Analytics and Data Science
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
DM Lecture 3
Session 10 handling bigger data
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
2015 data-science-salary-survey
Elsevier’s Healthcare Knowledge Graph
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Introduction to data science
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Different Career Paths in Data Science
Data science and_analytics_for_ordinary_people_ebook
Demystifying Data Science with an introduction to Machine Learning
Ad

Viewers also liked (6)

PPT
Exploring Data Preparation and Visualization Tools for Urban Forestry
PPTX
Essential Data Engineering for Data Scientist
PPTX
Reinventing the Modern Information Pipeline: Paxata and MapR
PPT
Business Research Methods. data collection preparation and analysis
PDF
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
PPT
Data Preparation and Processing
Exploring Data Preparation and Visualization Tools for Urban Forestry
Essential Data Engineering for Data Scientist
Reinventing the Modern Information Pipeline: Paxata and MapR
Business Research Methods. data collection preparation and analysis
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation and Processing
Ad

Similar to Data Preparation for Data Science (20)

PDF
Introduction to data science part one and
PPT
data science ppt of emngineering studnets
PPTX
Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...
PPT
Data Science-1 (1).ppt
PPTX
Data science and visualization power point
PPTX
Data science intro deck
PPTX
DataScienceandVisualization_Mod_1_ppt.pptx
PPTX
Data Science presentation for explanation of numpy and pandas
PDF
1750392290550.IX-AI-PART-B-unit-2-DATA-LITERACY.pdf
PPTX
313 IDS _Course_Introduction_PPT.pptx
PPTX
Data Science_Unit-1.2 part - 2 of intro.pptx
PDF
Data Science Crash Course
PDF
S1-Introduction_to_Computational_physics.pdf
PPTX
Lecture #01
PDF
Data Science: Origins, Methods, Challenges and the future?
PPTX
DS_Teacher_Presentation DS and Education.pptx
PDF
Data fluency for the 21st century
PPTX
Data analytics using Scalable Programming
PDF
Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...
PPTX
01-Introduction.pptx
Introduction to data science part one and
data science ppt of emngineering studnets
Data; Data manipulation, sorting, grouping, rearranging. Plotting the data. D...
Data Science-1 (1).ppt
Data science and visualization power point
Data science intro deck
DataScienceandVisualization_Mod_1_ppt.pptx
Data Science presentation for explanation of numpy and pandas
1750392290550.IX-AI-PART-B-unit-2-DATA-LITERACY.pdf
313 IDS _Course_Introduction_PPT.pptx
Data Science_Unit-1.2 part - 2 of intro.pptx
Data Science Crash Course
S1-Introduction_to_Computational_physics.pdf
Lecture #01
Data Science: Origins, Methods, Challenges and the future?
DS_Teacher_Presentation DS and Education.pptx
Data fluency for the 21st century
Data analytics using Scalable Programming
Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...
01-Introduction.pptx

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
August Patch Tuesday
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Mushroom cultivation and it's methods.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
A Presentation on Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Chapter 5: Probability Theory and Statistics
PDF
NewMind AI Weekly Chronicles - August'25-Week II
A comparative analysis of optical character recognition models for extracting...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Encapsulation_ Review paper, used for researhc scholars
WOOl fibre morphology and structure.pdf for textiles
MIND Revenue Release Quarter 2 2025 Press Release
OMC Textile Division Presentation 2021.pptx
August Patch Tuesday
1 - Historical Antecedents, Social Consideration.pdf
A novel scalable deep ensemble learning framework for big data classification...
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Mushroom cultivation and it's methods.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A comparative study of natural language inference in Swahili using monolingua...
A Presentation on Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
DP Operators-handbook-extract for the Mautical Institute
Chapter 5: Probability Theory and Statistics
NewMind AI Weekly Chronicles - August'25-Week II

Data Preparation for Data Science

  • 1. Data Preparation for Data Science Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 2. Table of Contents Preliminaries Demo Questions Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 4. Garbage In =⇒ Garbage Out “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 5. Data Cleansing =⇒ Data Understanding There are two ways to understand your data • Syntactic Understanding • Semantic Understanding If you hope to get anything out of your data, you have to have a handle on both. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 6. Syntactic Understanding: True Types A true type is a label applied to data points xi such that xi are mutually comparable. • Schemas type != true data type • A specific column can have many different types Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 7. Syntactic Understanding: Density Data density is an indication of how data is clumped together. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 8. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 9. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 10. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 11. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Data density is an assumption underlying any conclusions drawn from your data. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 12. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 13. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 14. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 15. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 16. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation • Outlier Alerting Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 17. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 18. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 19. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 20. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 21. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Semantic understanding does not imply SkyNet Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 22. DEMO Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 27. Implications for Team Structure To be successful, Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 28. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 29. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 30. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 31. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty • Your data science teams need software engineering chops. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 32. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.1 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 1 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016