SlideShare a Scribd company logo
Data Science
A practitioner’s perspective
Amir Ziai
@amirziai
Who am I?
● Data Scientist at ZEFR, ad tech, LA
● Previously worked in healthcare, SaaS, and finance
Agenda
● Data Science
● My perspective
○ Problems
○ Pitfalls
○ Minimum skills
○ How to build your skills
● Resources
Data Science, a short history
● 1960, Peter Naur used it as a substitute for computer science
● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture
● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job
● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018
● 2015, Data Scientists don’t scale
● 2016, Why You’re Not Getting Value from Your Data Science
https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
Data Science, growth
Data Science, hyped?
http://www.kdnuggets.com/wp-content/uploads/gartner-2014-hype-cycle.jpeg
Data Science, too broad
● BI Analyst/Engineer
● Analytics Engineer
● Data Engineer
● Statistician
● Research Scientist
● Machine Learning Engineer
● AI Engineer
● Solutions Specialist (with analytical background)
● Software Architect
● Financial Modeler
● Actuary
● ...
Data Science, definition
“Data Scientist is a Data Analyst who lives in California”
“Data Scientist is statistics on a Mac”
“...someone who is better at statistics than any software engineer and better at software
engineering than any statistician”
Data Science, the many Venn diagrams
Data Science, process
● Data wrangling (get data from any source, reshape, scale up if needed)
● Problem formulation and modeling (ML, DL, AI)
● Communicate the findings (visualization, UI/UX)
● Productize (SWE, Data Engineering, DevOps)
In the context of:
● Benefit (business value)
● Cost (development, infrastructure, and architecture)
My perspective, what does ZEFR do?
● Ingesting hundreds of millions of videos per day
● Help brands show relevant ads
● Identify content for monetization
● Data science
○ Optimize advertising campaigns
○ Forecast inventory
○ Process text, image, audio, and video
○ Petabyte scale
My perspective, scale and automation
Requirements
● Billions of examples, million of features to train the models with
● Scoring on a similar scale of data
● Models to be re-trained near real-time
Implications
● Have to use cloud computing and distributed systems
● Small deltas in quality and algorithm efficiency magnified to massive cost or
benefit deltas
● Solid software engineering and automation is key
My perspective, example
Task
● Train a better forecasting model (vs. a benchmark statistic)
● Hundreds of terabytes of historical data available
Process
● Wrangling Pre-process and featurize (Spark, S3, RedShift)
● Modeling VW, H2O, hyper-parameter optimization
● Communication Justify cost of 100 node EMR cluster ($1,000 per day)
● Productize Test, deploy, automate with Jenkins, ECS and Kafka
My perspective, the grind
Weeks of tuning the infrastructure,
finding the right features, reasoning
through algorithm complexity
My perspective, pitfalls
● Unreasonable expectations
○ Hype, just hire a few PhDs
○ Is data science too easy?
● Throwing it over the fence*
○ Data science builds models in R/Python, engineering implements it in Java, C, Scala
● Dismissing the importance of good software engineering practices
○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible
● Dismissing the importance of understanding and formulating the problem
○ Get out and talk to people
● Dismissing or not understanding architecture, infrastructure, and cost/benefit
* Full disclosure: article is written by my boss Jonathan Morra at ZEFR
My perspective, data science platforms
● Many companies have recognized the problem with the the disconnect between
data science and engineering
● Facebook and Uber have in-house platforms
● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data
Robot, Yhat, just to name a few
● Very expensive and inflexible in our case
https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/
https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw
https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
My perspective, minimum data science requirements
- Statically-typed language (C, Java, Scala)
- Dynamically-typed language (Python, R)
- SQL (lag, partition, joins, rank, nested subqueries)
- NoSQL (JSON, MongoDB, Couch)
- Data wrangling (Pandas, dplyr, Julia, PySpark, Dask)
- Command-line fu
- Cloud computing (spin up instances, S3, ssh) and environment isolation
- Software engineering best practices (testing, version control, complexity)
- ML theory (bias/variance, complexity, encoding, hashing, feature engineering)
- ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow)
- Basic stats (experiment design, hypothesis testing, moments)
My perspective, how to build your skills
● Take courses in areas of weakness (Udacity, Coursera)
● Showcase your skills with projects on GitHub
● Write a blog about things you’re good at to refine your understanding
● Do Kaggle competitions
● Contribute to StackOverflow and/or CrossValidated
● Contribute to open source projects (sklearn, tensorflow, dask, spark)
Resources
Newsletters, blogs and people to follow
Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild
Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian
Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog,
Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of
a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...

More Related Content

PDF
Data science
PPTX
Introduction to data science
PPTX
Big Data Analytics for BI, BA and QA
PDF
Introduction to Python for Data Science
PDF
Data science
PDF
Life of a data scientist (pub)
PDF
Data science
PPTX
data science
Data science
Introduction to data science
Big Data Analytics for BI, BA and QA
Introduction to Python for Data Science
Data science
Life of a data scientist (pub)
Data science
data science

What's hot (20)

PPTX
Data Science: Past, Present, and Future
PDF
Data science e machine learning
PDF
What is Big Data?
PDF
8 minute intro to data science
PDF
IIPGH Webinar 1: Getting Started With Data Science
PPTX
Big Data and the Art of Data Science
PDF
Introduction to Data Science
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
PDF
Data Science
PPTX
Introduction to data science club
PDF
Demystifying Data Science with an introduction to Machine Learning
PDF
Introduction to Data Science
PDF
Data science presentation 2nd CI day
PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PDF
Data science and visualization lab presentation
DOCX
Datascienceindia article
PPS
Big Data Science: Intro and Benefits
PDF
Data science presentation
PPTX
So, What Does a Data Scientist do?
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Data Science: Past, Present, and Future
Data science e machine learning
What is Big Data?
8 minute intro to data science
IIPGH Webinar 1: Getting Started With Data Science
Big Data and the Art of Data Science
Introduction to Data Science
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science
Introduction to data science club
Demystifying Data Science with an introduction to Machine Learning
Introduction to Data Science
Data science presentation 2nd CI day
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Data science and visualization lab presentation
Datascienceindia article
Big Data Science: Intro and Benefits
Data science presentation
So, What Does a Data Scientist do?
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ad

Similar to Data science a practitioner's perspective (20)

PDF
Big Data for Data Scientists - Info Session
PDF
Building successful data science teams
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
PDF
How Data Virtualization Puts Machine Learning into Production (APAC)
PDF
Think Big | Enterprise Artificial Intelligence
PDF
Paving The Way To Data Driven
PDF
How to crack Big Data and Data Science roles
PDF
How to become a data scientist
PDF
How to program your way into data science?
PPTX
Big data webinar may23 nrit by sunil
PPTX
Careers in Data Science _ Navigating the Digital Frontier (1).pptx
DOCX
Pratik Patel resume
PDF
Difference Between Data Analyst, Data Scientist, and Data Engineer.pdf
DOCX
Pratik Patel Python/ Big Data Analyst
PDF
Architecting Agile Data Applications for Scale
PDF
From Lab to Factory: Or how to turn data into value
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
PDF
Big Data overview
PPTX
Data Science Introduction: Concepts, lifecycle, applications.pptx
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Big Data for Data Scientists - Info Session
Building successful data science teams
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Machine Learning into Production (APAC)
Think Big | Enterprise Artificial Intelligence
Paving The Way To Data Driven
How to crack Big Data and Data Science roles
How to become a data scientist
How to program your way into data science?
Big data webinar may23 nrit by sunil
Careers in Data Science _ Navigating the Digital Frontier (1).pptx
Pratik Patel resume
Difference Between Data Analyst, Data Scientist, and Data Engineer.pdf
Pratik Patel Python/ Big Data Analyst
Architecting Agile Data Applications for Scale
From Lab to Factory: Or how to turn data into value
Dirty data? Clean it up! - Datapalooza Denver 2016
Big Data overview
Data Science Introduction: Concepts, lifecycle, applications.pptx
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Ad

Recently uploaded (20)

PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
cuic standard and advanced reporting.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
Advanced Soft Computing BINUS July 2025.pdf
Modernizing your data center with Dell and AMD
cuic standard and advanced reporting.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development

Data science a practitioner's perspective

  • 1. Data Science A practitioner’s perspective Amir Ziai @amirziai
  • 2. Who am I? ● Data Scientist at ZEFR, ad tech, LA ● Previously worked in healthcare, SaaS, and finance
  • 3. Agenda ● Data Science ● My perspective ○ Problems ○ Pitfalls ○ Minimum skills ○ How to build your skills ● Resources
  • 4. Data Science, a short history ● 1960, Peter Naur used it as a substitute for computer science ● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture ● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job ● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018 ● 2015, Data Scientists don’t scale ● 2016, Why You’re Not Getting Value from Your Data Science https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
  • 7. Data Science, too broad ● BI Analyst/Engineer ● Analytics Engineer ● Data Engineer ● Statistician ● Research Scientist ● Machine Learning Engineer ● AI Engineer ● Solutions Specialist (with analytical background) ● Software Architect ● Financial Modeler ● Actuary ● ...
  • 8. Data Science, definition “Data Scientist is a Data Analyst who lives in California” “Data Scientist is statistics on a Mac” “...someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  • 9. Data Science, the many Venn diagrams
  • 10. Data Science, process ● Data wrangling (get data from any source, reshape, scale up if needed) ● Problem formulation and modeling (ML, DL, AI) ● Communicate the findings (visualization, UI/UX) ● Productize (SWE, Data Engineering, DevOps) In the context of: ● Benefit (business value) ● Cost (development, infrastructure, and architecture)
  • 11. My perspective, what does ZEFR do? ● Ingesting hundreds of millions of videos per day ● Help brands show relevant ads ● Identify content for monetization ● Data science ○ Optimize advertising campaigns ○ Forecast inventory ○ Process text, image, audio, and video ○ Petabyte scale
  • 12. My perspective, scale and automation Requirements ● Billions of examples, million of features to train the models with ● Scoring on a similar scale of data ● Models to be re-trained near real-time Implications ● Have to use cloud computing and distributed systems ● Small deltas in quality and algorithm efficiency magnified to massive cost or benefit deltas ● Solid software engineering and automation is key
  • 13. My perspective, example Task ● Train a better forecasting model (vs. a benchmark statistic) ● Hundreds of terabytes of historical data available Process ● Wrangling Pre-process and featurize (Spark, S3, RedShift) ● Modeling VW, H2O, hyper-parameter optimization ● Communication Justify cost of 100 node EMR cluster ($1,000 per day) ● Productize Test, deploy, automate with Jenkins, ECS and Kafka
  • 14. My perspective, the grind Weeks of tuning the infrastructure, finding the right features, reasoning through algorithm complexity
  • 15. My perspective, pitfalls ● Unreasonable expectations ○ Hype, just hire a few PhDs ○ Is data science too easy? ● Throwing it over the fence* ○ Data science builds models in R/Python, engineering implements it in Java, C, Scala ● Dismissing the importance of good software engineering practices ○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible ● Dismissing the importance of understanding and formulating the problem ○ Get out and talk to people ● Dismissing or not understanding architecture, infrastructure, and cost/benefit * Full disclosure: article is written by my boss Jonathan Morra at ZEFR
  • 16. My perspective, data science platforms ● Many companies have recognized the problem with the the disconnect between data science and engineering ● Facebook and Uber have in-house platforms ● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data Robot, Yhat, just to name a few ● Very expensive and inflexible in our case https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/ https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
  • 17. My perspective, minimum data science requirements - Statically-typed language (C, Java, Scala) - Dynamically-typed language (Python, R) - SQL (lag, partition, joins, rank, nested subqueries) - NoSQL (JSON, MongoDB, Couch) - Data wrangling (Pandas, dplyr, Julia, PySpark, Dask) - Command-line fu - Cloud computing (spin up instances, S3, ssh) and environment isolation - Software engineering best practices (testing, version control, complexity) - ML theory (bias/variance, complexity, encoding, hashing, feature engineering) - ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow) - Basic stats (experiment design, hypothesis testing, moments)
  • 18. My perspective, how to build your skills ● Take courses in areas of weakness (Udacity, Coursera) ● Showcase your skills with projects on GitHub ● Write a blog about things you’re good at to refine your understanding ● Do Kaggle competitions ● Contribute to StackOverflow and/or CrossValidated ● Contribute to open source projects (sklearn, tensorflow, dask, spark)
  • 19. Resources Newsletters, blogs and people to follow Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog, Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...