SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE
CRASH COURSE
BERLIN 2018
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI
MACHINE LEARNING
DEEP LEARNING
WHY NOW?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key drivers behind AI Explosion
 Exponential data growth
– And the ability to Process All Data – both Structured & Unstructured
 Faster & open distributed systems
– Such as Hadoop, Spark, TensorFlow, …
 Smarter algorithms
– Esp. in the Machine Learning and Deep Learning domains
– More Accurate Models  Better ROI for Customers
Source: Deloitte Tech Trends 2017 report
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Google does not have better algorithms, only more data.
-- Peter Norvig, Dir of Research, Google
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data: The New Oil
Training Data: The New New Oil
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Effectiveness of AI technologies will be only as good as the data they have
access to, and the most valuable data may exist beyond the borders of
one’s own organization.”
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE PREREQUISITES
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE & MACHINE LEARNING
WHAT IS A MODEL?
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a ML Model?
 Mathematical formula with a number of parameters that need to be learned from the
data. Fitting a model to the data is a process known as model training.
 E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
y = 2x + 5
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ALGORITHMS
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
 Logistic Regression
– Fast training, linear model
– Classes expressed in probabilities
 Support Vector Machines (SVM)
– “Best” supervised learning algorithm, effective
– More robust to outliers than Log Regression
– Handles non-linearity
 Random Forest
– Fast training
– Handles categorical features
– Does not require feature scaling
– Captures non-linearity and
feature interaction
 Naïve Bayes
– Good for text classification
– Assumes independent variables
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Intro to Decision Trees
 http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
 Removing noise in images by selecting only
“important” features
 Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DEEP LEARNING
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow Playground
playground.tensorflow.org
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Identify the right Deep Learning problems
 DL is terrific at language tasks, image classification, speech translation, machine
translation, and game playing (i.e. Chess, Go, Starcraft).
 It is less performant at traditional Machine Learning tasks such as credit card fraud
detection, asset pricing, and credit scoring.
Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Limits to Deep Learning (DL)
 We don’t have infinite datasets
– DL is not great at generalizing
– ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000
categories
 Top 10 challenges for Deep Learning
1.Data hungry
2.Shallow and limited capacity for transfer
3.No natural way to deal w/ hierarchical structure
4.Struggles w/ open-ended inference
5.Not sufficiently transparent
6.Now well integrated w/ prior knowledge
7.Cannot distinguish causation from correlation
8.Presumes stable world
9.Works well as an approximation, but answers cannot be fully trusted
10. Difficult to engineer with
Source: arxiv.org/pdf/1801.00631.pdf
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI HACKING
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE JOURNEY
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Start by Asking Relevant Questions
 Specific (can you think of a clear answer?)
 Measurable (quantifiable? data driven?)
 Actionable (if you had an answer, could you do something with it?)
 Realistic (can you get an answer with data you have?)
 Timely (answer in reasonable timeframe?)
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States  California, CA, Cal., Cal
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection
 Also known as variable or attribute selection
 Why important?
– simplification of models  easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
 Dimensionality reduction vs feature selection
– Dimensionality reduction: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hyperparameters
 Define higher-level model properties, e.g. complexity or learning rate
 Cannot be learned during training  need to be predefined
 Can be decided by
– setting different values
– training different models
– choosing the values that test better
 Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
 R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
 RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
With that in mind…
 No simple formula for “good questions” only general guidelines
 The right data is better than lots of data
 Understanding relationships matters
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Training Set
Learning Algorithm
h
hypothesis/model
input output
Ingest / Enrich Data
Clean / Transform / Filter
Select / Create New Features
Evaluate Accuracy / Score
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MODEL TRAINING
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scatter Data
|label|features|
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SPARK ML PIPELINES
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML Pipeline
 fit() is for training
 transform() is for prediction
Input
DataFrame
Input
DataFrame
Output
Dataframe
Pipeline
Pipeline Model
fit
transform
Train
Predict
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
SAMPLE CODE
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX + HDP
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: Google NIPS
ML Code
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
D ATA S C I E N C E P L AT F O R M
Community Open Source Scale & Enterprise Security
• Find tutorials and datasets
• Connect with Data Scientists
• Ask questions
• Read articles and papers
• Fork and share projects
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE and Shiny
• Apache Spark
• Your favorite libraries
• Data Science at Scale
• Run Spark Jobs on HDP Cluster
• Secure Hadoop Support
• Ranger Atlas Support for Data
• Support for ABAC
Model Management
• Data Shaping Pipeline UI
• Auto-data preparation & modeling
• Advanced Visualizations
• Model management & deployment
• Documented Model APIs
Data Science Experience
SPSS Modeler
for DSX
DO for DSX
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX
HDP
Data Science Experience (DSX) Local
Enterprise Data Science platform for teams
Livy REST interface
Hortonworks Data Platform (HDP)
Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FINAL NOTES
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“The business value of AI consists of its ability to lower the cost of
prediction, just as computers lowered the cost of arithmetic.”
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building a Business Around AI - HBR
1. Find and own valuable data no one else has
2. Take a systemic view of your business, and find data adjacencies
3. Package AI for the customer experience
"No single tool, even one as powerful as AI, determines the fate of a business. As
much as the world changes, deep truths — around unearthing customer
knowledge, capturing scarce goods, and finding profitable adjacencies — will
matter greatly. As ever, the technology works to the extent that its owners
know what it can do, and know their market."
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Robert Hryniewicz
@RobHryniewicz
Thanks!

More Related Content

PPTX
Data Science Crash Course
PDF
Apache Metron in the Real World
PPTX
Compute-based sizing and system dashboard
PPTX
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
PPTX
Apache Atlas: Governance for your Data
PPTX
The Implacable advance of the data
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
PDF
Apache Hadoop Crash Course - HS16SJ
Data Science Crash Course
Apache Metron in the Real World
Compute-based sizing and system dashboard
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...
Apache Atlas: Governance for your Data
The Implacable advance of the data
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ

What's hot (20)

PDF
SparkR Best Practices for R Data Scientists
PPTX
Airline reservations and routing: a graph use case
PDF
Paris FOD Meetup #5 Cognizant Presentation
PPTX
Balancing data democratization with comprehensive information governance: bui...
PDF
Paris FOD Meetup #5 Hortonworks Presentation
PDF
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
PPTX
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
PPTX
Overcoming the AI hype — and what enterprises should really focus on
PPTX
Accelerating query processing with materialized views in Apache Hive
PDF
Data Science Crash Course
PPTX
Shaping a Digital Vision
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
PPTX
Security, ETL, BI & Analytics, and Software Integration
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
PDF
Apache Hadoop Crash Course
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Tools and approaches for migrating big datasets to the cloud
SparkR Best Practices for R Data Scientists
Airline reservations and routing: a graph use case
Paris FOD Meetup #5 Cognizant Presentation
Balancing data democratization with comprehensive information governance: bui...
Paris FOD Meetup #5 Hortonworks Presentation
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Make Streaming Analytics work for you: The Devil is in the Details
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Overcoming the AI hype — and what enterprises should really focus on
Accelerating query processing with materialized views in Apache Hive
Data Science Crash Course
Shaping a Digital Vision
Designing data pipelines for analytics and machine learning in industrial set...
Security, ETL, BI & Analytics, and Software Integration
Interactive Analytics at Scale in Apache Hive Using Druid
Apache Hadoop Crash Course
Innovation in the Enterprise Rent-A-Car Data Warehouse
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Tools and approaches for migrating big datasets to the cloud
Ad

Similar to Data Science Crash Course (20)

PDF
Data Science Crash Course
PDF
Data science workshop
PPTX
Spark-Zeppelin-ML on HWX
PDF
Data Science Crash Course
PPTX
Introduction overviewmachinelearning sig Door Lucas Jellema
PPTX
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
PPTX
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
PPTX
Introduction to Machine Learning - An overview and first step for candidate d...
PDF
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
PPTX
Machine Learning AND Deep Learning for OpenPOWER
PPTX
Ml - A shallow dive
PPTX
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
PPTX
Azure Databricks for Data Scientists
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Machine Learning With Spark
PDF
ML master class
PDF
Introduction to Data Science
PDF
A step towards machine learning at accionlabs
PDF
Introduction to machine learning and applications (1)
PPTX
Big Data & Machine Learning - TDC2013 Sao Paulo
Data Science Crash Course
Data science workshop
Spark-Zeppelin-ML on HWX
Data Science Crash Course
Introduction overviewmachinelearning sig Door Lucas Jellema
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...
The Art of Intelligence – Introduction Machine Learning for Oracle profession...
Introduction to Machine Learning - An overview and first step for candidate d...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Machine Learning AND Deep Learning for OpenPOWER
Ml - A shallow dive
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Azure Databricks for Data Scientists
Choosing a Machine Learning technique to solve your need
Machine Learning With Spark
ML master class
Introduction to Data Science
A step towards machine learning at accionlabs
Introduction to machine learning and applications (1)
Big Data & Machine Learning - TDC2013 Sao Paulo
Ad

More from DataWorks Summit (20)

PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
PPTX
Applying Noisy Knowledge Graphs to Real Problems
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Applying Noisy Knowledge Graphs to Real Problems

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Modernizing your data center with Dell and AMD
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Modernizing your data center with Dell and AMD
MYSQL Presentation for SQL database connectivity
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced Soft Computing BINUS July 2025.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

Data Science Crash Course

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE CRASH COURSE BERLIN 2018 Robert Hryniewicz Data Evangelist @RobHryniewicz
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI MACHINE LEARNING DEEP LEARNING WHY NOW?
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key drivers behind AI Explosion  Exponential data growth – And the ability to Process All Data – both Structured & Unstructured  Faster & open distributed systems – Such as Hadoop, Spark, TensorFlow, …  Smarter algorithms – Esp. in the Machine Learning and Deep Learning domains – More Accurate Models  Better ROI for Customers Source: Deloitte Tech Trends 2017 report
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Healthcare Predict diagnosis Prioritize screenings Reduce re-admittance rates Financial services Fraud Detection/prevention Predict underwriting risk New account risk screens Public Sector Analyze public sentiment Optimize resource allocation Law enforcement & security Retail Product recommendation Inventory management Price optimization Telco/mobile Predict customer churn Predict equipment failure Customer behavior analysis Oil & Gas Predictive maintenance Seismic data management Predict well production levels
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Google does not have better algorithms, only more data. -- Peter Norvig, Dir of Research, Google
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data: The New Oil Training Data: The New New Oil
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Effectiveness of AI technologies will be only as good as the data they have access to, and the most valuable data may exist beyond the borders of one’s own organization.”
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE PREREQUISITES
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE & MACHINE LEARNING WHAT IS A MODEL?
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is a ML Model?  Mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training.  E.g. linear regression – Goal: fit a line y = mx + c to data points – After model training: y = 2x + 5 Input OutputModel 1, 0, 7, 2, … 7, 5, 19, 9, … y = 2x + 5
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ALGORITHMS
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms:  Logistic Regression – Fast training, linear model – Classes expressed in probabilities  Support Vector Machines (SVM) – “Best” supervised learning algorithm, effective – More robust to outliers than Log Regression – Handles non-linearity  Random Forest – Fast training – Handles categorical features – Does not require feature scaling – Captures non-linearity and feature interaction  Naïve Bayes – Good for text classification – Assumes independent variables
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Intro to Decision Trees  http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications:  Removing noise in images by selecting only “important” features  Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Deep Learning Clustering Dimensionality Reduction • XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN) • Convolutional Neural Network (CNN) • Yinyang K-Means • t-Distributed Stochastic Neighbor Embedding (t-SNE) • Local Regression (LOESS) Collaborative Filtering • Weighted Alternating Least Squares (WALS)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DEEP LEARNING
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow Playground playground.tensorflow.org
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Identify the right Deep Learning problems  DL is terrific at language tasks, image classification, speech translation, machine translation, and game playing (i.e. Chess, Go, Starcraft).  It is less performant at traditional Machine Learning tasks such as credit card fraud detection, asset pricing, and credit scoring. Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Limits to Deep Learning (DL)  We don’t have infinite datasets – DL is not great at generalizing – ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000 categories  Top 10 challenges for Deep Learning 1.Data hungry 2.Shallow and limited capacity for transfer 3.No natural way to deal w/ hierarchical structure 4.Struggles w/ open-ended inference 5.Not sufficiently transparent 6.Now well integrated w/ prior knowledge 7.Cannot distinguish causation from correlation 8.Presumes stable world 9.Works well as an approximation, but answers cannot be fully trusted 10. Difficult to engineer with Source: arxiv.org/pdf/1801.00631.pdf
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI HACKING
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE JOURNEY
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Start by Asking Relevant Questions  Specific (can you think of a clear answer?)  Measurable (quantifiable? data driven?)  Actionable (if you had an answer, could you do something with it?)  Realistic (can you get an answer with data you have?)  Timely (answer in reasonable timeframe?)
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States  California, CA, Cal., Cal
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature Selection  Also known as variable or attribute selection  Why important? – simplification of models  easier to interpret by researchers/users – shorter training times – enhanced generalization by reducing overfitting  Dimensionality reduction vs feature selection – Dimensionality reduction: create new combinations of attributes – Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hyperparameters  Define higher-level model properties, e.g. complexity or learning rate  Cannot be learned during training  need to be predefined  Can be decided by – setting different values – training different models – choosing the values that test better  Hyperparameter examples – Number of leaves or depth of a tree – Number of latent factors in a matrix factorization – Learning rate (in many models) – Number of hidden layers in a deep neural network – Number of clusters in a k-means clustering
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Residuals • residual of an observed value is the difference between the observed value and the estimated value  R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data  RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved With that in mind…  No simple formula for “good questions” only general guidelines  The right data is better than lots of data  Understanding relationships matters
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Training Set Learning Algorithm h hypothesis/model input output Ingest / Enrich Data Clean / Transform / Filter Select / Create New Features Evaluate Accuracy / Score
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MODEL TRAINING
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scatter Data |label|features| |-12.0| [-4.9]| | -6.0| [-4.5]| | -7.2| [-4.1]| | -5.0| [-3.2]| | -2.0| [-3.0]| | -3.1| [-2.1]| | -4.0| [-1.5]| | -2.2| [-1.2]| | -2.0| [-0.7]| | 1.0| [-0.5]| | -0.7| [-0.2]| ... ... ...
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Coefficients: 2.81 Intercept: 3.05 y = 2.81x + 3.05 Training Result
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Linear Regression (two features) Coefficients: [0.464, 0.464] Intercept: 0.0563
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SPARK ML PIPELINES
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML Pipeline  fit() is for training  transform() is for prediction Input DataFrame Input DataFrame Output Dataframe Pipeline Pipeline Model fit transform Train Predict
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved rf = RandomForestClassifier(numTrees=100) pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf]) model = pipe.fit(trainData) # Train model results = model.transform(testData) # Test model SAMPLE CODE
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX + HDP
  • 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: Google NIPS ML Code
  • 58. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved D ATA S C I E N C E P L AT F O R M Community Open Source Scale & Enterprise Security • Find tutorials and datasets • Connect with Data Scientists • Ask questions • Read articles and papers • Fork and share projects • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE and Shiny • Apache Spark • Your favorite libraries • Data Science at Scale • Run Spark Jobs on HDP Cluster • Secure Hadoop Support • Ranger Atlas Support for Data • Support for ABAC Model Management • Data Shaping Pipeline UI • Auto-data preparation & modeling • Advanced Visualizations • Model management & deployment • Documented Model APIs Data Science Experience SPSS Modeler for DSX DO for DSX
  • 60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX HDP Data Science Experience (DSX) Local Enterprise Data Science platform for teams Livy REST interface Hortonworks Data Platform (HDP) Enterprise compute (Spark/Hive) & storage (HDFS/Ozone)
  • 61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FINAL NOTES
  • 62. 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “The business value of AI consists of its ability to lower the cost of prediction, just as computers lowered the cost of arithmetic.”
  • 63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building a Business Around AI - HBR 1. Find and own valuable data no one else has 2. Take a systemic view of your business, and find data adjacencies 3. Package AI for the customer experience "No single tool, even one as powerful as AI, determines the fate of a business. As much as the world changes, deep truths — around unearthing customer knowledge, capturing scarce goods, and finding profitable adjacencies — will matter greatly. As ever, the technology works to the extent that its owners know what it can do, and know their market."
  • 64. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Robert Hryniewicz @RobHryniewicz Thanks!