SlideShare a Scribd company logo
Advanced Analytics,
Big Data
and
Being a Data Scientist
Zenodia Charpy
1. Introduction to data science – where did it come from
2. Why did I become a data scientist ?
3. Definition of data science
4. Data science skillset map
5. Data science process – one off vs. production pipeline
6. Data science process breakdown – a bit more detail
7. Various Data Science tools
8. Q&A
Agenda of today
Data Science – where did it come from ?
Google trend – what people are searching
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
Google trend
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
Cloud computing Virtualization Big Data Data Science
Cloud computing
Virtualization
Data Science
Big Data
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
what people are searching – top 5 keywords
Examples of what make
the data so big
Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
Data Science can help
to reveal these insights
Data Value from
business’s perspective
Göteborg university(condensed)
complexity
Time find patterns
. . .
Data Science
Why did I become a data scientist ?
WHY ?
As an analyst for many years…
I realise …
Act on
Customer
Time (weekly) Time!
Time (weekly)
Time
(+6 months) Time (monthly)
Insight to action – too slow !
Request insights
The Analysts
Issues discovered
1. Data is not centralized /syncronized
2. Data quality is bad
3. Organization’s hierarchy slow down
decision making process
4. NO Common KPIs (isolated measurement)
5. Marketing Strategy strongly
depending on gut-feelings ( historical
reason )
6. Knowledge gaps & misconceptions
(focus on visualization, not necessary facts)
7. Insufficient information
( insufficient data sources to answer to
the given question)
monitor
marketers
Answering , usually in a
dashboard/reports … format
Analysing
How did it happened ?
Fragmented data view
1. Focus on Database as the only truth
2. Limited data sources ( mostly DB +
clickstreams)
3. Central data repository non-existed
4. Common definiton of a customer
non-existed
5. Customers’ ever-changing behavior
( historical vs real time behavioural data )
6. Marketers’ believes vs. real
evidence about the customers
Skewed data view –
example : seeing is believing, really ?
The 5 V’s of Big data
Data Science can at least answer to SOME of those concerns !
But . . .
it heavily depends on how mature is the organization
Organization
Maturity
Data Maturity
Resistance to change
Isolated acceptance
Growing importance
Embracing throughout
business disciplines
Data-driven
product & organization
Fragmented data
(Ad-hoc reports focused)
Central Data lake
(exploratory analysis)
360 data view
In real time
(predictive analytics)
Data governance
(Data quality control)
Data driven enterprise
strategy
(recommender system)
Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
Data Scientist – definition !
Data science is a "concept to unify statistics, data analysis and their
related methods" in order to "understand and analyze actual phenomena"
with data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, information science,
and computer science, in particular from the subdomains of machine
learning, classification, cluster analysis, data mining, databases,
and visualization.
Short definition (wikipedia)
Typical characteristics :
Is question specific
Bias-Variance tradeoff + over/under fitting
Split data into training , testing ( validation ) sets
Can be combined with other algorithms
Can utilize parallelization
Deal with all kinds of data (incl. unstructured)
Data mining technique ( for big data) is applied
Machine Learning(ML)
Predictive analytics
(Supervised Learning)
Typical Characteristic:
Focus on feature engineering ( variables selection)
Exploration vs exploitation
Prediction preformance decade quickly with time
Mostly ad-hoc | one-off based
Deal with all kinds of data ( when applying machine
learning) or else mostly structured|semi-structured data
Typical characteristics:
Ad-hoc based
Limited data blending
Mostly structured data ( from database)
Focus on historical statistic models
Modelling focus on finding correlation or
describing existed datasets
Inferential
+ Exploratory
+ Descriptive
Data Science synonyms … what includes what
Data Science knowledge-domains overlays
Data Scientist – the mytical creature ?
Fire-breathing dragon Real-life dragon (relaxed version)
Data Scientist – The skillset map
Unicorn version vs your own path !
Not on the map but equally important
Teamwork essentials -
• Story-telling
• visualization
• Cooperation/team building
• Inter-personal / inspiration coach
• Open mind
• Knowledge sharing
Personality traits –
• Extreme Curiosity
• Detective spirit
• Naive and stupid
• Strong ethic (data protection / privacy
law)
My journey – my own version
Tree Trunk :
Skillsets yet to
be acquired
Math
(University)
Statistic
(University)
Computer
Science
(Master)
The ground
Data Science threshold
Specialization areas/
Further development
• Programming : R & Python
• Machine Learning Algorithms
• Data mining techniques
• Cloud services (Virtualization concepts)
• Big Data Eco systems
• Bayesian Statistics
• Graph Theory (option)
• Text mining techniques(option)
Analyst
(work experience)
Roots :
Your initial foundation
• Leadership /Team building
• Recommender system
• Experimental design
• Game theory
• Story-telling/presentation skills
• New model development
• Deep Learning  artificial
Intelligence
Tree branches & leaves :
Specialized interests
Motivation
is the key !
My motivation !
Waterfall (M. C. Escher)
Monument valley
What motivate you ?
What would your path look like ?
(15 mins Break)
Refresh our memory from previous section -
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs. your own)
• Motivation is the key !
Göteborg university(condensed)
WHY teamwork approach
Ask yourself the follow questions . . .
Do you have unlimited amount of time ?
Knowledge bank
Do you think that you know absolutely EVERYTHING there is to know on earth ?
A Data Science Dream Team
Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
A Data Science Dream Team
In REALITY . . .
Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
data science process
one-off (POC) vs. production pipeline
Where are these two approaches came from ?
due to organization maturity . . .
Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
One-off
(Proof Of Concepts=POC)
Production PipeLine
The two approaches -
one-off (POC) vs. production pipeline
Data engineer
Business
knowledge Data scientist IT support
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deliverables
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
data science process
Compare the two approach
Data engineer
Business
knowledge Data scientist IT support
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
70-80% 10%~20%
comparison
Oragnization maturity
What are they looking for
Project scope
Platform & technology
Data source availbility
Data quality
Deliverbles
One-off
phrase 1 phrase 2
To understand how data science
work (baby step)
Small 4 -8 weeks
Do not change anything existed
inhouse
Mainly DB + 1 or 2 additional
datasource
Poor, need lots of clearning
Focus in intepretation(visualized)
Production
Pipeline
Phrase +2 and forward
Participate in data science
process
At least 3 months and above
Consider or already migrate to
new platform/technology
Start to map out all available
datasources
Start to sort out data quality
Focus on code( hence limitation
on programming language)
Data Science Process –
Box-in the activities overview
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Define
Business
Question
Define the
goal
Decompose
the question
Verify
understanding
Project
Scoping
Map data
sources
Establish
performance
measure
Data scientist
Workspace
Task Force
Business
limitation
Define project
scope
Data
acquisition &
Preparation
Environment
set up
Languages:
SQL, R,
Python…etc
Data sources
merging
Data pre-scan
Q&A
Data Quality
review
Descriptive
statistics
(data
exploration)
Explore data
(plots)
Data
manipulation
Outliers/NA s
summary
statistics
Data explore
review
Features
Engineering
Establish
performance
threshold
Features
engineering
Algorithms
selection
Bueinss sign off
Model
building &
validation
Type of models
Model selection
criteria
Build and
Validate the
model
Review results
Deploy
/deliverables
To whom
On what platform
Update
Frequency
Performance
review
Infographic(visual
ization)
Deployment
review
Step-wised Data Science Process :
from Business Question  Scoping
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
questions
How to get
the data
(access)
done
Datalake
Environment
set up
issues
Extract
Next : About Data
SpecifyNot
ready
?
The Scope
1. thresholds
2. Data scope
3. Resource
4. taskforce
5. Limitations
6. Budget &
timeline…etc
define
NOT done
Ready
Question  Scope
Step-wised Data Science Process :
Data acquisition  data preparation
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Main table
(PK= Transaction ID
FK=StoreID )
Acquire data – merge the data sources
Customer Interests
(PK=email address)
6.Joined by email
Data source type : social
3.Joined by StoreID
Promotions : campaign name,
campaign duration, in which store,
discout level…etc
(PK=CampaignID,
FK=StoreID)
Data source type : campaign tool
1.Joined by
TrasactionID
Customer Purchase informaiton
(PK=CID
FK=Transaction ID)
Customer Database
(PK=CID
FK=email)Joined by CID
Data source type : DatabaseData source type : Database
4.Joined by
StoreID
Store Survey : questions, scale of
satisfaction, product rating..etc
(PK=SurveyID,
FK=StoreID)
Data source type : Survey tool
Store Geo Info: location, km to center, km
to customer’s address, kms to competitor’s
store in the same postcode region…etc
(PK=StoreID)
5.Joined by
StoreID
Data source type : API calls
2.Joined by
Transaction IDWebsite Browsing :
Pages viewed, avg time on site ,
product browsed..etc
(PK=CookieID,
FK=TrasactionID)
Data source type : clickstream
The GOAL
Step-wised Data Science Process :
Descriptive Statistics
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
A flower called iris
3
Sentosa Virginica Versicolor
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
width
LengthSepal
Petal
Sepal Petal
Göteborg university(condensed)
Göteborg university(condensed)
Göteborg university(condensed)
Göteborg university(condensed)
Step-wised Data Science Process:
Features engineering
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
- Observation from Descriptive Statistics
- Remove highly correlated columns/parameters
(example slides further down the presentation)
- Candidate models’ requirement ?
- Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering )
- Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models)
- Forward stepwised /Backward stepwised / shrinkage selection concepts vs.
Blackbox model rank features importance ?
- Computing time vs. response
- Business limitations
( example, business equire to shink the features to <=20 )
Feature Selection ( things to consider)
Example (justifying selected features)
Background :
you’ve done an exploratory analysis about correlation,
you have the result and now you need to explain it in a 5-
year’s-old-can-understand way and use the exploratory
results to do your feature selection !
explain Correlation with a metaphor
Interval of distance
Direction to the right
A B
Observation Interval of distance
Direction to the right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same
direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading
at the same direction
explain Correlation with a metaphor continued
Linear
Correlation
In the following slides, for intuitive
convenience purpose we rescale
and map the correlation coefficient
into the % format - - -
Example :
Strong positive correlation :
1  100%
where:
is the covariance of varible X and Y
is th standard deviation of X
is th standard deviation of Y
Pearson’s correlation :
The result of the analysis
Externalsheettempexhaustpipe
External sheet temp exhaust pipe
Actual exhaust temperature exhaust pipe
Actualexhausttemperatureexhaustpipe
Process value regulator under pressure
Processvalueregulatorunderpressure
Process value regulator hood damper
Processvalueregulatorhooddamper
Negative pressure exhaust pipe
Negativepressureexhaustpipe
Regulator value hood damper
Regulator value exhaust damper
Actualvaluedamperexhaustpipe
Regulatorvalueexhaustdamper
Regulatorprocessvalue
Actual value damper exhaust pipe
Before we leave this metaphor –
one last thing :
” correlation does not impley causation ! ”
Correlation does not imply causation !
Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place?
Guess 1 : husband and wife
I drive
Tesla car
I drive
Volvo car
Guess 2 : racing track
A B
A B
Guess 3 : coincidence
Before diving into training your model(s) …
ask yourself :
what type of model should I use ?
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
Question :
Do you have the correct
answer to a given
business question ?
Supervised learning
Regressions
Classes
Unsupervised learning
Deep learning
Clustering
Association analysis
What type of models are suitable ?
YES
NO
Before diving into training your model(s) …
Models landscape
1. Supervised
2. Unsupervised
3.Deep learning
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Hierarchical temporal memory
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation recommender
Recommender Systems
Others
Others
Data Science Process :
Model training Model Validation
( example : supervised learning)
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
pre-processed data
Validation
set
Training set Test set
Split into
Train
ML models
Check
Select one winning model
Models that pass the
testing set
Winning
model
Monitor model
performance
Re-train
the
models ?
Yes
No
decide
Sampling from
live data
streams
If we want to be REALLY picky
Live testing the
winning model
data science process
Model selection criteria
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Example ( justifying how you select the model)
Background:
you built a prediction model (let’s say to classify customer
purchase=Yes/No), now you need to explain why did you
pick THAT alrogithm in the first place !
criterias logistic trees RF GBM weights
Performance
=Accuracy 86,5% 86,7% 86,8% 85,8% 10%
Sensitivity
4,6% 12,5% 8,4% 21,4% 20%
interpretability 1 0,8 0,4 0,2 30%
Time to
compute 1 0,8 0,2 0,2 20%
# of
parameters 2,4 2,4 1,89 2,38 10%
Conflict to use
regression Yes partial minimum minimum 10%
Ranking 1,016 1,063 0,625 0,894 100%
Performance=(true positive+true negative)/test set’s population  the model correctly predicted on Both whether you are a Purchaser or NonPurchaser
Sensitivity =True positive/all positives on test set  the model correctly predicted that you are going to purchase
Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
Data Science Process :
explain your model
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Example (explaion the selected the model)
Background:
Now I have select a model called recursive Partition tree (rPart),
the stakeholders asked me to explain how this model works …
High level - Conceptually
Medium level - a bit more detail
Recurssive Partitioning Tree (rPart)–
How does it work ?
Explained in 2 levels . . .
High Level – Conceptually
High level – rPart how does it work ?
Parent node
Use both criteria 1 & 2
to decide whether to
split or not
Child node 2.1
(repeat the same
thing)
Child node 2.2
(repeat the same
thing)
For every parameters Pi , check
1) Is spliting on Pi with value Xi
gives me more information ?
2) Is split on Pi with value Xi
gives me better accuracy for prediction?
Note: information is defined by inforamtion theory and have the
option of Gini index and information gain( link )
• Minisplit - the minimum number of
observations that must exist in a node in order for a
split to be attempted
• Minibucket-minimum observation in terminal
node =minsplit/3
• cp- complexity parameter,punish the model if too
many parameters will used and not much of
increasing of accuracy/information were achieved
Criteria 1 Criteria 2
Split on Parameter Pi
with value Xi
YESNO
… …
Tree Split nodes on : Hyper-parameters
Medium Level – a bit more detail
1) information gain 2) accuracy improvement
Scenario 2 :
If the end nodes have 100 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is perfect classification, hence this node is said to
be reaching minimum impurity (entropy=0)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1)*log2(1) +0 =0 minimum impurity
Scenario 1 :
If the end nodes have 50-50 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be
reaching maximum impurity (entropy=1)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity
1) Information gain by checking the Impurity of the end nodes calculated by entropy
Total: 10 data points
Label :
5 Purchase+ 5 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+5 noPurchase
(end node2)
Total: data points
Label :
5 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 1
P1(Purchase)=0
P1(noPurchase)=5/10 =1/2
P2(Purchase)=5/10
P(noPurchase)=0/10
Total: 10 data points
Label :
0 Purchase+ 10 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+10 noPurchase
(end node2)
Total: data points
Label :
0 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 2
P1(Purchase)=0
P1(noPurchase)=10/10=1
P2(Purchase)=0
P(noPurchase)=0
0
2) how rpart calculating misclassification rate on parameter Pi with value Xi
20 data
points
10 data
points
10 data
points
Age <45?Yes No
Predict
noPurc
hase
=7
Predict
Purchase
=3
cntTotal <110?Yes No
Correct
classified
rate =1/7
Correct
classified
rate =1/3
Predict
noPurc
hase
=5
Predict
Purchase
=5
cntTotal <75 ?Yes No
Correct
classified
rate =1/5
Correct
classified
rate =1/5
rPart model will ask for each and every value Xi in
a parameter Pi
Was it a good idea´(via calculate the
missclassification rate) to split on this value and it
will do so for all parameter Pi on all possible value
Xi associated with Pi (see image on the left as an
example )
Overall misclassification rate
(True Purchase + true noPurchase) / total population
= 4/20 =20%
Misclassified =1- 20% =80%
Data Science Process :
deployment
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Board members /
CTO, CEO, CFO..etc
Marketing directors, Marketers
Processed data for
visualization
Data
scientist
Model Performance
Matrices & output
prediction
pass
business
owner’s
vision
Deliverables: One-off (POC)
Interpretability
Lesson learned - Final
reports or prototype
dashboards for internal
sales
WoW-effect Visualization
IT + Content
creators +
marketers
Processed data for
visualization
Data
scientist
Code for
embeddedment into
applications
Model Performance
Matrices & output
prediction
Pass
integration
test
Deployment : Production Pipeline
Reproducibility
Add to organization-wide
dashboards&reporting
pipeline (automated)
Embedded code directly
into applications
( content recommender, product mix vs
customer segments matching..etc)
Use the output of model
prediction for further
marketing purpose
( such as segmentation, customer
profiling..etc)
Process efficiency
(15 mins Break)
Refresh our memory from previous sections
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs your own)
o Why team work approach
o Dream team mates
o Data science process : two approach ( why , compare ,
boxed-in activities)
o Data science process breakdown in details (step-wised)
Data Science Tools –
SPSS Modeler
SPSS modeler – visualized programming
Göteborg university(condensed)
Göteborg university(condensed)
Data Science Tools –
Microsoft Azure ML
(demo)
URL : https://studio.azureml.net/
Data Science Tools –
IBM data science experience/workbench
(Python+Jupyter Notebook demo)
URL : https://datascientistworkbench.com/
Data Science Tools –
R+RStudio(demo)
Data Science Tools –
Python and R cheatsheet
Göteborg university(condensed)
https://www.analyticsvidhya.com/blog/2016/12/cheatsheet-scikit-learn-caret-package-for-python-r-respectively/?utm_content=buffer3140b&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer
https://www.analyticsvidhya.com/blog/2016/12/cheatsheet-scikit-learn-caret-package-for-python-r-respectively/?utm_content=buffer3140b&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

More Related Content

PPTX
Intro to Data Science by DatalentTeam at Data Science Clinic#11
PDF
Data Science
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PDF
Introduction to Data Science
PDF
From Rocket Science to Data Science
PPTX
Session 01 designing and scoping a data science project
PDF
Introduction to Data Science (Data Summit, 2017)
PDF
Data science vs. Data scientist by Jothi Periasamy
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Data Science
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Introduction to Data Science
From Rocket Science to Data Science
Session 01 designing and scoping a data science project
Introduction to Data Science (Data Summit, 2017)
Data science vs. Data scientist by Jothi Periasamy

What's hot (20)

PDF
Data science e machine learning
PPTX
Data Science: Past, Present, and Future
PPTX
A Practical-ish Introduction to Data Science
PDF
Data Science Project Lifecycle
PPTX
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
PDF
Data Scientist Enablement roadmap 1.0
PDF
Data Scientist Toolbox
PPTX
Big Data and the Art of Data Science
PPTX
Introduction to data science
PPTX
Big Data and Data Science: The Technologies Shaping Our Lives
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
PPTX
Session 04 communicating results
PDF
Introduction on Data Science
PDF
Emcien overview v6 01282013
PPTX
Public Data and Data Mining Competitions - What are Lessons?
PDF
8 minute intro to data science
PDF
The Evolution of Data Science
PPTX
Introduction to Big Data and its Trends
PPSX
Intro to Data Science Big Data
PPTX
Introduction of Data Science
Data science e machine learning
Data Science: Past, Present, and Future
A Practical-ish Introduction to Data Science
Data Science Project Lifecycle
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
Data Scientist Enablement roadmap 1.0
Data Scientist Toolbox
Big Data and the Art of Data Science
Introduction to data science
Big Data and Data Science: The Technologies Shaping Our Lives
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
Session 04 communicating results
Introduction on Data Science
Emcien overview v6 01282013
Public Data and Data Mining Competitions - What are Lessons?
8 minute intro to data science
The Evolution of Data Science
Introduction to Big Data and its Trends
Intro to Data Science Big Data
Introduction of Data Science
Ad

Viewers also liked (20)

PDF
2015 Internet Trends Report
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PDF
KoprowskiT-Difinify2017-SQL_Security_In_The_Cloud
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
PPTX
Film funding
PDF
2017 iosco research report on financial technologies (fintech)
PDF
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
PPTX
Tugas[4] 0317-[Wildan Latief]-[1512500818]
PPTX
Freewill Eng245 2017
PDF
Regulating corporate vc
PDF
Email Marketing Metrics Benchmark Study 2016
PPTX
Tugas 4 0317-imelda felicia-1412510545
PDF
Tracxn Research - Mobile Advertising Landscape, February 2017
PPTX
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
PPTX
Tugas 4 0317-fahreza yozi-1612510832 -
PDF
Europa AI startup scaleups report 2016
PPTX
Meetup sthlm - introduction to Machine Learning with demo cases
PPTX
5 Job Skills Every Data Scientist Must Possess
PDF
Data_Scientist_Position_Description
PDF
The field-guide-to-data-science 2015 (second edition) By Booz | Allen | Hamilton
2015 Internet Trends Report
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
KoprowskiT-Difinify2017-SQL_Security_In_The_Cloud
MongoDB NoSQL database a deep dive -MyWhitePaper
Film funding
2017 iosco research report on financial technologies (fintech)
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
Tugas[4] 0317-[Wildan Latief]-[1512500818]
Freewill Eng245 2017
Regulating corporate vc
Email Marketing Metrics Benchmark Study 2016
Tugas 4 0317-imelda felicia-1412510545
Tracxn Research - Mobile Advertising Landscape, February 2017
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
Tugas 4 0317-fahreza yozi-1612510832 -
Europa AI startup scaleups report 2016
Meetup sthlm - introduction to Machine Learning with demo cases
5 Job Skills Every Data Scientist Must Possess
Data_Scientist_Position_Description
The field-guide-to-data-science 2015 (second edition) By Booz | Allen | Hamilton
Ad

Similar to Göteborg university(condensed) (20)

PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
How Can Analytics Improve Business?
PDF
3 джозеп курто превращаем вашу организацию в big data компанию
PDF
02 a holistic approach to big data
PDF
Problem Definition muAoPS | Analytics Problem Solving | Mu Sigma
PDF
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
PPT
Data mining
PDF
Data science presentation
PDF
Embracing data science
PPTX
Data Mining With Big Data
PDF
Thinkful DC - Intro to Data Science
PPTX
Big Data and HR - Talk @SwissHR Congress
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPTX
Tips and Tricks to be an Effective Data Scientist
PPTX
Data Science and AI in Biomedicine: The World has Changed
PDF
Ultimate Data Science Cheat Sheet For Success
PPTX
Chapter 1 Introduction to Data Science (Computing)
PDF
DAS Slides: Graph Databases — Practical Use Cases
PPTX
Lect 1 introduction
PPTX
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
How Can Analytics Improve Business?
3 джозеп курто превращаем вашу организацию в big data компанию
02 a holistic approach to big data
Problem Definition muAoPS | Analytics Problem Solving | Mu Sigma
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Data mining
Data science presentation
Embracing data science
Data Mining With Big Data
Thinkful DC - Intro to Data Science
Big Data and HR - Talk @SwissHR Congress
Data Science and AI in Biomedicine: The World has Changed
Tips and Tricks to be an Effective Data Scientist
Data Science and AI in Biomedicine: The World has Changed
Ultimate Data Science Cheat Sheet For Success
Chapter 1 Introduction to Data Science (Computing)
DAS Slides: Graph Databases — Practical Use Cases
Lect 1 introduction
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...

More from Zenodia Charpy (7)

PPTX
DeepLearning Experiments in Medical Image show case
PPTX
how to build a Length of Stay model for a ProofOfConcept project
PPTX
Tech Day Kista Mässa Stockholm 2018
PPTX
PPTX
Data Science on Azure
PPTX
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
PPTX
Datascience and Azure(v1.0)
DeepLearning Experiments in Medical Image show case
how to build a Length of Stay model for a ProofOfConcept project
Tech Day Kista Mässa Stockholm 2018
Data Science on Azure
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Datascience and Azure(v1.0)

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PDF
Mega Projects Data Mega Projects Data
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Computer network topology notes for revision
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Introduction to Business Data Analytics.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Miokarditis (Inflamasi pada Otot Jantung)
.pdf is not working space design for the following data for the following dat...
Mega Projects Data Mega Projects Data
Launch Your Data Science Career in Kochi – 2025
Computer network topology notes for revision
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Business Data Analytics.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
Miokarditis (Inflamasi pada Otot Jantung)

Göteborg university(condensed)

  • 1. Advanced Analytics, Big Data and Being a Data Scientist Zenodia Charpy
  • 2. 1. Introduction to data science – where did it come from 2. Why did I become a data scientist ? 3. Definition of data science 4. Data science skillset map 5. Data science process – one off vs. production pipeline 6. Data science process breakdown – a bit more detail 7. Various Data Science tools 8. Q&A Agenda of today
  • 3. Data Science – where did it come from ?
  • 4. Google trend – what people are searching 1 2 3 4 Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science 1 2 3 4
  • 5. Google trend 1 2 3 4 Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science 1 2 3 4
  • 6. Cloud computing Virtualization Big Data Data Science
  • 7. Cloud computing Virtualization Data Science Big Data Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science what people are searching – top 5 keywords
  • 8. Examples of what make the data so big Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
  • 9. Data Science can help to reveal these insights Data Value from business’s perspective
  • 12. Why did I become a data scientist ?
  • 13. WHY ? As an analyst for many years… I realise …
  • 14. Act on Customer Time (weekly) Time! Time (weekly) Time (+6 months) Time (monthly) Insight to action – too slow ! Request insights The Analysts Issues discovered 1. Data is not centralized /syncronized 2. Data quality is bad 3. Organization’s hierarchy slow down decision making process 4. NO Common KPIs (isolated measurement) 5. Marketing Strategy strongly depending on gut-feelings ( historical reason ) 6. Knowledge gaps & misconceptions (focus on visualization, not necessary facts) 7. Insufficient information ( insufficient data sources to answer to the given question) monitor marketers Answering , usually in a dashboard/reports … format Analysing
  • 15. How did it happened ? Fragmented data view 1. Focus on Database as the only truth 2. Limited data sources ( mostly DB + clickstreams) 3. Central data repository non-existed 4. Common definiton of a customer non-existed 5. Customers’ ever-changing behavior ( historical vs real time behavioural data ) 6. Marketers’ believes vs. real evidence about the customers
  • 16. Skewed data view – example : seeing is believing, really ?
  • 17. The 5 V’s of Big data
  • 18. Data Science can at least answer to SOME of those concerns ! But . . . it heavily depends on how mature is the organization
  • 19. Organization Maturity Data Maturity Resistance to change Isolated acceptance Growing importance Embracing throughout business disciplines Data-driven product & organization Fragmented data (Ad-hoc reports focused) Central Data lake (exploratory analysis) 360 data view In real time (predictive analytics) Data governance (Data quality control) Data driven enterprise strategy (recommender system) Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
  • 20. Data Scientist – definition !
  • 21. Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. Short definition (wikipedia)
  • 22. Typical characteristics : Is question specific Bias-Variance tradeoff + over/under fitting Split data into training , testing ( validation ) sets Can be combined with other algorithms Can utilize parallelization Deal with all kinds of data (incl. unstructured) Data mining technique ( for big data) is applied Machine Learning(ML) Predictive analytics (Supervised Learning) Typical Characteristic: Focus on feature engineering ( variables selection) Exploration vs exploitation Prediction preformance decade quickly with time Mostly ad-hoc | one-off based Deal with all kinds of data ( when applying machine learning) or else mostly structured|semi-structured data Typical characteristics: Ad-hoc based Limited data blending Mostly structured data ( from database) Focus on historical statistic models Modelling focus on finding correlation or describing existed datasets Inferential + Exploratory + Descriptive Data Science synonyms … what includes what
  • 24. Data Scientist – the mytical creature ?
  • 25. Fire-breathing dragon Real-life dragon (relaxed version)
  • 26. Data Scientist – The skillset map Unicorn version vs your own path !
  • 27. Not on the map but equally important Teamwork essentials - • Story-telling • visualization • Cooperation/team building • Inter-personal / inspiration coach • Open mind • Knowledge sharing Personality traits – • Extreme Curiosity • Detective spirit • Naive and stupid • Strong ethic (data protection / privacy law)
  • 28. My journey – my own version Tree Trunk : Skillsets yet to be acquired Math (University) Statistic (University) Computer Science (Master) The ground Data Science threshold Specialization areas/ Further development • Programming : R & Python • Machine Learning Algorithms • Data mining techniques • Cloud services (Virtualization concepts) • Big Data Eco systems • Bayesian Statistics • Graph Theory (option) • Text mining techniques(option) Analyst (work experience) Roots : Your initial foundation • Leadership /Team building • Recommender system • Experimental design • Game theory • Story-telling/presentation skills • New model development • Deep Learning  artificial Intelligence Tree branches & leaves : Specialized interests Motivation is the key !
  • 30. Waterfall (M. C. Escher) Monument valley
  • 31. What motivate you ? What would your path look like ? (15 mins Break)
  • 32. Refresh our memory from previous section - • Relationship between data science and big data • What motivate me to become a data scientist • The definition of data science and it’s closely related synonym • The skillset map for becoming a data scientist ( unicorn version vs. your own) • Motivation is the key !
  • 34. WHY teamwork approach Ask yourself the follow questions . . .
  • 35. Do you have unlimited amount of time ? Knowledge bank Do you think that you know absolutely EVERYTHING there is to know on earth ?
  • 36. A Data Science Dream Team
  • 38. A Data Science Dream Team In REALITY . . .
  • 40. data science process one-off (POC) vs. production pipeline
  • 41. Where are these two approaches came from ? due to organization maturity . . .
  • 42. Traditional BI Data- Driven Organization & Products Data silos – Fragmented data views Resistance to Change Isolated Acceptance DataLake Acquisition Growing Importance Data Quality and Governance Embrace throughout Business Disciplines Automated data management & administration Organization maturity Phase 1 (Infancy) Phase 2 (Technical adoption) Phase 3 (Business adoption) Phase 4 (Data&Analytic as a Service) Phase component Real-time dashboard(s) Algorithm embedded dashboard(s) Algorithm Performance dashboard(s) Visualization of deliveries Pattern detecting Unsupervised learning Supervised Learning Recommender System(s) Deep Learning Possible type of ML used in each phase Data exploration Experimental design Map data sources vs customers touch points Acquire solution for architecture Control data Quality merge data sources and automise processing Design experiment – extract preference data Platform maturity (data + technology) Pipe-line data processing & application flow
  • 43. Traditional BI Data- Driven Organization & Products Data silos – Fragmented data views Resistance to Change Isolated Acceptance DataLake Acquisition Growing Importance Data Quality and Governance Embrace throughout Business Disciplines Automated data management & administration Organization maturity Phase 1 (Infancy) Phase 2 (Technical adoption) Phase 3 (Business adoption) Phase 4 (Data&Analytic as a Service) Phase component Real-time dashboard(s) Algorithm embedded dashboard(s) Algorithm Performance dashboard(s) Visualization of deliveries Pattern detecting Unsupervised learning Supervised Learning Recommender System(s) Deep Learning Possible type of ML used in each phase Data exploration Experimental design Map data sources vs customers touch points Acquire solution for architecture Control data Quality merge data sources and automise processing Design experiment – extract preference data Platform maturity (data + technology) Pipe-line data processing & application flow One-off (Proof Of Concepts=POC) Production PipeLine
  • 44. The two approaches - one-off (POC) vs. production pipeline
  • 45. Data engineer Business knowledge Data scientist IT support Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deliverables One-Off iterations Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deployment Apply to Application Production Pipelines Performance Optimization Enable automization
  • 46. data science process Compare the two approach
  • 47. Data engineer Business knowledge Data scientist IT support One-Off iterations Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deployment Apply to Application Production Pipelines Performance Optimization Enable automization 70-80% 10%~20%
  • 48. comparison Oragnization maturity What are they looking for Project scope Platform & technology Data source availbility Data quality Deliverbles One-off phrase 1 phrase 2 To understand how data science work (baby step) Small 4 -8 weeks Do not change anything existed inhouse Mainly DB + 1 or 2 additional datasource Poor, need lots of clearning Focus in intepretation(visualized) Production Pipeline Phrase +2 and forward Participate in data science process At least 3 months and above Consider or already migrate to new platform/technology Start to map out all available datasources Start to sort out data quality Focus on code( hence limitation on programming language)
  • 49. Data Science Process – Box-in the activities overview Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 50. Define Business Question Define the goal Decompose the question Verify understanding Project Scoping Map data sources Establish performance measure Data scientist Workspace Task Force Business limitation Define project scope Data acquisition & Preparation Environment set up Languages: SQL, R, Python…etc Data sources merging Data pre-scan Q&A Data Quality review Descriptive statistics (data exploration) Explore data (plots) Data manipulation Outliers/NA s summary statistics Data explore review Features Engineering Establish performance threshold Features engineering Algorithms selection Bueinss sign off Model building & validation Type of models Model selection criteria Build and Validate the model Review results Deploy /deliverables To whom On what platform Update Frequency Performance review Infographic(visual ization) Deployment review
  • 51. Step-wised Data Science Process : from Business Question  Scoping Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 52. questions How to get the data (access) done Datalake Environment set up issues Extract Next : About Data SpecifyNot ready ? The Scope 1. thresholds 2. Data scope 3. Resource 4. taskforce 5. Limitations 6. Budget & timeline…etc define NOT done Ready Question  Scope
  • 53. Step-wised Data Science Process : Data acquisition  data preparation Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 54. Main table (PK= Transaction ID FK=StoreID ) Acquire data – merge the data sources Customer Interests (PK=email address) 6.Joined by email Data source type : social 3.Joined by StoreID Promotions : campaign name, campaign duration, in which store, discout level…etc (PK=CampaignID, FK=StoreID) Data source type : campaign tool 1.Joined by TrasactionID Customer Purchase informaiton (PK=CID FK=Transaction ID) Customer Database (PK=CID FK=email)Joined by CID Data source type : DatabaseData source type : Database 4.Joined by StoreID Store Survey : questions, scale of satisfaction, product rating..etc (PK=SurveyID, FK=StoreID) Data source type : Survey tool Store Geo Info: location, km to center, km to customer’s address, kms to competitor’s store in the same postcode region…etc (PK=StoreID) 5.Joined by StoreID Data source type : API calls 2.Joined by Transaction IDWebsite Browsing : Pages viewed, avg time on site , product browsed..etc (PK=CookieID, FK=TrasactionID) Data source type : clickstream The GOAL
  • 55. Step-wised Data Science Process : Descriptive Statistics Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 56. A flower called iris 3 Sentosa Virginica Versicolor Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
  • 62. Step-wised Data Science Process: Features engineering Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 63. - Observation from Descriptive Statistics - Remove highly correlated columns/parameters (example slides further down the presentation) - Candidate models’ requirement ? - Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering ) - Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models) - Forward stepwised /Backward stepwised / shrinkage selection concepts vs. Blackbox model rank features importance ? - Computing time vs. response - Business limitations ( example, business equire to shink the features to <=20 ) Feature Selection ( things to consider)
  • 64. Example (justifying selected features) Background : you’ve done an exploratory analysis about correlation, you have the result and now you need to explain it in a 5- year’s-old-can-understand way and use the exploratory results to do your feature selection !
  • 65. explain Correlation with a metaphor Interval of distance Direction to the right A B
  • 66. Observation Interval of distance Direction to the right A B Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same direction Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading at the same direction explain Correlation with a metaphor continued
  • 67. Linear Correlation In the following slides, for intuitive convenience purpose we rescale and map the correlation coefficient into the % format - - - Example : Strong positive correlation : 1  100% where: is the covariance of varible X and Y is th standard deviation of X is th standard deviation of Y Pearson’s correlation :
  • 68. The result of the analysis Externalsheettempexhaustpipe External sheet temp exhaust pipe Actual exhaust temperature exhaust pipe Actualexhausttemperatureexhaustpipe Process value regulator under pressure Processvalueregulatorunderpressure Process value regulator hood damper Processvalueregulatorhooddamper Negative pressure exhaust pipe Negativepressureexhaustpipe Regulator value hood damper Regulator value exhaust damper Actualvaluedamperexhaustpipe Regulatorvalueexhaustdamper Regulatorprocessvalue Actual value damper exhaust pipe
  • 69. Before we leave this metaphor – one last thing : ” correlation does not impley causation ! ”
  • 70. Correlation does not imply causation ! Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place? Guess 1 : husband and wife I drive Tesla car I drive Volvo car Guess 2 : racing track A B A B Guess 3 : coincidence
  • 71. Before diving into training your model(s) … ask yourself : what type of model should I use ? Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 72. Question : Do you have the correct answer to a given business question ? Supervised learning Regressions Classes Unsupervised learning Deep learning Clustering Association analysis What type of models are suitable ? YES NO
  • 73. Before diving into training your model(s) … Models landscape 1. Supervised 2. Unsupervised 3.Deep learning Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 74. Supervised Learning Regressions: Linear Regression Step-wised Regression Piecewise Polynomials and splines Smoothing Splines Logistic Regression Multivariate Adaptive Regression Splines Least Absolute Shrinkage and Selection Operator (LASSO) Ridge Regression Linear Discriminant Analysis (LDA) Trees : Decision trees Gradient Boosted Regression trees Adaptive Boosting trees (AdaBoost) Conditional Inference trees (CI trees) Bootstrap Aggregation (Bagging) trees Gradient Boosted Machines(GBM) Random Forest (RF) Support Vector Machines (SVM) : Support vector classifier (two class) Support vector classifier (multiclass) Kernels and support vector machines Dimensionality reduction: Principal Component Analysis(PCA) Singular Value Decomposition (SVD) MinHash Locality Sensitive Hashing(LSH) t-Distributed Stochastic Neighbor embedding (t-SNE) Clustering : Kmeans Clustering Hierarchical Clustering Bradley-Fayyad-Reina (BFR) clustering Clustering Using REpresentatives CURE clustering Bayesian networks Topic modelling Market Basket : Apriori (association rules) Park Chen and Yu algorithm (PCY) Savasere, Omiecinski and Navathe (SON) Toivonen’s algorithm Stream Analysis : Bloom filters Flajolet-Martin Algorithm Alon-Matias-Szegedy Datar-Gionis-Indyk-Motwani algorithm Unsupervised Learning NeuralNetwork families Deep Learning Perceptrons Simple Neural Networks (fully connected ) Deep Boltzmann machines Convolutional neural networks Recurrent neural networks Hierarchical temporal memory Genetic algorithm (chromosome) Multi-arm bandit K’s Nearest Neighbors (KNN) Content based recommender User-User recommender Item-item recommender Hybrid recommender Latent Dirichlet Allocation recommender Recommender Systems Others Others
  • 75. Data Science Process : Model training Model Validation ( example : supervised learning) Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 76. pre-processed data Validation set Training set Test set Split into Train ML models Check Select one winning model Models that pass the testing set Winning model Monitor model performance Re-train the models ? Yes No decide Sampling from live data streams If we want to be REALLY picky Live testing the winning model
  • 77. data science process Model selection criteria Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 78. Example ( justifying how you select the model) Background: you built a prediction model (let’s say to classify customer purchase=Yes/No), now you need to explain why did you pick THAT alrogithm in the first place !
  • 79. criterias logistic trees RF GBM weights Performance =Accuracy 86,5% 86,7% 86,8% 85,8% 10% Sensitivity 4,6% 12,5% 8,4% 21,4% 20% interpretability 1 0,8 0,4 0,2 30% Time to compute 1 0,8 0,2 0,2 20% # of parameters 2,4 2,4 1,89 2,38 10% Conflict to use regression Yes partial minimum minimum 10% Ranking 1,016 1,063 0,625 0,894 100% Performance=(true positive+true negative)/test set’s population  the model correctly predicted on Both whether you are a Purchaser or NonPurchaser Sensitivity =True positive/all positives on test set  the model correctly predicted that you are going to purchase Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
  • 80. Data Science Process : explain your model Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 81. Example (explaion the selected the model) Background: Now I have select a model called recursive Partition tree (rPart), the stakeholders asked me to explain how this model works …
  • 82. High level - Conceptually Medium level - a bit more detail Recurssive Partitioning Tree (rPart)– How does it work ? Explained in 2 levels . . .
  • 83. High Level – Conceptually
  • 84. High level – rPart how does it work ? Parent node Use both criteria 1 & 2 to decide whether to split or not Child node 2.1 (repeat the same thing) Child node 2.2 (repeat the same thing) For every parameters Pi , check 1) Is spliting on Pi with value Xi gives me more information ? 2) Is split on Pi with value Xi gives me better accuracy for prediction? Note: information is defined by inforamtion theory and have the option of Gini index and information gain( link ) • Minisplit - the minimum number of observations that must exist in a node in order for a split to be attempted • Minibucket-minimum observation in terminal node =minsplit/3 • cp- complexity parameter,punish the model if too many parameters will used and not much of increasing of accuracy/information were achieved Criteria 1 Criteria 2 Split on Parameter Pi with value Xi YESNO … … Tree Split nodes on : Hyper-parameters
  • 85. Medium Level – a bit more detail 1) information gain 2) accuracy improvement
  • 86. Scenario 2 : If the end nodes have 100 percent of the chance to say that a class to be Purchaser or noPurchaser, it is perfect classification, hence this node is said to be reaching minimum impurity (entropy=0) calculation formula : -P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase)) -P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase)) =0 -(1)*log2(1) +0 =0 minimum impurity Scenario 1 : If the end nodes have 50-50 percent of the chance to say that a class to be Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be reaching maximum impurity (entropy=1) calculation formula : -P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase)) -P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase)) =0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity 1) Information gain by checking the Impurity of the end nodes calculated by entropy Total: 10 data points Label : 5 Purchase+ 5 noPurchase (end node1) Total: 5 data points Label : 0 Purchase+5 noPurchase (end node2) Total: data points Label : 5 Purchase+ 0 noPurchase Spliting condition 1Yes No Scenario 1 P1(Purchase)=0 P1(noPurchase)=5/10 =1/2 P2(Purchase)=5/10 P(noPurchase)=0/10 Total: 10 data points Label : 0 Purchase+ 10 noPurchase (end node1) Total: 5 data points Label : 0 Purchase+10 noPurchase (end node2) Total: data points Label : 0 Purchase+ 0 noPurchase Spliting condition 1Yes No Scenario 2 P1(Purchase)=0 P1(noPurchase)=10/10=1 P2(Purchase)=0 P(noPurchase)=0 0
  • 87. 2) how rpart calculating misclassification rate on parameter Pi with value Xi 20 data points 10 data points 10 data points Age <45?Yes No Predict noPurc hase =7 Predict Purchase =3 cntTotal <110?Yes No Correct classified rate =1/7 Correct classified rate =1/3 Predict noPurc hase =5 Predict Purchase =5 cntTotal <75 ?Yes No Correct classified rate =1/5 Correct classified rate =1/5 rPart model will ask for each and every value Xi in a parameter Pi Was it a good idea´(via calculate the missclassification rate) to split on this value and it will do so for all parameter Pi on all possible value Xi associated with Pi (see image on the left as an example ) Overall misclassification rate (True Purchase + true noPurchase) / total population = 4/20 =20% Misclassified =1- 20% =80%
  • 88. Data Science Process : deployment Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 89. Board members / CTO, CEO, CFO..etc Marketing directors, Marketers Processed data for visualization Data scientist Model Performance Matrices & output prediction pass business owner’s vision Deliverables: One-off (POC) Interpretability Lesson learned - Final reports or prototype dashboards for internal sales WoW-effect Visualization
  • 90. IT + Content creators + marketers Processed data for visualization Data scientist Code for embeddedment into applications Model Performance Matrices & output prediction Pass integration test Deployment : Production Pipeline Reproducibility Add to organization-wide dashboards&reporting pipeline (automated) Embedded code directly into applications ( content recommender, product mix vs customer segments matching..etc) Use the output of model prediction for further marketing purpose ( such as segmentation, customer profiling..etc) Process efficiency
  • 92. Refresh our memory from previous sections • Relationship between data science and big data • What motivate me to become a data scientist • The definition of data science and it’s closely related synonym • The skillset map for becoming a data scientist ( unicorn version vs your own) o Why team work approach o Dream team mates o Data science process : two approach ( why , compare , boxed-in activities) o Data science process breakdown in details (step-wised)
  • 93. Data Science Tools – SPSS Modeler
  • 94. SPSS modeler – visualized programming
  • 97. Data Science Tools – Microsoft Azure ML (demo) URL : https://studio.azureml.net/
  • 98. Data Science Tools – IBM data science experience/workbench (Python+Jupyter Notebook demo) URL : https://datascientistworkbench.com/
  • 99. Data Science Tools – R+RStudio(demo)
  • 100. Data Science Tools – Python and R cheatsheet

Editor's Notes

  • #5: Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
  • #6: Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
  • #9: Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
  • #20: LOB = Line of Business
  • #21: LOB = Line of Business
  • #23: Source : https://en.wikipedia.org/wiki/Data_science
  • #24: Source : http://scott.fortmann-roe.com/docs/BiasVariance.html
  • #25: http://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/
  • #29: http://www.intelligenthq.com/technology/top-10-requirements-to-be-a-data-scientist/
  • #32: Now i want you to spend some time to read about this slide so i can drink some water because i am thirsty .. :P Oki so motivation is very personal.. You need to find yours.. Here are mine… I am extrememly attracted to knowledge.. In fact every time i found something interesting.. I can’t just let it pass, i need to stay with it until i know more, or enough that satisfy my hunger for knowledge.. … i dont know about you but for me.. this need of knowing more drives me to go further…. Secondary , it sounds a bit cliché to say that the beautiful thing about learning is that no one can take it away from you …. Oki, so if you really think about it, it is true in that, well in this world, we are all alone, we can try to keep those who we care about close, we can try to build the most secure locker in the world… Eventually, things, people leave us… the only thing that you are stuck with, is yourself and the knowledge you know… in a way , it is both sad and nice.. So the third picture is quite curious… does anyone knows who made this art ? https://en.wikipedia.org/wiki/Waterfall_(M._C._Escher) So anyone wants to guess why i choose this picture ? Things are not always what it seems at the first glace  when you look a bit longer, you will realize something is off… then you will ask yourself why is it so.. This is exactly the point, it challenges you to think outside the box , we live in a world with conditions… everything comes with conditions that we are not even conciously aware of… for example. We restrict ourselve to think in a less than 3 dimensional ways.. That we get confused when dimensions grows higher than 3… what if we are allow to go to the 4th or 5th dimensions , what will happen ? Another way to think about it is that , we awwume gravity exists even in pictures.. Oki so says that it SHOULD exist at all costs ? What if we try to surreal .. This concept of challenge your fundemental ’’bias’’ extend to everything you do as a data scientist.. Remember that i said in the data science skillset map that you need to be naive and stupid ? Ask questions about why it is so.. Why it is done like this is actually important, it sometimes reveal hidden truth
  • #35: So anyone can tell me what is the difference between these two picture ?
  • #39: Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
  • #41: Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
  • #52: We will only go through till model validation but not deployment or after deployment
  • #58: Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
  • #67: We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  • #68: We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  • #107: Source : https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience+(1).pdf
  • #108: Source: https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf
  • #109: Source: https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf