SlideShare a Scribd company logo
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Before we begin tonight…
developer edition
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (might be a stretch…)
Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Understand elements of a time series
• Excel as a tool
• Practical application
• Familiar with time series manipulation techniques
• Automatic time series procedures (homework)
• Gain familiarity with Weka
• Dive into Decision Trees, in Weka (time permitting)
Learning Objectives
Deriving Knowledge from Data at Scale
Lecture 3 Outline
Follow Up
Deriving Knowledge from Data at Scale
Lecture 3 Outline
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (might be a stretch…)
Deriving Knowledge from Data at Scale
What tools to use?
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Weka – explorer…
• KNIME – experimentation…
Get proficient in at least two (2) tools…
Deriving Knowledge from Data at Scale
http://www.cs.waikato.ac.nz/ml/weka/ we’ll use this in class…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Fixed and known period
Rise and fall, not a fixed period
Trend
Smoothing
Moving
Average
Exponential
Model
Linear
Exponential
Auto Regressive
Deriving Knowledge from Data at Scale
In the multiplicative mode for time series modeling
Time Series = Trend component * Seasonality * Irregular
Let’s assume the cyclical component is 0…
Deriving Knowledge from Data at Scale
forecast Year 5
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
trend
Deriving Knowledge from Data at Scale
Linear regression deseasonalized
Analysis ToolPak
Deriving Knowledge from Data at Scale
Create time step column
Deriving Knowledge from Data at Scale
Create time step column
Select Data Analysis option
Deseasonalize data
Time Step
Labels
OK
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
=(Intercept (F4 to lock) + slope (F4 to lock) * time code for row)
=(5.099 + .147 * 1)
Copy all the way down to Y4 Q4
Deriving Knowledge from Data at Scale
seasonality
seasonality * trend = prediction
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
seasonality
trend
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Running Example: Amazon Orders
Deriving Knowledge from Data at Scale
-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14
DailyOrders
Deriving Knowledge from Data at Scale
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
2/1/15 2/8/15 2/15/15 2/22/15 3/1/15 3/8/15 3/15/15 3/22/15 3/29/15 4/5/15 4/12/15 4/19/15 4/26/15
Deriving Knowledge from Data at Scale
-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
11/1/14
11/8/14
11/15/14
11/22/14
11/29/14
12/6/14
12/13/14
12/20/14
12/27/14
1/3/15
1/10/15
1/17/15
1/24/15
1/31/15
2/7/15
2/14/15
2/21/15
2/28/15
3/7/15
3/14/15
3/21/15
3/28/15
4/4/15
4/11/15
4/18/15
4/25/15
Cyber Monday
Black
Friday
Christmas Eve
Super Saturday
Deriving Knowledge from Data at Scale
yt »a1yt-1 +a2yt-7 +a3yt-365 +et
SSresidual = yt - a1yt-1 +a2yt-7 +a3yt-365( )éë ùû
2
t
å
Deriving Knowledge from Data at Scale
Uncover Missing Data
Missing vs. Anomalous Data
Deriving Knowledge from Data at Scale
-
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
8/17/04 8/17/05 8/17/06 8/17/07 8/17/08 8/17/09 8/17/10 8/17/11 8/17/12 8/17/13 8/17/14
DailyOrders
No lag
Test
Train (fit parameters)
Deriving Knowledge from Data at Scale
1. Coefficient of Determination (R2)
2. Mean Absolute Error (MAE)
SStotal = (yt - y)2
t
å R2
=1-
SSresidual
SStotal
1
T yt - a1yt-1 +a2yt-7 +a3yt-365( )
t
å
Deriving Knowledge from Data at Scale
yt » 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252
Deriving Knowledge from Data at Scale
yt
yt - 0.59yt-1 +0.25yt-7 +0.20yt-365 +31,252( )
Deriving Knowledge from Data at Scale
yt » 0.57yt-1 +0.27yt-7 +0.19yt-365+2,288,140´I(CyberMonday)+30,239
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
thousands
Deriving Knowledge from Data at Scale
Data Downloads
Data Downloads
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Naïve method
• Mean method
• Seasonal naïve method
• Drift method
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
But by now, you know this…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
nothing is forecastable until it is stable…
(1) Mean constant
Volatility constant
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Transformation: take differences (diff() function in R)
Transformation: take logs or powers. Box-Cox family of
transformations flexibly covers both:
Y = (lambda*y + 1)^(1/lambda)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
y(t) = y(t-1) + e
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
arima() stats
forecast()
Arima()
auto.arima() forecast
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
[1] "ETS(M,Md,M)“ – Holt Winter, multiplicative error, multiplicative
damped trend, multiplicative seasonality,
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Feb Mar Apr May Jun Jul
114,727,363 123,818,067 132,671,221 141,424,018 150,134,416 158,826,902
Deriving Knowledge from Data at Scale
These are scale dependent, so OK if comparing forecasts on the
same data set or same scale of data…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
rolling forecasting origin
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
R
A Little R Time Series Book
Python/Pandas
Video1 Video2 Video3
statsmodels
Texts
Deriving Knowledge from Data at Scale
Out of Class Reading (2), optional but very helpful…
Deriving Knowledge from Data at Scale
For this project you can use the beer data set, or analyze a
dataset of interest to you. The objective is to give you a
hands on opportunity to work with the R time series
functionality, in particular ARIMA. You can find time series
datasets at the Time Series Data Library, or you can
fallback to use the beer data set.
Deriving Knowledge from Data at Scale
See homework description for what to turn in…
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
http://www.cs.waikato.ac.nz/ml/weka/
Deriving Knowledge from Data at Scale
http://weka.wikispaces.com/ARFF+(stable+version)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Opening Discussion
• Forecasting, continued (2/2)
• Introducing Weka
• Decision Trees
• Hands On, Decision Tree in Weka (stretch goal)
Deriving Knowledge from Data at Scale
evaluation
http://www.20q.net/
Deriving Knowledge from Data at Scale
• Classification
• Regression
• Clustering
classification trees
Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions
Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
Each node is a test on
one attribute
Possible attribute values
of the node
Leafs are the
decisions
Sample size
Your data
gets smaller
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
A new test example:
(Outlook==rain) and (not
Windy==false)
Pass it on the tree
-> Decision is yes.
Deriving Knowledge from Data at Scale
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Outlook
Humidity
Windy
(Outlook ==overcast) -> yes
(Outlook==rain) and (not Windy==false) ->yes
(Outlook==sunny) and (Humidity=normal) ->yes
Deriving Knowledge from Data at Scale
• The goal is to have the resulting decision tree as small as
possible (Occam’s Razor)
• Finding the minimal decision tree consistent with the data is
NP-hard
• Recursive algorithm is a greedy heuristic search for a simple
tree, but cannot guarantee optimality.
• Select attributes that split the examples to sets that are
relatively pure in one label; this way we are closer to a leaf
node.
Deriving Knowledge from Data at Scale
test
test
Overfitting
Deriving Knowledge from Data at Scale
Which attribute should be used as the test?
Intuitively, you would prefer the one that separates
the training examples as much as possible, reduces
the entropy…
Deriving Knowledge from Data at Scale
+ - - + + + - - + - + - + +
- - + + + - - + - + - - + - -
+ - + - - + - + - + + - - +
+ - - - + - + - + + - - + +
+ - - + - + - + + - - + - +
- - + + + - + - +
+ - + - + + + - -
+ - + - - + - +
- - + - + - +
- - - + - - - -
+ - - + - - -
+ + + +
+ + + +
- - - - - -
- - - - - -
+ + + + +
+ + + + +
+ + + +
- - + - + - +
- + + + - - - - - -
- - - - - -
+ + +
+ + +
- - - - -
Highly Disorganized
High Entropy
Highly Organized
Low Entropy
Deriving Knowledge from Data at Scale
amount of uncertainty
Deriving Knowledge from Data at Scale
4 +
4 -
8 +
0 -
The distribution is less uniform
Entropy is lower
The node is purer
Deriving Knowledge from Data at Scale
(information before split) – (information after split)
Deriving Knowledge from Data at Scale
provides most information
about the class
reduces class entropy most
information gain
Deriving Knowledge from Data at Scale
Example
Humidity Wind
High Normal Strong Weak
S: [9+,5-] S: [9+,5-]
S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-]
E = 0.985 E = 0.592 E = 0.811 E = 1.0
E = 0.940 E = 0.940
Gain(S, Humidity)
= .940 - (7/14).985 - (7/14).592
= 0.151
Gain(S, Wind)
= .940 - (8/14).811 - (6/14)1.0
= 0.048
Deriving Knowledge from Data at Scale
Hypothesis space search in TDIDT
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
area with probably
wrong predictions
Overfitting: Example
+
+
+
+
+
+
+
-
-
- -
-
--
-
--
-
-
-
- +
-
-
-
-
-
Deriving Knowledge from Data at Scale
That’s all for tonight….

More Related Content

PDF
Barga Data Science lecture 8
PDF
Barga Data Science lecture 5
PDF
Barga Data Science lecture 6
PDF
Barga Data Science lecture 10
PDF
Barga Data Science lecture 9
PDF
Barga Data Science lecture 7
PDF
Barga Data Science lecture 4
PDF
Barga Data Science lecture 1
Barga Data Science lecture 8
Barga Data Science lecture 5
Barga Data Science lecture 6
Barga Data Science lecture 10
Barga Data Science lecture 9
Barga Data Science lecture 7
Barga Data Science lecture 4
Barga Data Science lecture 1

What's hot (20)

PDF
Managing machine learning
PDF
Introduction to machine learning and deep learning
PPTX
H2O World - Top 10 Data Science Pitfalls - Mark Landry
PDF
Module 1 introduction to machine learning
PPTX
Machine Learning and Real-World Applications
PDF
Machine Learning for Dummies
PDF
Fairly Measuring Fairness In Machine Learning
PDF
Data exploration validation and sanitization
PDF
Hacking Predictive Modeling - RoadSec 2018
PPTX
Top 10 Data Science Practitioner Pitfalls
PPTX
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
PDF
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
PDF
Moving Your Machine Learning Models to Production with TensorFlow Extended
PDF
Feature Reduction Techniques
PDF
Applications in Machine Learning
PPTX
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
PPTX
AI Algorithms
PDF
CRISP-DM - Agile Approach To Data Mining Projects
PDF
Knowledge discovery claudiad amato
PDF
Module 9: Natural Language Processing Part 2
Managing machine learning
Introduction to machine learning and deep learning
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Module 1 introduction to machine learning
Machine Learning and Real-World Applications
Machine Learning for Dummies
Fairly Measuring Fairness In Machine Learning
Data exploration validation and sanitization
Hacking Predictive Modeling - RoadSec 2018
Top 10 Data Science Practitioner Pitfalls
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Moving Your Machine Learning Models to Production with TensorFlow Extended
Feature Reduction Techniques
Applications in Machine Learning
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
AI Algorithms
CRISP-DM - Agile Approach To Data Mining Projects
Knowledge discovery claudiad amato
Module 9: Natural Language Processing Part 2
Ad

Similar to Barga Data Science lecture 3 (20)

PPTX
Time Series Anomaly Detection with .net and Azure
PDF
Is this normal?
PDF
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
PPTX
Discover deep insights with Salesforce Einstein Analytics and Discovery
PDF
Data Profiling in Apache Calcite
PDF
Data profiling with Apache Calcite
PDF
Data profiling in Apache Calcite
PDF
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
PDF
Jogging While Driving, and Other Software Engineering Research Problems (invi...
PDF
VerticaPy_original - Anritsu.pdf
PPT
Lecture 25
PDF
RIC-NN: A Robust Transferable Deep Learning Framework for Cross-sectional Inv...
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
PPTX
Is your excel production code?
PDF
AiCore Brochure 27-Mar-2023-205529.pdf
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Exploring ML methods to increase the profitability of the trading strategy
PPT
Data structures cs301 power point slides lecture 01
PDF
Fast Distributed Online Classification
PDF
Fire-fighting java big data problems
Time Series Anomaly Detection with .net and Azure
Is this normal?
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
Discover deep insights with Salesforce Einstein Analytics and Discovery
Data Profiling in Apache Calcite
Data profiling with Apache Calcite
Data profiling in Apache Calcite
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Jogging While Driving, and Other Software Engineering Research Problems (invi...
VerticaPy_original - Anritsu.pdf
Lecture 25
RIC-NN: A Robust Transferable Deep Learning Framework for Cross-sectional Inv...
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Is your excel production code?
AiCore Brochure 27-Mar-2023-205529.pdf
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Exploring ML methods to increase the profitability of the trading strategy
Data structures cs301 power point slides lecture 01
Fast Distributed Online Classification
Fire-fighting java big data problems
Ad

More from Roger Barga (8)

PDF
RS Barga STRATA'18 New York City
PDF
Barga Strata'18 presentation
PDF
Barga ACM DEBS 2013 Keynote
PDF
Data Driven Engineering 2014
PDF
Barga Galvanize Sept 2015
PDF
Barga DIDC'14 Invited Talk
PDF
Barga Data Science lecture 2
PDF
Barga IC2E & IoTDI'16 Keynote
RS Barga STRATA'18 New York City
Barga Strata'18 presentation
Barga ACM DEBS 2013 Keynote
Data Driven Engineering 2014
Barga Galvanize Sept 2015
Barga DIDC'14 Invited Talk
Barga Data Science lecture 2
Barga IC2E & IoTDI'16 Keynote

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Business Analytics and business intelligence.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Business Analytics and business intelligence.pdf
ISS -ESG Data flows What is ESG and HowHow
IB Computer Science - Internal Assessment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
Miokarditis (Inflamasi pada Otot Jantung)

Barga Data Science lecture 3