SlideShare a Scribd company logo
Data Mining 101
Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars
Outline
Introduction
•Terminology
•Potential application
•Venn diagram
Process
overview
•Business understanding
•Data understanding (exploration)
•Data preparation (preprocessing)
•Modeling
•Evaluation
•Deployment (presentation)
Tools &
Resource
Introduction – Terminology
Data
mining
Knowledge
Discovery
in
Databases
Big data
analytics
Statistics
Data
science
“
”
The process of collecting,
searching through, and analyzing
a large amount of data in a
database, as to discover patterns
or relationships.
Data Mining - dictionary.reference.com
Introduction – Potential Application
Customer
segmentation
Recommendation
engine
Social media
mining
“
”
What should we do?
Where to start? Do I have to get a master degree in statistics?
http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram
And now the business process…
CRISP DM Methodology
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
Business Understanding
CRISP DM Methodology
Objective Statement
Bottom-up
Top-down
Objective Statement
Data Problem
vs
Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Inventory of Resources
Resource
Data,
Knowledge,
Tools
Hardware
Personnel
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Requirements, Assumptions, and Constraints
Requirements
Scheduling
Accuracy
Security
Assumptions
Data quality
External
factors
Reporting type
Constraints
Legal issues
Budget
Resources
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Risks and Contingencies
Contingency Plan
Financial
Organizational
Business
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment – Terminology
Write down related terminology
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
Situation Assessment – Costs and Benefits
Money, money, money!
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg
“
”
How to evaluate the results?
Define your success criteria!
Data Understanding
CRISP DM Methodology
Data Collection
External Internal
vs
Watch out!
“
”
visible ≠ accessible ≠
storable ≠ presentable
Victor Lavrenko – Text Technologies
http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
Data Exploration –
Visualization Heuristics
 Visualize fast. Visualize reactively.
 Go for high information 2D visualizations.
 Select data subsets to visualize.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration –
Visualization Heuristics
 Never let anomalies pass you by. Dig deeper.
 Use your visualizations to inform potential
models. Use your potential model to direct your
visualizations.
 Expect problems in your data.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
“
”
This is the cheapest and most
informative stage of data
mining.
Nigel Goddard – DME Visualization
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration –
Visualization Tools
 Column/bar: Large change
 Line, curve: Small change, long periods
 Histogram: Frequency distribution
https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
Data Preparation
CRISP DM Methodology
“
”
Which one should I include
(or exclude)?
Data Selection
Data Cleaning
Dirty
Data
Missing
value
Incomplete
OutdatedDuplication
OutlierRemember: Expect problems in your data.
Data Construction
 Feature engineering – derived attributes,
e.g.:
year from timestamp
quarter from timestamp
BMI from weight and height
Log(x) for skewed data (e.g. house price)
Data Splitting
Two kinds of data splitting:
Training-Validation-Testing
Cross Validation
Data Splitting –
Training-Validation-Testing
• Construct
classifierTraining
• Pick algorithm
• Knob settings
(tree depth, k in
kNN, c in SVM)
Validation
• Estimate future
error rateTesting
Split randomly to avoid bias
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Data Splitting –
Cross Validation
Every point is both training and testing, never at the same time
Dimensionality Reduction
Principal
Component
Analysis
Linear
Discriminant
Analysis
vs
Modeling
CRISP DM Methodology
Machine Learning
Classification Regression Ranking Clustering
Model Selection
Regression
Technique
Generalization bound
Linear regression
Kernel ridge regression
Support vector regression
Lasso
“
”
Which one should I choose?
Should I use all of them?
It depends on…
Model Selection
Assumptions
The predictors are linearly
independent
The error is a random variable
with a mean of zero conditional on
the explanatory variables
The sample is representative of
the population for the inference
prediction
Interpretability
The
understandability
of why the model
is true or how the
model is induced
from
https://chenhaot.com/pubs/mldg-interpretability.pdf
Beware of Overfitting!
http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png
Model Assessment
Regression
• (R)MSE
• Mean
Absolute
Error
• Correlation
Coefficient
Classification
• Accuracy
• Precision
• Recall
• F-score
Descriptive
• Std. Error
• p-value
• Confidence
Interval
Evaluation
CRISP DM Methodology
“
”
Does my model solve the
problem?
What is the impact? Is it novel? How useful is the solution?
Deployment
CRISP DM Methodology
The Tasks
Plan deployment
Plan monitoring
and maintenance
Produce final
report
Review project
Tools & Resource
 Text mining: NLTK, spaCy, OpenNLP
 Query expansion & clustering: Carrot2, Weka
 Data mining & machine learning: Weka, scikit-learn
 Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala
 Python lib: Pandas, SciPy, NumPy, scikit-learn
 Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark
 Visualization: D3.js
 Community: Big Data & Open Data Indonesia
“
”
Thank you!
Data Mining 101 – Python-ID Meetup February 2015
Okiriza Wibisono - @okiriza
Ali Akbar Septiandri - @aliakbars

More Related Content

PPTX
Data Science Training
PDF
Machine learning in action at Pipedrive
PDF
Azure Machine Learning
PPTX
Azure Machine Learning 101
PDF
Machine Learning and AI: Core Methods and Applications
PDF
GTU GeekDay Data Science and Applications
PDF
Synthetic VIX Data Generation Using ML Techniques
PPTX
Data mining course learning outcomes,Data Mining CMAP
Data Science Training
Machine learning in action at Pipedrive
Azure Machine Learning
Azure Machine Learning 101
Machine Learning and AI: Core Methods and Applications
GTU GeekDay Data Science and Applications
Synthetic VIX Data Generation Using ML Techniques
Data mining course learning outcomes,Data Mining CMAP

Similar to Data Mining 101 (20)

PPTX
Data Mining - The Big Picture!
PPTX
Machine Learning 2 deep Learning: An Intro
PDF
Experimenting with Data!
PPTX
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
PDF
THEORITICAL FRAMEWORK FOR THE DATA MINING PROCESS
PDF
Drifting Away: Testing ML Models in Production
PPTX
MLIntro_ADA.pptx
PDF
An Introduction to Advanced analytics and data mining
PPT
Data mining
PPTX
SESE 2021: Where Systems Engineering meets AI/ML
PPTX
Machine Learning With ML.NET
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PPTX
Azure Databricks for Data Scientists
PPTX
Machine learning
PDF
Machine_Learning_Overview_Presentation_1.pdf
PDF
Machine_Learning_Overview_Presentation_1.pdf
PPTX
Machine_Learning_Overview_Presentation_1.pptx
PDF
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
PDF
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
PPTX
C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka
Data Mining - The Big Picture!
Machine Learning 2 deep Learning: An Intro
Experimenting with Data!
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
THEORITICAL FRAMEWORK FOR THE DATA MINING PROCESS
Drifting Away: Testing ML Models in Production
MLIntro_ADA.pptx
An Introduction to Advanced analytics and data mining
Data mining
SESE 2021: Where Systems Engineering meets AI/ML
Machine Learning With ML.NET
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Azure Databricks for Data Scientists
Machine learning
Machine_Learning_Overview_Presentation_1.pdf
Machine_Learning_Overview_Presentation_1.pdf
Machine_Learning_Overview_Presentation_1.pptx
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Computer network topology notes for revision
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
Launch Your Data Science Career in Kochi – 2025
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
Computer network topology notes for revision
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Fluorescence-microscope_Botany_detailed content
Clinical guidelines as a resource for EBP(1).pdf
Ad

Data Mining 101

Editor's Notes

  • #11: Cross Industry Standard Process for Data Mining
  • #43: Beware of overfitting!