SlideShare a Scribd company logo
Data Mining
by Shaoli Lu
What is data mining?
• Data mining is the process of analyzing data to
find hidden patterns using automatic statistical
methodologies/algorithms/models
• Use data as the “brain”
• Predictive analytics is a subset of data mining
• Ad hoc queries and OLAP are not suited to the
task
Statistical Models
• Decision trees
• Clustering
• Naïve Bayes
• Neural network
• Logistic regression
• Time series
• Associations
Business Cases
• Recommendation generation
• Anomaly detection
• Churn analysis
• Risk management
• Customer segmentation
• Targeted ads
• Forecasting
• Data exploration
Common Approaches
• Classification
• Clustering
• Association
• Regression
• Forecasting
• Sequence Analysis
• Deviation Analysis
Classification
• Classification is the most common data mining task. Business problems such
as churn analysis, risk management, and targeted advertising usually involve
classification
• Supervised Machine Learning
• Classification is the act of assigning a category to each case
• Each case contains a set of attributes, one of which is the class attribute
• The task requires finding a model that describes the class attribute as a
function of input attributes
• Typical classification algorithms include decision trees, neural network, and
Naïve Bayes
Clustering
• Clustering is also called segmentation. It is used
to identify natural groupings of cases based on a
set of attributes
• Cases within the same group have more or less
similar attribute values
• Clustering is an unsupervised machine learning.
There is no single attribute used to guide the
training process, so all input attributes are
treated equally
Association
• Association is also called market basket analysis
• Common usage of association is to identify
common sets of items and rules for the purpose
of cross-selling
• The association task has two goals: to find those
items that appear together frequently, and from
that, to determine rules about the association
Regression
• The regression task is similar to classification, except
that instead of looking for patterns that describe a
class, the goal is to find patterns to determine a
numerical value
• The most popular techniques used for regression are
linear regression and logistic regression. SQL Server
supports regression trees (part of the Microsoft
Decision Trees algorithm) and neural networks
• Support categorical inputs as well as numerical
inputs
Forecasting
• As input, it takes sequences of numbers
indicating a series of values through time, and
then it imputes future values of those series
using a variety of machine-learning and statistical
techniques that deal with seasonality, trending,
and noisiness of data
Sequence Analysis
• Sequence analysis is used to find patterns in a
series of events called a sequence
• Both sequence and time-series data are similar
in that they contain adjacent observations that
are order-dependent. The difference is that
where a time series contains numerical data, a
sequence series contains discrete states
Deviation Analysis
• Deviation analysis is used to find rare cases that
behave very differently from the norm
• Widely used, fraud protection
• There is no standard technique for deviation
analysis, usually apply decision trees, clustering,
or neural network algorithms for this task
Demo
• Demo #1: Data Mining By Example – Building
Predictive Model Using Microsoft Decision Trees

More Related Content

PPTX
Data mining techniques unit v
PPTX
Data mining techniques unit 2
PDF
2 introductory slides
PDF
6 module 4
PDF
4 module 3 --
PDF
3 module 2
PDF
Ghhh
PPTX
01 Introduction to Data Mining
Data mining techniques unit v
Data mining techniques unit 2
2 introductory slides
6 module 4
4 module 3 --
3 module 2
Ghhh
01 Introduction to Data Mining

What's hot (10)

PPTX
Data Mining: Classification and analysis
PDF
An Introduction to Advanced analytics and data mining
PDF
Hierarchical clustering
PPTX
Data Cleaning Techniques
PPTX
PPT
1.2 steps and functionalities
PDF
data mining
PPTX
Data mining tasks
PPTX
Data mining Basics and complete description onword
PPTX
The 8 Step Data Mining Process
Data Mining: Classification and analysis
An Introduction to Advanced analytics and data mining
Hierarchical clustering
Data Cleaning Techniques
1.2 steps and functionalities
data mining
Data mining tasks
Data mining Basics and complete description onword
The 8 Step Data Mining Process
Ad

Similar to Data mining (20)

PPTX
Week-1-Introduction to Data Mining.pptx
PPTX
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
PPTX
Data Mining with SQL Server 2008
PPT
Data mining
PPT
Data Mining-2023 (2).ppt
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
PPTX
Data mining an introduction
PPTX
Data mining concepts and work
PPTX
fundamentals_of_data_science_and_its_intro.pptx
PPTX
__an_intro_duction_for_data_sceince.pptx
PPT
Data mining
PDF
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
PDF
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
PPTX
Information Technology Data Mining
PDF
Chapter 1.pdf
PPTX
Classification and prediction in data mining
PPTX
Data mining approaches and methods
PDF
Study of Data Mining Methods and its Applications
PPTX
Introduction to Data mining
Week-1-Introduction to Data Mining.pptx
DWDM_UNIT4.pptx ddddddddddddddddddddddddddddd
Data Mining with SQL Server 2008
Data mining
Data Mining-2023 (2).ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
Data mining an introduction
Data mining concepts and work
fundamentals_of_data_science_and_its_intro.pptx
__an_intro_duction_for_data_sceince.pptx
Data mining
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONS
Information Technology Data Mining
Chapter 1.pdf
Classification and prediction in data mining
Data mining approaches and methods
Study of Data Mining Methods and its Applications
Introduction to Data mining
Ad

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Lecture1 pattern recognition............
PDF
Business Analytics and business intelligence.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
Introduction to Data Science and Data Analysis
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Lecture1 pattern recognition............
Business Analytics and business intelligence.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Qualitative Qantitative and Mixed Methods.pptx
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
[EN] Industrial Machine Downtime Prediction
Computer network topology notes for revision
Quality review (1)_presentation of this 21

Data mining

  • 2. What is data mining? • Data mining is the process of analyzing data to find hidden patterns using automatic statistical methodologies/algorithms/models • Use data as the “brain” • Predictive analytics is a subset of data mining • Ad hoc queries and OLAP are not suited to the task
  • 3. Statistical Models • Decision trees • Clustering • Naïve Bayes • Neural network • Logistic regression • Time series • Associations
  • 4. Business Cases • Recommendation generation • Anomaly detection • Churn analysis • Risk management • Customer segmentation • Targeted ads • Forecasting • Data exploration
  • 5. Common Approaches • Classification • Clustering • Association • Regression • Forecasting • Sequence Analysis • Deviation Analysis
  • 6. Classification • Classification is the most common data mining task. Business problems such as churn analysis, risk management, and targeted advertising usually involve classification • Supervised Machine Learning • Classification is the act of assigning a category to each case • Each case contains a set of attributes, one of which is the class attribute • The task requires finding a model that describes the class attribute as a function of input attributes • Typical classification algorithms include decision trees, neural network, and Naïve Bayes
  • 7. Clustering • Clustering is also called segmentation. It is used to identify natural groupings of cases based on a set of attributes • Cases within the same group have more or less similar attribute values • Clustering is an unsupervised machine learning. There is no single attribute used to guide the training process, so all input attributes are treated equally
  • 8. Association • Association is also called market basket analysis • Common usage of association is to identify common sets of items and rules for the purpose of cross-selling • The association task has two goals: to find those items that appear together frequently, and from that, to determine rules about the association
  • 9. Regression • The regression task is similar to classification, except that instead of looking for patterns that describe a class, the goal is to find patterns to determine a numerical value • The most popular techniques used for regression are linear regression and logistic regression. SQL Server supports regression trees (part of the Microsoft Decision Trees algorithm) and neural networks • Support categorical inputs as well as numerical inputs
  • 10. Forecasting • As input, it takes sequences of numbers indicating a series of values through time, and then it imputes future values of those series using a variety of machine-learning and statistical techniques that deal with seasonality, trending, and noisiness of data
  • 11. Sequence Analysis • Sequence analysis is used to find patterns in a series of events called a sequence • Both sequence and time-series data are similar in that they contain adjacent observations that are order-dependent. The difference is that where a time series contains numerical data, a sequence series contains discrete states
  • 12. Deviation Analysis • Deviation analysis is used to find rare cases that behave very differently from the norm • Widely used, fraud protection • There is no standard technique for deviation analysis, usually apply decision trees, clustering, or neural network algorithms for this task
  • 13. Demo • Demo #1: Data Mining By Example – Building Predictive Model Using Microsoft Decision Trees