SlideShare a Scribd company logo
Data Mining Steps
Problem Definition
Market Analysis
Customer Profiling, Identifying Customer Requirements, Cross
Market Analysis, Target Marketing, Determining Customer
purchasing pattern
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation, Resource Planning,
Competition
Fraud Detection
Customer Retention
Production Control
Science Exploration
> Data Preparation
Data preparation is about constructing a dataset from one or
more data sources to be used for exploration and modeling. It is
a solid practice to start with an initial dataset to get familiar
with the data, to discover first insights into the data and have a
good understanding of any possible data quality issues. The
Datasets you are provided in these projects were obtained from
kaggle.com.
Variable selection and description
Numerical – Ratio, Interval
Categorical – Ordinal, Nominal
Simplifying variables: From continuous to discrete
Formatting the data
Basic data integrity checks: missing data, outliers
> Data Exploration
Data Exploration is about describing the data by means of
statistical and visualization techniques.
· Data Visualization:
o
Univariate
analysis explores variables (attributes) one by one. Variables
could be either categorical or numerical.
Univariate Analysis - Categorical
Statistics
Visualization
Description
Count
Bar Chart
The number of values of the specified variable.
Count%
Pie Chart
The percentage of values of the specified variable
Univariate Analysis - Numerical
Statistics
Visualization
Equation
Description
Count
Histogram
N
The number of values (observations) of the variable.
Minimum
Box Plot
Min
The smallest value of the variable.
Maximum
Box Plot
Max
The largest value of the variable.
Mean
Box Plot
The sum of the values divided by the count.
Median
Box Plot
The middle value. Below and above median lies an equal
number of values.
Mode
Histogram
The most frequent value. There can be more than one mode.
Quantile
Box Plot
A set of 'cut points' that divide a set of data into groups
containing equal numbers of values (Quartile, Quintile,
Percentile, ...).
Range
Box Plot
Max-Min
The difference between maximum and minimum.
Variance
Histogram
A measure of data dispersion.
Standard Deviation
Histogram
The square root of variance.
Coefficient of Deviation
Histogram
A measure of data dispersion divided by mean.
Skewness
Histogram
A measure of symmetry or asymmetry in the distribution of
data.
Kurtosis
Histogram
A measure of whether the data are peaked or flat relative to a
normal distribution.
Note: There are two types of numerical variables, interval and
ratio. An interval variable has values whose differences are
interpretable, but it does not have a true zero. A good example
is temperature in Centigrade degrees. Data on an interval scale
can be added and subtracted but cannot be meaningfully
multiplied or divided. For example, we cannot say that one day
is twice as hot as another day. In contrast, a ratio variable has
values with a true zero and can be added, subtracted, multiplied
or divided (e.g., weight).
o
Bivariate analysis
is the simultaneous analysis of two variables (attributes). It
explores the concept of relationship between two variables,
whether there exists an association and the strength of this
association.
There are three types of bivariate analysis.
1.Numerical & Numerical
ScMatter Plot, Linear Correlation …
2.Categorical & Categorical
Stacked Column Chart, Combination Chart, Chi-square Test
3.Numerical & Categorical
Line Chart with Error Bars, Combination Chart, Z-test and t-test
> Modeling
· Predictive modeling is the process by which a model is created
to predict an outcome
o If the outcome is categorical it is called
classification
and if the outcome is numerical it is called
regression
.
· Descriptive modeling or
clustering
is the assignment of observations into clusters so that
observations in the same cluster are similar.
· Finally,
a
ssociation rules
can find interesting associations amongst observations.
Classification algorithms:
Frequency Table
ZeroR
,
OneR
,
Naive Bayesian
,
Decision Tree
Covariance Matrix
Linear Discriminant Analysis
,
Logistic Regression
Similarity Functions
K Nearest Neighbors
Others
Artificial Neural Network
,
Support Vector Machine
Regression
Frequency Table
Decision Tree
Covariance Matrix
Multiple Linear Regression
Similarity Function
K Nearest Neighbors
Others
Artificial Neural Network
,
Support Vector Machine
Clustering algorithms are:
Hierarchical
Agglomerative
,
Divisive
Partitive
K Means
,
Self-Organizing Map
> Evaluation
· helps to find the best model that represents our data and how
well the chosen model will work in the future. Hold-Out and
Cross-Validation
> Deployment
The concept of deployment in predictive data mining refers to
the application of a model for prediction to new data.
<
Data Mining StepsProblem Definition Market AnalysisC

More Related Content

PPT
02 data
PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PPT
02Data mining 243657786756868766758(1).ppt
PPTX
Know Your Data in data mining applications
PPT
Getting to Know Your Data Some sources from where you can access datasets for...
PPT
02Data.ppt data mining introduction topic
PPT
02Data.ppt 02Data.ppt data mining introduction topic1
PPT
Data mining Concepts and Techniques
02 data
Data_Analytics_for_IoT_Solutions.pptx.pdf
02Data mining 243657786756868766758(1).ppt
Know Your Data in data mining applications
Getting to Know Your Data Some sources from where you can access datasets for...
02Data.ppt data mining introduction topic
02Data.ppt 02Data.ppt data mining introduction topic1
Data mining Concepts and Techniques

Similar to Data Mining StepsProblem Definition Market AnalysisC (20)

PPT
Upstate CSCI 525 Data Mining Chapter 2
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PPT
Data mining :Concepts and Techniques Chapter 2, data
PDF
PPT
hanjia chapter_2.ppt data mining chapter 2
PPT
02Data(1).ppt Computer Science Computer Science
PPT
Chapter 2. Know Your Data.ppt
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
PPTX
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
PPTX
Preprocessing_exploring_and_Visualization.pptx
PPT
02Data.ppt
PPT
02Data.ppt
PPT
Data mining data characteristics
PPT
Data mining techniques in data mining with examples
PPT
DATA MINING: CONCEPTS AND TECHNIQUES OF DATA MINING
PPT
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
data mining chapter no 2 concepts and techniques
DOCX
Exam Short Preparation on Data Analytics
PDF
Data Mining - Exploring Data
Upstate CSCI 525 Data Mining Chapter 2
Data Mining: Concepts and Techniques — Chapter 2 —
Data mining :Concepts and Techniques Chapter 2, data
hanjia chapter_2.ppt data mining chapter 2
02Data(1).ppt Computer Science Computer Science
Chapter 2. Know Your Data.ppt
Model Evaluation & Visualisation part of a series of intro modules for data ...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Preprocessing_exploring_and_Visualization.pptx
02Data.ppt
02Data.ppt
Data mining data characteristics
Data mining techniques in data mining with examples
DATA MINING: CONCEPTS AND TECHNIQUES OF DATA MINING
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
Data Mining and Warehousing Concept and Techniques
data mining chapter no 2 concepts and techniques
Exam Short Preparation on Data Analytics
Data Mining - Exploring Data

More from sharondabriggs (20)

DOCX
There are numerous theories that attempt to explain the development .docx
DOCX
There are multifaceted ethical issues relating to international inve.docx
DOCX
There are multiple ways to bring threats and vulnerabilities to ligh.docx
DOCX
There are many kinds of input controls. Write a 4-5 page paper in wh.docx
DOCX
There are many different types of tests that can be applied to an in.docx
DOCX
There are five general ethical topics and you are required to .docx
DOCX
There are eight elements of thought in reasoning. We often use mor.docx
DOCX
There are 16 questions on the exam 3 essay questions, 2 short answe.docx
DOCX
There are 2 easy questions you need to answer, and i need 200 words .docx
DOCX
Theory Application Paper The theory application p.docx
DOCX
Theory-based Nutrition Education ProgramPart 1 Using your Unit 5 .docx
DOCX
Themed Research paper of a word minimum of 2000 words, which will b.docx
DOCX
Theme and Narrative Elements in the Short StoryIn two to four doub.docx
DOCX
Then write a 3-5 page paper on the doctrine that President Richard N.docx
DOCX
Theodore Dalrymple How—and How Not—to Love Mankind A.docx
DOCX
The yellow highlighted  below is a question in the small online qu.docx
DOCX
theme throughout this course has been that human and social services.docx
DOCX
THEMES IN HISTORY 1. Geographic Determinism on the course of.docx
DOCX
the zip is the webiste i have done so far. i just need addition elem.docx
DOCX
The  growth, development, and learned behaviors that occur durin.docx
There are numerous theories that attempt to explain the development .docx
There are multifaceted ethical issues relating to international inve.docx
There are multiple ways to bring threats and vulnerabilities to ligh.docx
There are many kinds of input controls. Write a 4-5 page paper in wh.docx
There are many different types of tests that can be applied to an in.docx
There are five general ethical topics and you are required to .docx
There are eight elements of thought in reasoning. We often use mor.docx
There are 16 questions on the exam 3 essay questions, 2 short answe.docx
There are 2 easy questions you need to answer, and i need 200 words .docx
Theory Application Paper The theory application p.docx
Theory-based Nutrition Education ProgramPart 1 Using your Unit 5 .docx
Themed Research paper of a word minimum of 2000 words, which will b.docx
Theme and Narrative Elements in the Short StoryIn two to four doub.docx
Then write a 3-5 page paper on the doctrine that President Richard N.docx
Theodore Dalrymple How—and How Not—to Love Mankind A.docx
The yellow highlighted  below is a question in the small online qu.docx
theme throughout this course has been that human and social services.docx
THEMES IN HISTORY 1. Geographic Determinism on the course of.docx
the zip is the webiste i have done so far. i just need addition elem.docx
The  growth, development, and learned behaviors that occur durin.docx

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Basic Mud Logging Guide for educational purpose
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Classroom Observation Tools for Teachers
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
master seminar digital applications in india
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Sports Quiz easy sports quiz sports quiz
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Pharma ospi slides which help in ospi learning
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
01-Introduction-to-Information-Management.pdf
PDF
Computing-Curriculum for Schools in Ghana
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Microbial disease of the cardiovascular and lymphatic systems
102 student loan defaulters named and shamed – Is someone you know on the list?
Basic Mud Logging Guide for educational purpose
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Classroom Observation Tools for Teachers
TR - Agricultural Crops Production NC III.pdf
master seminar digital applications in india
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Sports Quiz easy sports quiz sports quiz
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
VCE English Exam - Section C Student Revision Booklet
Pharma ospi slides which help in ospi learning
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Institutional Correction lecture only . . .
Microbial diseases, their pathogenesis and prophylaxis
Supply Chain Operations Speaking Notes -ICLT Program
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
01-Introduction-to-Information-Management.pdf
Computing-Curriculum for Schools in Ghana

Data Mining StepsProblem Definition Market AnalysisC

  • 1. Data Mining Steps Problem Definition Market Analysis Customer Profiling, Identifying Customer Requirements, Cross Market Analysis, Target Marketing, Determining Customer purchasing pattern Corporate Analysis and Risk Management Finance Planning and Asset Evaluation, Resource Planning, Competition Fraud Detection Customer Retention Production Control Science Exploration > Data Preparation Data preparation is about constructing a dataset from one or more data sources to be used for exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the data, to discover first insights into the data and have a good understanding of any possible data quality issues. The Datasets you are provided in these projects were obtained from kaggle.com.
  • 2. Variable selection and description Numerical – Ratio, Interval Categorical – Ordinal, Nominal Simplifying variables: From continuous to discrete Formatting the data Basic data integrity checks: missing data, outliers > Data Exploration Data Exploration is about describing the data by means of statistical and visualization techniques. · Data Visualization: o Univariate analysis explores variables (attributes) one by one. Variables could be either categorical or numerical. Univariate Analysis - Categorical Statistics Visualization
  • 3. Description Count Bar Chart The number of values of the specified variable. Count% Pie Chart The percentage of values of the specified variable Univariate Analysis - Numerical Statistics Visualization Equation
  • 4. Description Count Histogram N The number of values (observations) of the variable. Minimum Box Plot Min The smallest value of the variable. Maximum Box Plot
  • 5. Max The largest value of the variable. Mean Box Plot The sum of the values divided by the count. Median Box Plot The middle value. Below and above median lies an equal number of values. Mode
  • 6. Histogram The most frequent value. There can be more than one mode. Quantile Box Plot A set of 'cut points' that divide a set of data into groups containing equal numbers of values (Quartile, Quintile, Percentile, ...). Range Box Plot Max-Min The difference between maximum and minimum.
  • 7. Variance Histogram A measure of data dispersion. Standard Deviation Histogram The square root of variance. Coefficient of Deviation Histogram A measure of data dispersion divided by mean.
  • 8. Skewness Histogram A measure of symmetry or asymmetry in the distribution of data. Kurtosis Histogram A measure of whether the data are peaked or flat relative to a normal distribution. Note: There are two types of numerical variables, interval and ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. For example, we cannot say that one day is twice as hot as another day. In contrast, a ratio variable has values with a true zero and can be added, subtracted, multiplied or divided (e.g., weight). o
  • 9. Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association. There are three types of bivariate analysis. 1.Numerical & Numerical ScMatter Plot, Linear Correlation … 2.Categorical & Categorical Stacked Column Chart, Combination Chart, Chi-square Test 3.Numerical & Categorical Line Chart with Error Bars, Combination Chart, Z-test and t-test > Modeling · Predictive modeling is the process by which a model is created to predict an outcome o If the outcome is categorical it is called classification and if the outcome is numerical it is called regression . · Descriptive modeling or clustering is the assignment of observations into clusters so that observations in the same cluster are similar.
  • 10. · Finally, a ssociation rules can find interesting associations amongst observations. Classification algorithms: Frequency Table ZeroR , OneR , Naive Bayesian , Decision Tree Covariance Matrix Linear Discriminant Analysis , Logistic Regression
  • 11. Similarity Functions K Nearest Neighbors Others Artificial Neural Network , Support Vector Machine Regression Frequency Table Decision Tree Covariance Matrix
  • 12. Multiple Linear Regression Similarity Function K Nearest Neighbors Others Artificial Neural Network , Support Vector Machine Clustering algorithms are: Hierarchical
  • 13. Agglomerative , Divisive Partitive K Means , Self-Organizing Map > Evaluation · helps to find the best model that represents our data and how well the chosen model will work in the future. Hold-Out and Cross-Validation > Deployment The concept of deployment in predictive data mining refers to the application of a model for prediction to new data. <