SlideShare a Scribd company logo
2
Most read
3
Most read
Introduction
UNIT 1 - Chapter 1
Ranjit Reddy M M. Tech., (Ph. D)
Associate Professor
Department of Computer Science & Engineering
2
Contents/Topics
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
 Summary
January 31, 2016 Data Mining: Concepts and Techniques 3
What Is Data Mining?
 Data Mining: (knowledge discovery from data)
 Extracting or “Mining” knowledge from large amounts of data.
 Searching for knowledge in your data
 Alternative names:
 Knowledge discovery (mining) in databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.
Knowledge Discovery (KDD) Process
January 31, 2016 Data Mining: Concepts and Techniques 5
Knowledge Discovery (KDD) Process steps
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined-Flat files,
spread sheets and relational tables)
 3. Data selection (where data relevant to the analysis task are retrieved from the
database)
 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
 5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)
Architecture of typical data mining system
January 31, 2016 Data Mining: Concepts and Techniques 7
Architecture of typical data mining system
 Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).
January 31, 2016 Data Mining: Concepts and Techniques 8
Architecture of typical data mining system
 Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module.
 User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.
Motivating Challenges
 Scalability:
 Datasets with sizes of gigabytes, terabytes or even petabytes
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
 High Dimensionality:
 Data sets with hundreds or thousands of attributes.
 Example: Dataset that contains measurements of temperature at various
location
 Traditional data analysis techniques that were developed for low dimensional
data .
 Need to develop data mining algorithms to handle high dimensionality.
Motivating Challenges
 Heterogeneous and Complex Data:
 Traditional data analysis methods deal with datasets containing attributes of
same type(Continuous or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets
 Data Ownership and Distribution:
 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple
entities.
 Need to develop distributed data mining algorithms to handle distributed
datasets.
 Key challenges:
 How to reduce the amount of communication needed for distributed data.
 How to effectively consolidate the data mining results from multiple sources
 How to address data security issues.
Motivating Challenges
 Non Traditional Analysis:
 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then
data is analyzed with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis
generation and evaluation.
The Origins of Data Mining
 Data Mining Draws ideas, such as:
 Sampling, estimation and hypothesis testing from statistics.
 Search algorithms, modeling techniques and learning theories from Artificial
Intelligence or Machine Learning, Pattern Recognition.
 Database systems are
needed to provide support
for efficient storage,
Indexing and query
processing.
 The Techniques from
parallel computing are
addressing the massive size of some datasets.
 Distributed Computing techniques are used to gather information from different
locations.
Data Mining Tasks
 Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values
of other attributes. The predicted attribute is known as target or dependent
variable and other attribute is known as explanatory or independent
variables.
 Descriptive Tasks: Characterize the general properties of the data in the
database(Correlations, Trends, Clusters, Trajectories and anomalies).
 Four of the core data mining tasks:
 Classification & Regression
 Association Analysis
 Cluster Analysis
 Anomaly Detection
Data Mining Functionalities
Data Mining Functionalities
 Predictive Modeling: Building a model for the target variable as a function of the
explanatory variable.
 Classification: Which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book
store(Target variable is binary valued).
 Regression: Which is used for Continuous Target Variables.
 Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute)
.
Data Mining Functionalities
 Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets
 The above table illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together
by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers
also tend to buy milk)
Transaction ID Items
1
2
3
4
5
6
7
8
9
10
{Bread, Butter, Diapers, Milk}
{Coffee, Sugar, Cookies, Salmon}
{Bread, Butter, Coffee, Diapers, Milk, Eggs}
{Bread, Butter, Salmon, Chicken}
{Eggs, Bread, Butter}
{Salmon, Diapers, Milk}
{Bread, Tea, Sugar, Eggs}
{Coffee, Sugar, Chicken, Eggs}
{Bread, Diapers, Milk, Salt}
{Tea, Eggs, Cookies, Diapers, Milk}
Market
Basket
Analysis
Data Mining Functionalities
 Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the
intra class similarity(Within a Cluster) and minimizing the interclass
similarity(Cluster to Cluster).
Document Clustering
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a
word and c is the number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
 First Cluster consists of the first 3 articles (News about the Economy)
 Second cluster contain last 3 articles (News about the Heath Care)
Article Word
1
2
3
4
5
6
Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
Data Mining Functionalities
 Anomaly detection:
 The task of identifying observations whose characteristics are significantly different
from the rest of the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…
 Ex: Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder,
along with the personal information such as credit limit, age, annual income and
address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

More Related Content

PPTX
Unix vs linux
PPTX
Solid principles
PPTX
Computer Science-Data Structures :Abstract DataType (ADT)
PPTX
Big data ppt
DOCX
smart street light system using IOT
PDF
Support Vector Machines for Classification
PPTX
التحول الرقمى والتكنولوجيات الذكية - التحول الرقمي الذكي -
PPTX
Data mining
Unix vs linux
Solid principles
Computer Science-Data Structures :Abstract DataType (ADT)
Big data ppt
smart street light system using IOT
Support Vector Machines for Classification
التحول الرقمى والتكنولوجيات الذكية - التحول الرقمي الذكي -
Data mining

What's hot (20)

PPT
1.8 discretization
PPTX
Data preprocessing
PPT
Map reduce in BIG DATA
PPTX
Distributed database
PPTX
Data preprocessing in Machine learning
PPT
Heuristic Search Techniques Unit -II.ppt
PPTX
Distributed Query Processing
PPTX
Challenges of Conventional Systems.pptx
PDF
OLAP in Data Warehouse
DOC
Data Mining: Data Preprocessing
PPTX
data generalization and summarization
PPTX
Major issues in data mining
PPTX
Introduction to distributed database
PPTX
Apriori algorithm
PPT
14. Query Optimization in DBMS
PPTX
Kdd process
PDF
Data Mining: Association Rules Basics
PPTX
Data cubes
PPT
PDF
Data visualization in Python
1.8 discretization
Data preprocessing
Map reduce in BIG DATA
Distributed database
Data preprocessing in Machine learning
Heuristic Search Techniques Unit -II.ppt
Distributed Query Processing
Challenges of Conventional Systems.pptx
OLAP in Data Warehouse
Data Mining: Data Preprocessing
data generalization and summarization
Major issues in data mining
Introduction to distributed database
Apriori algorithm
14. Query Optimization in DBMS
Kdd process
Data Mining: Association Rules Basics
Data cubes
Data visualization in Python
Ad

Similar to data mining (20)

PPTX
Unit-V-Introduction to Data Mining.pptx
PPT
Talk
DOCX
data mining and data warehousing
PDF
Data mining
PPTX
UNIT 2: Part 2: Data Warehousing and Data Mining
PPTX
PDF
data mining lecture notes for btech students+
PDF
Data Mining & Data Warehousing Lecture Notes
PPTX
Data mining
PPTX
Seminar Presentation
PDF
G045033841
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PPT
Introduction to Data Mining
PPT
Data Mining
PPT
Cssu dw dm
PDF
Data Mining
PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
PPTX
Data mining introduction
PPT
20IT501_DWDM_PPT_Unit_II.ppt
Unit-V-Introduction to Data Mining.pptx
Talk
data mining and data warehousing
Data mining
UNIT 2: Part 2: Data Warehousing and Data Mining
data mining lecture notes for btech students+
Data Mining & Data Warehousing Lecture Notes
Data mining
Seminar Presentation
G045033841
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
Introduction to Data Mining
Data Mining
Cssu dw dm
Data Mining
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Data mining introduction
20IT501_DWDM_PPT_Unit_II.ppt
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Global journeys: estimating international migration
PPTX
Computer network topology notes for revision
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Introduction to Business Data Analytics.
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Global journeys: estimating international migration
Computer network topology notes for revision
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Business Data Analytics.
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx

data mining

  • 1. Introduction UNIT 1 - Chapter 1 Ranjit Reddy M M. Tech., (Ph. D) Associate Professor Department of Computer Science & Engineering
  • 2. 2 Contents/Topics  What Is Data Mining?  Motivating Challenges  The Origins of Data Mining  Data Mining Tasks  Summary
  • 3. January 31, 2016 Data Mining: Concepts and Techniques 3 What Is Data Mining?  Data Mining: (knowledge discovery from data)  Extracting or “Mining” knowledge from large amounts of data.  Searching for knowledge in your data  Alternative names:  Knowledge discovery (mining) in databases (KDD)  knowledge extraction  data/pattern analysis  data archeology  data dredging  information harvesting  business intelligence, etc.
  • 5. January 31, 2016 Data Mining: Concepts and Techniques 5 Knowledge Discovery (KDD) Process steps  1. Data cleaning (to remove noise and inconsistent data)  2. Data integration (where multiple data sources may be combined-Flat files, spread sheets and relational tables)  3. Data selection (where data relevant to the analysis task are retrieved from the database)  4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)  5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)  6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)  7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 6. Architecture of typical data mining system
  • 7. January 31, 2016 Data Mining: Concepts and Techniques 7 Architecture of typical data mining system  Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
  • 8. January 31, 2016 Data Mining: Concepts and Techniques 8 Architecture of typical data mining system  Data mining engine: Consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. This component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
  • 9. Motivating Challenges  Scalability:  Datasets with sizes of gigabytes, terabytes or even petabytes  Massive datasets cannot fit into main memory  Need to develop scalable data mining algorithms to mine massive datasets  Scalability can also be improved by using sampling or developing parallel and distributed algorithms.  High Dimensionality:  Data sets with hundreds or thousands of attributes.  Example: Dataset that contains measurements of temperature at various location  Traditional data analysis techniques that were developed for low dimensional data .  Need to develop data mining algorithms to handle high dimensionality.
  • 10. Motivating Challenges  Heterogeneous and Complex Data:  Traditional data analysis methods deal with datasets containing attributes of same type(Continuous or Categorical).  Complex data sets contains image, video, text etc.  Need to develop mining methods to handle complex datasets  Data Ownership and Distribution:  Data is not stored in one location or owned by one organization.  Data is geographically distributed among resources belonging to multiple entities.  Need to develop distributed data mining algorithms to handle distributed datasets.  Key challenges:  How to reduce the amount of communication needed for distributed data.  How to effectively consolidate the data mining results from multiple sources  How to address data security issues.
  • 11. Motivating Challenges  Non Traditional Analysis:  Traditional statistical approach is based on a hypothesize-and-test paradigm.  A hypothesis is proposed, an experiment is designed to gather the data, and then data is analyzed with respect to the hypothesis.  This process is extremely labor-intensive.  Need to develop mining methods to automate the process of hypothesis generation and evaluation.
  • 12. The Origins of Data Mining  Data Mining Draws ideas, such as:  Sampling, estimation and hypothesis testing from statistics.  Search algorithms, modeling techniques and learning theories from Artificial Intelligence or Machine Learning, Pattern Recognition.  Database systems are needed to provide support for efficient storage, Indexing and query processing.  The Techniques from parallel computing are addressing the massive size of some datasets.  Distributed Computing techniques are used to gather information from different locations.
  • 13. Data Mining Tasks  Data Mining tasks divided into two major categories:  Predictive Tasks: Predict the value of particular attribute based on the values of other attributes. The predicted attribute is known as target or dependent variable and other attribute is known as explanatory or independent variables.  Descriptive Tasks: Characterize the general properties of the data in the database(Correlations, Trends, Clusters, Trajectories and anomalies).  Four of the core data mining tasks:  Classification & Regression  Association Analysis  Cluster Analysis  Anomaly Detection
  • 15. Data Mining Functionalities  Predictive Modeling: Building a model for the target variable as a function of the explanatory variable.  Classification: Which is used for Discrete Target Variables. Ex: Predicting whether a web user will make a purchase at an online book store(Target variable is binary valued).  Regression: Which is used for Continuous Target Variables.  Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute) .
  • 16. Data Mining Functionalities  Association Analysis:  Used to discover patterns that describe strongly associated features in the data.  The discovered patterns are typically represented in the form of implication rules or feature subsets  The above table illustrate the data collected at supermarkets.  Association analysis can be applied to find items that are frequently bought together by customers.  Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers also tend to buy milk) Transaction ID Items 1 2 3 4 5 6 7 8 9 10 {Bread, Butter, Diapers, Milk} {Coffee, Sugar, Cookies, Salmon} {Bread, Butter, Coffee, Diapers, Milk, Eggs} {Bread, Butter, Salmon, Chicken} {Eggs, Bread, Butter} {Salmon, Diapers, Milk} {Bread, Tea, Sugar, Eggs} {Coffee, Sugar, Chicken, Eggs} {Bread, Diapers, Milk, Salt} {Tea, Eggs, Cookies, Diapers, Milk} Market Basket Analysis
  • 17. Data Mining Functionalities  Cluster Analysis:  Grouping of similar things is called cluster.  The objects are clustered or grouped based on the principle of maximizing the intra class similarity(Within a Cluster) and minimizing the interclass similarity(Cluster to Cluster). Document Clustering  Each Article is represented as a set of word frequency pairs (w, c), Where w is a word and c is the number of times the word appears in the article.  There are 2 natural clusters in the above dataset  First Cluster consists of the first 3 articles (News about the Economy)  Second cluster contain last 3 articles (News about the Heath Care) Article Word 1 2 3 4 5 6 Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2 Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1 Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3 Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2 Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2 Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1
  • 18. Data Mining Functionalities  Anomaly detection:  The task of identifying observations whose characteristics are significantly different from the rest of the data.  Such observations are known as anomalies or Outliers.  A good anomaly detector must have a high detection rate and a low false rate.  Applications: Detection of fraud, Network Intrusions etc…  Ex: Credit Card Fraud Detection:  A Credit Card Company records the transactions made by every credit card holder, along with the personal information such as credit limit, age, annual income and address.  When a new transaction arrives, it is compared against the profile of the user.  If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.