data mining

Introduction
UNIT 1 - Chapter 1
Ranjit Reddy M M. Tech., (Ph. D)
Associate Professor
Department of Computer Science & Engineering

2
Contents/Topics
 What Is Data Mining?
 Motivating Challenges
 The Origins of Data Mining
 Data Mining Tasks
 Summary

January 31, 2016 Data Mining: Concepts and Techniques 3
What Is Data Mining?
 Data Mining: (knowledge discovery from data)
 Extracting or “Mining” knowledge from large amounts of data.
 Searching for knowledge in your data
 Alternative names:
 Knowledge discovery (mining) in databases (KDD)
 knowledge extraction
 data/pattern analysis
 data archeology
 data dredging
 information harvesting
 business intelligence, etc.

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process steps
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined-Flat files,
spread sheets and relational tables)
 3. Data selection (where data relevant to the analysis task are retrieved from the
database)
 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance)
 5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)

Architecture of typical data mining system

 Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories. Data cleaning and data integration
techniques may be performed on the data.
 Database or data warehouse server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction. Knowledge such as user beliefs, which can be used to assess a
pattern’s interestingness based on its unexpectedness, may also be included. Other
examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous
sources).

 Data mining engine: Consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution analysis.
 Pattern evaluation module: This component typically employs interestingness
measures and interacts with the data mining modules so as to focus the search
toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated
with the mining module.
 User interface: This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. This
component allows the user to browse database and data warehouse schemas or
data structures, evaluate mined patterns, and visualize the patterns in different
forms.

Motivating Challenges
 Scalability:
 Datasets with sizes of gigabytes, terabytes or even petabytes
 Massive datasets cannot fit into main memory
 Need to develop scalable data mining algorithms to mine massive datasets
 Scalability can also be improved by using sampling or developing parallel and
distributed algorithms.
 High Dimensionality:
 Data sets with hundreds or thousands of attributes.
 Example: Dataset that contains measurements of temperature at various
location
 Traditional data analysis techniques that were developed for low dimensional
data .
 Need to develop data mining algorithms to handle high dimensionality.

 Heterogeneous and Complex Data:
 Traditional data analysis methods deal with datasets containing attributes of
same type(Continuous or Categorical).
 Complex data sets contains image, video, text etc.
 Need to develop mining methods to handle complex datasets
 Data Ownership and Distribution:
 Data is not stored in one location or owned by one organization.
 Data is geographically distributed among resources belonging to multiple
entities.
 Need to develop distributed data mining algorithms to handle distributed
datasets.
 Key challenges:
 How to reduce the amount of communication needed for distributed data.
 How to effectively consolidate the data mining results from multiple sources
 How to address data security issues.

 Non Traditional Analysis:
 Traditional statistical approach is based on a hypothesize-and-test paradigm.
 A hypothesis is proposed, an experiment is designed to gather the data, and then
data is analyzed with respect to the hypothesis.
 This process is extremely labor-intensive.
 Need to develop mining methods to automate the process of hypothesis
generation and evaluation.

The Origins of Data Mining
 Data Mining Draws ideas, such as:
 Sampling, estimation and hypothesis testing from statistics.
 Search algorithms, modeling techniques and learning theories from Artificial
Intelligence or Machine Learning, Pattern Recognition.
 Database systems are
needed to provide support
for efficient storage,
Indexing and query
processing.
 The Techniques from
parallel computing are
addressing the massive size of some datasets.
 Distributed Computing techniques are used to gather information from different
locations.

Data Mining Tasks
 Data Mining tasks divided into two major categories:
 Predictive Tasks: Predict the value of particular attribute based on the values
of other attributes. The predicted attribute is known as target or dependent
variable and other attribute is known as explanatory or independent
variables.
 Descriptive Tasks: Characterize the general properties of the data in the
database(Correlations, Trends, Clusters, Trajectories and anomalies).
 Four of the core data mining tasks:
 Classification & Regression
 Association Analysis
 Cluster Analysis
 Anomaly Detection

Data Mining Functionalities
 Predictive Modeling: Building a model for the target variable as a function of the
explanatory variable.
 Classification: Which is used for Discrete Target Variables.
Ex: Predicting whether a web user will make a purchase at an online book
store(Target variable is binary valued).
 Regression: Which is used for Continuous Target Variables.
 Ex: Forecasting the future price of a stock(Price is a continuous-valued attribute)
.

 Association Analysis:
 Used to discover patterns that describe strongly associated features in the data.
 The discovered patterns are typically represented in the form of implication rules or
feature subsets
 The above table illustrate the data collected at supermarkets.
 Association analysis can be applied to find items that are frequently bought together
by customers.
 Discovered Association Rule is {Diapers} → {Milk} (Customers who buy diapers
also tend to buy milk)
Transaction ID Items
1
2
3
4
5
6
7
8
9
10
{Bread, Butter, Diapers, Milk}
{Coffee, Sugar, Cookies, Salmon}
{Bread, Butter, Coffee, Diapers, Milk, Eggs}
{Bread, Butter, Salmon, Chicken}
{Eggs, Bread, Butter}
{Salmon, Diapers, Milk}
{Bread, Tea, Sugar, Eggs}
{Coffee, Sugar, Chicken, Eggs}
{Bread, Diapers, Milk, Salt}
{Tea, Eggs, Cookies, Diapers, Milk}
Market
Basket
Analysis

 Cluster Analysis:
 Grouping of similar things is called cluster.
 The objects are clustered or grouped based on the principle of maximizing the
intra class similarity(Within a Cluster) and minimizing the interclass
similarity(Cluster to Cluster).
Document Clustering
 Each Article is represented as a set of word frequency pairs (w, c), Where w is a
word and c is the number of times the word appears in the article.
 There are 2 natural clusters in the above dataset
 First Cluster consists of the first 3 articles (News about the Economy)
 Second cluster contain last 3 articles (News about the Heath Care)
Article Word
1
2
3
4
5
6
Dollar : 1, Industry : 4, Country : 2, Loan : 3, Deal : 2, Government : 2
Machinery : 2, Labor : 3, Market : 4, Industry : 2, Work : 3, Country : 1
Domestic: 4, Forecast : 2, Gain : 1, Market : 3, Country : 2, Index : 3
Patient : 4, Symptom : 2, Drug : 3, Health : 2, Clinic : 2, Doctor : 2
Death : 2, Cancer : 4, Drug : 3, Public : 4, Health : 3, Director : 2
Medical : 2, Cost : 3, Increase : 2, Patient : 2, Health : 3, Care : 1

 Anomaly detection:
 The task of identifying observations whose characteristics are significantly different
from the rest of the data.
 Such observations are known as anomalies or Outliers.
 A good anomaly detector must have a high detection rate and a low false rate.
 Applications: Detection of fraud, Network Intrusions etc…
 Ex: Credit Card Fraud Detection:
 A Credit Card Company records the transactions made by every credit card holder,
along with the personal information such as credit limit, age, annual income and
address.
 When a new transaction arrives, it is compared against the profile of the user.
 If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

data mining

More Related Content

What's hot (20)

Similar to data mining (20)

Recently uploaded (20)

data mining