SlideShare a Scribd company logo
September 11, 2020 Data Mining: Concepts and Techniques 1
Why Data Mining?
◼ The Explosive Growth of Data: from terabytes to petabytes
◼ Data collection and data availability
◼ Automated data collection tools, database systems, Web,
computerized society
◼ Major sources of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube
◼ We are drowning in data, but starving for knowledge!
◼ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
September 11, 2020 Data Mining: Concepts and Techniques 2
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web
databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
September 11, 2020 Data Mining: Concepts and Techniques 3
What Is Data Mining?
◼ Data mining (knowledge discovery from data)
◼ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
September 11, 2020 Data Mining: Concepts and Techniques 4
Knowledge Discovery (KDD) Process
◼ Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
September 11, 2020 Data Mining: Concepts and Techniques 5
KDD Process: Several Key Steps
◼ Learning the application domain
◼ relevant prior knowledge and goals of application
◼ Creating a target data set: data selection
◼ Data cleaning and preprocessing: (may take 60% of effort!)
◼ Data reduction and transformation
◼ Find useful features, dimensionality/variable reduction, invariant
representation
◼ Choosing functions of data mining
◼ summarization, classification, regression, association, clustering
◼ Choosing the mining algorithm(s)
◼ Data mining: search for patterns of interest
◼ Pattern evaluation and knowledge presentation
◼ visualization, transformation, removing redundant patterns, etc.
◼ Use of discovered knowledge
September 11, 2020 Data Mining: Concepts and Techniques 6
Are All the “Discovered” Patterns Interesting?
◼ Data mining may generate thousands of patterns: Not all of them
are interesting
◼ Suggested approach: Human-centered, query-based, focused mining
◼ Interestingness measures
◼ A pattern is interesting if it is easily understood by humans, valid on new
or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
◼ Objective vs. subjective interestingness measures
◼ Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
◼ Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
September 11, 2020 Data Mining: Concepts and Techniques 7
Find All and Only Interesting Patterns?
◼ Find all the interesting patterns: Completeness
◼ Can a data mining system find all the interesting patterns? Do we
need to find all of the interesting patterns?
◼ Association vs. classification vs. clustering
◼ Search for only interesting patterns: An optimization problem
◼ Can a data mining system find only the interesting patterns?
◼ Approaches
◼ First general all the patterns and then filter out the uninteresting
ones
◼ Generate only the interesting patterns—mining query
optimization
September 11, 2020 Data Mining: Concepts and Techniques 8
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
September 11, 2020 Data Mining: Concepts and Techniques 9
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
September 11, 2020 Data Mining: Concepts and Techniques 10
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowl
edge-
Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
September 11, 2020 Data Mining: Concepts and Techniques 12
Why Data Mining?—Potential Applications
◼ Data analysis and decision support
◼ Market analysis and management
◼ Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
◼ Risk analysis and management
◼ Forecasting, customer retention, quality control, competitive
analysis
◼ Fraud detection and detection of unusual patterns (outliers)
◼ Other Applications
◼ Text mining (news group, email, documents) and Web mining
◼ Stream data mining
◼ Bioinformatics and bio-data analysis
September 11, 2020 Data Mining: Concepts and Techniques 13
Ex. 1: Market Analysis and Management
◼ Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
◼ Target marketing
◼ Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
◼ Determine customer purchasing patterns over time
◼ Cross-market analysis—Find associations/co-relations between product sales,
& predict based on such association
◼ Customer profiling—What types of customers buy what products (clustering
or classification)
◼ Customer requirement analysis
◼ Identify the best products for different groups of customers
◼ Predict what factors will attract new customers
◼ Provision of summary information
◼ Multidimensional summary reports
September 11, 2020 Data Mining: Concepts and Techniques 14
Ex. 2: Corporate Analysis & Risk Management
◼ Finance planning and asset evaluation
◼ cash flow analysis and prediction
◼ contingent claim analysis to evaluate assets
◼ cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
◼ Resource planning
◼ summarize and compare the resources and spending
◼ Competition
◼ monitor competitors and market directions
◼ group customers into classes and a class-based pricing procedure
◼ set pricing strategy in a highly competitive market
September 11, 2020 Data Mining: Concepts and Techniques 15
Ex. 3: Fraud Detection & Mining Unusual Patterns
◼ Approaches: Clustering & model construction for frauds, outlier analysis
◼ Applications: Health care, retail, credit card service, telecomm.
◼ Auto insurance: ring of collisions
◼ Money laundering: suspicious monetary transactions
◼ Medical insurance
◼ Professional patients, ring of doctors, and ring of references
◼ Unnecessary or correlated screening tests
◼ Telecommunications: phone-call fraud
◼ Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
◼ Retail industry
◼ Analysts estimate that 38% of retail shrink is due to dishonest
employees
◼ Anti-terrorism
September 11, 2020 Data Mining: Concepts and Techniques 16
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
◼ Knowledge to be mined
◼ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
◼ Multiple/integrated functions and mining at multiple levels
◼ Techniques utilized
◼ Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
◼ Applications adapted
◼ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
September 11, 2020 Data Mining: Concepts and Techniques 17
Data Mining: Classification Schemes
◼ General functionality
◼ Descriptive data mining
◼ Predictive data mining
◼ Different views lead to different classifications
◼ Data view: Kinds of data to be mined
◼ Knowledge view: Kinds of knowledge to be discovered
◼ Method view: Kinds of techniques utilized
◼ Application view: Kinds of applications adapted
September 11, 2020 Data Mining: Concepts and Techniques 18
Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web
September 11, 2020 Data Mining: Concepts and Techniques 19
Data Mining Functionalities
◼ Multidimensional concept description: Characterization and
discrimination
◼ Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
◼ Frequent patterns, association, correlation vs. causality
◼ Diaper → Beer [0.5%, 75%] (Correlation or causality?)
◼ Classification and prediction
◼ Construct models (functions) that describe and distinguish
classes or concepts for future prediction
◼ E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
◼ Predict some unknown or missing numerical values
September 11, 2020 Data Mining: Concepts and Techniques 20
Data Mining Functionalities (2)
◼ Cluster analysis
◼ Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
◼ Maximizing intra-class similarity & minimizing interclass similarity
◼ Outlier analysis
◼ Outlier: Data object that does not comply with the general behavior
of the data
◼ Noise or exception? Useful in fraud detection, rare events analysis
◼ Trend and evolution analysis
◼ Trend and deviation: e.g., regression analysis
◼ Sequential pattern mining: e.g., digital camera → large SD memory
◼ Periodicity analysis
◼ Similarity-based analysis
◼ Other pattern-directed or statistical analyses
September 11, 2020 Data Mining: Concepts and Techniques 21
Primitives that Define a Data Mining Task
◼ Task-relevant data
◼ Database or data warehouse name
◼ Database tables or data warehouse cubes
◼ Condition for data selection
◼ Relevant attributes or dimensions
◼ Data grouping criteria
◼ Type of knowledge to be mined
◼ Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
◼ Background knowledge
◼ Pattern interestingness measurements
◼ Visualization/presentation of discovered patterns
Data Mining Tasks
• Data Mining makes use of various algorithms
to perform a variety of tasks.
• These algorithms examine the sample data of
a problem and determine a model that fits
close to solving the problem.
• The models used to solve a problem are
classified as predictive and descriptive.
Predictive Model
• A Predictive Model enables to predict the
values of data by making use of known results
from different set of sample data.
• The data mining tasks that forms the part of
predictive model are:
• Classification
• Regression
• Time series Analysis
Descriptive Model
• A Descriptive Model enables to determine the
patterns and relationships in a sample data.
• The data mining tasks that forms the part of
Descriptive Model are:
• Clustering
• Summarization
• Association Rules
• Sequence Discovery
Classification
• Classification enables to classify the data in a
large databank into pre defined set of classes.
• The classes are defined before studying or
examining data in data bank.
• It not only enables to examine the existing
sample data but also enable to predict future
behavior of that sample data.
• Eg. Fraud detection in credit card related
transactions.
Regression
• Regression enables to forecast future data
values based on present and past data values.
• It examines the values of data and develops a
mathematical formula.
• The result produced on using this formula
enables to predict future behavior of existing
data.
• Eg. Future behavior of Saving pattern
Time series Analysis
• It makes of current and past sample data to
predict future values. The values used are
evenly distributed as
hourly,weekly,monthly,yearly,and so on.
• Eg. Examine trends in stock market for various
companies for a specific period and
accordingly make investments.
Clustering
• It enables to create new groups and classes based
on the study of patterns and relationship
between values of data in a data bank.
• It is similar to classification but does not require
to predefine the groups of classes.
• It is known as unsupervised learning or
segmentation.
• All those data items that resembles more closely
with each other are clubbed together in a single
group, known as clusters.
Summarization
• It enables to summarize a large chunk of data
containing in a web page or document.
• The summarized data enables to get the gist
of the entire web page or document.
• It is also known as characterization or
generalization.
• It searches for specific characteristics and
attributes of data in large amount of data and
then summarizes the same.
Association Rules
• It enables to establish association and
relationships between large unclassified data
items based on certain attributes and
characteristics.
Sequence Discovery
• It enables to determine the sequential
patterns that exist in a large and unorganized
data bank.
Data Preprocessing
Data Summarization: Summarization is a key data mining concept which
involves techniques for finding a compact description of a dataset. Simple
summarization methods such as tabulating the mean and standard deviations are
often applied for exploratory data analysis, data visualization and automated report
generation. Clustering [12, 21] is another data mining technique that is often used
to summarize large datasets.
Data preprocessing is a data mining technique that involves transforming raw
data into an understandable format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain
many errors. Data preprocessing is a proven method of resolving such issues. Data
preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks).
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level can
be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:Wavelet
transforms and PCA (Principal Componenet Analysis).
Data Integration: is a data preprocessing technique that involves combining data
from multiple heterogeneous data sources into a coherent data store and provide a
unified view of the data. These sources may include multiple data cubes, databases
or flat files.
The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is “tight coupling
approach” and another is “loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation and
Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms it
in a way the source database can understand and then sends the query directly
to the source databases to obtain the result.
• And the data only remains in the actual source databases.
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
September 11, 2020 Data Mining: Concepts and Techniques 1
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
September 11, 2020 Data Mining: Concepts and Techniques 2
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
September 11, 2020 Data Mining: Concepts and Techniques 3
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”
September 11, 2020 Data Mining: Concepts and Techniques 4
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
September 11, 2020 Data Mining: Concepts and Techniques 5
• A data warehouse is a semantically consistent
data store that serves as a physical
implementation of a decision support data
model and stores the information on which an
enterprise needs to make strategic decisions.
September 11, 2020 Data Mining: Concepts and Techniques 6
Data Warehouse vs. Operational DBMS
• The major task of on-line operational database systems is to
perform on-line transaction and query processing. These
systems are called on-line transaction processing (OLTP)
systems. They cover most of the day-to-day operations of an
organization, such as purchasing, inventory, manufacturing,
banking, payroll, registration, and accounting.
• Data warehouse systems, on the other hand, serve users or
knowledge workers in the role of data analysis and decision
making. Such systems can organize and present data in various
formats in order to accommodate the diverse needs of the
different users. These systems are known as on-line analytical
processing (OLAP) systems.
September 11, 2020 Data Mining: Concepts and Techniques 7
Distinct features (OLTP vs. OLAP):
• Users and system orientation:
• An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and
information technology professionals.
• An OLAP systems market-oriented and is used for data
analysis by knowledge workers, including managers,
executives, and analysts.
• Data contents:
• An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making.
• An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and
stores and manages information at different levels of
granularity. These features make the data easier to use in
informed decision making.
September 11, 2020 Data Mining: Concepts and Techniques 8
• View: An OLTP system focuses mainly on the current data
within an enterprise or department, without referring to
historical data or data in different organizations.
• In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that
originates from different organizations, integrating information
from many data stores. Because of their huge volume, OLAP
data are stored on multiple storage media.
• Access patterns: The access patterns of an OLTP system consist
mainly of short, atomic transactions. Such a system requires
concurrency control and recovery mechanisms.
• Accesses to OLAP systems are mostly read-only operations
(because most data warehouses store historical rather than up-
to-date information), although many could be complex queries.
September 11, 2020 Data Mining: Concepts and Techniques 9
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
September 11, 2020 Data Mining: Concepts and Techniques 10
Data warehouse architecture
September 11, 2020 Data Mining: Concepts and Techniques 23
Design of Data Warehouse: A Business Analysis Framework
• Four views regarding the design of a data warehouse
– Top-down view
• allows selection of the relevant information necessary for the data warehouse.
This information matches the current and future business needs.
– Data source view
• exposes the information being captured, stored, and managed by operational
systems. This information may be documented at various levels of detail and
accuracy, from individual data source tables to integrated data source tables.
– Data warehouse view
• consists of fact tables and dimension tables. It represents the information that is
stored inside the data warehouse, including precalculated totals and counts, as
well as information regarding the source, date, and time of origin, added to
provide historical context.
– Business query view
• sees the perspectives of data in the warehouse from the view point of end-
user
September 11, 2020 Data Mining: Concepts and Techniques 24
September 11, 2020 Data Mining: Concepts and Techniques 26
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
Middle Tier:OLAP
Server
Analysis
Query
Reports
Data mining
Monitoring
&
Administration
Metadata
Data Sources Top tier: Front-End Tools
Output
Data Marts
Operational
DBs
Other
sources
Bottom Tier: Datawarehouse server
OLAP Server
September 11, 2020 Data Mining: Concepts and Techniques 27
◼ 1. The bottom tier is a warehouse database server that
is almost always a relational database system.
◼ Back-end tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources (such as customer profile information provided by
external consultants).
◼ These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load
and refresh functions to update the data warehouse.
◼ This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
September 11, 2020 Data Mining: Concepts and Techniques 28
2. The middle tier is an OLAP server that is typically
implemented using either
(1) a relational OLAP (ROLAP) model, that is, an
extended relational DBMS that maps operations on
multidimensional data to standard relational
operations; or
(2) a multidimensional OLAP (MOLAP) model, that is,
a special-purpose server that directly implements
multidimensional data and operations.
3. The top tier is a front-end client layer, which
contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).
Three Data Warehouse Models
• From the architecture point of view, there are three data warehouse
models: the enterprise warehouse, the data mart, and the virtual
warehouse.
• Enterprise warehouse
– collects all of the information about subjects spanning the entire
organization
• Data Mart
– a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be materialized
September 11, 2020 Data Mining: Concepts and Techniques 29
Data Warehouse Back-End Tools and Utilities
• Data extraction
– get data from multiple, heterogeneous, and external sources
• Data cleaning
– detect errors in the data and rectify them when possible
• Data transformation
– convert data from legacy or host format to warehouse format
• Load
– sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
September 11, 2020 Data Mining: Concepts and Techniques 30
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
– schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta-data
– data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
– warehouse schema, view and derived data definitions
• Business data
– business terms and definitions, ownership of data, charging policies
September 11, 2020 Data Mining: Concepts and Techniques 31
Types of OLAP Servers: ROLAP versus MOLAP
versus HOLAP
• OLAP servers present business users with multidimensional
data from data warehouses or data marts, without
concerns regarding how or where the data are stored.
• The physical architecture and implementation of OLAP
servers must consider data storage issues.
Implementations of a warehouse server for OLAP
processing include the following:
• Relational OLAP (ROLAP) servers: These are the
intermediate servers that stand in between a relational
back-end server and client front-end tools. They use a
relational or extended-relationalDBMS to store and
manage warehouse data, and OLAP middleware to support
missing pieces. ROLAP technology tends to have greater
scalability than MOLAP technology.
September 11, 2020 Data Mining: Concepts and Techniques 32
• Multidimensional OLAP (MOLAP) servers: These
servers support multidimensional views of data
through array-based multidimensional storage
engines. They map multidimensional views directly to
data cube array structures. The advantage of using a
data cube is that it allows fast indexing to precomputed
summarized data.
• Hybrid OLAP (HOLAP) servers: The hybrid OLAP
approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and
the faster computation of MOLAP.
• Specialized SQL servers: To meet the growing demand
of OLAP processing in relational databases, some
database system vendors implement specialized SQL
servers that provide advanced query language and
query processing support for SQL queries over star and
snowflake schemas in a read-only environment.
September 11, 2020 Data Mining: Concepts and Techniques 33
Data Warehouse Implementation
• Data warehouses contain huge volumes of
data.
• OLAP servers demand that decision support
queries be answered in the order of seconds.
Therefore, it is crucial for data warehouse
systems to support highly efficient cube
computation techniques, access methods, and
query processing techniques.
September 11, 2020 Data Mining: Concepts and Techniques 34
Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting using
tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
September 11, 2020 Data Mining: Concepts and Techniques 35
September 11, 2020 Data Mining: Concepts and Techniques 1
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami in the context of
frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
September 11, 2020 Data Mining: Concepts and Techniques 2
Why Is Freq. Pattern Mining Important?
◼ Discloses an intrinsic and important property of data sets
◼ Forms the foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis
◼ Sequential, structural (e.g., sub-graph) patterns
◼ Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
◼ Classification: associative classification
◼ Cluster analysis: frequent pattern-based clustering
◼ Broad applications
September 11, 2020 Data Mining: Concepts and Techniques 3
Basic Concepts: Frequent Patterns and
Association Rules
◼ Itemset X = {x1, …, xk}
◼ Find all the rules X → Y with minimum
support and confidence
◼ support, s, probability that a
transaction contains X  Y
◼ confidence, c, conditional
probability that a transaction
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A → D (60%, 100%)
D → A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
Association Rule mining is a two step process:
1. Find all frequent itemsets:
Each of these itemsets will occur atleast as
frequently as a predetermined minimum support
count
2. Generate strong association rules from
frequent itemsets:
These rules must satisfy minimum support and
minimum confidence.
September 11, 2020 Data Mining: Concepts and Techniques 4
September 11, 2020 Data Mining: Concepts and Techniques 5
Apriori: Finding frequent itemsets mining using A
Candidate Generation
◼ Apriori Property:
“All non-empty subset of frequent item set must be frequent”
◼ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated
September 11, 2020 Data Mining: Concepts and Techniques 6
The Apriori Algorithm—An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
September 11, 2020 Data Mining: Concepts and Techniques 7
The Apriori Algorithm — Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3
itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
September 11, 2020 Data Mining: Concepts and Techniques 8
The Apriori Algorithm
◼ Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
September 11, 2020 Data Mining: Concepts and Techniques 9
Important Details of Apriori
◼ How to generate candidates?
◼ Step 1: self-joining Lk
◼ Step 2: pruning
◼ How to count supports of candidates?
◼ Example of Candidate-generation
◼ L3={abc, abd, acd, ace, bcd}
◼ Self-joining: L3*L3
◼ abcd from abc and abd
◼ acde from acd and ace
◼ Pruning:
◼ acde is removed because ade is not in L3
◼ C4={abcd}
Generating Association Rules from Frequent
Itemsets
◼ Once the frequent item sets from transactions in
a database D have been found, it is
straightforward to generate strong association
rules from them (where strong association rules
satisfy both minimum support and minimum
confidence).
◼ This can be done using the following equation
for confidence.
September 11, 2020 Data Mining: Concepts and Techniques 11
◼ Confidence (A => B) = P (B|A) = support_count
(AUB) / support_count (A)
◼ The conditional probability is expressed in terms
of item set support_count, where support_count
(AUB) is the number of transactions containing
the item set AUB, and support_count (A) is the
number of transactions containing the item set A.
◼ Based on this equation, association rules can be
generated as follows:
September 11, 2020 Data Mining: Concepts and Techniques 12
◼ For each frequent item set l, generate all non
empty subsets of l.
◼ For every non empty subset s of l, output the rule
“s => (l-s)” if
support_count(l) / support_count(s) ≥ min_conf,
where min_conf is the minimum confidence
threshold.
◼ Because the rules are generated from frequent
item sets, each one automatically satisfies
minimum_support.
September 11, 2020 Data Mining: Concepts and Techniques 13
September 11, 2020 Data Mining: Concepts and Techniques 14
Challenges of Frequent Pattern Mining
◼ Challenges
◼ Multiple scans of transaction database
◼ Huge number of candidates
◼ Tedious workload of support counting for candidates
◼ Improving Apriori: general ideas
◼ Reduce passes of transaction database scans
◼ Shrink number of candidates
◼ Facilitate support counting of candidates
September 11, 2020 Data Mining: Concepts and Techniques 15
Mining Frequent Patterns Without
Candidate Generation
September 11, 2020 Data Mining: Concepts and Techniques 16
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
F-list=f-c-a-b-m-p
September 11, 2020 Data Mining: Concepts and Techniques 17
Partition Patterns and Databases
◼ Frequent patterns can be partitioned into subsets
according to f-list
◼ F-list=f-c-a-b-m-p
◼ Patterns containing p
◼ Patterns having m but no p
◼ …
◼ Patterns having c but no a nor b, m, p
◼ Pattern f
◼ Completeness and non-redundency
September 11, 2020 Data Mining: Concepts and Techniques 18
Find Patterns Having P From P-conditional Database
◼ Starting at the frequent item header table in the FP-tree
◼ Traverse the FP-tree by following the link of each frequent item p
◼ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
September 11, 2020 Data Mining: Concepts and Techniques 19
From Conditional Pattern-bases to Conditional FP-trees
◼ For each pattern-base
◼ Accumulate the count for each item in the base
◼ Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent
patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam


{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
September 11, 2020 Data Mining: Concepts and Techniques 20
Mining Frequent Patterns With FP-trees
◼ Idea: Frequent pattern growth
◼ Recursively grow frequent patterns by pattern and
database partition
◼ Method
◼ For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
◼ Repeat the process on each newly created conditional
FP-tree
◼ Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
September 11, 2020 Data Mining: Concepts and Techniques 21
Why Is FP-Growth the Winner?
◼ Divide-and-conquer:
◼ decompose both the mining task and DB according to
the frequent patterns obtained so far
◼ leads to focused search of smaller databases
◼ Other factors
◼ no candidate generation, no candidate test
◼ compressed database: FP-tree structure
◼ no repeated scan of entire database
◼ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
data_mining_unit1.pdf
data_mining_unit1.pdf
C3=L2*L2
• {I1,I2,I3}
• {I1,I2,I5}
• {I1,I3,I5}
• {I2,I3,I4}
• {I2,I3,I5}
• {I2,I4,I5}
• C4=L3*L3
• {I1,I2,I3,I5} is pruned because its
subset{i2,i3,i5} is not frequent.
• So C4=0 and the algorithm terminates having
found all the frequent itemsets.
data_mining_unit1.pdf
data_mining_unit1.pdf
September 11, 2020 Data Mining: Concepts and Techniques 1
◼ Classification
◼ predicts categorical class labels
◼ classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
◼ Prediction
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
◼ Typical applications
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Fraud detection
Classification vs. Prediction
September 11, 2020 Data Mining: Concepts and Techniques 2
Supervised vs. Unsupervised Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
September 11, 2020 Data Mining: Concepts and Techniques 3
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees,
or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the
classified result from the model
◼ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
September 11, 2020 Data Mining: Concepts and Techniques 4
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
September 11, 2020 Data Mining: Concepts and Techniques 5
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
September 11, 2020 Data Mining: Concepts and Techniques 8
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This
follows an
example
(Playing
Tennis)
Decision tree induction is the learning of decision tree from class-labeled
training tuples.
September 11, 2020 Data Mining: Concepts and Techniques 9
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
September 11, 2020 Data Mining: Concepts and Techniques 10
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer
manner
◼ At start, all the training examples are at the root
◼ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◼ Examples are partitioned recursively based on selected attributes
◼ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class
◼ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
◼ There are no samples left
September 11, 2020 Data Mining: Concepts and Techniques 11
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
Rule Extraction from a Decision Tree
◼ Rules are easier to understand than large trees
◼ One rule is created for each path from the root
to a leaf
◼ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
◼ Rules are mutually exclusive and exhaustive
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 1
CSE5230 Tutorial: The ID3 Decision Tree Algorithm
MONASH UNIVERSITY
Faculty of Information Technology
CSE5230 Data Mining
Semester 2, 2004
The aim of this exercise is to learn how one famous algorithm for constructing decision trees, ID3, works.
You will do this by building a decision tree by hand for a small dataset. At the end of this exercise you
should understand how ID3 constructs a decision tree using the concept of Information Gain. You will
be able to use the decision tree you create to make a decision about new data.
1 The weather data
The weather problem is a toy data set which we will use to understand how a decision tree is built. It
comes from Quinlan (1986), a paper which discusses the ID3 algorithm introduced in Quinlan (1979). It
is reproduced with slight modifications in Witten and Frank (1999), and concerns the conditions under
which some hypothetical outdoor game may be played. The data is shown in Table 1.
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
Table 1: The weather data (Witten and Frank; 1999, p. 9).
In this dataset, there are five categorical attributes outlook, temperature, humidity, windy, and play. We
are interested in building a system which will enable us to decide whether or not to play the game on
the basis of the weather conditions, i.e. we wish to predict the value of play using outlook, temperature,
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 2
humidity, and windy. We can think of the attribute we wish to predict, i.e. play, as the output attribute,
and the other attributes as input attributes.
2 Building a decision tree using the ID3 algorithm
As we saw in lecture 6, a decision tree consists of nodes and arcs which connect nodes. To make a
decision, one starts at the root node, and asks questions to determine which arc to follow, until one
reaches a leaf node and the decision is made. This basic structure is shown in Figure 1.
Root Node
Node1 Node2
Leaf Node
Arcs
Figure 1: Basic decision tree structure
The main ideas behind the ID3 algorithm are:
Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible
value of that attribute. A leaf node corresponds to the expected value of the output attribute when
the input attributes are described by the path from the root node to that leaf node.
In a “good” decision tree, each non-leaf node should correspond to the input attribute which is
the most informative about the output attribute amongst all the input attributes not yet considered
in the path from the root node to that node. This is because we would like to predict the output
attribute using the smallest possible number of questions on average.
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 3
Entropy is used to determine how informative a particular input attribute is about the output at-
tribute for a subset of the training data. Entropy is a measure of uncertainty in communication
systems introduced by Shannon (1948). It is fundamental in modern information theory.
2.1 Entropy
In information theory, entropy is a measure of the uncertainty about a source of messages. The more
uncertain a receiver is about a source of messages, the more information that receiver will need in order
to know what message has been sent.
For example, if a message source always sends exactly the same message, the receiver does not need any
information to know what message has been sent—it’s always the same! The entropy of such a source is
zero: there is no uncertainty at all. On the other hand, if a source can send possible messages and each
message occurs independently of the preceding message with equal probability, then the uncertainty of
the receiver is maximised. The receiver will need to ask

yes/no questions to determine which
message has been sent, i.e. the receiver needs
 
bits of information.
Question 1 Why does the receiver need
 
bits? Think about representing each mes-
sage as a binary number.
The average number of bits required to identify each message is a measure of the receiver’s uncertainty
about the source, and is known as the entropy of the source.
Consider a source which can produce messages  

  . All messages are produced
independently of each other, and the probability of producing message  is  . For such a source
with a message probability distribution   

!  # , the entropy $%   is
$   ('

)
*
+

   (1)
In the equation above,

  means
 
+  —from now on we will assume that all logarithms are to
the base two.
If a set , of records from a database (i.e. the training set for building the decision tree) is partitioned
into - classes /. 0.

 0.213 on the basis of the output attribute, then the average amount of infor-
mation (measured in bits) needed to identify the class of a record is $% 54  , where 64 is the probability
distribution of the classes, estimated from the data as
748:9;
.
;
;
,
;
;
.

;
;
,
;
=;
.21
;
;
,
;?
 (2)
The notation
;
.
;
means the number of elements in set .2 . For the weather data, where play is the output
attribute, we have for the entire dataset ,
 4 :9@
ACB ED
ACB


where class .F corresponds to “no”, and class .

to “yes”. Using Equation 1, the entropy of set , is
thus
$%G,  H$%  4I  'J9@
ACB

2@
ACBKML D
ACB

2D
ACB5

 N
D
B
N
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 4
Note that here we identify the entropy of the set $G,  with the entropy of the probability distribution of
the members of the set, $%  4K .
2.2 Information gain
Now consider what happens if we partition the set on the basis of an input attribute into subsets
,K  ,

  ,  . The information needed to identify the class of an element of , is the weighted average
of the information needed to identify the class of an element of each subset:
$%  ,  

)
*
;
, 
;
;
,
;
$G,    (3)
In the context of building a decision tree, we are interested in how much information about the output
attribute can be gained by knowing the value of an input attribute . This is just the difference between
the information needed to classify an element of , before knowing the value of , $G,  , and the
information needed after partitioning the dataset , on the basis of knowing the value of , $%  ,  .
We define the information gain due to attribute for set , as1
   ,   $G,  ' $  ,   (4)
In order to decide which attribute to split upon, the ID3 algorithm computes the information gain for
each attribute, and selects the one with the highest gain.
As an example, let us consider the information gain for attribute temperature in the weather data. The
attribute temperature can have three values, cool, mild and hot. Partitioning on the basis of temperature
will thus divide , into three subsets. This reorganisation of the weather data is shown in Table 2. We
have
;
,   ;

B
,
;
, ;
 and
;
, ;

B
. We can calculate the information needed to classify an
element of , after this partitioning using Equation 3:
$ !$#% '## ,  
B
ACB $G,   7L
ACB $G, 6L
B
ACB $G, 

B
ACB 9 'J9
A
B


A
B 6L
(
B


(
B 
 
L
ACB 9#'J9
)



)
 ML
B



B
 
 
L
B
ACB 9 ' 9
)
B


)
B 6L
)
B


)
B 
 

B
ACB+* N, AA
L
ACB-* N
D
A
, L
B
ACB-* A
 NN
 N
D
AA
The information gain for attribute temperature for set , is thus
  ./$#% '## ,   N
D
B
N 'N
D
AA
 N N
)
D
0 2143 
1
Note that in information theory 5 687:9';2=? @ is known as the mutual information between the input attribute  and the
output attribute.
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 5
outlook temperature humidity windy play
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny cool normal false yes
rainy mild high false yes
sunny mild high false no
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
overcast hot high false yes
overcast hot normal false yes
Table 2: The weather data partitioned on the basis of attribute temperature.
Question 2 Calculate the information gain for the other input attributes, outlook, humidity,
and windy. You may use whatever tools you like to do the calculation. Note
the following useful identity:


  


 

 

An entropy calculator has been created to help you with this, and subsequent,
questions. You can find it at:
http://www.csse.monash.edu.au/˜davids/cgi-bin/cse5230/entropy calc.cgi.
Question 3 Given your answer to the question above, what attribute would ID3 select to
split upon for set , , i.e. the complete weather data from Table 1?
2.3 The ID3 algorithm
The ID3 algorithm works by recursively applying the procedure above to each of the subsets produced
until “pure” nodes are found—a pure node contains elements of only one class—or until there are no
attributes left to consider. It can be stated in pseudocode, as is shown in Figure 2.
Question 4 Use the ID3 algorithm given in Figure 2 to construct a decision tree for the
weather data by hand. The first call to ID3() uses the entire set of input
attributes and the entire set of training data.
Question 5 Consider a new data element with input attributes  $# % $

   #   .
How is it classified by the tree you constructed above?
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 6
function ID3 (I, 0, T) {
/* I is the set of input attributes
* O is the output attribute
* T is a set of training data
*
* function ID3 returns a decision tree
*/
if (T is empty) {
return a single node with the value Failure;
}
if (all records in T have the same value for O) {
return a single node with that value;
}
if (I is empty) {
return a single node with the value of the most frequent value of
O in T;
/* Note: some elements in this node will be incorrectly classified */
}
/* now handle the case where we can’t return a single node */
compute the information gain for each attribute in I relative to T;
let X be the attribute with largest Gain(X, T) of the attributes in I;
let {x_j| j=1,2, .., m} be the values of X;
let {T_j| j=1,2, .., m} be the subsets of T when T is partitioned
according the value of X;
return a tree with the root node labelled X and
arcs labelled x_1, x_2, .., x_m, where the arcs go to the
trees ID3(I-{X}, O, T_1), ID3(I-{X}, O, T_2), .., ID3(I-{X}, O, T_m);
}
Figure 2: The ID3 algorithm
2.4 The problem of attributes with many values
The simple ID3 algorithm above can have difficulties when an input attribute has many possible val-
ues, because
    ,  tends to favour attributes which have a large number of values. It is easy to
understand why if we consider an extreme case.
Imagine that our dataset , contains an attribute that has a different value for every element of , . This
could arise in practice if a unique record ID was retained when extracting , from a database—for
example a patient ID number in a hospital database. Such an attribute would give the maximum possible
information gain, since all the training data can be correctly classified by examining its value. It would
result in a decision tree in which all nodes below the root node were leaf nodes. This tree, however,
would be completely useless for classifying new data: there would be no arc corresponding to the value
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 7
of the ID attribute. Moreover, there is no reason to suspect that such an attribute would have any causal
relationship with the output attribute.
The problem also arises when an attribute can take on many values, even if they are not unique to each
element. Quinlan (1986) suggests a solution based on considering the amount of information required
to determine the value of an attribute for a set , . This is given by $ 
4K , where 
4 is the
probability distribution of the values of :

4 :9 ;
,
;
;
,
;
 ;
,

;
;
,
;
  ;
, 
;
;
,
; 
 (5)
The quantity $% 
4K is known as the split information for attribute and set , . We will call it
  21  
  ,  H$ 
45 . For the weather data, the split information for the attribute temperature is
  21 
 $!#% '#4 ,   'J9
B
ACB


B
ACBKML
ACB


ACBKML
B
ACB


B
ACB5


A

@
 0 143 
Quinlan (1986) suggests that rather than choosing the attribute with the biggest
    ,  , we select
the one with the biggest
  . 14 
  ,  , where
 . 14 
  ,  
    , 
  21 
  , 
 (6)
Note that
 ! 14 
  ,  might not always be defined. Quinlan (1986) specifies a gain ratio criterion,
which says that we should select the attribute with the highest
 ! 14 
  ,  from amongst those
attributes with average-or-better
   ,  .
If you have time, attempt the following questions.
Question 6 Under what circumstances is
  21  
  ,  equal to N ?
Question 7 What is
  . 14 
 $!$#% '#4? ,  ?
Question 8 What are the gain ratios for the other input attributes?
Question 9 Construct a decision tree for the weather data using the gain ratio criterion
rather than the information gain.
References
Quinlan, J. R. (1979). Discovering rules by induction from large collections of examples, in D. Michie
(ed.), Expert Systems in the Micro Electronic Age, Edinburgh University Press, pp. 168–201.
Quinlan, J. R. (1986). Induction of decision trees, Machine Learning 1(1): 81–106.
Shannon, C. E. (1948). A mathematical theory of communication, Bell System Technical Journal
27: 379–423 and 623–656.
http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html
David McG. Squire, August 26, 2004
CSE5230 Tutorial: The ID3 Decision Tree Algorithm 8
Witten, I. H. and Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations, Morgan Kaufmann, San Francisco, CA, USA.
http://www.cs.waikato.ac.nz/˜ml/weka/book.html
David McG. Squire, August 26, 2004
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
September 11, 2020 Data Mining: Concepts and Techniques 1
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according
to their house type, value, and geographical location
September 11, 2020 Data Mining: Concepts and Techniques 2
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
September 11, 2020 Data Mining: Concepts and Techniques 3
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 Incorporation of user-specified constraints
 Interpretability and usability
September 11, 2020 Data Mining: Concepts and Techniques 5
 Distances are normally used to measure the similarity
or dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and q is a positive integer
 If q = 1, d is Manhattan distance
September 11, 2020 Data Mining: Concepts and Techniques 6
q
q
p
p
q
q
j
x
i
x
j
x
i
x
j
x
i
x
j
i
d )
|
|
...
|
|
|
(|
)
,
(
2
2
1
1
−
+
+
−
+
−
=
|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d −
+
+
−
+
−
=
 If q = 2, d is Euclidean distance:
 Both eculidean and manhattan distance satisfy the
following mathematic requirements of a distance
function
 d(i,j)  0:Distance is a non negative number
 d(i,i) = 0: The distance of an object to itself is 0
 d(i,j) = d(j,i): Distance is symmetric function
 d(i,j)  d(i,k) + d(k,j): Going directly from objet I to objet j
in sapce is no more than making a detour over any other
object k(triangular inequality)
September 11, 2020 Data Mining: Concepts and Techniques 7
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d −
+
+
−
+
−
=
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
September 11, 2020 Data Mining: Concepts and Techniques 8
 Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters.
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means :Each cluster is represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) : Each cluster is
represented by one of the objects in the cluster
September 11, 2020 Data Mining: Concepts and Techniques 9
 Given k, the k-means algorithm is implemented in
four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
September 11, 2020 Data Mining: Concepts and Techniques 10
 Example
September 11, 2020 Data Mining: Concepts and Techniques 11
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassign
reassign
 Each object is distributed to a cluster center to
which it is nearest.
 Next the cluster centers are updated. That is,
mean value of each cluster is recalculated based
on the current objects in the cluster. Using the
new cluster centers, the objects are redistributed
to the clusters based on which cluster center is
the nearest.
 This process iterates. The process of iteratively
reassigning objects to clusters to improve the
partitioning is referred to as iterative relocation.
 When no redistribution of objects in any clusters
occur, the process terminates.
September 11, 2020 Data Mining: Concepts and Techniques 12
Input:
k:number of clusters
D: a dataset containing n objects
Output: a set of k clusters
Method
 1.Arbitrarily choose k objects from D as the
intial centers
 2. reappear
 3. (re) assign each object to the cluster to which
the object is the most similar,based on the mean
value of the objects in the cluster
 4. Update the cluster means, i .e., calculate the
mean value of the objects for each cluster
 5. until no change
September 11, 2020 Data Mining: Concepts and Techniques 13
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
September 11, 2020 Data Mining: Concepts and Techniques 14
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
September 11, 2020 Data Mining: Concepts and Techniques 16
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
 Find representative objects, called medoids, in clusters
 PAM (Partitioning Around Medoids)
 starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-
medoids if it improves the total distance of the
resulting clustering
 PAM works effectively for small data sets, but does
not scale well for large data sets.
September 11, 2020 Data Mining: Concepts and Techniques 17
September 11, 2020 Data Mining: Concepts and Techniques 18
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
September 11, 2020 Data Mining: Concepts and Techniques 19
September 11, 2020 Data Mining: Concepts and Techniques 20
Four cases of cost function for k-medoid clustering
 PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean
 PAM works efficiently for small data sets but does not
scale well for large data sets.
➔ Sampling based method,
CLARA(Clustering LARge Applications)
September 11, 2020 Data Mining: Concepts and Techniques 22
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output.
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased.
September 11, 2020 Data Mining: Concepts and Techniques 23
1
Mining Complex Types of Data
■ Mining Sequence Data
■ Mining Time Series
■ Mining Symbolic Sequences
■ Mining Biological Sequences
■ Mining Graphs and Networks
■ Mining Other Kinds of Data
2
Mining Sequence Data
■ Similarity Search in Time Series Data
■ Subsequence match, dimensionality reduction, query-based similarity
search, motif-based similarity search
■ Regression and Trend Analysis in Time-Series Data
■ long term + cyclic + seasonal variation + random movements
■ Sequential Pattern Mining in Symbolic Sequences
■ GSP, PrefixSpan, constraint-based sequential pattern mining
■ Sequence Classification
■ Feature-based vs. sequence-distance-based vs. model-based
3
Mining Graphs and Networks
■ Graph Pattern Mining
■ Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph
■ Statistical Modeling of Networks
■ Small world phenomenon, power law (log-tail) distribution, densification
■ Clustering and Classification of Graphs and Homogeneous Networks
■ Clustering: Fast Modularity vs. SCAN
■ Classification: model vs. pattern-based mining
■ Clustering, Ranking and Classification of Heterogeneous Networks
■ RankClus, RankClass, and meta path-based, user-guided methodology
■ Role Discovery and Link Prediction in Information Networks
■ PathPredict
■ Similarity Search and OLAP in Information Networks: PathSim, GraphCube
■ Evolution of Social and Information Networks: EvoNetClus
4
Mining Other Kinds of Data
■ Mining Spatial Data
■ Spatial frequent/co-located patterns, spatial clustering and classification
■ Mining Spatiotemporal and Moving Object Data
■ Spatiotemporal data mining, trajectory mining, periodica, swarm, …
■ Mining Cyber-Physical System Data
■ Applications: healthcare, air-traffic control, flood simulation
■ Mining Multimedia Data
■ Social media data, geo-tagged spatial clustering, periodicity analysis, …
■ Mining Text Data
■ Topic modeling, i-topic model, integration with geo- and networked data
■ Mining Web Data
■ Web content, web structure, and web usage mining
■ Mining Data Streams
■ Dynamics, one-pass, patterns, clustering, classification, outlier detection
* Data Mining: Principles and Algorithms 5
Mining Complex Data Objects:
Generalization of Structured Data
■ Set-valued attribute
■ Generalization of each value in the set into its corresponding
higher-level concepts
■ Derivation of the general behavior of the set, such as the number
of elements in the set, the types or value ranges in the set, or the
weighted average for numerical data
■ E.g., hobby = {tennis, hockey, chess, violin, PC_games}
generalizes to {sports, music, e_games}
■ List-valued or a sequence-valued attribute
■ Same as set-valued attributes except that the order of the
elements in the sequence should be observed in the generalization
* Data Mining: Principles and Algorithms 6
Generalizing Spatial and Multimedia Data
■ Spatial data:
■ Generalize detailed geographic points into clustered regions, such
as business, residential, industrial, or agricultural areas,
according to land usage
■ Require the merge of a set of geographic areas by spatial
operations
■ Image data:
■ Extracted by aggregation and/or approximation
■ Size, color, shape, texture, orientation, and relative positions and
structures of the contained objects or regions in the image
■ Music data:
■ Summarize its melody: based on the approximate patterns that
repeatedly occur in the segment
■ Summarized its style: based on its tone, tempo, or the major
musical instruments played
* Data Mining: Principles and Algorithms 7
Generalizing Object Data
■ Object identifier
■ generalize to the lowest level of class in the class/subclass
hierarchies
■ Class composition hierarchies
■ generalize only those closely related in semantics to the current one
■ Construction and mining of object cubes
■ Extend the attribute-oriented induction method
■ Apply a sequence of class-based generalization operators on
different attributes
■ Continue until getting a small number of generalized objects that
can be summarized as a concise in high-level terms
■ Implementation
■ Examine each attribute, generalize it to simple-valued data
■ Construct a multidimensional data cube (object cube)
■ Problem: it is not always desirable to generalize a set of values
to single-valued data
* Data Mining: Principles and Algorithms 8
Ex.: Plan Mining by Divide and Conquer
■ Plan: a sequence of actions
■ E.g., Travel (flight): traveler, departure, arrival, d-time, a-time,
airline, price, seat
■ Plan mining: extraction of important or significant generalized
(sequential) patterns from a planbase (a large collection of plans)
■ E.g., Discover travel patterns in an air flight database, or
■ find significant patterns from the sequences of actions in the
repair of automobiles
■ Method
■ Attribute-oriented induction on sequence data
■ A generalized travel plan: small-big*-small
■ Divide  conquer:Mine characteristics for each subsequence
■ E.g., big*: same airline, small-big: nearby region
* Data Mining: Principles and Algorithms 9
A Travel Database for Plan Mining
■ Example: Mining a travel planbase
Travel plan table
Airport info table
* Data Mining: Principles and Algorithms 10
Multidimensional Analysis
■ Strategy
■ Generalize the
planbase in
different directions
■ Look for sequential
patterns in the
generalized plans
■ Derive high-level
plans
A multi-D model for the planbase
* Data Mining: Principles and Algorithms 11
Multidimensional Generalization
Multi-Dimensional generalization of the planbase
Merging consecutive, identical actions in plans
* Data Mining: Principles and Algorithms 12
What Is a Spatial Database System?
■ Geometric, geographic or spatial data: space-related data
■ Example: Geographic space (2-D abstraction of earth surface),
VLSI design, model of human brain, 3-D space representing the
arrangement of chains of protein molecule.
■ Spatial database system vs. image database systems.
■ Image database system: handling digital raster image (e.g.,
satellite sensing, computer tomography), may also contain
techniques for object analysis and extraction from images and
some spatial database functionality.
■ Spatial (geometric, geographic) database system: handling
objects in space that have identity and well-defined extents,
locations, and relationships.
* Data Mining: Principles and Algorithms 13
GIS (Geographic Information System)
■ GIS (Geographic Information System)
■ Analysis and visualization of geographic data
■ Common analysis functions of GIS
■ Search (thematic search, search by region)
■ Location analysis (buffer, corridor, overlay)
■ Terrain analysis (slope/aspect, drainage network)
■ Flow analysis (connectivity, shortest path)
■ Distribution (nearest neighbor, proximity, change detection)
■ Spatial analysis/statistics (pattern, centrality, similarity, topology)
■ Measurements (distance, perimeter, shape, adjacency, direction)
* Data Mining: Principles and Algorithms 14
Spatial DBMS (SDBMS)
■ SDBMS is a software system that
■ supports spatial data models, spatial ADTs, and a query
language supporting them
■ supports spatial indexing, spatial operations efficiently,
and query optimization
■ can work with an underlying DBMS
■ Examples
■ Oracle Spatial Data Catridge
■ ESRI Spatial Data Engine
* Data Mining: Principles and Algorithms 15
Modeling Spatial Objects
■ What needs to be represented?
■ Two important alternative views
■ Single objects: distinct entities arranged in space each
of which has its own geometric description
■ modeling cities, forests, rivers
■ Spatially related collection of objects: describe space
itself (about every point in space)
■ modeling land use, partition of a country into
districts
* Data Mining: Principles and Algorithms 16
Modeling Single Objects: Point, Line and Region
■ Point: location only but not extent
■ Line (or a curve usually represented by a polyline, a
sequence of line segment):
■ moving through space, or connections in space (roads,
rivers, cables, etc.)
■ Region:
■ Something having extent in 2D-space (country, lake,
park). It may have a hole or consist of several disjoint
pieces.
* Data Mining: Principles and Algorithms 17
Modeling Spatially Related Collection of Objects
■ Modeling spatially related collection of objects: plane partitions and
networks.
■ A partition: a set of region objects that are required to be disjoint
(e.g., a thematic map). There exist often pairs of objects with a
common boundary (adjacency relationship).
■ A network: a graph embedded into the plane, consisting of a set of
point objects, forming its nodes, and a set of line objects describing
the geometry of the edges, e.g., highways. rivers, power supply
lines.
■ Other interested spatially related collection of objects: nested
partitions, or a digital terrain (elevation) model.
* Data Mining: Principles and Algorithms 18
Spatial Data Types and Models
■ Field-based model: raster
data
■ framework: partitioning of
space
■ Object-based model: vector
model
■ point, line, polygon,
Objects, Attributes
* Data Mining: Principles and Algorithms 19
Spatial Data Warehousing
■ Spatial data warehouse: Integrated, subject-oriented, time-variant,
and nonvolatile spatial data repository
■ Spatial data integration: a big issue
■ Structure-specific formats (raster- vs. vector-based, OO vs.
relational models, different storage and indexing, etc.)
■ Vendor-specific formats (ESRI, MapInfo, Integraph, IDRISI, etc.)
■ Geo-specific formats (geographic vs. equal area projection, etc.)
■ Spatial data cube: multidimensional spatial database
■ Both dimensions and measures may contain spatial components
* Data Mining: Principles and Algorithms 20
Dimensions and Measures in Spatial
Data Warehouse
■ Dimensions
■ non-spatial
■ e.g. “25-30 degrees”
generalizes to“hot”
(both are strings)
■ spatial-to-nonspatial
■ e.g. Seattle generalizes
to description “Pacific
Northwest” (as a string)
■ spatial-to-spatial
■ e.g. Seattle generalizes
to Pacific Northwest (as
a spatial region)
■ Measures
■ numerical (e.g. monthly revenue
of a region)
■ distributive (e.g. count, sum)
■ algebraic (e.g. average)
■ holistic (e.g. median, rank)
■ spatial
■ collection of spatial pointers
(e.g. pointers to all regions
with temperature of 25-30
degrees in July)
* Data Mining: Principles and Algorithms 21
Spatial-to-Spatial Generalization
■ Generalize detailed
geographic points into
clustered regions, such as
businesses, residential,
industrial, or agricultural
areas, according to land
usage
■ Requires the merging of a
set of geographic areas by
spatial operations
Dissolve
Merge
Clip
Intersect
Union
* Data Mining: Principles and Algorithms 22
Example: British Columbia Weather
Pattern Analysis
■ Input
■ A map with about 3,000 weather probes scattered in B.C.
■ Daily data for temperature, precipitation, wind velocity, etc.
■ Data warehouse using star schema
■ Output
■ A map that reveals patterns: merged (similar) regions
■ Goals
■ Interactive analysis (drill-down, slice, dice, pivot, roll-up)
■ Fast response time
■ Minimizing storage space used
■ Challenge
■ A merged region may contain hundreds of “primitive” regions
(polygons)
* Data Mining: Principles and Algorithms 23
Star Schema of the BC Weather Warehouse
■ Spatial data warehouse
■ Dimensions
■ region_name
■ time
■ temperature
■ precipitation
■ Measurements
■ region_map
■ area
■ count
Fact table
Dimension table
* Data Mining: Principles and Algorithms 24
Multidimensional Analysis of
Multimedia Data
■ Multimedia data cube
■ Design and construction similar to that of traditional
data cubes from relational data
■ Contain additional dimensions and measures for
multimedia information, such as color, texture, and
shape
■ The database does not store images but their descriptors
■ Feature descriptor: a set of vectors for each visual
characteristic
■ Color vector: contains the color histogram
■ MFC (Most Frequent Color) vector: five color centroids
■ MFO (Most Frequent Orientation) vector: five edge orientation
centroids
■ Layout descriptor: contains a color layout vector and an
edge layout vector
* Data Mining: Principles and Algorithms 25
Multi-Dimensional Search in
Multimedia Databases
* Data Mining: Principles and Algorithms 26
Color histogram Texture layout
Multi-Dimensional Analysis in
Multimedia Databases
* Data Mining: Principles and Algorithms 27
Refining or combining searches
Search for “blue sky”
(top layout grid is blue)
Search for “blue sky and
green meadows”
(top layout grid is blue
and bottom is green)
Search for “airplane in blue sky”
(top layout grid is blue and
keyword = “airplane”)
Mining Multimedia Databases
* Data Mining: Principles and Algorithms 28
RED
WHITE
BLUE
GIF
JPEG
By Format
By Colour
Sum
Cross Tab
RED
WHITE
BLUE
Colour
Sum
Group By
Measurement
JPEG
GIF Small
Very Large
RED
WHITE
BLUE
By Colour
By Format  Colour
By Format  Size
By Colour  Size
By Format
By Size
Sum
The Data Cube and
the Sub-Space Measurements
Medium
Large
• Format of image
• Duration
• Colors
• Textures
• Keywords
• Size
• Width
• Height
• Internet domain of image
• Internet domain of parent pages
• Image popularity
Mining Multimedia Databases
* Data Mining: Principles and Algorithms 29
Mining Multimedia Databases in
* Data Mining: Principles and Algorithms 30
Classification in MultiMediaMiner
* Data Mining: Principles and Algorithms 31
Spatial Relationships from Layout
property P1 next-to property P2
property P1 on-top-of property P2
Different Resolution Hierarchy
Mining Multimedia Databases
* Data Mining: Principles and Algorithms 32
From Coarse to Fine Resolution Mining
Mining Multimedia Databases
* Data Mining: Concepts and Techniques 33
Data Mining Applications
■ Data mining is an interdisciplinary field with wide
and diverse applications
■ There exist nontrivial gaps between data
mining principles and domain-specific
applications
■ Some application domains
■ Financial data analysis
■ Retail industry
■ Telecommunication industry
■ Biological data analysis
* Data Mining: Concepts and Techniques 34
Data Mining for Financial Data Analysis
■ Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
■ Design and construction of data warehouses for
multidimensional data analysis and data mining
■ View the debt and revenue changes by month, by
region, by sector, and by other factors
■ Access statistical information such as max, min, total,
average, trend, etc.
■ Loan payment prediction/consumer credit policy analysis
■ feature selection and attribute relevance ranking
■ Loan payment performance
■ Consumer credit rating
* Data Mining: Concepts and Techniques 35
Financial Data Mining
■ Classification and clustering of customers for targeted
marketing
■ multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group
■ Detection of money laundering and other financial crimes
■ integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs)
■ Tools: data visualization, linkage analysis,
classification, clustering tools, outlier analysis, and
sequential pattern analysis tools (find unusual access
sequences)
* Data Mining: Concepts and Techniques 36
Data Mining for Retail Industry
■ Retail industry: huge amounts of data on sales, customer
shopping history, etc.
■ Applications of retail data mining
■ Identify customer buying behaviors
■ Discover customer shopping patterns and trends
■ Improve the quality of customer service
■ Achieve better customer retention and satisfaction
■ Enhance goods consumption ratios
■ Design more effective goods transportation and
distribution policies
* Data Mining: Concepts and Techniques 37
Data Mining in Retail Industry (2)
■ Ex. 1. Design and construction of data warehouses based
on the benefits of data mining
■ Multidimensional analysis of sales, customers, products,
time, and region
■ Ex. 2. Analysis of the effectiveness of sales campaigns
■ Ex. 3. Customer retention: Analysis of customer loyalty
■ Use customer loyalty card information to register
sequences of purchases of particular customers
■ Use sequential pattern mining to investigate changes in
customer consumption or loyalty
■ Suggest adjustments on the pricing and variety of
goods
■ Ex. 4. Purchase recommendation and cross-reference of
items
* Data Mining: Concepts and Techniques 38
Data Mining for Telecomm. Industry (1)
■ A rapidly expanding and highly competitive industry and
a great demand for data mining
■ Understand the business involved
■ Identify telecommunication patterns
■ Catch fraudulent activities
■ Make better use of resources
■ Improve the quality of service
■ Multidimensional analysis of telecommunication data
■ Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
* Data Mining: Concepts and Techniques 39
Data Mining for Telecomm. Industry (2)
■ Fraudulent pattern analysis and the identification of unusual patterns
■ Identify potentially fraudulent users and their atypical usage
patterns
■ Detect attempts to gain fraudulent entry to customer accounts
■ Discover unusual patterns which may need special attention
■ Multidimensional association and sequential pattern analysis
■ Find usage patterns for a set of communication services by
customer group, by month, etc.
■ Promote the sales of specific services
■ Improve the availability of particular services in a region
* Data Mining: Concepts and Techniques 40
Biomedical Data Analysis
■ DNA sequences
■ Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
■ Humans have around 30,000 genes
■ Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
■ Semantic integration of heterogeneous, distributed
genome databases
■ Current: highly distributed, uncontrolled generation
and use of a wide variety of DNA data
* Data Mining: Concepts and Techniques 41
DNA Analysis: Examples
■ Similarity search and comparison among DNA sequences
■ Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy)
■ Identify gene sequence patterns that play roles in various diseases
■ Association analysis: identification of co-occurring gene sequences
■ Most diseases are not triggered by a single gene but by a
combination of genes acting together
■ Association analysis may help determine the kinds of genes that
are likely to co-occur together in target samples
■ Path analysis: linking genes to different disease development stages
■ Different genes may become active at different stages of the
disease
■ Develop pharmaceutical interventions that target the different
stages separately
42
Privacy, Security and Social Impacts of
Data Mining
■ Many data mining applications do not touch personal data
■ E.g., meteorology, astronomy, geography, geology, biology, and
other scientific and engineering data
■ Many DM studies are on developing scalable algorithms to find general
or statistically significant patterns, not touching individuals
■ The real privacy concern: unconstrained access of individual records,
especially privacy-sensitive information
■ Method 1: Removing sensitive IDs associated with the data
■ Method 2: Data security-enhancing methods
■ Multi-level security model: permit to access to only authorized level
■ Encryption: e.g., blind signatures, biometric encryption, and
anonymous databases (personal information is encrypted and stored
at different locations)
■ Method 3: Privacy-preserving data mining methods
* Data Mining: Concepts and Techniques 43
Trends in Data Mining
■ Application exploration
■ development of application-specific data mining
system
■ Invisible data mining (mining as built-in function)
■ Scalable data mining methods
■ Constraint-based mining: use of constraints to guide
data mining systems in their search for interesting
patterns
■ Integration of data mining with database systems, data
warehouse systems, and Web database systems
■ Invisible data mining
* Data Mining: Concepts and Techniques 44
Trends in Data Mining
■ Standardization of data mining language
■ A standard will facilitate systematic development,
improve interoperability, and promote the education
and use of data mining systems in industry and society
■ Visual data mining
■ New methods for mining complex types of data
■ More research is required towards the integration of
data mining methods with existing data analysis
techniques for the complex types of data
■ Web mining
■ Privacy protection and information security in data mining
* Data Mining: Concepts and Techniques 1
Data Mining Applications
■ Data mining is an interdisciplinary field with wide
and diverse applications
■ There exist nontrivial gaps between data
mining principles and domain-specific
applications
■ Some application domains
■ Financial data analysis
■ Retail industry
■ Telecommunication industry
■ Biological data analysis
* Data Mining: Concepts and Techniques 2
Data Mining for Financial Data Analysis
■ Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
■ Design and construction of data warehouses for
multidimensional data analysis and data mining
■ View the debt and revenue changes by month, by
region, by sector, and by other factors
■ Access statistical information such as max, min, total,
average, trend, etc.
■ Loan payment prediction/consumer credit policy analysis
■ feature selection and attribute relevance ranking
■ Loan payment performance
■ Consumer credit rating
* Data Mining: Concepts and Techniques 3
Financial Data Mining
■ Classification and clustering of customers for targeted
marketing
■ multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group
■ Detection of money laundering and other financial crimes
■ integration of from multiple DBs (e.g., bank
transactions, federal/state crime history DBs)
■ Tools: data visualization, linkage analysis,
classification, clustering tools, outlier analysis, and
sequential pattern analysis tools (find unusual access
sequences)
* Data Mining: Concepts and Techniques 4
Data Mining for Retail Industry
■ Retail industry: huge amounts of data on sales, customer
shopping history, etc.
■ Applications of retail data mining
■ Identify customer buying behaviors
■ Discover customer shopping patterns and trends
■ Improve the quality of customer service
■ Achieve better customer retention and satisfaction
■ Enhance goods consumption ratios
■ Design more effective goods transportation and
distribution policies
* Data Mining: Concepts and Techniques 5
Data Mining in Retail Industry (2)
■ Ex. 1. Design and construction of data warehouses based
on the benefits of data mining
■ Multidimensional analysis of sales, customers, products,
time, and region
■ Ex. 2. Analysis of the effectiveness of sales campaigns
■ Ex. 3. Customer retention: Analysis of customer loyalty
■ Use customer loyalty card information to register
sequences of purchases of particular customers
■ Use sequential pattern mining to investigate changes in
customer consumption or loyalty
■ Suggest adjustments on the pricing and variety of
goods
■ Ex. 4. Purchase recommendation and cross-reference of
items
* Data Mining: Concepts and Techniques 6
Data Mining for Telecomm. Industry (1)
■ A rapidly expanding and highly competitive industry and
a great demand for data mining
■ Understand the business involved
■ Identify telecommunication patterns
■ Catch fraudulent activities
■ Make better use of resources
■ Improve the quality of service
■ Multidimensional analysis of telecommunication data
■ Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
* Data Mining: Concepts and Techniques 7
Data Mining for Telecomm. Industry (2)
■ Fraudulent pattern analysis and the identification of unusual patterns
■ Identify potentially fraudulent users and their atypical usage
patterns
■ Detect attempts to gain fraudulent entry to customer accounts
■ Discover unusual patterns which may need special attention
■ Multidimensional association and sequential pattern analysis
■ Find usage patterns for a set of communication services by
customer group, by month, etc.
■ Promote the sales of specific services
■ Improve the availability of particular services in a region
* Data Mining: Concepts and Techniques 8
Biomedical Data Analysis
■ DNA sequences
■ Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
■ Humans have around 30,000 genes
■ Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
■ Semantic integration of heterogeneous, distributed
genome databases
■ Current: highly distributed, uncontrolled generation
and use of a wide variety of DNA data
* Data Mining: Concepts and Techniques 9
DNA Analysis: Examples
■ Similarity search and comparison among DNA sequences
■ Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy)
■ Identify gene sequence patterns that play roles in various diseases
■ Association analysis: identification of co-occurring gene sequences
■ Most diseases are not triggered by a single gene but by a
combination of genes acting together
■ Association analysis may help determine the kinds of genes that
are likely to co-occur together in target samples
■ Path analysis: linking genes to different disease development stages
■ Different genes may become active at different stages of the
disease
■ Develop pharmaceutical interventions that target the different
stages separately
* Data Mining: Concepts and Techniques 10
Trends in Data Mining
■ Application exploration
■ development of application-specific data mining
system
■ Invisible data mining (mining as built-in function)
■ Scalable data mining methods
■ Constraint-based mining: use of constraints to guide
data mining systems in their search for interesting
patterns
■ Integration of data mining with database systems, data
warehouse systems, and Web database systems
■ Invisible data mining
* Data Mining: Concepts and Techniques 11
Trends in Data Mining
■ Standardization of data mining language
■ A standard will facilitate systematic development,
improve interoperability, and promote the education
and use of data mining systems in industry and society
■ Visual data mining
■ New methods for mining complex types of data
■ More research is required towards the integration of
data mining methods with existing data analysis
techniques for the complex types of data
■ Web mining
■ Privacy protection and information security in data mining

More Related Content

PPT
Data mining 1
PPT
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
PPTX
Data Mining
PPTX
Introduction on Data Mining basics level
PPT
Data mining
PPT
ch_1_dm data preprocessing in data mining
PPT
What Is DATA MINING(INTRODUCTION)
PPT
18231979 Data Mining
Data mining 1
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
Data Mining
Introduction on Data Mining basics level
Data mining
ch_1_dm data preprocessing in data mining
What Is DATA MINING(INTRODUCTION)
18231979 Data Mining

Similar to data_mining_unit1.pdf (20)

PPT
unit 1 DATA MINING.ppt
PPT
introduction to data minining and unit iii
PDF
Introduction to Data Mining
PDF
Datamininglecture
PPT
Data Mining introduction and basic concepts
PPT
Data mining concept and methods for basic
PPT
Chapter1_IntroductionIntroductionIntroduction.ppt
PPT
Introduction
PPT
Introduction.ppt
PPTX
DMDA Unit-1.pptx .
PPT
Chapter 01Intro.ppt full explanation used
PPT
introduction to data mining tutorial
PPTX
Introduction_to_Data_Mining12345678.pptx
PPT
01Intro(1).ppt Introduction In computer science
PPT
Introduction of Data Mining - Concept and techniques
PPT
Introduction to data warehouse
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PPT
Data Mining: Concepts and Techniques.ppt
PPT
Chapter 1. Introduction.ppt
unit 1 DATA MINING.ppt
introduction to data minining and unit iii
Introduction to Data Mining
Datamininglecture
Data Mining introduction and basic concepts
Data mining concept and methods for basic
Chapter1_IntroductionIntroductionIntroduction.ppt
Introduction
Introduction.ppt
DMDA Unit-1.pptx .
Chapter 01Intro.ppt full explanation used
introduction to data mining tutorial
Introduction_to_Data_Mining12345678.pptx
01Intro(1).ppt Introduction In computer science
Introduction of Data Mining - Concept and techniques
Introduction to data warehouse
Unit 1 (Chapter-1) on data mining concepts.ppt
Data Mining: Concepts and Techniques.ppt
Chapter 1. Introduction.ppt
Ad

Recently uploaded (20)

PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Managing Community Partner Relationships
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DOCX
Factor Analysis Word Document Presentation
PPTX
IMPACT OF LANDSLIDE.....................
PDF
annual-report-2024-2025 original latest.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Leprosy and NLEP programme community medicine
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Business Analytics and business intelligence.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Business_Capability_Map_Collection__pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
modul_python (1).pptx for professional and student
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Managing Community Partner Relationships
A Complete Guide to Streamlining Business Processes
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Factor Analysis Word Document Presentation
IMPACT OF LANDSLIDE.....................
annual-report-2024-2025 original latest.
Optimise Shopper Experiences with a Strong Data Estate.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Qualitative Qantitative and Mixed Methods.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Leprosy and NLEP programme community medicine
DU, AIS, Big Data and Data Analytics.ppt
Business Analytics and business intelligence.pdf
Introduction to Data Science and Data Analysis
Business_Capability_Map_Collection__pptx
Ad

data_mining_unit1.pdf

  • 1. September 11, 2020 Data Mining: Concepts and Techniques 1 Why Data Mining? ◼ The Explosive Growth of Data: from terabytes to petabytes ◼ Data collection and data availability ◼ Automated data collection tools, database systems, Web, computerized society ◼ Major sources of abundant data ◼ Business: Web, e-commerce, transactions, stocks, … ◼ Science: Remote sensing, bioinformatics, scientific simulation, … ◼ Society and everyone: news, digital cameras, YouTube ◼ We are drowning in data, but starving for knowledge! ◼ “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
  • 2. September 11, 2020 Data Mining: Concepts and Techniques 2 Evolution of Database Technology ◼ 1960s: ◼ Data collection, database creation, IMS and network DBMS ◼ 1970s: ◼ Relational data model, relational DBMS implementation ◼ 1980s: ◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.) ◼ Application-oriented DBMS (spatial, scientific, engineering, etc.) ◼ 1990s: ◼ Data mining, data warehousing, multimedia databases, and Web databases ◼ 2000s ◼ Stream data management and mining ◼ Data mining and its applications ◼ Web technology (XML, data integration) and global information systems
  • 3. September 11, 2020 Data Mining: Concepts and Techniques 3 What Is Data Mining? ◼ Data mining (knowledge discovery from data) ◼ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data ◼ Alternative names ◼ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 4. September 11, 2020 Data Mining: Concepts and Techniques 4 Knowledge Discovery (KDD) Process ◼ Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 5. September 11, 2020 Data Mining: Concepts and Techniques 5 KDD Process: Several Key Steps ◼ Learning the application domain ◼ relevant prior knowledge and goals of application ◼ Creating a target data set: data selection ◼ Data cleaning and preprocessing: (may take 60% of effort!) ◼ Data reduction and transformation ◼ Find useful features, dimensionality/variable reduction, invariant representation ◼ Choosing functions of data mining ◼ summarization, classification, regression, association, clustering ◼ Choosing the mining algorithm(s) ◼ Data mining: search for patterns of interest ◼ Pattern evaluation and knowledge presentation ◼ visualization, transformation, removing redundant patterns, etc. ◼ Use of discovered knowledge
  • 6. September 11, 2020 Data Mining: Concepts and Techniques 6 Are All the “Discovered” Patterns Interesting? ◼ Data mining may generate thousands of patterns: Not all of them are interesting ◼ Suggested approach: Human-centered, query-based, focused mining ◼ Interestingness measures ◼ A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm ◼ Objective vs. subjective interestingness measures ◼ Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. ◼ Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
  • 7. September 11, 2020 Data Mining: Concepts and Techniques 7 Find All and Only Interesting Patterns? ◼ Find all the interesting patterns: Completeness ◼ Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? ◼ Association vs. classification vs. clustering ◼ Search for only interesting patterns: An optimization problem ◼ Can a data mining system find only the interesting patterns? ◼ Approaches ◼ First general all the patterns and then filter out the uninteresting ones ◼ Generate only the interesting patterns—mining query optimization
  • 8. September 11, 2020 Data Mining: Concepts and Techniques 8 Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 9. September 11, 2020 Data Mining: Concepts and Techniques 9 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 10. September 11, 2020 Data Mining: Concepts and Techniques 10 Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories
  • 11. September 11, 2020 Data Mining: Concepts and Techniques 12 Why Data Mining?—Potential Applications ◼ Data analysis and decision support ◼ Market analysis and management ◼ Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation ◼ Risk analysis and management ◼ Forecasting, customer retention, quality control, competitive analysis ◼ Fraud detection and detection of unusual patterns (outliers) ◼ Other Applications ◼ Text mining (news group, email, documents) and Web mining ◼ Stream data mining ◼ Bioinformatics and bio-data analysis
  • 12. September 11, 2020 Data Mining: Concepts and Techniques 13 Ex. 1: Market Analysis and Management ◼ Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies ◼ Target marketing ◼ Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. ◼ Determine customer purchasing patterns over time ◼ Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association ◼ Customer profiling—What types of customers buy what products (clustering or classification) ◼ Customer requirement analysis ◼ Identify the best products for different groups of customers ◼ Predict what factors will attract new customers ◼ Provision of summary information ◼ Multidimensional summary reports
  • 13. September 11, 2020 Data Mining: Concepts and Techniques 14 Ex. 2: Corporate Analysis & Risk Management ◼ Finance planning and asset evaluation ◼ cash flow analysis and prediction ◼ contingent claim analysis to evaluate assets ◼ cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) ◼ Resource planning ◼ summarize and compare the resources and spending ◼ Competition ◼ monitor competitors and market directions ◼ group customers into classes and a class-based pricing procedure ◼ set pricing strategy in a highly competitive market
  • 14. September 11, 2020 Data Mining: Concepts and Techniques 15 Ex. 3: Fraud Detection & Mining Unusual Patterns ◼ Approaches: Clustering & model construction for frauds, outlier analysis ◼ Applications: Health care, retail, credit card service, telecomm. ◼ Auto insurance: ring of collisions ◼ Money laundering: suspicious monetary transactions ◼ Medical insurance ◼ Professional patients, ring of doctors, and ring of references ◼ Unnecessary or correlated screening tests ◼ Telecommunications: phone-call fraud ◼ Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm ◼ Retail industry ◼ Analysts estimate that 38% of retail shrink is due to dishonest employees ◼ Anti-terrorism
  • 15. September 11, 2020 Data Mining: Concepts and Techniques 16 Multi-Dimensional View of Data Mining ◼ Data to be mined ◼ Relational, data warehouse, transactional, stream, object- oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW ◼ Knowledge to be mined ◼ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. ◼ Multiple/integrated functions and mining at multiple levels ◼ Techniques utilized ◼ Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. ◼ Applications adapted ◼ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 16. September 11, 2020 Data Mining: Concepts and Techniques 17 Data Mining: Classification Schemes ◼ General functionality ◼ Descriptive data mining ◼ Predictive data mining ◼ Different views lead to different classifications ◼ Data view: Kinds of data to be mined ◼ Knowledge view: Kinds of knowledge to be discovered ◼ Method view: Kinds of techniques utilized ◼ Application view: Kinds of applications adapted
  • 17. September 11, 2020 Data Mining: Concepts and Techniques 18 Data Mining: On What Kinds of Data? ◼ Database-oriented data sets and applications ◼ Relational database, data warehouse, transactional database ◼ Advanced data sets and advanced applications ◼ Data streams and sensor data ◼ Time-series data, temporal data, sequence data (incl. bio-sequences) ◼ Structure data, graphs, social networks and multi-linked data ◼ Object-relational databases ◼ Heterogeneous databases and legacy databases ◼ Spatial data and spatiotemporal data ◼ Multimedia database ◼ Text databases ◼ The World-Wide Web
  • 18. September 11, 2020 Data Mining: Concepts and Techniques 19 Data Mining Functionalities ◼ Multidimensional concept description: Characterization and discrimination ◼ Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions ◼ Frequent patterns, association, correlation vs. causality ◼ Diaper → Beer [0.5%, 75%] (Correlation or causality?) ◼ Classification and prediction ◼ Construct models (functions) that describe and distinguish classes or concepts for future prediction ◼ E.g., classify countries based on (climate), or classify cars based on (gas mileage) ◼ Predict some unknown or missing numerical values
  • 19. September 11, 2020 Data Mining: Concepts and Techniques 20 Data Mining Functionalities (2) ◼ Cluster analysis ◼ Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns ◼ Maximizing intra-class similarity & minimizing interclass similarity ◼ Outlier analysis ◼ Outlier: Data object that does not comply with the general behavior of the data ◼ Noise or exception? Useful in fraud detection, rare events analysis ◼ Trend and evolution analysis ◼ Trend and deviation: e.g., regression analysis ◼ Sequential pattern mining: e.g., digital camera → large SD memory ◼ Periodicity analysis ◼ Similarity-based analysis ◼ Other pattern-directed or statistical analyses
  • 20. September 11, 2020 Data Mining: Concepts and Techniques 21 Primitives that Define a Data Mining Task ◼ Task-relevant data ◼ Database or data warehouse name ◼ Database tables or data warehouse cubes ◼ Condition for data selection ◼ Relevant attributes or dimensions ◼ Data grouping criteria ◼ Type of knowledge to be mined ◼ Characterization, discrimination, association, classification, prediction, clustering, outlier analysis, other data mining tasks ◼ Background knowledge ◼ Pattern interestingness measurements ◼ Visualization/presentation of discovered patterns
  • 21. Data Mining Tasks • Data Mining makes use of various algorithms to perform a variety of tasks. • These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem. • The models used to solve a problem are classified as predictive and descriptive.
  • 22. Predictive Model • A Predictive Model enables to predict the values of data by making use of known results from different set of sample data. • The data mining tasks that forms the part of predictive model are: • Classification • Regression • Time series Analysis
  • 23. Descriptive Model • A Descriptive Model enables to determine the patterns and relationships in a sample data. • The data mining tasks that forms the part of Descriptive Model are: • Clustering • Summarization • Association Rules • Sequence Discovery
  • 24. Classification • Classification enables to classify the data in a large databank into pre defined set of classes. • The classes are defined before studying or examining data in data bank. • It not only enables to examine the existing sample data but also enable to predict future behavior of that sample data. • Eg. Fraud detection in credit card related transactions.
  • 25. Regression • Regression enables to forecast future data values based on present and past data values. • It examines the values of data and develops a mathematical formula. • The result produced on using this formula enables to predict future behavior of existing data. • Eg. Future behavior of Saving pattern
  • 26. Time series Analysis • It makes of current and past sample data to predict future values. The values used are evenly distributed as hourly,weekly,monthly,yearly,and so on. • Eg. Examine trends in stock market for various companies for a specific period and accordingly make investments.
  • 27. Clustering • It enables to create new groups and classes based on the study of patterns and relationship between values of data in a data bank. • It is similar to classification but does not require to predefine the groups of classes. • It is known as unsupervised learning or segmentation. • All those data items that resembles more closely with each other are clubbed together in a single group, known as clusters.
  • 28. Summarization • It enables to summarize a large chunk of data containing in a web page or document. • The summarized data enables to get the gist of the entire web page or document. • It is also known as characterization or generalization. • It searches for specific characteristics and attributes of data in large amount of data and then summarizes the same.
  • 29. Association Rules • It enables to establish association and relationships between large unclassified data items based on certain attributes and characteristics.
  • 30. Sequence Discovery • It enables to determine the sequential patterns that exist in a large and unorganized data bank.
  • 31. Data Preprocessing Data Summarization: Summarization is a key data mining concept which involves techniques for finding a compact description of a dataset. Simple summarization methods such as tabulating the mean and standard deviations are often applied for exploratory data analysis, data visualization and automated report generation. Clustering [12, 21] is another data mining technique that is often used to summarize large datasets. Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing. Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
  • 32. Steps Involved in Data Preprocessing: 1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. • (a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: 1. Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. 2. Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value. • (b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : 1. Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. 2. Regression: Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). 3. Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. 2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:
  • 33. 1. Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) 2. Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. 3. Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual levels. 4. Concept Hierarchy Generation: Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”. 3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: 1. Data Cube Aggregation: Aggregation operation is applied to data for the construction of the data cube. 2. Attribute Subset Selection: The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute.the attribute having p-value greater than significance level can be discarded. 3. Numerosity Reduction: This enable to store the model of data instead of whole data, for example: Regression Models. 4. Dimensionality Reduction: This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet Analysis). Data Integration: is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a
  • 34. unified view of the data. These sources may include multiple data cubes, databases or flat files. The data integration approach are formally defined as triple <G, S, M> where, G stand for the global schema, S stand for heterogenous source of schema, M stand for mapping between the queries of source and global schema. There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling approach”. Tight Coupling: Here, a data warehouse is treated as an information retrieval component. • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation and Loading. Loose Coupling: • Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result. • And the data only remains in the actual source databases.
  • 35. What is Data Warehouse? • Defined in many different ways, but not rigorously. – A decision support database that is maintained separately from the organization’s operational database – Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon • Data warehousing: – The process of constructing and using data warehouses September 11, 2020 Data Mining: Concepts and Techniques 1
  • 36. Data Warehouse—Subject-Oriented • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process September 11, 2020 Data Mining: Concepts and Techniques 2
  • 37. Data Warehouse—Integrated • Constructed by integrating multiple, heterogeneous data sources – relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. September 11, 2020 Data Mining: Concepts and Techniques 3
  • 38. Data Warehouse—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element” September 11, 2020 Data Mining: Concepts and Techniques 4
  • 39. Data Warehouse—Nonvolatile • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: • initial loading of data and access of data September 11, 2020 Data Mining: Concepts and Techniques 5
  • 40. • A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. September 11, 2020 Data Mining: Concepts and Techniques 6
  • 41. Data Warehouse vs. Operational DBMS • The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. • Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems. September 11, 2020 Data Mining: Concepts and Techniques 7
  • 42. Distinct features (OLTP vs. OLAP): • Users and system orientation: • An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. • An OLAP systems market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. • Data contents: • An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. • An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making. September 11, 2020 Data Mining: Concepts and Techniques 8
  • 43. • View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. • In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media. • Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. • Accesses to OLAP systems are mostly read-only operations (because most data warehouses store historical rather than up- to-date information), although many could be complex queries. September 11, 2020 Data Mining: Concepts and Techniques 9
  • 44. OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB September 11, 2020 Data Mining: Concepts and Techniques 10
  • 45. Data warehouse architecture September 11, 2020 Data Mining: Concepts and Techniques 23
  • 46. Design of Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse – Top-down view • allows selection of the relevant information necessary for the data warehouse. This information matches the current and future business needs. – Data source view • exposes the information being captured, stored, and managed by operational systems. This information may be documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables. – Data warehouse view • consists of fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including precalculated totals and counts, as well as information regarding the source, date, and time of origin, added to provide historical context. – Business query view • sees the perspectives of data in the warehouse from the view point of end- user September 11, 2020 Data Mining: Concepts and Techniques 24
  • 47. September 11, 2020 Data Mining: Concepts and Techniques 26 Data Warehouse: A Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh Middle Tier:OLAP Server Analysis Query Reports Data mining Monitoring & Administration Metadata Data Sources Top tier: Front-End Tools Output Data Marts Operational DBs Other sources Bottom Tier: Datawarehouse server OLAP Server
  • 48. September 11, 2020 Data Mining: Concepts and Techniques 27 ◼ 1. The bottom tier is a warehouse database server that is almost always a relational database system. ◼ Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). ◼ These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse. ◼ This tier also contains a metadata repository, which stores information about the data warehouse and its contents.
  • 49. September 11, 2020 Data Mining: Concepts and Techniques 28 2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations. 3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
  • 50. Three Data Warehouse Models • From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse. • Enterprise warehouse – collects all of the information about subjects spanning the entire organization • Data Mart – a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart • Virtual warehouse – A set of views over operational databases – Only some of the possible summary views may be materialized September 11, 2020 Data Mining: Concepts and Techniques 29
  • 51. Data Warehouse Back-End Tools and Utilities • Data extraction – get data from multiple, heterogeneous, and external sources • Data cleaning – detect errors in the data and rectify them when possible • Data transformation – convert data from legacy or host format to warehouse format • Load – sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions • Refresh – propagate the updates from the data sources to the warehouse September 11, 2020 Data Mining: Concepts and Techniques 30
  • 52. Metadata Repository • Meta data is the data defining warehouse objects. It stores: • Description of the structure of the data warehouse – schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents • Operational meta-data – data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) • The algorithms used for summarization • The mapping from operational environment to the data warehouse • Data related to system performance – warehouse schema, view and derived data definitions • Business data – business terms and definitions, ownership of data, charging policies September 11, 2020 Data Mining: Concepts and Techniques 31
  • 53. Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP • OLAP servers present business users with multidimensional data from data warehouses or data marts, without concerns regarding how or where the data are stored. • The physical architecture and implementation of OLAP servers must consider data storage issues. Implementations of a warehouse server for OLAP processing include the following: • Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relationalDBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP technology tends to have greater scalability than MOLAP technology. September 11, 2020 Data Mining: Concepts and Techniques 32
  • 54. • Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to precomputed summarized data. • Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. • Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some database system vendors implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. September 11, 2020 Data Mining: Concepts and Techniques 33
  • 55. Data Warehouse Implementation • Data warehouses contain huge volumes of data. • OLAP servers demand that decision support queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods, and query processing techniques. September 11, 2020 Data Mining: Concepts and Techniques 34
  • 56. Data Warehouse Usage • Three kinds of data warehouse applications – Information processing • supports querying, basic statistical analysis, and reporting using tables, charts and graphs – Analytical processing • multidimensional analysis of data warehouse data • supports basic OLAP operations – Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools September 11, 2020 Data Mining: Concepts and Techniques 35
  • 57. September 11, 2020 Data Mining: Concepts and Techniques 1 What Is Frequent Pattern Analysis? ◼ Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set ◼ First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and association rule mining ◼ Motivation: Finding inherent regularities in data ◼ What products were often purchased together?— Beer and diapers?! ◼ What are the subsequent purchases after buying a PC? ◼ What kinds of DNA are sensitive to this new drug? ◼ Can we automatically classify web documents? ◼ Applications ◼ Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
  • 58. September 11, 2020 Data Mining: Concepts and Techniques 2 Why Is Freq. Pattern Mining Important? ◼ Discloses an intrinsic and important property of data sets ◼ Forms the foundation for many essential data mining tasks ◼ Association, correlation, and causality analysis ◼ Sequential, structural (e.g., sub-graph) patterns ◼ Pattern analysis in spatiotemporal, multimedia, time- series, and stream data ◼ Classification: associative classification ◼ Cluster analysis: frequent pattern-based clustering ◼ Broad applications
  • 59. September 11, 2020 Data Mining: Concepts and Techniques 3 Basic Concepts: Frequent Patterns and Association Rules ◼ Itemset X = {x1, …, xk} ◼ Find all the rules X → Y with minimum support and confidence ◼ support, s, probability that a transaction contains X  Y ◼ confidence, c, conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A → D (60%, 100%) D → A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
  • 60. Association Rule mining is a two step process: 1. Find all frequent itemsets: Each of these itemsets will occur atleast as frequently as a predetermined minimum support count 2. Generate strong association rules from frequent itemsets: These rules must satisfy minimum support and minimum confidence. September 11, 2020 Data Mining: Concepts and Techniques 4
  • 61. September 11, 2020 Data Mining: Concepts and Techniques 5 Apriori: Finding frequent itemsets mining using A Candidate Generation ◼ Apriori Property: “All non-empty subset of frequent item set must be frequent” ◼ Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! ◼ Method: ◼ Initially, scan DB once to get frequent 1-itemset ◼ Generate length (k+1) candidate itemsets from length k frequent itemsets ◼ Test the candidates against DB ◼ Terminate when no frequent or candidate set can be generated
  • 62. September 11, 2020 Data Mining: Concepts and Techniques 6 The Apriori Algorithm—An Example Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 63. September 11, 2020 Data Mining: Concepts and Techniques 7 The Apriori Algorithm — Example TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Database D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D C1 L1 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 L2 C2 C2 Scan D C3 L3 itemset {2 3 5} Scan D itemset sup {2 3 5} 2
  • 64. September 11, 2020 Data Mining: Concepts and Techniques 8 The Apriori Algorithm ◼ Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
  • 65. September 11, 2020 Data Mining: Concepts and Techniques 9 Important Details of Apriori ◼ How to generate candidates? ◼ Step 1: self-joining Lk ◼ Step 2: pruning ◼ How to count supports of candidates? ◼ Example of Candidate-generation ◼ L3={abc, abd, acd, ace, bcd} ◼ Self-joining: L3*L3 ◼ abcd from abc and abd ◼ acde from acd and ace ◼ Pruning: ◼ acde is removed because ade is not in L3 ◼ C4={abcd}
  • 66. Generating Association Rules from Frequent Itemsets ◼ Once the frequent item sets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). ◼ This can be done using the following equation for confidence. September 11, 2020 Data Mining: Concepts and Techniques 11
  • 67. ◼ Confidence (A => B) = P (B|A) = support_count (AUB) / support_count (A) ◼ The conditional probability is expressed in terms of item set support_count, where support_count (AUB) is the number of transactions containing the item set AUB, and support_count (A) is the number of transactions containing the item set A. ◼ Based on this equation, association rules can be generated as follows: September 11, 2020 Data Mining: Concepts and Techniques 12
  • 68. ◼ For each frequent item set l, generate all non empty subsets of l. ◼ For every non empty subset s of l, output the rule “s => (l-s)” if support_count(l) / support_count(s) ≥ min_conf, where min_conf is the minimum confidence threshold. ◼ Because the rules are generated from frequent item sets, each one automatically satisfies minimum_support. September 11, 2020 Data Mining: Concepts and Techniques 13
  • 69. September 11, 2020 Data Mining: Concepts and Techniques 14 Challenges of Frequent Pattern Mining ◼ Challenges ◼ Multiple scans of transaction database ◼ Huge number of candidates ◼ Tedious workload of support counting for candidates ◼ Improving Apriori: general ideas ◼ Reduce passes of transaction database scans ◼ Shrink number of candidates ◼ Facilitate support counting of candidates
  • 70. September 11, 2020 Data Mining: Concepts and Techniques 15 Mining Frequent Patterns Without Candidate Generation
  • 71. September 11, 2020 Data Mining: Concepts and Techniques 16 Construct FP-tree from a Transaction Database {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
  • 72. September 11, 2020 Data Mining: Concepts and Techniques 17 Partition Patterns and Databases ◼ Frequent patterns can be partitioned into subsets according to f-list ◼ F-list=f-c-a-b-m-p ◼ Patterns containing p ◼ Patterns having m but no p ◼ … ◼ Patterns having c but no a nor b, m, p ◼ Pattern f ◼ Completeness and non-redundency
  • 73. September 11, 2020 Data Mining: Concepts and Techniques 18 Find Patterns Having P From P-conditional Database ◼ Starting at the frequent item header table in the FP-tree ◼ Traverse the FP-tree by following the link of each frequent item p ◼ Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 74. September 11, 2020 Data Mining: Concepts and Techniques 19 From Conditional Pattern-bases to Conditional FP-trees ◼ For each pattern-base ◼ Accumulate the count for each item in the base ◼ Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 75. September 11, 2020 Data Mining: Concepts and Techniques 20 Mining Frequent Patterns With FP-trees ◼ Idea: Frequent pattern growth ◼ Recursively grow frequent patterns by pattern and database partition ◼ Method ◼ For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree ◼ Repeat the process on each newly created conditional FP-tree ◼ Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
  • 76. September 11, 2020 Data Mining: Concepts and Techniques 21 Why Is FP-Growth the Winner? ◼ Divide-and-conquer: ◼ decompose both the mining task and DB according to the frequent patterns obtained so far ◼ leads to focused search of smaller databases ◼ Other factors ◼ no candidate generation, no candidate test ◼ compressed database: FP-tree structure ◼ no repeated scan of entire database ◼ basic ops—counting local freq items and building sub FP-tree, no pattern search and matching
  • 79. C3=L2*L2 • {I1,I2,I3} • {I1,I2,I5} • {I1,I3,I5} • {I2,I3,I4} • {I2,I3,I5} • {I2,I4,I5}
  • 80. • C4=L3*L3 • {I1,I2,I3,I5} is pruned because its subset{i2,i3,i5} is not frequent. • So C4=0 and the algorithm terminates having found all the frequent itemsets.
  • 83. September 11, 2020 Data Mining: Concepts and Techniques 1 ◼ Classification ◼ predicts categorical class labels ◼ classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data ◼ Prediction ◼ models continuous-valued functions, i.e., predicts unknown or missing values ◼ Typical applications ◼ Credit approval ◼ Target marketing ◼ Medical diagnosis ◼ Fraud detection Classification vs. Prediction
  • 84. September 11, 2020 Data Mining: Concepts and Techniques 2 Supervised vs. Unsupervised Learning ◼ Supervised learning (classification) ◼ Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations ◼ New data is classified based on the training set ◼ Unsupervised learning (clustering) ◼ The class labels of training data is unknown ◼ Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 85. September 11, 2020 Data Mining: Concepts and Techniques 3 Classification—A Two-Step Process ◼ Model construction: describing a set of predetermined classes ◼ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ◼ The set of tuples used for model construction is training set ◼ The model is represented as classification rules, decision trees, or mathematical formulae ◼ Model usage: for classifying future or unknown objects ◼ Estimate accuracy of the model ◼ The known label of test sample is compared with the classified result from the model ◼ Accuracy rate is the percentage of test set samples that are correctly classified by the model ◼ If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 86. September 11, 2020 Data Mining: Concepts and Techniques 4 Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
  • 87. September 11, 2020 Data Mining: Concepts and Techniques 5 Process (2): Using the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 88. September 11, 2020 Data Mining: Concepts and Techniques 8 Decision Tree Induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no This follows an example (Playing Tennis) Decision tree induction is the learning of decision tree from class-labeled training tuples.
  • 89. September 11, 2020 Data Mining: Concepts and Techniques 9 Output: A Decision Tree for “buys_computer” age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no
  • 90. September 11, 2020 Data Mining: Concepts and Techniques 10 Algorithm for Decision Tree Induction ◼ Basic algorithm (a greedy algorithm) ◼ Tree is constructed in a top-down recursive divide-and-conquer manner ◼ At start, all the training examples are at the root ◼ Attributes are categorical (if continuous-valued, they are discretized in advance) ◼ Examples are partitioned recursively based on selected attributes ◼ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) ◼ Conditions for stopping partitioning ◼ All samples for a given node belong to the same class ◼ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ◼ There are no samples left
  • 91. September 11, 2020 Data Mining: Concepts and Techniques 11 age? student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no ◼ Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = yes IF age = young AND credit_rating = fair THEN buys_computer = no Rule Extraction from a Decision Tree ◼ Rules are easier to understand than large trees ◼ One rule is created for each path from the root to a leaf ◼ Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction ◼ Rules are mutually exclusive and exhaustive
  • 92. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 1 CSE5230 Tutorial: The ID3 Decision Tree Algorithm MONASH UNIVERSITY Faculty of Information Technology CSE5230 Data Mining Semester 2, 2004 The aim of this exercise is to learn how one famous algorithm for constructing decision trees, ID3, works. You will do this by building a decision tree by hand for a small dataset. At the end of this exercise you should understand how ID3 constructs a decision tree using the concept of Information Gain. You will be able to use the decision tree you create to make a decision about new data. 1 The weather data The weather problem is a toy data set which we will use to understand how a decision tree is built. It comes from Quinlan (1986), a paper which discusses the ID3 algorithm introduced in Quinlan (1979). It is reproduced with slight modifications in Witten and Frank (1999), and concerns the conditions under which some hypothetical outdoor game may be played. The data is shown in Table 1. outlook temperature humidity windy play sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no Table 1: The weather data (Witten and Frank; 1999, p. 9). In this dataset, there are five categorical attributes outlook, temperature, humidity, windy, and play. We are interested in building a system which will enable us to decide whether or not to play the game on the basis of the weather conditions, i.e. we wish to predict the value of play using outlook, temperature, David McG. Squire, August 26, 2004
  • 93. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 2 humidity, and windy. We can think of the attribute we wish to predict, i.e. play, as the output attribute, and the other attributes as input attributes. 2 Building a decision tree using the ID3 algorithm As we saw in lecture 6, a decision tree consists of nodes and arcs which connect nodes. To make a decision, one starts at the root node, and asks questions to determine which arc to follow, until one reaches a leaf node and the decision is made. This basic structure is shown in Figure 1. Root Node Node1 Node2 Leaf Node Arcs Figure 1: Basic decision tree structure The main ideas behind the ID3 algorithm are: Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible value of that attribute. A leaf node corresponds to the expected value of the output attribute when the input attributes are described by the path from the root node to that leaf node. In a “good” decision tree, each non-leaf node should correspond to the input attribute which is the most informative about the output attribute amongst all the input attributes not yet considered in the path from the root node to that node. This is because we would like to predict the output attribute using the smallest possible number of questions on average. David McG. Squire, August 26, 2004
  • 94. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 3 Entropy is used to determine how informative a particular input attribute is about the output at- tribute for a subset of the training data. Entropy is a measure of uncertainty in communication systems introduced by Shannon (1948). It is fundamental in modern information theory. 2.1 Entropy In information theory, entropy is a measure of the uncertainty about a source of messages. The more uncertain a receiver is about a source of messages, the more information that receiver will need in order to know what message has been sent. For example, if a message source always sends exactly the same message, the receiver does not need any information to know what message has been sent—it’s always the same! The entropy of such a source is zero: there is no uncertainty at all. On the other hand, if a source can send possible messages and each message occurs independently of the preceding message with equal probability, then the uncertainty of the receiver is maximised. The receiver will need to ask yes/no questions to determine which message has been sent, i.e. the receiver needs bits of information. Question 1 Why does the receiver need bits? Think about representing each mes- sage as a binary number. The average number of bits required to identify each message is a measure of the receiver’s uncertainty about the source, and is known as the entropy of the source. Consider a source which can produce messages . All messages are produced independently of each other, and the probability of producing message is . For such a source with a message probability distribution ! # , the entropy $% is $ (' ) * + (1) In the equation above, means + —from now on we will assume that all logarithms are to the base two. If a set , of records from a database (i.e. the training set for building the decision tree) is partitioned into - classes /. 0. 0.213 on the basis of the output attribute, then the average amount of infor- mation (measured in bits) needed to identify the class of a record is $% 54 , where 64 is the probability distribution of the classes, estimated from the data as 748:9; . ; ; , ; ; . ; ; , ; =; .21 ; ; , ;? (2) The notation ; . ; means the number of elements in set .2 . For the weather data, where play is the output attribute, we have for the entire dataset , 4 :9@ ACB ED ACB where class .F corresponds to “no”, and class . to “yes”. Using Equation 1, the entropy of set , is thus $%G, H$% 4I 'J9@ ACB 2@ ACBKML D ACB 2D ACB5 N D B N David McG. Squire, August 26, 2004
  • 95. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 4 Note that here we identify the entropy of the set $G, with the entropy of the probability distribution of the members of the set, $% 4K . 2.2 Information gain Now consider what happens if we partition the set on the basis of an input attribute into subsets ,K , , . The information needed to identify the class of an element of , is the weighted average of the information needed to identify the class of an element of each subset: $% , ) * ; , ; ; , ; $G, (3) In the context of building a decision tree, we are interested in how much information about the output attribute can be gained by knowing the value of an input attribute . This is just the difference between the information needed to classify an element of , before knowing the value of , $G, , and the information needed after partitioning the dataset , on the basis of knowing the value of , $% , . We define the information gain due to attribute for set , as1 , $G, ' $ , (4) In order to decide which attribute to split upon, the ID3 algorithm computes the information gain for each attribute, and selects the one with the highest gain. As an example, let us consider the information gain for attribute temperature in the weather data. The attribute temperature can have three values, cool, mild and hot. Partitioning on the basis of temperature will thus divide , into three subsets. This reorganisation of the weather data is shown in Table 2. We have ; , ; B , ; , ; and ; , ; B . We can calculate the information needed to classify an element of , after this partitioning using Equation 3: $ !$#% '## , B ACB $G, 7L ACB $G, 6L B ACB $G, B ACB 9 'J9 A B A B 6L ( B ( B L ACB 9#'J9 ) ) ML B B L B ACB 9 ' 9 ) B ) B 6L ) B ) B B ACB+* N, AA L ACB-* N D A , L B ACB-* A NN N D AA The information gain for attribute temperature for set , is thus ./$#% '## , N D B N 'N D AA N N ) D 0 2143 1 Note that in information theory 5 687:9';2=? @ is known as the mutual information between the input attribute and the output attribute. David McG. Squire, August 26, 2004
  • 96. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 5 outlook temperature humidity windy play rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny cool normal false yes rainy mild high false yes sunny mild high false no rainy mild normal false yes sunny mild normal true yes overcast mild high true yes rainy mild high true no sunny hot high false no sunny hot high true no overcast hot high false yes overcast hot normal false yes Table 2: The weather data partitioned on the basis of attribute temperature. Question 2 Calculate the information gain for the other input attributes, outlook, humidity, and windy. You may use whatever tools you like to do the calculation. Note the following useful identity: An entropy calculator has been created to help you with this, and subsequent, questions. You can find it at: http://www.csse.monash.edu.au/˜davids/cgi-bin/cse5230/entropy calc.cgi. Question 3 Given your answer to the question above, what attribute would ID3 select to split upon for set , , i.e. the complete weather data from Table 1? 2.3 The ID3 algorithm The ID3 algorithm works by recursively applying the procedure above to each of the subsets produced until “pure” nodes are found—a pure node contains elements of only one class—or until there are no attributes left to consider. It can be stated in pseudocode, as is shown in Figure 2. Question 4 Use the ID3 algorithm given in Figure 2 to construct a decision tree for the weather data by hand. The first call to ID3() uses the entire set of input attributes and the entire set of training data. Question 5 Consider a new data element with input attributes $# % $ # . How is it classified by the tree you constructed above? David McG. Squire, August 26, 2004
  • 97. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 6 function ID3 (I, 0, T) { /* I is the set of input attributes * O is the output attribute * T is a set of training data * * function ID3 returns a decision tree */ if (T is empty) { return a single node with the value Failure; } if (all records in T have the same value for O) { return a single node with that value; } if (I is empty) { return a single node with the value of the most frequent value of O in T; /* Note: some elements in this node will be incorrectly classified */ } /* now handle the case where we can’t return a single node */ compute the information gain for each attribute in I relative to T; let X be the attribute with largest Gain(X, T) of the attributes in I; let {x_j| j=1,2, .., m} be the values of X; let {T_j| j=1,2, .., m} be the subsets of T when T is partitioned according the value of X; return a tree with the root node labelled X and arcs labelled x_1, x_2, .., x_m, where the arcs go to the trees ID3(I-{X}, O, T_1), ID3(I-{X}, O, T_2), .., ID3(I-{X}, O, T_m); } Figure 2: The ID3 algorithm 2.4 The problem of attributes with many values The simple ID3 algorithm above can have difficulties when an input attribute has many possible val- ues, because , tends to favour attributes which have a large number of values. It is easy to understand why if we consider an extreme case. Imagine that our dataset , contains an attribute that has a different value for every element of , . This could arise in practice if a unique record ID was retained when extracting , from a database—for example a patient ID number in a hospital database. Such an attribute would give the maximum possible information gain, since all the training data can be correctly classified by examining its value. It would result in a decision tree in which all nodes below the root node were leaf nodes. This tree, however, would be completely useless for classifying new data: there would be no arc corresponding to the value David McG. Squire, August 26, 2004
  • 98. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 7 of the ID attribute. Moreover, there is no reason to suspect that such an attribute would have any causal relationship with the output attribute. The problem also arises when an attribute can take on many values, even if they are not unique to each element. Quinlan (1986) suggests a solution based on considering the amount of information required to determine the value of an attribute for a set , . This is given by $ 4K , where 4 is the probability distribution of the values of : 4 :9 ; , ; ; , ; ; , ; ; , ; ; , ; ; , ; (5) The quantity $% 4K is known as the split information for attribute and set , . We will call it 21 , H$ 45 . For the weather data, the split information for the attribute temperature is 21 $!#% '#4 , 'J9 B ACB B ACBKML ACB ACBKML B ACB B ACB5 A @ 0 143 Quinlan (1986) suggests that rather than choosing the attribute with the biggest , , we select the one with the biggest . 14 , , where . 14 , , 21 , (6) Note that ! 14 , might not always be defined. Quinlan (1986) specifies a gain ratio criterion, which says that we should select the attribute with the highest ! 14 , from amongst those attributes with average-or-better , . If you have time, attempt the following questions. Question 6 Under what circumstances is 21 , equal to N ? Question 7 What is . 14 $!$#% '#4? , ? Question 8 What are the gain ratios for the other input attributes? Question 9 Construct a decision tree for the weather data using the gain ratio criterion rather than the information gain. References Quinlan, J. R. (1979). Discovering rules by induction from large collections of examples, in D. Michie (ed.), Expert Systems in the Micro Electronic Age, Edinburgh University Press, pp. 168–201. Quinlan, J. R. (1986). Induction of decision trees, Machine Learning 1(1): 81–106. Shannon, C. E. (1948). A mathematical theory of communication, Bell System Technical Journal 27: 379–423 and 623–656. http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html David McG. Squire, August 26, 2004
  • 99. CSE5230 Tutorial: The ID3 Decision Tree Algorithm 8 Witten, I. H. and Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, CA, USA. http://www.cs.waikato.ac.nz/˜ml/weka/book.html David McG. Squire, August 26, 2004
  • 100.  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms September 11, 2020 Data Mining: Concepts and Techniques 1
  • 101.  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location September 11, 2020 Data Mining: Concepts and Techniques 2
  • 102.  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns September 11, 2020 Data Mining: Concepts and Techniques 3
  • 103.  Ability to deal with different types of attributes  Ability to handle dynamic data  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  Incorporation of user-specified constraints  Interpretability and usability September 11, 2020 Data Mining: Concepts and Techniques 5
  • 104.  Distances are normally used to measure the similarity or dissimilarity between two data objects  Some popular ones include: Minkowski distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance September 11, 2020 Data Mining: Concepts and Techniques 6 q q p p q q j x i x j x i x j x i x j i d ) | | ... | | | (| ) , ( 2 2 1 1 − + + − + − = | | ... | | | | ) , ( 2 2 1 1 p p j x i x j x i x j x i x j i d − + + − + − =
  • 105.  If q = 2, d is Euclidean distance:  Both eculidean and manhattan distance satisfy the following mathematic requirements of a distance function  d(i,j)  0:Distance is a non negative number  d(i,i) = 0: The distance of an object to itself is 0  d(i,j) = d(j,i): Distance is symmetric function  d(i,j)  d(i,k) + d(k,j): Going directly from objet I to objet j in sapce is no more than making a detour over any other object k(triangular inequality) September 11, 2020 Data Mining: Concepts and Techniques 7 ) | | ... | | | (| ) , ( 2 2 2 2 2 1 1 p p j x i x j x i x j x i x j i d − + + − + − =
  • 106.  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors  Typical methods: k-means, k-medoids, CLARANS  Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using some criterion  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON September 11, 2020 Data Mining: Concepts and Techniques 8
  • 107.  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters.  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means :Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) : Each cluster is represented by one of the objects in the cluster September 11, 2020 Data Mining: Concepts and Techniques 9
  • 108.  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment September 11, 2020 Data Mining: Concepts and Techniques 10
  • 109.  Example September 11, 2020 Data Mining: Concepts and Techniques 11 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign
  • 110.  Each object is distributed to a cluster center to which it is nearest.  Next the cluster centers are updated. That is, mean value of each cluster is recalculated based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest.  This process iterates. The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation.  When no redistribution of objects in any clusters occur, the process terminates. September 11, 2020 Data Mining: Concepts and Techniques 12
  • 111. Input: k:number of clusters D: a dataset containing n objects Output: a set of k clusters Method  1.Arbitrarily choose k objects from D as the intial centers  2. reappear  3. (re) assign each object to the cluster to which the object is the most similar,based on the mean value of the objects in the cluster  4. Update the cluster means, i .e., calculate the mean value of the objects for each cluster  5. until no change September 11, 2020 Data Mining: Concepts and Techniques 13
  • 112.  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes September 11, 2020 Data Mining: Concepts and Techniques 14
  • 113.  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data.  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. September 11, 2020 Data Mining: Concepts and Techniques 16 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 114.  Find representative objects, called medoids, in clusters  PAM (Partitioning Around Medoids)  starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non- medoids if it improves the total distance of the resulting clustering  PAM works effectively for small data sets, but does not scale well for large data sets. September 11, 2020 Data Mining: Concepts and Techniques 17
  • 115. September 11, 2020 Data Mining: Concepts and Techniques 18 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 116. September 11, 2020 Data Mining: Concepts and Techniques 19
  • 117. September 11, 2020 Data Mining: Concepts and Techniques 20 Four cases of cost function for k-medoid clustering
  • 118.  PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean  PAM works efficiently for small data sets but does not scale well for large data sets. ➔ Sampling based method, CLARA(Clustering LARge Applications) September 11, 2020 Data Mining: Concepts and Techniques 22
  • 119.  It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output.  Strength: deals with larger data sets than PAM  Weakness:  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased. September 11, 2020 Data Mining: Concepts and Techniques 23
  • 120. 1 Mining Complex Types of Data ■ Mining Sequence Data ■ Mining Time Series ■ Mining Symbolic Sequences ■ Mining Biological Sequences ■ Mining Graphs and Networks ■ Mining Other Kinds of Data
  • 121. 2 Mining Sequence Data ■ Similarity Search in Time Series Data ■ Subsequence match, dimensionality reduction, query-based similarity search, motif-based similarity search ■ Regression and Trend Analysis in Time-Series Data ■ long term + cyclic + seasonal variation + random movements ■ Sequential Pattern Mining in Symbolic Sequences ■ GSP, PrefixSpan, constraint-based sequential pattern mining ■ Sequence Classification ■ Feature-based vs. sequence-distance-based vs. model-based
  • 122. 3 Mining Graphs and Networks ■ Graph Pattern Mining ■ Frequent subgraph patterns, closed graph patterns, gSpan vs. CloseGraph ■ Statistical Modeling of Networks ■ Small world phenomenon, power law (log-tail) distribution, densification ■ Clustering and Classification of Graphs and Homogeneous Networks ■ Clustering: Fast Modularity vs. SCAN ■ Classification: model vs. pattern-based mining ■ Clustering, Ranking and Classification of Heterogeneous Networks ■ RankClus, RankClass, and meta path-based, user-guided methodology ■ Role Discovery and Link Prediction in Information Networks ■ PathPredict ■ Similarity Search and OLAP in Information Networks: PathSim, GraphCube ■ Evolution of Social and Information Networks: EvoNetClus
  • 123. 4 Mining Other Kinds of Data ■ Mining Spatial Data ■ Spatial frequent/co-located patterns, spatial clustering and classification ■ Mining Spatiotemporal and Moving Object Data ■ Spatiotemporal data mining, trajectory mining, periodica, swarm, … ■ Mining Cyber-Physical System Data ■ Applications: healthcare, air-traffic control, flood simulation ■ Mining Multimedia Data ■ Social media data, geo-tagged spatial clustering, periodicity analysis, … ■ Mining Text Data ■ Topic modeling, i-topic model, integration with geo- and networked data ■ Mining Web Data ■ Web content, web structure, and web usage mining ■ Mining Data Streams ■ Dynamics, one-pass, patterns, clustering, classification, outlier detection
  • 124. * Data Mining: Principles and Algorithms 5 Mining Complex Data Objects: Generalization of Structured Data ■ Set-valued attribute ■ Generalization of each value in the set into its corresponding higher-level concepts ■ Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data ■ E.g., hobby = {tennis, hockey, chess, violin, PC_games} generalizes to {sports, music, e_games} ■ List-valued or a sequence-valued attribute ■ Same as set-valued attributes except that the order of the elements in the sequence should be observed in the generalization
  • 125. * Data Mining: Principles and Algorithms 6 Generalizing Spatial and Multimedia Data ■ Spatial data: ■ Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage ■ Require the merge of a set of geographic areas by spatial operations ■ Image data: ■ Extracted by aggregation and/or approximation ■ Size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image ■ Music data: ■ Summarize its melody: based on the approximate patterns that repeatedly occur in the segment ■ Summarized its style: based on its tone, tempo, or the major musical instruments played
  • 126. * Data Mining: Principles and Algorithms 7 Generalizing Object Data ■ Object identifier ■ generalize to the lowest level of class in the class/subclass hierarchies ■ Class composition hierarchies ■ generalize only those closely related in semantics to the current one ■ Construction and mining of object cubes ■ Extend the attribute-oriented induction method ■ Apply a sequence of class-based generalization operators on different attributes ■ Continue until getting a small number of generalized objects that can be summarized as a concise in high-level terms ■ Implementation ■ Examine each attribute, generalize it to simple-valued data ■ Construct a multidimensional data cube (object cube) ■ Problem: it is not always desirable to generalize a set of values to single-valued data
  • 127. * Data Mining: Principles and Algorithms 8 Ex.: Plan Mining by Divide and Conquer ■ Plan: a sequence of actions ■ E.g., Travel (flight): traveler, departure, arrival, d-time, a-time, airline, price, seat ■ Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) ■ E.g., Discover travel patterns in an air flight database, or ■ find significant patterns from the sequences of actions in the repair of automobiles ■ Method ■ Attribute-oriented induction on sequence data ■ A generalized travel plan: small-big*-small ■ Divide conquer:Mine characteristics for each subsequence ■ E.g., big*: same airline, small-big: nearby region
  • 128. * Data Mining: Principles and Algorithms 9 A Travel Database for Plan Mining ■ Example: Mining a travel planbase Travel plan table Airport info table
  • 129. * Data Mining: Principles and Algorithms 10 Multidimensional Analysis ■ Strategy ■ Generalize the planbase in different directions ■ Look for sequential patterns in the generalized plans ■ Derive high-level plans A multi-D model for the planbase
  • 130. * Data Mining: Principles and Algorithms 11 Multidimensional Generalization Multi-Dimensional generalization of the planbase Merging consecutive, identical actions in plans
  • 131. * Data Mining: Principles and Algorithms 12 What Is a Spatial Database System? ■ Geometric, geographic or spatial data: space-related data ■ Example: Geographic space (2-D abstraction of earth surface), VLSI design, model of human brain, 3-D space representing the arrangement of chains of protein molecule. ■ Spatial database system vs. image database systems. ■ Image database system: handling digital raster image (e.g., satellite sensing, computer tomography), may also contain techniques for object analysis and extraction from images and some spatial database functionality. ■ Spatial (geometric, geographic) database system: handling objects in space that have identity and well-defined extents, locations, and relationships.
  • 132. * Data Mining: Principles and Algorithms 13 GIS (Geographic Information System) ■ GIS (Geographic Information System) ■ Analysis and visualization of geographic data ■ Common analysis functions of GIS ■ Search (thematic search, search by region) ■ Location analysis (buffer, corridor, overlay) ■ Terrain analysis (slope/aspect, drainage network) ■ Flow analysis (connectivity, shortest path) ■ Distribution (nearest neighbor, proximity, change detection) ■ Spatial analysis/statistics (pattern, centrality, similarity, topology) ■ Measurements (distance, perimeter, shape, adjacency, direction)
  • 133. * Data Mining: Principles and Algorithms 14 Spatial DBMS (SDBMS) ■ SDBMS is a software system that ■ supports spatial data models, spatial ADTs, and a query language supporting them ■ supports spatial indexing, spatial operations efficiently, and query optimization ■ can work with an underlying DBMS ■ Examples ■ Oracle Spatial Data Catridge ■ ESRI Spatial Data Engine
  • 134. * Data Mining: Principles and Algorithms 15 Modeling Spatial Objects ■ What needs to be represented? ■ Two important alternative views ■ Single objects: distinct entities arranged in space each of which has its own geometric description ■ modeling cities, forests, rivers ■ Spatially related collection of objects: describe space itself (about every point in space) ■ modeling land use, partition of a country into districts
  • 135. * Data Mining: Principles and Algorithms 16 Modeling Single Objects: Point, Line and Region ■ Point: location only but not extent ■ Line (or a curve usually represented by a polyline, a sequence of line segment): ■ moving through space, or connections in space (roads, rivers, cables, etc.) ■ Region: ■ Something having extent in 2D-space (country, lake, park). It may have a hole or consist of several disjoint pieces.
  • 136. * Data Mining: Principles and Algorithms 17 Modeling Spatially Related Collection of Objects ■ Modeling spatially related collection of objects: plane partitions and networks. ■ A partition: a set of region objects that are required to be disjoint (e.g., a thematic map). There exist often pairs of objects with a common boundary (adjacency relationship). ■ A network: a graph embedded into the plane, consisting of a set of point objects, forming its nodes, and a set of line objects describing the geometry of the edges, e.g., highways. rivers, power supply lines. ■ Other interested spatially related collection of objects: nested partitions, or a digital terrain (elevation) model.
  • 137. * Data Mining: Principles and Algorithms 18 Spatial Data Types and Models ■ Field-based model: raster data ■ framework: partitioning of space ■ Object-based model: vector model ■ point, line, polygon, Objects, Attributes
  • 138. * Data Mining: Principles and Algorithms 19 Spatial Data Warehousing ■ Spatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository ■ Spatial data integration: a big issue ■ Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.) ■ Vendor-specific formats (ESRI, MapInfo, Integraph, IDRISI, etc.) ■ Geo-specific formats (geographic vs. equal area projection, etc.) ■ Spatial data cube: multidimensional spatial database ■ Both dimensions and measures may contain spatial components
  • 139. * Data Mining: Principles and Algorithms 20 Dimensions and Measures in Spatial Data Warehouse ■ Dimensions ■ non-spatial ■ e.g. “25-30 degrees” generalizes to“hot” (both are strings) ■ spatial-to-nonspatial ■ e.g. Seattle generalizes to description “Pacific Northwest” (as a string) ■ spatial-to-spatial ■ e.g. Seattle generalizes to Pacific Northwest (as a spatial region) ■ Measures ■ numerical (e.g. monthly revenue of a region) ■ distributive (e.g. count, sum) ■ algebraic (e.g. average) ■ holistic (e.g. median, rank) ■ spatial ■ collection of spatial pointers (e.g. pointers to all regions with temperature of 25-30 degrees in July)
  • 140. * Data Mining: Principles and Algorithms 21 Spatial-to-Spatial Generalization ■ Generalize detailed geographic points into clustered regions, such as businesses, residential, industrial, or agricultural areas, according to land usage ■ Requires the merging of a set of geographic areas by spatial operations Dissolve Merge Clip Intersect Union
  • 141. * Data Mining: Principles and Algorithms 22 Example: British Columbia Weather Pattern Analysis ■ Input ■ A map with about 3,000 weather probes scattered in B.C. ■ Daily data for temperature, precipitation, wind velocity, etc. ■ Data warehouse using star schema ■ Output ■ A map that reveals patterns: merged (similar) regions ■ Goals ■ Interactive analysis (drill-down, slice, dice, pivot, roll-up) ■ Fast response time ■ Minimizing storage space used ■ Challenge ■ A merged region may contain hundreds of “primitive” regions (polygons)
  • 142. * Data Mining: Principles and Algorithms 23 Star Schema of the BC Weather Warehouse ■ Spatial data warehouse ■ Dimensions ■ region_name ■ time ■ temperature ■ precipitation ■ Measurements ■ region_map ■ area ■ count Fact table Dimension table
  • 143. * Data Mining: Principles and Algorithms 24 Multidimensional Analysis of Multimedia Data ■ Multimedia data cube ■ Design and construction similar to that of traditional data cubes from relational data ■ Contain additional dimensions and measures for multimedia information, such as color, texture, and shape ■ The database does not store images but their descriptors ■ Feature descriptor: a set of vectors for each visual characteristic ■ Color vector: contains the color histogram ■ MFC (Most Frequent Color) vector: five color centroids ■ MFO (Most Frequent Orientation) vector: five edge orientation centroids ■ Layout descriptor: contains a color layout vector and an edge layout vector
  • 144. * Data Mining: Principles and Algorithms 25 Multi-Dimensional Search in Multimedia Databases
  • 145. * Data Mining: Principles and Algorithms 26 Color histogram Texture layout Multi-Dimensional Analysis in Multimedia Databases
  • 146. * Data Mining: Principles and Algorithms 27 Refining or combining searches Search for “blue sky” (top layout grid is blue) Search for “blue sky and green meadows” (top layout grid is blue and bottom is green) Search for “airplane in blue sky” (top layout grid is blue and keyword = “airplane”) Mining Multimedia Databases
  • 147. * Data Mining: Principles and Algorithms 28 RED WHITE BLUE GIF JPEG By Format By Colour Sum Cross Tab RED WHITE BLUE Colour Sum Group By Measurement JPEG GIF Small Very Large RED WHITE BLUE By Colour By Format Colour By Format Size By Colour Size By Format By Size Sum The Data Cube and the Sub-Space Measurements Medium Large • Format of image • Duration • Colors • Textures • Keywords • Size • Width • Height • Internet domain of image • Internet domain of parent pages • Image popularity Mining Multimedia Databases
  • 148. * Data Mining: Principles and Algorithms 29 Mining Multimedia Databases in
  • 149. * Data Mining: Principles and Algorithms 30 Classification in MultiMediaMiner
  • 150. * Data Mining: Principles and Algorithms 31 Spatial Relationships from Layout property P1 next-to property P2 property P1 on-top-of property P2 Different Resolution Hierarchy Mining Multimedia Databases
  • 151. * Data Mining: Principles and Algorithms 32 From Coarse to Fine Resolution Mining Mining Multimedia Databases
  • 152. * Data Mining: Concepts and Techniques 33 Data Mining Applications ■ Data mining is an interdisciplinary field with wide and diverse applications ■ There exist nontrivial gaps between data mining principles and domain-specific applications ■ Some application domains ■ Financial data analysis ■ Retail industry ■ Telecommunication industry ■ Biological data analysis
  • 153. * Data Mining: Concepts and Techniques 34 Data Mining for Financial Data Analysis ■ Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality ■ Design and construction of data warehouses for multidimensional data analysis and data mining ■ View the debt and revenue changes by month, by region, by sector, and by other factors ■ Access statistical information such as max, min, total, average, trend, etc. ■ Loan payment prediction/consumer credit policy analysis ■ feature selection and attribute relevance ranking ■ Loan payment performance ■ Consumer credit rating
  • 154. * Data Mining: Concepts and Techniques 35 Financial Data Mining ■ Classification and clustering of customers for targeted marketing ■ multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group ■ Detection of money laundering and other financial crimes ■ integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) ■ Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)
  • 155. * Data Mining: Concepts and Techniques 36 Data Mining for Retail Industry ■ Retail industry: huge amounts of data on sales, customer shopping history, etc. ■ Applications of retail data mining ■ Identify customer buying behaviors ■ Discover customer shopping patterns and trends ■ Improve the quality of customer service ■ Achieve better customer retention and satisfaction ■ Enhance goods consumption ratios ■ Design more effective goods transportation and distribution policies
  • 156. * Data Mining: Concepts and Techniques 37 Data Mining in Retail Industry (2) ■ Ex. 1. Design and construction of data warehouses based on the benefits of data mining ■ Multidimensional analysis of sales, customers, products, time, and region ■ Ex. 2. Analysis of the effectiveness of sales campaigns ■ Ex. 3. Customer retention: Analysis of customer loyalty ■ Use customer loyalty card information to register sequences of purchases of particular customers ■ Use sequential pattern mining to investigate changes in customer consumption or loyalty ■ Suggest adjustments on the pricing and variety of goods ■ Ex. 4. Purchase recommendation and cross-reference of items
  • 157. * Data Mining: Concepts and Techniques 38 Data Mining for Telecomm. Industry (1) ■ A rapidly expanding and highly competitive industry and a great demand for data mining ■ Understand the business involved ■ Identify telecommunication patterns ■ Catch fraudulent activities ■ Make better use of resources ■ Improve the quality of service ■ Multidimensional analysis of telecommunication data ■ Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.
  • 158. * Data Mining: Concepts and Techniques 39 Data Mining for Telecomm. Industry (2) ■ Fraudulent pattern analysis and the identification of unusual patterns ■ Identify potentially fraudulent users and their atypical usage patterns ■ Detect attempts to gain fraudulent entry to customer accounts ■ Discover unusual patterns which may need special attention ■ Multidimensional association and sequential pattern analysis ■ Find usage patterns for a set of communication services by customer group, by month, etc. ■ Promote the sales of specific services ■ Improve the availability of particular services in a region
  • 159. * Data Mining: Concepts and Techniques 40 Biomedical Data Analysis ■ DNA sequences ■ Gene: a sequence of hundreds of individual nucleotides arranged in a particular order ■ Humans have around 30,000 genes ■ Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes ■ Semantic integration of heterogeneous, distributed genome databases ■ Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data
  • 160. * Data Mining: Concepts and Techniques 41 DNA Analysis: Examples ■ Similarity search and comparison among DNA sequences ■ Compare the frequently occurring patterns of each class (e.g., diseased and healthy) ■ Identify gene sequence patterns that play roles in various diseases ■ Association analysis: identification of co-occurring gene sequences ■ Most diseases are not triggered by a single gene but by a combination of genes acting together ■ Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples ■ Path analysis: linking genes to different disease development stages ■ Different genes may become active at different stages of the disease ■ Develop pharmaceutical interventions that target the different stages separately
  • 161. 42 Privacy, Security and Social Impacts of Data Mining ■ Many data mining applications do not touch personal data ■ E.g., meteorology, astronomy, geography, geology, biology, and other scientific and engineering data ■ Many DM studies are on developing scalable algorithms to find general or statistically significant patterns, not touching individuals ■ The real privacy concern: unconstrained access of individual records, especially privacy-sensitive information ■ Method 1: Removing sensitive IDs associated with the data ■ Method 2: Data security-enhancing methods ■ Multi-level security model: permit to access to only authorized level ■ Encryption: e.g., blind signatures, biometric encryption, and anonymous databases (personal information is encrypted and stored at different locations) ■ Method 3: Privacy-preserving data mining methods
  • 162. * Data Mining: Concepts and Techniques 43 Trends in Data Mining ■ Application exploration ■ development of application-specific data mining system ■ Invisible data mining (mining as built-in function) ■ Scalable data mining methods ■ Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns ■ Integration of data mining with database systems, data warehouse systems, and Web database systems ■ Invisible data mining
  • 163. * Data Mining: Concepts and Techniques 44 Trends in Data Mining ■ Standardization of data mining language ■ A standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society ■ Visual data mining ■ New methods for mining complex types of data ■ More research is required towards the integration of data mining methods with existing data analysis techniques for the complex types of data ■ Web mining ■ Privacy protection and information security in data mining
  • 164. * Data Mining: Concepts and Techniques 1 Data Mining Applications ■ Data mining is an interdisciplinary field with wide and diverse applications ■ There exist nontrivial gaps between data mining principles and domain-specific applications ■ Some application domains ■ Financial data analysis ■ Retail industry ■ Telecommunication industry ■ Biological data analysis
  • 165. * Data Mining: Concepts and Techniques 2 Data Mining for Financial Data Analysis ■ Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality ■ Design and construction of data warehouses for multidimensional data analysis and data mining ■ View the debt and revenue changes by month, by region, by sector, and by other factors ■ Access statistical information such as max, min, total, average, trend, etc. ■ Loan payment prediction/consumer credit policy analysis ■ feature selection and attribute relevance ranking ■ Loan payment performance ■ Consumer credit rating
  • 166. * Data Mining: Concepts and Techniques 3 Financial Data Mining ■ Classification and clustering of customers for targeted marketing ■ multidimensional segmentation by nearest-neighbor, classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group ■ Detection of money laundering and other financial crimes ■ integration of from multiple DBs (e.g., bank transactions, federal/state crime history DBs) ■ Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)
  • 167. * Data Mining: Concepts and Techniques 4 Data Mining for Retail Industry ■ Retail industry: huge amounts of data on sales, customer shopping history, etc. ■ Applications of retail data mining ■ Identify customer buying behaviors ■ Discover customer shopping patterns and trends ■ Improve the quality of customer service ■ Achieve better customer retention and satisfaction ■ Enhance goods consumption ratios ■ Design more effective goods transportation and distribution policies
  • 168. * Data Mining: Concepts and Techniques 5 Data Mining in Retail Industry (2) ■ Ex. 1. Design and construction of data warehouses based on the benefits of data mining ■ Multidimensional analysis of sales, customers, products, time, and region ■ Ex. 2. Analysis of the effectiveness of sales campaigns ■ Ex. 3. Customer retention: Analysis of customer loyalty ■ Use customer loyalty card information to register sequences of purchases of particular customers ■ Use sequential pattern mining to investigate changes in customer consumption or loyalty ■ Suggest adjustments on the pricing and variety of goods ■ Ex. 4. Purchase recommendation and cross-reference of items
  • 169. * Data Mining: Concepts and Techniques 6 Data Mining for Telecomm. Industry (1) ■ A rapidly expanding and highly competitive industry and a great demand for data mining ■ Understand the business involved ■ Identify telecommunication patterns ■ Catch fraudulent activities ■ Make better use of resources ■ Improve the quality of service ■ Multidimensional analysis of telecommunication data ■ Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.
  • 170. * Data Mining: Concepts and Techniques 7 Data Mining for Telecomm. Industry (2) ■ Fraudulent pattern analysis and the identification of unusual patterns ■ Identify potentially fraudulent users and their atypical usage patterns ■ Detect attempts to gain fraudulent entry to customer accounts ■ Discover unusual patterns which may need special attention ■ Multidimensional association and sequential pattern analysis ■ Find usage patterns for a set of communication services by customer group, by month, etc. ■ Promote the sales of specific services ■ Improve the availability of particular services in a region
  • 171. * Data Mining: Concepts and Techniques 8 Biomedical Data Analysis ■ DNA sequences ■ Gene: a sequence of hundreds of individual nucleotides arranged in a particular order ■ Humans have around 30,000 genes ■ Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes ■ Semantic integration of heterogeneous, distributed genome databases ■ Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data
  • 172. * Data Mining: Concepts and Techniques 9 DNA Analysis: Examples ■ Similarity search and comparison among DNA sequences ■ Compare the frequently occurring patterns of each class (e.g., diseased and healthy) ■ Identify gene sequence patterns that play roles in various diseases ■ Association analysis: identification of co-occurring gene sequences ■ Most diseases are not triggered by a single gene but by a combination of genes acting together ■ Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples ■ Path analysis: linking genes to different disease development stages ■ Different genes may become active at different stages of the disease ■ Develop pharmaceutical interventions that target the different stages separately
  • 173. * Data Mining: Concepts and Techniques 10 Trends in Data Mining ■ Application exploration ■ development of application-specific data mining system ■ Invisible data mining (mining as built-in function) ■ Scalable data mining methods ■ Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns ■ Integration of data mining with database systems, data warehouse systems, and Web database systems ■ Invisible data mining
  • 174. * Data Mining: Concepts and Techniques 11 Trends in Data Mining ■ Standardization of data mining language ■ A standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society ■ Visual data mining ■ New methods for mining complex types of data ■ More research is required towards the integration of data mining methods with existing data analysis techniques for the complex types of data ■ Web mining ■ Privacy protection and information security in data mining