Data miningppt378

 Motivation: Why data mining?
 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Major issues in data mining
2

 Data explosion problem

 Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
 We are drowning in data, but starving for knowledge!
 Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing

 Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

3

 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.) and application-oriented DBMS (spatial, scientific, engineering,
etc.)
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and Web
databases
4

 Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
 Alternative names:
 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 What is not data mining?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs

5

 Database analysis and decision support
 Market analysis and management
▪ target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
 Risk analysis and management
▪ Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
 Fraud detection and management
 Other Applications
 Text mining (news group, email, documents)
 Stream data mining
 Web mining.
 DNA data analysis

6

 Where are the data sources for analysis?
 Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
 Target marketing
 Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
 Determine customer purchasing patterns over time
 Conversion of single to a joint bank account: marriage, etc.
 Cross-market analysis
 Associations/co-relations between product sales
 Prediction based on the association information

7

 Customer profiling
 data mining can tell you what types of customers buy what products
(clustering or classification)
 Identifying customer requirements
 identifying the best products for different customers

 use prediction to find what factors will attract new customers
 Provides summary information
 various multidimensional summary reports

 statistical summary information (data central tendency and
variation)
8

 Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
 Resource planning:
 summarize and compare the resources and spending
 Competition:
 monitor competitors and market directions
 group customers into classes and a class-based pricing procedure
 set pricing strategy in a highly competitive market

9

 Applications
 widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
 Approach
 use historical data to build models of fraudulent behavior and use
data mining to help identify similar instances
 Examples
 auto insurance: detect a group of people who stage accidents to
collect on insurance
 money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
 medical insurance: detect professional patients and ring of doctors
and ring of references
10

 Detecting inappropriate medical treatment
 Australian Health Insurance Commission identifies that in many cases
blanket screening tests were requested (save Australian $1m/yr).
 Detecting telephone fraud
 Telephone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm.
 British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a multimillion
dollar fraud.
 Retail
 Analysts estimate that 38% of retail shrink is due to dishonest
employees.

11

 Sports
 IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York
Knicks and Miami Heat
 Astronomy
 JPL and the Palomar Observatory discovered 22 quasars with the
help of data mining
 Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web
site organization, etc.

12

Pattern Evaluation
 Data mining: the core of
knowledge discovery
process. Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
13

 Learning the application domain:
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation:
 Find useful features, dimensionality/variable reduction, invariant
representation.
 Choosing functions of data mining
 summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

14

 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial and temporal data
 Time-series data and stream data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
15

 Association rule mining:
 Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases,
relational databases, and other information repositories.
 Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database
 Motivation: finding regularities in data
 What products were often purchased together? — Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?

16

Transaction-id Items bought  Itemset X={x1, …, xk}
10 A, B, C  Find all the rules XY with min
20 A, C confidence and support
30 A, D  support, s, probability that a
40 B, E, F transaction contains X∪Y
 confidence, c, conditional probability
that a transaction having X also
Customer Customer contains Y.
buys both buys
diapers

Let min_support = 50%,
min_conf = 50%:
Customer
A  C (50%, 66.7%)
buys beer C  A (50%, 100%)
17

Transaction-id Items bought Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C
30 A, D Frequent pattern Support
40 B, E, F {A} 75%
{B} 50%
{C} 50%
{A, C} 50%

For rule A ⇒ C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%

18

 Any subset of a frequent itemset must be frequent
 if {beer, diaper, nuts} is frequent, so is {beer, diaper}
 every transaction having {beer, diaper, nuts} also contains {beer, diaper}
 Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested!
 Method:
 generate length (k+1) candidate itemsets from length k frequent itemsets,
and
 test the candidates against DB
 The performance studies show its efficiency and scalability

19

Itemset sup
Itemset sup
{A} 2
Database TDB L1 {A} 2
C1 {B} 3
Tid Items {B} 3
{C} 3
10 A, C, D 1st scan {C} 3
{D} 1
{E} 3
20 B, C, E {E} 3
30 A, B, C, E
40 B, E C2 Itemset sup C2 Itemset
{A, B} 1
{A, C} 2 2nd scan {A, B}
L2 Itemset sup
{A, E} 1 {A, C}
{A, C} 2
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3 {B, C}
{B, E} 3
{C, E} 2 {B, E}
{C, E} 2
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
20

 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
21

 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
▪ abcd from abc and abd
▪ acde from acd and ace
 Pruning:
▪ acde is removed because ade is not in L3
 C4={abcd}

22

 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

23

 Finding models (functions) that describe and
distinguish classes or concepts for future prediction
 E.g., classify countries based on climate, or classify
cars based on gas mileage
 Presentation: decision-tree, classification rule, neural
network
 Prediction: Predict some unknown or missing
numerical values
24

Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier
M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
25

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof 7 yes
26

age income student credit_rating
<=30 high no fair
Training <=30
31…40
high
high
no excellent
no fair
set >40 medium no fair
>40 low yes fair
>40 low yes excellent
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
>40 medium no excellent

27

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

28

 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g., cluster houses
to find distribution patterns
 Clustering based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity
 Outlier analysis
 Outlier: a data object that does not comply with the general behavior of the
data
 It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis

29

Data miningppt378

More Related Content

What's hot (20)

Similar to Data miningppt378 (20)

Recently uploaded (20)

Data miningppt378