SlideShare a Scribd company logo
1
   Motivation: Why data mining?
   What is data mining?
   Data Mining: On what kind of data?
   Data mining functionality
   Are all the patterns interesting?
   Major issues in data mining
                                         2
   Data explosion problem

     Automated data collection tools and mature database technology

      lead to tremendous amounts of data stored in databases, data
      warehouses and other information repositories
   We are drowning in data, but starving for knowledge!
   Solution: Data warehousing and data mining
     Data warehousing and on-line analytical processing

     Extraction of interesting knowledge (rules, regularities, patterns,

      constraints) from data in large databases

                                                                            3
   1960s:
     Data collection, database creation, IMS and network DBMS
   1970s:
     Relational data model, relational DBMS implementation
   1980s:
     RDBMS, advanced data models (extended-relational, OO, deductive,
      etc.) and application-oriented DBMS (spatial, scientific, engineering,
      etc.)
   1990s—2000s:
     Data mining and data warehousing, multimedia databases, and Web
      databases
                                                                               4
   Data mining (knowledge discovery in databases):
     Extraction of interesting (non-trivial, implicit, previously
      unknown and potentially useful) information or patterns from
      data in large databases
   Alternative names:
     Data mining: a misnomer?
     Knowledge discovery(mining) in databases (KDD), knowledge
      extraction, data/pattern analysis, data archeology, data
      dredging, information harvesting, business intelligence, etc.
   What is not data mining?
     (Deductive) query processing.
     Expert systems or small ML/statistical programs


                                                                      5
   Database analysis and decision support
     Market analysis and management
       ▪ target marketing, customer relation management, market
         basket analysis, cross selling, market segmentation
     Risk analysis and management
       ▪ Forecasting, customer retention, improved underwriting, quality
         control, competitive analysis
     Fraud detection and management
   Other Applications
     Text mining (news group, email, documents)
     Stream data mining
     Web mining.
     DNA data analysis

                                                                           6
   Where are the data sources for analysis?
     Credit card transactions, loyalty cards, discount coupons, customer
      complaint calls, plus (public) lifestyle studies
   Target marketing
     Find clusters of “model” customers who share the same
      characteristics: interest, income level, spending habits, etc.
   Determine customer purchasing patterns over time
     Conversion of single to a joint bank account: marriage, etc.
   Cross-market analysis
     Associations/co-relations between product sales
     Prediction based on the association information

                                                                            7
   Customer profiling
     data mining can tell you what types of customers buy what products
      (clustering or classification)
   Identifying customer requirements
     identifying the best products for different customers

     use prediction to find what factors will attract new customers
   Provides summary information
     various multidimensional summary reports

     statistical summary information (data central tendency and
      variation)
                                                                           8
   Finance planning and asset evaluation
     cash flow analysis and prediction
     contingent claim analysis to evaluate assets
     cross-sectional and time series analysis (financial-ratio, trend
      analysis, etc.)
   Resource planning:
     summarize and compare the resources and spending
   Competition:
     monitor competitors and market directions
     group customers into classes and a class-based pricing procedure
     set pricing strategy in a highly competitive market



                                                                         9
   Applications
     widely used in health care, retail, credit card services,
      telecommunications (phone card fraud), etc.
   Approach
     use historical data to build models of fraudulent behavior and use
      data mining to help identify similar instances
   Examples
     auto insurance: detect a group of people who stage accidents to
      collect on insurance
     money laundering: detect suspicious money transactions (US
      Treasury's Financial Crimes Enforcement Network)
     medical insurance: detect professional patients and ring of doctors
      and ring of references
                                                                            10
   Detecting inappropriate medical treatment
     Australian Health Insurance Commission identifies that in many cases
      blanket screening tests were requested (save Australian $1m/yr).
   Detecting telephone fraud
     Telephone call model: destination of the call, duration, time of day or
      week. Analyze patterns that deviate from an expected norm.
     British Telecom identified discrete groups of callers with frequent
      intra-group calls, especially mobile phones, and broke a multimillion
      dollar fraud.
   Retail
     Analysts estimate that 38% of retail shrink is due to dishonest
      employees.



                                                                                11
   Sports
     IBM Advanced Scout analyzed NBA game statistics (shots blocked,
      assists, and fouls) to gain competitive advantage for New York
      Knicks and Miami Heat
   Astronomy
     JPL and the Palomar Observatory discovered 22 quasars with the
      help of data mining
   Internet Web Surf-Aid
     IBM Surf-Aid applies data mining algorithms to Web access logs for
      market-related pages to discover customer preference and behavior
      pages, analyzing effectiveness of Web marketing, improving Web
      site organization, etc.

                                                                           12
Pattern Evaluation
   Data mining: the core of
    knowledge discovery
    process.                       Data Mining

                    Task-relevant Data


      Data                   Selection
      Warehouse
Data Cleaning

          Data Integration


        Databases
                                                               13
   Learning the application domain:
     relevant prior knowledge and goals of application
   Creating a target data set: data selection
   Data cleaning and preprocessing: (may take 60% of effort!)
   Data reduction and transformation:
     Find useful features, dimensionality/variable reduction, invariant
        representation.
   Choosing functions of data mining
       summarization, classification, regression, association, clustering.
   Choosing the mining algorithm(s)
   Data mining: search for patterns of interest
   Pattern evaluation and knowledge presentation
     visualization, transformation, removing redundant patterns, etc.
   Use of discovered knowledge


                                                                              14
   Relational databases
   Data warehouses
   Transactional databases
   Advanced DB and information repositories
       Object-oriented and object-relational databases
       Spatial and temporal data
       Time-series data and stream data
       Text databases and multimedia databases
       Heterogeneous and legacy databases
       WWW
                                                          15
   Association rule mining:
     Finding frequent patterns, associations, correlations, or causal
      structures among sets of items or objects in transaction databases,
      relational databases, and other information repositories.
     Frequent pattern: pattern (set of items, sequence, etc.) that occurs
      frequently in a database
   Motivation: finding regularities in data
       What products were often purchased together? — Beer and diapers?!
       What are the subsequent purchases after buying a PC?
       What kinds of DNA are sensitive to this new drug?
       Can we automatically classify web documents?



                                                                             16
Transaction-id           Items bought    Itemset X={x1, …, xk}
     10                     A, B, C      Find all the rules XY with min
     20                      A, C         confidence and support
     30                      A, D          support, s, probability that a
     40                     B, E, F          transaction contains X∪Y
                                           confidence, c, conditional probability
                                             that a transaction having X also
             Customer       Customer         contains Y.
             buys both      buys
                            diapers

                                            Let min_support = 50%,
                                            min_conf = 50%:
 Customer
                                               A  C (50%, 66.7%)
 buys beer                                     C  A (50%, 100%)
                                                                                 17
Transaction-id   Items bought   Min. support 50%
     10             A, B, C     Min. confidence 50%
     20              A, C
     30              A, D         Frequent pattern    Support
     40             B, E, F             {A}             75%
                                        {B}             50%
                                        {C}             50%
                                       {A, C}           50%


 For rule A ⇒ C:
    support = support({A}∪{C}) = 50%
    confidence = support({A}∪{C})/support({A}) = 66.6%

                                                                18
   Any subset of a frequent itemset must be frequent
     if {beer, diaper, nuts} is frequent, so is {beer, diaper}
     every transaction having {beer, diaper, nuts} also contains {beer, diaper}
 Apriori pruning principle: If there is any itemset which is infrequent, its
  superset should not be generated/tested!
 Method:
   generate length (k+1) candidate itemsets from length k frequent itemsets,
    and
   test the candidates against DB
 The performance studies show its efficiency and scalability



                                                                                   19
Itemset       sup
                                                                     Itemset       sup
                                            {A}         2
Database TDB                                                  L1         {A}         2
                                  C1        {B}         3
Tid        Items                                                         {B}         3
                                            {C}         3
10         A, C, D           1st scan                                    {C}         3
                                            {D}         1
                                                                         {E}         3
20         B, C, E                          {E}         3
30     A, B, C, E
40          B, E                 C2     Itemset    sup               C2        Itemset
                                         {A, B}     1
                                         {A, C}     2        2nd scan           {A, B}
 L2   Itemset        sup
                                         {A, E}     1                           {A, C}
       {A, C}            2
                                         {B, C}     2                           {A, E}
       {B, C}            2
                                         {B, E}     3                           {B, C}
       {B, E}            3
                                         {C, E}     2                           {B, E}
       {C, E}            2
                                                                                {C, E}

      C3     Itemset
                                 3rd scan         L3   Itemset     sup
             {B, C, E}                                 {B, C, E}    2
                                                                                         20
   Pseudo-code:
      Ck: Candidate itemset of size k
      Lk : frequent itemset of size k
      L1 = {frequent items};
      for (k = 1; Lk !=∅; k++) do begin
         Ck+1 = candidates generated from Lk;
        for each transaction t in database do
               increment the count of all candidates in Ck+1
           that are contained in t
        Lk+1 = candidates in Ck+1 with min_support
        end
      return ∪k Lk;
                                                               21
   How to generate candidates?
     Step 1: self-joining Lk
     Step 2: pruning
   Example of Candidate-generation
     L3={abc, abd, acd, ace, bcd}
     Self-joining: L3*L3
       ▪ abcd from abc and abd
       ▪ acde from acd and ace
     Pruning:
       ▪ acde is removed because ade is not in L3
     C4={abcd}


                                                    22
   Suppose the items in Lk-1 are listed in an order
   Step 1: self-joining Lk-1
    insert into Ck
    select p.item1, p.item2, …, p.itemk-1, q.itemk-1
    from Lk-1 p, Lk-1 q
    where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
   Step 2: pruning
    forall itemsets c in Ck do
       forall (k-1)-subsets s of c do
           if (s is not in Lk-1) then delete c from Ck


                                                                           23
 Finding models (functions) that describe and
  distinguish classes or concepts for future prediction
 E.g., classify countries based on climate, or classify
  cars based on gas mileage
 Presentation: decision-tree, classification rule, neural
  network
 Prediction: Predict some unknown or missing
  numerical values
                                                             24
Classification
                                            Algorithms
              Training
                Data


NAM E   RANK           YEARS TENURED         Classifier
M ike   Assistant Prof   3      no           (Model)
M ary   Assistant Prof   7      yes
Bill    Professor        2      yes
Jim     Associate Prof   7      yes
                                       IF rank = ‘professor’
Dave    Assistant Prof   6      no
                                       OR years > 6
Anne    Associate Prof   3      no
                                       THEN tenured = ‘yes’
                                                            25
Classifier


                 Testing
                  Data                          Unseen Data

                                             (Jeff, Professor, 4)
NAM E      RANK           YEARS TENURED
Tom        Assistant Prof   2      no        Tenured?
M erlisa   Associate Prof   7      no
G eorge    Professor        5      yes
Joseph     Assistant Prof   7      yes
                                                                    26
age    income student credit_rating
           <=30    high       no fair
Training   <=30
           31…40
                   high
                   high
                              no excellent
                              no fair
set        >40     medium     no fair
           >40     low       yes fair
           >40     low       yes excellent
           31…40   low       yes excellent
           <=30    medium     no fair
           <=30    low       yes fair
           >40     medium    yes fair
           <=30    medium    yes excellent
           31…40   medium     no excellent
           31…40   high      yes fair
           >40     medium     no excellent

                                                   27
age?


        <=30          overcast
                       30..40     >40

     student?           yes        credit rating?


no              yes              excellent    fair

no              yes                 no        yes

                                                     28
   Cluster analysis
     Class label is unknown: Group data to form new classes, e.g., cluster houses
       to find distribution patterns
     Clustering based on the principle: maximizing the intra-class similarity and
       minimizing the interclass similarity
   Outlier analysis
     Outlier: a data object that does not comply with the general behavior of the
       data
     It can be considered as noise or exception but is quite useful in fraud
       detection, rare events analysis


                                                                                     29
30
31

More Related Content

PPT
Chapter 1: Introduction to Data Mining
PPTX
Data mining
PPTX
Introduction to Data Mining
PPT
Introduction to DataMining
PPTX
Data mining , Knowledge Discovery Process, Classification
PPT
Data mining-2
PPT
data mining
PPT
Chapter 1. Introduction
Chapter 1: Introduction to Data Mining
Data mining
Introduction to Data Mining
Introduction to DataMining
Data mining , Knowledge Discovery Process, Classification
Data mining-2
data mining
Chapter 1. Introduction

What's hot (20)

PPT
introduction to data mining tutorial
ODP
Data mining
PPT
Introduction-to-Knowledge Discovery in Database
PPT
Introduction To Data Mining
PPTX
3 Data Mining Tasks
PPT
Knowledge discovery thru data mining
PPT
Introduction to Data Mining
PPT
What Is DATA MINING(INTRODUCTION)
PPTX
Data mining concepts and work
PPTX
Basic Overview of Data Mining
PPTX
Data Mining
PPT
Data mining 1
PPT
Unit 3 part i Data mining
PPTX
Data mining
PPTX
Data mining and knowledge discovery
PDF
Introduction to Data Mining
PPTX
Data Mining: an Introduction
DOCX
data mining and data warehousing
PPTX
Knowledge discovery process
PDF
Data Mining: Future Trends and Applications
introduction to data mining tutorial
Data mining
Introduction-to-Knowledge Discovery in Database
Introduction To Data Mining
3 Data Mining Tasks
Knowledge discovery thru data mining
Introduction to Data Mining
What Is DATA MINING(INTRODUCTION)
Data mining concepts and work
Basic Overview of Data Mining
Data Mining
Data mining 1
Unit 3 part i Data mining
Data mining
Data mining and knowledge discovery
Introduction to Data Mining
Data Mining: an Introduction
data mining and data warehousing
Knowledge discovery process
Data Mining: Future Trends and Applications
Ad

Similar to Data miningppt378 (20)

PPT
Introduction
PPT
Introduction.ppt
PPTX
Data warehouse and data mining
PPT
Introduction data mining
PPT
6months industrial training in data mining, jalandhar
PPT
6 weeks summer training in data mining,ludhiana
PPT
6months industrial training in data mining,ludhiana
PPT
6 weeks summer training in data mining,jalandhar
PPT
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
PDF
Data mining 1 - Introduction (cheat sheet - printable)
PPT
Data mining final year project in ludhiana
PPT
Data mining final year project in jalandhar
PPT
Introduction To Data Mining
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
Introduction
Introduction.ppt
Data warehouse and data mining
Introduction data mining
6months industrial training in data mining, jalandhar
6 weeks summer training in data mining,ludhiana
6months industrial training in data mining,ludhiana
6 weeks summer training in data mining,jalandhar
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
Data mining 1 - Introduction (cheat sheet - printable)
Data mining final year project in ludhiana
Data mining final year project in jalandhar
Introduction To Data Mining
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Spectral efficient network and resource selection model in 5G networks
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
Group 1 Presentation -Planning and Decision Making .pptx
Programs and apps: productivity, graphics, security and other tools
Machine Learning_overview_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Data miningppt378

  • 1. 1
  • 2. Motivation: Why data mining?  What is data mining?  Data Mining: On what kind of data?  Data mining functionality  Are all the patterns interesting?  Major issues in data mining 2
  • 3. Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 3
  • 4. 1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 4
  • 5. Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases  Alternative names:  Data mining: a misnomer?  Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  What is not data mining?  (Deductive) query processing.  Expert systems or small ML/statistical programs 5
  • 6. Database analysis and decision support  Market analysis and management ▪ target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management ▪ Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management  Other Applications  Text mining (news group, email, documents)  Stream data mining  Web mining.  DNA data analysis 6
  • 7. Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.  Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information 7
  • 8. Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification)  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) 8
  • 9. Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)  Resource planning:  summarize and compare the resources and spending  Competition:  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market 9
  • 10. Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.  Approach  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances  Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references 10
  • 11. Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).  Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.  Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees. 11
  • 12. Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 12
  • 13. Pattern Evaluation  Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Selection Warehouse Data Cleaning Data Integration Databases 13
  • 14. Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 14
  • 15. Relational databases  Data warehouses  Transactional databases  Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial and temporal data  Time-series data and stream data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 15
  • 16. Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database  Motivation: finding regularities in data  What products were often purchased together? — Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents? 16
  • 17. Transaction-id Items bought  Itemset X={x1, …, xk} 10 A, B, C  Find all the rules XY with min 20 A, C confidence and support 30 A, D  support, s, probability that a 40 B, E, F transaction contains X∪Y  confidence, c, conditional probability that a transaction having X also Customer Customer contains Y. buys both buys diapers Let min_support = 50%, min_conf = 50%: Customer A  C (50%, 66.7%) buys beer C  A (50%, 100%) 17
  • 18. Transaction-id Items bought Min. support 50% 10 A, B, C Min. confidence 50% 20 A, C 30 A, D Frequent pattern Support 40 B, E, F {A} 75% {B} 50% {C} 50% {A, C} 50% For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6% 18
  • 19. Any subset of a frequent itemset must be frequent  if {beer, diaper, nuts} is frequent, so is {beer, diaper}  every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method:  generate length (k+1) candidate itemsets from length k frequent itemsets, and  test the candidates against DB  The performance studies show its efficiency and scalability 19
  • 20. Itemset sup Itemset sup {A} 2 Database TDB L1 {A} 2 C1 {B} 3 Tid Items {B} 3 {C} 3 10 A, C, D 1st scan {C} 3 {D} 1 {E} 3 20 B, C, E {E} 3 30 A, B, C, E 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 {A, C} 2 2nd scan {A, B} L2 Itemset sup {A, E} 1 {A, C} {A, C} 2 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {B, E} {C, E} 2 {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 20
  • 21. Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; 21
  • 22. How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3 ▪ abcd from abc and abd ▪ acde from acd and ace  Pruning: ▪ acde is removed because ade is not in L3  C4={abcd} 22
  • 23. Suppose the items in Lk-1 are listed in an order  Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1  Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 23
  • 24.  Finding models (functions) that describe and distinguish classes or concepts for future prediction  E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values 24
  • 25. Classification Algorithms Training Data NAM E RANK YEARS TENURED Classifier M ike Assistant Prof 3 no (Model) M ary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes IF rank = ‘professor’ Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’ 25
  • 26. Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assistant Prof 2 no Tenured? M erlisa Associate Prof 7 no G eorge Professor 5 yes Joseph Assistant Prof 7 yes 26
  • 27. age income student credit_rating <=30 high no fair Training <=30 31…40 high high no excellent no fair set >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >40 medium no excellent 27
  • 28. age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes 28
  • 29. Cluster analysis  Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 29
  • 30. 30
  • 31. 31