Data mining 1

資料掘常用技術的介紹挖
演講者 : 黃薰慧
日期 : 2006.11.25

內容大綱
 資料掘挖 (Data Mining) 簡介
 常用技術簡介
 Association Rule Mining
 Sequential Pattern Mining
 Classification
 Clustering
 結論

資訊資訊
MiningMining
資料
倉儲
MiningMining
知識
(Corporate Memory)
(Corporate Intelligence)
資料掘挖 (Data Mining) 簡介

何謂 Data Mining ？
 找尋隱藏在資料中的訊息
，如趨勢（ Trend ）、特
徵（ Pattern ）及相關性
（ Relationship ）。
 知識發掘術 (Knowledge
Discovery in Database,
KDD) 的一部份。
 運用電腦儲存運算能力及
使用統計方法工具。

常用技術簡介
 Data Mining 的常的技術有：見 Association Rule
Mining 、 Sequential Pattern Mining 、
Classification 、 Clustering 。
 Association Rule Mining 通常用在找資之間的關性。料聯
 Sequential Pattern Mining 通常用在找資之間的時序關係。料
 Classification 為給定已知的 data ，並且為這些 data 分類，然
後將這些 data 拿做來 training classification rule ；透過
classification rule ，推測其他的 data 屬於一種類；由哪於事先
有分類，故 Classification 為 Supervised( 監督式 ) 。
 Clustering 則是在一大群資中，將它們分群，由於沒有事先料
分類，所以 Clustering 為 Unsupervised( 未監督式 ) 。

Association Rule Mining 技術介紹
 Definition : 在已知的 data set 中，找尋 data item 之間的有用的關性。聯
 Glossary :
 Support(A→B) = P(A ∩ B)
 Confidence(A→B) = P(B | A)
 Example:
Buys(X, “computer”) → Buys(X,“financial_management_software ＂ )
令 A = Buys(X, “computer”) ， B = Buys(X,”financial_management_software”)
則 Support(A→B) 為買電腦，又買財務管軟體的機。了理率
而 Confidence 為在買電腦的情況下，會買財務管軟體的機。了理率

( 續 )
 Algorithm:
Association Rule 常用的演算法為 Apriori 。這個演算法，一開始會先定義
minimum support 與 minimum confidence ，藉著依照下個步驟產生列兩來
Association Rule ：
a. 找出所有 frequent item set 。
b. 根據這些 frequent item set ，產生 Association Rule ，而這些
Association
Rule 必須滿足 minimum support 與 minimum confidence 。
第二個步驟只要用排組合，就能建出，所以利列夠立來 Apriori 的重點在於如何
找出 frequent item set 。首先，根據 item 出現的次，決定數來 frequent 1-
itemset ，也就是只有一個 item 的 frequent itemset 。接著，假設我們要找
frequent k-itemset ，此步驟包含個子步驟：兩
a. 將 frequent (k-1)-itemset 的結果拿做來 Join ，如果可以形成 k 個
item 的
set ，那麼將它視為 Candidate itemset 。
b. 檢所有的查 Candidate itemset ，看看是否滿足 minimum support ，如
果滿足 minimum support ，則屬於 frequent k-itemset 。

( 續 )
 Example:
右表為一個買賣交的資表易料
，總共有 9 個交，而交的易易
item 有 5 種，現在，我們要找
出這 5 個 item 之間的關係，
我們將 minimum support 定
為 2 。

( 續 )

( 續 )
因為 {1,4} 、 {3,4} 、 {3,5} 、
{4,5} 的 support 小於 minimum
support ，所以刪除它們。

( 續 )
沒有任何一個 set 的 support 小於
minimum support ，所以刪除任何一個不
set ，此 frequent itemsets 為最後的結果。
接著，我們用排組合的方法，計算利列來 Association Rule ：
如果我們將 minimum confidence 定為 70% 的話，就只有 1&5=>2 、 2&5=>1 、
5=>1&2 而已，由此可知， minimum support 與 minimum confidence 的選定會影
響 Association Rule 的結果，因此在決定 minimum support 與 minimum confidence
的時候要謹慎小心。

Sequential Pattern Mining 技術介
紹
Q. How to find the sequential patterns?

Item
Itemset
Transaction
Step 1: 以 Customer_Id 及 TransactionTime 排序
Sequential Pattern Mining 技術介紹
( 續 )

With minimum support of 2 customers:
The large itemset (litemset): (30), (40), (70), (90), (40 70)
Item
Itemset
Transaction
Step 2: 找出 Large Itemset
( 續 )

Sequence
<(30) (90)> is supported by customer 1 and 4
<30 (40 70)> is supported by customer 2 and 4
3-Sequence
Step 3: 列出 Sequences
( 續 )

Q. Find the large sequences
with minimum support set to 25%:
- Large sequence:
<(30)>, <(40)>, <(70)>, <(90)>
<(30) (40)>, <(30) (70)>, <(30) (90)>
<(40 70)>, <(30)(40 70)>
Step 4: 找出 Large Sequences
( 續 )

Q. Find the maximal sequences
with minimum support of 2 customers:
- The answer set is:
<(30) (90)>, <30 (40 70)>
Sequential Patterns
Step 5: 找出 Maximal Sequences
( 續 )

 The Algorithm has five phases:
 Sort phase
 Large itemset phase
 Transformation phase
 Sequence phase
 Maximal phase
ApriorAll
ApriorSome
DynamicSome
( 續 )

Sort the database with
customer-id as the major key
and transaction-time as the
minor key.
Sort phase

 Find the large itemset.
 類同 association rules mining 找 large
itemset 方式。只不過針對每一 itemset ，每
個 customer 縱使有多筆 transactions 存有該
itemset ，也只能算一次 support 。
 Itemsets mapping
Litemset phase

Transformation phase
 Deleting non-large itemsets
 Mapping large itemsets to integers

Sequence phase
 Use the set of litemsets to find the
desired sequence.
 Two families of algorithms:
 Count-all:
AprioriAll
 Count-some:
AprioriSome,
DynamicSome

Maximal phase
 Find the maximum sequences among
the set of large sequences.
 從 large sequences 集合中，依序取出最長的
sequences ，除去其 sub-sequences 。
 最後該集合中留下的就是 maximum
sequences 。
 In some algorithms, this phase is
combined with the sequence phase.

Maximal phase
 Algorithm:
 S the set of all litemsets
 n the length of the longest sequence
for (k = n; k > 1; k--) do
for each k-sequence sk do
Delete from S all subsequences of sk

AprioriAll
 The basic method to mine sequential
patterns
 Based on the Apriori algorithm.
 Count all the large sequences,
including non-maximal sequences.
 Use Apriori-generate function to
generate candidate sequence.

Apriori Candidate Generation
 Generate candidates for pass using
only the large sequences found in the
previous pass.
 Then make a pass over the data to
find the support of the candidates.

 Algorithm:
 Lk the set of all large k-sequences
 Ck the set of candidate k-sequences
insert into Ck
select p.litemset1, p.litemset2,…, p.litemsetk-1,q.litemsetk-1
from Lk-1 p, Lk-1 q
where p.litemset1=q.litemset1,…, p.litemsetk-2=q.litemsetk-2;
for all sequences c∈Ck do
for all (k-1)-subsequences s of c do
if (s∉Lk-1) then
delete c from Ck;

AprioriAll (cont.)
L1 = {large 1-sequences}; // Result of the phase
for ( k=2; Lk-1≠Φ; k++) do
begin
Ck = New candidate generate from Lk-1
foreach customer-sequence c in the database do
Increment the count of all candidates in Ck that are contained in c
Lk = Candidates in Ck with minimum support.
End
Answer=Maximal Sequences in UkLk;

 Example: (Customer Sequences)
<{1 5}{2}{3}{4}>
<{1}{3}{4}{3 5}>
<{1}{2}{3}{4}>
<{1}{3}{5}>
<{4}{5}>
next step: find the large 1-sequences
With minimum set to 25%

Sequence Support
<1>
<2>
<3>
<4>
<5>
<{1 5}{2}{3}{4}>
<{1}{3}{4}{3 5}>
<{1}{2}{3}{4}>
<{1}{3}{5}>
<{4}{5}>
Example
Large 1-Sequence
4
2
4
4
2

next step:
find the large 3-sequences
Sequence Support
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
<{1 5}{2}{3}{4}>
<{1}{3}{4}{3 5}>
<{1}{2}{3}{4}>
<{1}{3}{5}>
<{4}{5}>
Example
Large 2-Sequence

Sequence Support
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
<{1 5}{2}{3}{4}>
<{1}{3}{4}{3 5}>
<{1}{2}{3}{4}>
<{1}{3}{5}>
<{4}{5}>
Example
Large 3-Sequence

next step: find the sequential pattern
Sequence Support
<1 2 3 4> 2
<{1 5}{2}{3}{4}>
<{1}{3}{4}{3 5}>
<{1}{2}{3}{4}>
<{1}{3}{5}>
<{4}{5}>
Example
Large 4-Sequence

Sequence Support
<1 2 3 4> 2
Example
Sequence Support
<1> 4
<2> 2
<3> 4
<4> 4
<5> 2
Sequence Support
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
Sequence Support
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
Find the maximal large sequences

Classification 技術介紹
 Definition : 根據已給定的資，將這些資標上類別，透過料料 training
classication rule 的步驟，可以得到 Classification Rule 。透
過這些 Classification Rule ，可以推測其他的 data 分別屬
於一種類別。哪
現在我們討有關於論 Decision Tree 的 Classification 方法 !
Decision Tree 主要是用得到來 Classification Rule 。要得到最佳的
Decision Tree ，是 NP-Hard 的問題。目前所存在的 induction-based 演
算法，大部分是以 Hunt 的演算法為基礎，因此，我們將以 Hunt 的演
算法為主。

Classification 技術介紹 ( 續 )
假設我們要從 Training Case 的集合 T 中找出 Decision Tree ，而類別總
共有 (C1,C2,…Ck) ，那麼有三個 Cases ：

 Example:
這個是已經給定的 Training Data Set ，並且也
已經標明類別。了
此為屬性 Outlook 的類別分布
表
此為屬性 Windy 的類別分布表
此為屬性 Humidity 的類別分布表

得到 Classification Rule 之後，就可以利
用這些 Rule ，推測其他的資的類別來料
，如，例 Outlook 為 rain ， Humidity 為
95 ， Windy 為 false ，則依照我們得到
的 Classification Rule ，那麼，就曉得可
以出去玩。而 Classification Rule 的好壞
，會影響推測結果的準確。度

Clustering 技術介紹
 Definition : 將一大群物件，依照相似分群，這個過程叫做度來 Clustering 。由於沒有事
先分類過，因此叫做 unsupervised learning 。
 Clustering Methods :
 Partitioning methods ：給定 k ，將資庫中的資分成料料 k 群，並且滿足以下的條件：
a. 每一群至少有一個 data
b. 每個 data 只屬於某一群
 Hierarchical methods ：將資做料 hierarchical 分解，通常有種方法─兩 agglomerative 和
divisive ， agglomerative 為 bottom up 方式，而 divisive 是 top
down 方式。
 Density-based methods ：為讓了 Cluster 可以有同的形，因此，不狀 Density-based 的
Clustering 方法被提出。來 Data 之間的分隔，是以
資密較料度小的區域分隔，資之間會定義長的公式，用來料度利
這個長的公式，計算資之間的密，將密高的部分度來料度度
分在同一群。
 Grid-based methods ： Grid-based Clustering 使用 multiresolution grid data structure ，將
空間分為有限個的 cell 形成來 grid structure 。而
multiresolution 的意思是 grid structure 的分隔，有好幾種同不
resolution ，因此叫做 multiresolution grid data structure 。主要的優
點在於速，通常度速的關鍵在於度 cell 的大小，數量
而是資庫中資的大小。不料料量
 Model-based methods ：主要的為將分群的結果逼近於某個學模型。理念數

Clustering 技術介紹 ( 續 )
 Clustering 的一些應用
 Market Segmentation

將顧客依其購買行為加以分群，以便針對不同類型的顧客群研
擬不同的行銷策略。
 Fraud Detection

找出已知 fraud 資料所組成的 clusters 的特性，作為未來偵測
fraud 的參考。
 Defect Analysis

找出已知缺陷零件資料所組成的 clusters 的特性，進而找出造成
缺陷零件的原因。
 Lapse Analysis

找出由已知使保險契約失效客戶資料所組成的 clusters 的特性，
進而作為未來保險契約規劃的參考。

Clustering 技術介紹 ( 續 )
 幾個 clustering 技術需解決的問題
 要分為幾個 clusters 才最理想 ?

Clusters 數目太大時，很難掌握其結果

Clusters 數目太小時，造成大 clusters ，容易造成 cluster 內的
records 不相似，一些重要的夠 cluster 特性很可能會相互抵銷
而被埋沒
 如何決定一筆哪 record 落在一個哪 cluster ?

各種相似度 (similarity) 或 distance 的計算方式

各種加權 (weighting) 的方式
 如何有效地呈現 clustering 的結果 ?

結論
看過這些 Data Mining 的技術之後，我們可以解了
這些技術的同點。不然而這些技術並是只能互不
相獨使用，而是可以互相輔助。立如，我們可例
以先用 Association Rule 找出 Item 之間的關係後
，再用這些關係，我們可以強化利 Clustering 的
準確。度

Data mining 1

More Related Content

What's hot (8)

Similar to Data mining 1 (20)

Data mining 1