Data mining knowledge representation Notes

Data mining knowledge representation
1 What Defines a Data Mining Task?
• Task relevant data: where and how to retrieve the data to be used
for mining
• Background knowledge: Concept hierarchies
• Interestingness measures: informal and formal selection techniques
to be applied to the output knowledge
• Representing input data and output knowledge: the structures used
to represent the input of the output of the data mining techniques
• Visualization techniques: needed to best view and document the
results of the whole process
2 Task relevant data
• Database or data warehouse name: where to find the data
• Database tables or data warehouse cubes
• Condition for data selection, relevant attributes or dimensions and
data grouping criteria: all this is used in the SQL query to retrieve
the data
1

3 Background knowledge: Concept hierarchies
The concept hierarchies are induced by a partial order1
over the values
of a given attribute. Depending on the type of the ordering relation we
distinguish several types of concept hierarchies.
3.1 Schema hierarchy
• Relating concept generality. The ordering reflects the generality of
the attribute values, e.g. street < city < state < country.
3.2 Set-grouping hierarchy
• The ordering relation is the subset relation (⊆). Applies to set
values.
• Example:
{13, ..., 39} = young; {13, ..., 19} = teenage;
{13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young.
• Theory:
– power set: the set of all subsets of a set, X.
– lattice (2X
, ⊆), sup(X, Y ) = X ∩ Y , inf(X, Y ) = X ∪ Y .
X ∩ Y
X Y
X ∪ Y
@
@
@
@
@
@
– top element > = {} (empty set), bottom element ⊥ = X.
1Consider a set A and an ordering relation R. R is a full order if for any x, y ∈ A, xRy exists. R is a partial order
if for any x ∈ A, there exists y ∈ A, such that either xRy or yRx exists.
2

3.3 Operation-derived hierarchy
Produced by applying an operation (encoding, decoding, information
extraction). For example:
markovz@cs.ccsu.edu
instantiates the hierarcy user−name < department < university <
usa−univeristy.
3.4 Rule-based hierarchy
Using rules to define the partial order, for example:
if antecedent then consequent
defines the order antecedent < consequent.
4 Interestingness measures
Criteria to evaluate hypotheses (knowledge extracted from data when
applying data mining techniques). This issue will be discussed in more
detail in Lecture notes - Chapter 9: ”Evaluating what’s been learned”.
4.1 Bayesian evaluation
• E - data
• H = {H1, H2, ..., Hn} - hypotheses
• Hbest = argmaxi P(Hi|E)
• Bayes theorem:
P(Hi|E) =
P(Hi)P(E|Hi)
Pn
i=1 P(Hi)P(E|Hi)
3

4.2 Simplicity
Occam’s Razor
Consider for example, association rule length, decision tree size, num-
ber and length of classification rules. The intuition suggests that the
best hypotesis is the simplest (shortest) one. This is the so called Oc-
cam’s Razor Principle also expressed as a mathematical theorem (Oc-
cam’s Razor Theorem). Here is an example of applying this principle
to grammars:
• Data:
E = {0, 000, 00000, 0000000, 000000000}
• Hypotheses:
G1 : S → 0|000|00000|0000000|000000000
G2 : S → 00S|0
• Best hypothesis: G2 (fewer and simpler rules)
However, as simplicity is a subjective measure we need formal criteria
to define it.
Formal criteria for simplicity
• Bayesian approach: need of large volume of experimental results
(statistics) to define prior probabilities.
• Algorithmic (Kolmogorov) complexity of an object (bit string): the
length of the shortest program of Universal Turing Machine, that
generates the string. Problems: computational complexity.
• Information-based approches: Minimum Description Length Prin-
ciple (MDL). Most often used in practice.
4

4.3 Minimum Description Length Principle (MDL)
• Bayes Theorem:
P(Hi|E) =
P(Hi)P(E|Hi)
Pn
i=1 P(Hi)P(E|Hi)
• Take a − log of both sides of Bayes (C is a constant):
− log2 P(Hi|E) = − log2 P(Hi) − log2 P(E|Hi) + C
• I(A) – information in message A, L(A) – min length of A in bits:
log2 P(A) = I(A) = L(A)
• Then: L(Hi|E) = L(Hi) + L(E|Hi) + C
• MDL: The hypothesis must reduce the information needed to en-
code the data, i.e.
L(E) > L(Hi) + L(E|Hi)
• The best hypothesis must maximize information compression:
Hbest = argmaxi (L(E) − L(Hi) − L(E|Hi))
4.4 Certainty
• Confidence of association ”if A then B”:
P(B|A) =
# of tuples containing both A and B
# of tupples containing A
5

• Classification accuracy: Use a training set to generate the hypoth-
esis, then test it on a separate test set.
Accuracy =
# of correct classifications
# of tuples in the test set
• Utility (support) of association ”if A then B”:
P(A, B) =
# of tupples containing both A and B
total # of tupples
5 Representing input data and output knowledge
5.1 Concepts (classes, categories, hypotheses): things to be mined/learned
• Classification mining/learning: predicting a discrete class, a kind
of supervised learning, success is measured on new data for which
class labels are known (test data).
• Association mining/learning: detecting associations between at-
tributes, can be used to predict any attribute value and more than
one attribute values, hence more rules can be generated, therefore
we need constraints (minimum support and minimum confidence).
• Clustering: grouping similar instances into clusters, a kind of unsu-
pervised learning, success is measured subjectively or by objective
functions.
• Numeric prediction: predicting a numeric quantity, a kind of su-
pervised learning, success is measured on test data.
• Concept description: output of the learning scheme
6

5.2 Instances (examples, tuples, transactions)
• Things to be classified, associated, or clustered.
• Individual, independent examples of the concept to be learned (tar-
get concept).
• Described by predetermined set of attributes.
• Input to the learning scheme: set of instances (dataset), represented
as a single relation (table).
• Independence assumption: no relationships between attributes.
• Positive and negative examples for a concept, Closed World As-
sumption (CWA): {negative} = {all}{positive}.
• Relational (First Order Logic) descriptions:
– Using variables (more compact representation). For example:
< a, b, b >, < a, c, c >, < b, a, a > can be represented as one
relational tuple < X, Y, Y >.
– Multiple relation concepts (FOIL, Inductive Logic Program-
ming, see Lecture Notes - Chapter 11). Example:
grandfather(X, Z) ← father(X, Y )∧(father(Y, Z)∨mother(Y, Z))
5.3 Attributes (features)
• Predefined set of features to describe an instance.
• Nominal (categorical, enumerated, discrete) attributes:
– Values are distinct symbols.
– No relation among nominal values.
7

– Only equality test can be performed.
– Special case: boolean attributes, transforming nominal to boolean.
• Structured:
– Partial order among nominal values
– Example: concept hierarchy
• Numeric:
– Continuous: full order (e.g. integer or real numbers).
– Interval: partial order.
5.4 Output knowledge representation
• Association rules
• Decision trees
• Classification rules
• Rules with relations
• Prediction schemes:
– Nearest neighbor
– Bayesian classification
– Neural networks
– Regression
• Clusters:
– Type of grouping: partitions/hierarchical
– Grouping or describing: agglomerative/conceptual
– Type of descriptions: statistical/structural
8

6 Visualization techniques: Why visualize data?
• Identifying problems:
– Histograms for nominal attributes: is the distribution consistent
with background knowledge?
– Graphs for numeric values: detecting outliers.
• Visualization show dependencies
• Consulting domain experts
• If data are too much, take a sample
9

Data mining knowledge representation Notes

More Related Content

What's hot (20)

Similar to Data mining knowledge representation Notes (20)

Recently uploaded (20)

Data mining knowledge representation Notes