SlideShare a Scribd company logo
Module - 3
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
Data mining knowledge representation
What Defines a Data Mining Task?
• Task relevant data: where and how to retrieve the data
to be used for mining
• Background knowledge: Concept hierarchies
• Interestingness measures: informal and formal
selection techniques to be applied to the output
knowledge
• Representing input data and output knowledge: the
structures used to represent the input of the output of
the data mining techniques
• Visualization techniques: needed to best view and
document the results of the whole process
Task relevant data
• Database or data warehouse name: where to find the
data
• Database tables or data warehouse cubes
• Condition for data selection, relevant attributes or
dimensions and data grouping criteria: all this is used in
the SQL query to retrieve the data
Background knowledge: Concept hierarchies
• The concept hierarchies are induced by a partial order
over the values of a given attribute. Depending on the
type of the ordering relation, we distinguish several
types of concept hierarchies.
– Schema hierarchy - Relating concept generality. The
ordering reflects the generality of the attribute values,
Example : street < city < state < country.
– Set-grouping hierarchy - The ordering relation is the subset
relation (⊆). Applies to set values. Example: {13, ..., 39} =
young; {13, ..., 19} = teenage; {13, ..., 19} ⊆ {13, ..., 39} ⇒
teenage < young.
Background knowledge: Concept hierarchies
– Operation-derived hierarchy - Produced by applying an
operation (encoding, decoding, information extraction)
Example : markovz@cs.ccsu.edu
instantiates the hierarcy user−name < department < university
– Rule-based hierarchy - Using rules to define the partial
order.
Example : if antecedent  consequent
Interestingness measures
 Criteria to evaluate hypotheses (knowledge extracted from data
when applying data mining techniques).
Example :
Representing input data and output
knowledge
 Concepts (classes, categories, hypotheses): things to be
mined/learned
Representing input data and output
knowledge
 Instances (examples, tuples, transactions)
Representing input data and output
knowledge
 Attributes (Features)
Representing input data and output
knowledge
 Output Knowledge Representations
Visualization techniques
• Visualization techniques enable us to visually identify
trends, ranges, frequency distributions, relationships,
outliers and make comparisons
• Some of the common graphs used in exploratory data
analysis and data mining are
• Frequency Polygrams and Histograms
• Scatterplots
• Box Plots
• Multiple Graphs
Frequency polygrams
• Frequency polygrams - Plot information according to
the number of observations reported for each value (or
ranges of values) for a particular variable (usually for
continuous variables)
• The shape of the plot reveals trends
Frequency polygram displaying a count for cars per year
Histograms
• Histograms provide a clear way of viewing the
frequency distribution for a single variable.
• Variables that are not continuous can also be shown as a
histogram
• The length of the bar is proportional to the size of the
group
• For continuous variables, a histogram can be very
useful in displaying the frequency distribution.
• The central values, the shape, the range of values as
well as any outliers can be identified through
Histograms
Various Histogram representations
Histogram showing categorical variable Diabetes
Histogram representing counts for ranges in the variable
Length
Histogram showing an outlier
Scatterplots
• Scatterplots can be used to identify whether any
relationship exists between two continuous variables
based on the ratio or interval scales
• The two variables are plotted on the x- and y-axes. Each
point displayed on the scatterplot is a single observation
• The position of the point is determined by the value of
the two variables.
• Scatterplots allow you to see the type of relationship
that may exist between the two variables
Various Scatter Plot representations
Scatterplot showing an outlier (X) Scatterplot showing a nonlinear relationship
Scatterplot showing no discernable relationship Scatterplot showing a negative relationship
Box Plots
• Box plots (also called box-and-whisker plots) provide a succinct
summary of the overall distribution for a variable
• Five points are displayed: the lower extreme value, the lower
quartile, the median, the upper quartile, the upper extreme and
the mean
• The values on the box plot are defined as follows:
– Lower extreme: The lowest value for the variable.
– Lower quartile: The point below which 25% of all observations fall.
– Median: The point below which 50% of all observations fall.
– Upper quartile: The point below which 75% of all observations fall.
– Upper extreme: The highest value for the variable.
– Mean: The average value for the variable.
Box Plot representation
A standard Box Plot
Box Plot for the Variable MPG

More Related Content

PDF
7 decision tree
PDF
3 module 2
PDF
2 introductory slides
PDF
6 module 4
PPT
Data mining techniques unit iv
PPTX
Data mining techniques unit v
PPT
Data mining technique for classification and feature evaluation using stream ...
PPTX
Data mining techniques unit 2
7 decision tree
3 module 2
2 introductory slides
6 module 4
Data mining techniques unit iv
Data mining techniques unit v
Data mining technique for classification and feature evaluation using stream ...
Data mining techniques unit 2

What's hot (20)

PPT
Data mining-primitives-languages-and-system-architectures2641
PPTX
Data mining primitives
PPT
Dma unit 1
PPTX
Dsa unit 1
PPTX
Decision tree induction
PPTX
Classification
PPTX
04 Classification in Data Mining
PPTX
Data mining Basics and complete description
PPTX
Dma unit 2
PPTX
PPTX
Data mining Basics and complete description onword
PPTX
Data mining
PPTX
Classification
PDF
Survey on Various Classification Techniques in Data Mining
PPTX
lazy learners and other classication methods
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PPTX
Datapreprocessing
PPT
Data preparation
PPTX
Data Mining: Mining stream time series and sequence data
Data mining-primitives-languages-and-system-architectures2641
Data mining primitives
Dma unit 1
Dsa unit 1
Decision tree induction
Classification
04 Classification in Data Mining
Data mining Basics and complete description
Dma unit 2
Data mining Basics and complete description onword
Data mining
Classification
Survey on Various Classification Techniques in Data Mining
lazy learners and other classication methods
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Datapreprocessing
Data preparation
Data Mining: Mining stream time series and sequence data
Ad

Similar to 4 module 3 -- (20)

PPTX
Types of Data in Machine Learning, Number aand Categorical
PDF
Excel and research
PPTX
EDA.pptx
PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PDF
Unit-3 Data Analytics.pdf
PPTX
Exploratory Spatial Data Analysis spatial data analysis and interpretation.pptx
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
PDF
Excel and research
PPTX
DAM_3Unit.pptx it is used todddddddddddddd
PPTX
Lect4 principal component analysis-I
PDF
1-Descriptive Statistics - pdf file descriptive
DOCX
Data Mining StepsProblem Definition Market AnalysisC
PDF
Data exploration validation and sanitization
PDF
Machine learning meetup
PDF
Data mining knowledge representation Notes
PPTX
unit 1.pptx
PPTX
ML SFCSE.pptx
PPT
17329274.ppt
PDF
Anomaly detection Meetup Slides
Types of Data in Machine Learning, Number aand Categorical
Excel and research
EDA.pptx
Data_Analytics_for_IoT_Solutions.pptx.pdf
Unit-3 Data Analytics.pdf
Exploratory Spatial Data Analysis spatial data analysis and interpretation.pptx
UNIT 3: Data Warehousing and Data Mining
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
Excel and research
DAM_3Unit.pptx it is used todddddddddddddd
Lect4 principal component analysis-I
1-Descriptive Statistics - pdf file descriptive
Data Mining StepsProblem Definition Market AnalysisC
Data exploration validation and sanitization
Machine learning meetup
Data mining knowledge representation Notes
unit 1.pptx
ML SFCSE.pptx
17329274.ppt
Anomaly detection Meetup Slides
Ad

Recently uploaded (20)

PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
KVL KCL ppt electrical electronics eee tiet
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PPTX
Wireless and Mobile Backhaul Market.pptx
PPTX
Syllabus Computer Six class curriculum s
PPTX
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
PDF
Cableado de Controladores Logicos Programables
PDF
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
PPTX
material for studying about lift elevators escalation
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PDF
How NGOs Save Costs with Affordable IT Rentals
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PDF
Dynamic Checkweighers and Automatic Weighing Machine Solutions
PPTX
Embedded for Artificial Intelligence 1.pptx
PPTX
Prograce_Present.....ggation_Simple.pptx
PPTX
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
PPTX
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
PPTX
title _yeOPC_Poisoning_Presentation.pptx
PPTX
"Fundamentals of Digital Image Processing: A Visual Approach"
PPTX
ERP good ERP good ERP good ERP good good ERP good ERP good
Smarter Security: How Door Access Control Works with Alarms & CCTV
KVL KCL ppt electrical electronics eee tiet
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
Wireless and Mobile Backhaul Market.pptx
Syllabus Computer Six class curriculum s
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
Cableado de Controladores Logicos Programables
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
material for studying about lift elevators escalation
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
How NGOs Save Costs with Affordable IT Rentals
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
Dynamic Checkweighers and Automatic Weighing Machine Solutions
Embedded for Artificial Intelligence 1.pptx
Prograce_Present.....ggation_Simple.pptx
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
title _yeOPC_Poisoning_Presentation.pptx
"Fundamentals of Digital Image Processing: A Visual Approach"
ERP good ERP good ERP good ERP good good ERP good ERP good

4 module 3 --

  • 2. What Defines a Data Mining Task? • Task relevant data: where and how to retrieve the data to be used for mining • Background knowledge: Concept hierarchies • Interestingness measures: informal and formal selection techniques to be applied to the output knowledge • Representing input data and output knowledge: the structures used to represent the input of the output of the data mining techniques • Visualization techniques: needed to best view and document the results of the whole process
  • 3. Task relevant data • Database or data warehouse name: where to find the data • Database tables or data warehouse cubes • Condition for data selection, relevant attributes or dimensions and data grouping criteria: all this is used in the SQL query to retrieve the data
  • 4. Background knowledge: Concept hierarchies • The concept hierarchies are induced by a partial order over the values of a given attribute. Depending on the type of the ordering relation, we distinguish several types of concept hierarchies. – Schema hierarchy - Relating concept generality. The ordering reflects the generality of the attribute values, Example : street < city < state < country. – Set-grouping hierarchy - The ordering relation is the subset relation (⊆). Applies to set values. Example: {13, ..., 39} = young; {13, ..., 19} = teenage; {13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young.
  • 5. Background knowledge: Concept hierarchies – Operation-derived hierarchy - Produced by applying an operation (encoding, decoding, information extraction) Example : markovz@cs.ccsu.edu instantiates the hierarcy user−name < department < university – Rule-based hierarchy - Using rules to define the partial order. Example : if antecedent  consequent
  • 6. Interestingness measures  Criteria to evaluate hypotheses (knowledge extracted from data when applying data mining techniques). Example :
  • 7. Representing input data and output knowledge  Concepts (classes, categories, hypotheses): things to be mined/learned
  • 8. Representing input data and output knowledge  Instances (examples, tuples, transactions)
  • 9. Representing input data and output knowledge  Attributes (Features)
  • 10. Representing input data and output knowledge  Output Knowledge Representations
  • 11. Visualization techniques • Visualization techniques enable us to visually identify trends, ranges, frequency distributions, relationships, outliers and make comparisons • Some of the common graphs used in exploratory data analysis and data mining are • Frequency Polygrams and Histograms • Scatterplots • Box Plots • Multiple Graphs
  • 12. Frequency polygrams • Frequency polygrams - Plot information according to the number of observations reported for each value (or ranges of values) for a particular variable (usually for continuous variables) • The shape of the plot reveals trends Frequency polygram displaying a count for cars per year
  • 13. Histograms • Histograms provide a clear way of viewing the frequency distribution for a single variable. • Variables that are not continuous can also be shown as a histogram • The length of the bar is proportional to the size of the group • For continuous variables, a histogram can be very useful in displaying the frequency distribution. • The central values, the shape, the range of values as well as any outliers can be identified through Histograms
  • 14. Various Histogram representations Histogram showing categorical variable Diabetes Histogram representing counts for ranges in the variable Length Histogram showing an outlier
  • 15. Scatterplots • Scatterplots can be used to identify whether any relationship exists between two continuous variables based on the ratio or interval scales • The two variables are plotted on the x- and y-axes. Each point displayed on the scatterplot is a single observation • The position of the point is determined by the value of the two variables. • Scatterplots allow you to see the type of relationship that may exist between the two variables
  • 16. Various Scatter Plot representations Scatterplot showing an outlier (X) Scatterplot showing a nonlinear relationship Scatterplot showing no discernable relationship Scatterplot showing a negative relationship
  • 17. Box Plots • Box plots (also called box-and-whisker plots) provide a succinct summary of the overall distribution for a variable • Five points are displayed: the lower extreme value, the lower quartile, the median, the upper quartile, the upper extreme and the mean • The values on the box plot are defined as follows: – Lower extreme: The lowest value for the variable. – Lower quartile: The point below which 25% of all observations fall. – Median: The point below which 50% of all observations fall. – Upper quartile: The point below which 75% of all observations fall. – Upper extreme: The highest value for the variable. – Mean: The average value for the variable.
  • 18. Box Plot representation A standard Box Plot Box Plot for the Variable MPG