1
UNIT I
DATA WAREHOUSING
2
DATA WAREHOUSING (Unit - I)
❑Data Warehouse and OLAP Technology:
○ 1.1 An Overview: Data Warehouse
○ 1.2 Data Warehouse Architecture
○ 1.3 A Multidimensional Data Model
○ 1.4 Data Warehouse Implementation
○ 1.5 From Data Warehousing to Data
Mining. (Han & Kamber)
3
Data Warehouse Overview
4
What is Data Warehouse?
■ Data warehousing provides architectures and tools for business
executives to systematically organize, understand, and use their data
to make strategic decisions.
■ Data warehouse refers to a data repository that is maintained
separately from an organization’s operational databases.
■ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”
■ Data warehousing: The process of constructing and using data
warehouses
Data Warehouse—Subject-Oriented
5
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
Data Warehouse—Integrated
6
■ Constructed by integrating multiple, heterogeneous data
sources
■ relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
Data Warehouse—Time Variant
7
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse contains an
element of time, explicitly or implicitly. But the key of
operational data may or may not contain “time element”
Data Warehouse—Nonvolatile
8
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
OLTP vs OLAP
9
OLTP OLAP
User & System
Orientation
Customer Oriented
( transaction & query processing)
market Oriented ( Data Analysis
by managers, executives &
Analysts)
Data Contents Current Data (too detailed) Large amount of data
(summarization & aggregation)
Database design ER data model ( Application oriented
database design)
Star or Snowflake model (subject
Oriented Database design)
View focus on current Data within an
enterprise or department
multiple versions of database
schema(evolutionary process),
data from diff. org. & many data
stores
Access Patterns short, atomic transactions (requires
concurrency control & recovery)
read-only operations ( Complex
queries)
15
Data Warehouse Architecture
16
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
■ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system
■ Data analysis and decision making
■ Distinct features (OLTP vs. OLAP):
■ User and system orientation: customer vs. market
■ Data contents: current, detailed vs. historical, consolidated
■ Database design: ER + application vs. star + subject
■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries
17
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
18
Data Warehousing: A Multitiered Architecture
19
■ Bottom Tier:
■ Warehouse Database server
■ a relational database system
■ Back-end tools and utilities
■ data extraction
■ by using API gateways(ODBC, JDBC & OLEDB)
■ cleaning
■ transformation
■ load & refresh
Data Warehousing: A Multitiered Architecture
20
■ Middle Tier (OLAP server)
■ ROLAP - Relational OLAP
■ extended RDBMS that maps operations on
multidimensional data to standard relational
operations.
■ MOLAP - Multidimensional OLAP
■ Special-purpose server that directly implements
multidimensional data and operations.
■ Top Tier
■ Front-end Client Layer
■ Query and reporting tools, analysis tools and
data mining tools.
Data Warehousing: A Multitiered Architecture
21
■ Data Warehouse Models:
■ Enterprise warehouse:
■ collects all of the information about subjects
spanning the entire organization.
■ corporate-wide data integration
■ can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond.
■ implemented on mainframes, computer
superservers, or parallel architecture platforms
Data Warehousing: A Multitiered Architecture
22
■ Data Warehouse Models:
■ Data mart:a subset of corporate-wide data that is of value
to a specific group of users
■ confined to specific selected subjects.
■ Example - marketing data mart may confine its subjects to
customer, item, and sales.
■ implemented on low-cost departmental servers
■ Independent Data mart - data captured from
■ one or more operational systems or external information
providers,
or
■ from data generated locally within a particular department or
geographic area.
■ Dependent Data mart - sourced directly from enterprise data
warehouses.
Data Warehousing: A Multitiered Architecture
23
■ Data Warehouse Models:
■ Virtual warehouse:
■ A virtual warehouse is a set of views over operational
databases.
■ easy to build but requires excess capacity on operational
database servers.
Data Warehousing: A Multitiered Architecture
24
■ Data extraction: gathers data from multiple,
heterogeneous, and external sources.
■ Data Cleaning: detects errors in the data and
rectifies them when possible
■ Data transformation: converts data from
legacy or host format to warehouse format.
■ Load: sorts, summarizes, consolidates,
computes views, checks integrity, and builds
indices and partitions.
■ Refresh: propagates the updates from the data
sources to the warehouse.
Data Warehousing: A Multitiered Architecture
25
Metadata Repository:
metadata are the data that define warehouse
objects
It consists of:
1) Data warehouse structure
2) Operational metadata
3) algorithms used for summarization
4) Mapping from the operational environment to
the data warehouse
5) Data related to system performance
6) Business metadata
Data Warehousing: A Multitiered Architecture
26
Metadata Repository:
■ data warehouse structure
i) warehouse schema,
ii) view, dimensions,
iii) hierarchies, and
iv) derived data definitions,
v) data mart locations and contents.
■ Operational metadata
i) data lineage (history of migrated data and the
sequence of transformations applied to it),
ii) currency of data (active, archived, or purged),
iii) monitoring information (warehouse usage
statistics, error reports, and audit trails).
Data Warehousing: A Multitiered Architecture
27
Metadata Repository:
■ The algorithms used for summarization,
i) measure and dimension definition algorithms,
ii) data on granularity,
iii) partitions,
iv) subject areas,
v) aggregation,
vi) summarization, and
vii) predefined queries and reports.
Data Warehousing: A Multitiered Architecture
28
Metadata Repository:
1) Mapping from the operational environment to the
data warehouse
i) source databases and their contents,
ii) gateway descriptions,
iii) data partitions,
iv) data extraction, cleaning, transformation rules and
defaults
v) data refresh and purging rules, and
vi) security (user authorization and access control).
Data Warehousing: A Multitiered Architecture
29
Metadata Repository:
■ Data related to system performance
■ indices and profiles that improve data access and
retrieval performance,
■ rules for the timing and scheduling of refresh,
update, and replication cycles.
■ Business metadata,
■ business terms and definitions,
■ data ownership information, and
■ charging policies
30
A Multidimensional Data Model
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
31
■ A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
■ Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
■ Example:-
■ AllElectronics may create a sales data warehouse
■ time, item, branch, and location - These
dimensions allow the store to keep track of things
like monthly sales of items and the branches and
locations at which the items were sold.
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
32
■ Each dimension may have a table associated with it, called
a dimension table, which further describes the
dimension.
■ For example - a dimension table for item may contain the
attributes item name, brand, type.
■ A multidimensional data model is typically organized
around a central theme, such as sales. This theme is
represented by a fact table.
■ Facts are numeric measures.
■ The fact table contains the names of the facts, or
measures, as well as keys to each of the related dimension
tables.
33
Data Cube: A Multidimensional Data Model
■ A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
34
Data Cube: A Multidimensional Data Model
■ A data cube is a lattice of cuboids
■ A data warehouse is usually modeled by a multidimensional data
structure, called a data cube, in which
■ each dimension corresponds to an attribute or a set of
attributes in the schema, and
■ each cell stores the value of some aggregate measure such as
count or sum(sales_amount).
■ A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.
Data Cube: A Multidimensional Data Model
35
2-D View of Sales data
■ AllElectronics sales data for items sold per quarter in the city of Vancouver.
■ a simple 2-D data cube that is a table or spreadsheet for sales data from
AllElectronics
Data Cube: A Multidimensional Data Model
36
3-D View of a Sales data
The 3-D data in the table are represented as a series of 2-D tables
Data Cube: A Multidimensional Data Model
37
3D Data Cube Representation of Sales data
we may also represent the same data in the form of a 3D data cube
Data Cube: A Multidimensional Data Model
38
4-D Data Cube Representation of Sales Data
we may display any n-dimensional data as a series of (n − 1)-dimensional
“cubes.”
39
Cube: A Lattice of Cuboids
all
time item location supplier
0-D(apex) cuboid
1-D cuboids
time,location item,location location,supplier
time,item
time,supplier item,supplier
2-D cuboids
time,item,location
time,location,supplier
cuboids
time,item,supplier item,location,supplier
4-D(base)
cuboid
time, item, location, supplier
40
■ In data warehousing literature, an n-D base cube is called a base
cuboid.
■ The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
■ In our example, this is the total sales, or dollars sold,
summarized over all four dimensions.
■ The apex cuboid is typically denoted by all.
■ The lattice of cuboids forms a data cube.
Schemas for Multidimensional Data Models
41
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Schemas for Multidimensional Data Models
42
■ Star schema: In this, a data warehouse contains
(1) a large central table (fact table) containing the bulk
of the data, with no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each dimension.
■ Each dimension is represented by only one table.
■ Each table contains a set of attributes
■ Problem: redundancy in dimension tables.
■ ex:- location dimension table will create redundancy
among the attributes province or state and country; that
is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).
43
Star schema
44
Snow flake schema
■ Variant of the star schema model
■ Dimension tables are normalized ( to remove
redundancy)
■ Dimension table is splitted into additional tables.
■ The resulting schema graph forms a shape similar to a
snowflake.
■ Problem
■ more joins will be needed to execute a query ( affects
system performance)
■ so this is not as popular as the star schema in data
warehouse design.
45
Snowflake schema
46
Fact Constellation
● A fact constellation schema allows dimension tables to be
shared between fact tables
● A data warehouse collects information about subjects that
span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is
enterprise-wide.
● For data warehouses, the fact constellation
schema is commonly used.
● For data marts, the star or snowflake schema is
commonly used
47
Fact
Constellation
This schema specifies two fact tables,
sales and shipping
the dimensions tables for time, item, and
location are shared between the sales
and shipping fact tables.
48
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
■ Just as relational query languages like SQL can be used
to specify relational queries, a data mining query
language (DMQL) can be used to specify data mining
tasks.
■ Data warehouses and data marts can be defined using
two language primitives, one for cube definition and
one for dimension definition.
49
Syntax for Cube and Dimension
Definition in DMQL
■ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
■ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
■ Special Case (Shared Dimension Tables)
■ First time as “cube definition”
■ define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
Defining Star Schema in DMQL
50
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
Defining Snowflake Schema in DMQL
51
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Defining Fact Constellation in DMQL
52
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Concept Hierarchies
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
■ A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level.
■ concept hierarchy for the dimension location
53
Concept Hierarchies
■ A concept hierarchy that is a total or partial order among
attributes in a database schema is called a schema
hierarchy.
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 54
Concept Hierarchies
55
■ Concept hierarchies may also be defined by discretizing
or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
■ A total or partial order can be defined among groups of
values.
Measures of Data Cube: Three
Categories
56
■ A multidimensional point in the data cube space can be
defined by a set of dimension-value pairs,
for example, 〈time = “Q1”, location = “Vancouver”,
item = “computer”〉.
■ A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension-value pairs defining the given point.
■ Based on the kind of aggregate functions used, measures
can be organized into three categories : distributive,
algebraic, holistic
Measures of Data Cube: Three
Categories
■ Distributive: An aggregate function is distributive if the result
derived by applying the function to n aggregate values is same
as that derived by applying the function on all the data without
partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: An aggregate function is algebraic if it can be
computed by an algebraic function with M arguments (where M
is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
■ E.g., avg()=sum()/count(), min_N(), standard_deviation()
■ Holistic: An aggregate function is holistic if there is no constant
bound on the storage size and there does not exist an algebraic
function with M arguments (where M is a constant) that
characterizes the computation.
■ E.g., median(), mode(), rank() 57
58
Typical OLAP Operations
■ Roll up (drill-up):
■ Drill down (roll down):
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: Allows users to analyze the same data through
different reports, analyze it with different features and even display it
through different visualization methods
59
Fig. 3.10 Typical OLAP
Operations
Source & Courtesy: https://www.javatpoint.com/olap-operations
60
Typical OLAP Operations:Roll Up/Drill Up
■ summarize data
■ by climbing up
hierarchy
or
■ by dimension
reduction
Source & Courtesy: https://www.javatpoint.com/olap-operations
61
Typical OLAP Operations:Roll Down
■ reverse of roll-up
■ from higher
level summary
to lower level
summary or
detailed data, or
introducing new
dimensions
Typical OLAP Operations:Slicing
● Slice is the act of picking a rectangular subset of a cube by choosing a single
value for one of its dimensions, creating a new cube with one fewer
dimension.
● Example: The sales figures of all sales regions and all product categories of
the company in the year 2005 and 2006 are "sliced" out of the data cube.
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
62
Typical OLAP Operations:Slicing
Slicing:
It selects a single
dimension from the OLAP
cube which results in a new
sub-cube creation.
Source & Courtesy: https://www.javatpoint.com/olap-operations 63
Typical OLAP Operations:Dice
● Dice: The dice operation produces a subcube by allowing the analyst to pick
specific values of multiple dimensions
● The picture shows a dicing operation: The new cube shows the sales figures
of a limited number of product categories, the time and region dimensions
cover the same range as before.
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
64
Typical OLAP Operations:Dicing
Dice:
It selects a sub-
cube from the
OLAP cube by
selecting two or
more
dimensions.
Source & Courtesy: https://www.javatpoint.com/olap-operations 65
Typical OLAP Operations:Pivot
66
Pivot allows an analyst to rotate the cube in space to see its various faces. For
example, cities could be arranged vertically and products horizontally while viewing
data for a particular quarter.
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
A Star-Net Query Model
67
● The querying of multidimensional databases can be based
on a starnet model.
● It consists of radial lines emanating from a central point,
where each line represents a concept hierarchy for a
dimension.
● Each abstraction level in the hierarchy is called a footprint.
● These represent the granularities available for use by OLAP
operations such as drill-down and roll-up.
A Star-Net Query Model
68
A Star-Net Query Model
69
■ Four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time,
respectively
■ footprints representing abstraction levels of the
dimension - time line has four footprints: “day,”
“month,” “quarter,” and “year.”
■ Concept hierarchies can be used to generalize data by
replacing low-level values (such as “day” for the time
dimension) by higher-level abstractions (such as “year”)
or
■ to specialize data by replacing higher-level abstractions
with lower-level values.
Data Warehouse Design and Usage
70
A Business Analysis Framework for Data
Warehouse Design:
■ To design an effective data warehouse we need to
understand and analyze business needs and construct a
business analysis framework.
■ Different views are combined to form a complex
framework.
Data Warehouse Design and Usage
71
■ Four different views regarding a data warehouse design
must be considered:
■ Top-down view
■ allows the selection of the relevant information
necessary for the data warehouse (matches current
and future business needs).
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems.
■ Documented at various levels of detail and accuracy,
from individual data source tables to integrated data
source tables.
■ Modeled in ER model or CASE (computer-aided
software engineering).
Data Warehouse Design and Usage
72
■ Data warehouse view
■ includes fact tables and dimension tables.
■ It represents the information that is stored inside the
data warehouse, including
■ precalculated totals and counts,
■ information regarding the source, date, and time
of origin, added to provide historical context.
■ Business query view
■ is the data perspective in the data warehouse from
the end-user’s viewpoint.
Data Warehouse Design and Usage
■ Skills required to build & use a Data warehouse
■ Business Skills
■ how systems store and manage their data,
■ how to build extractors (operational DBMS to DW)
■ how to build warehouse refresh software(update)
■ Technology skills
■ the ability to discover patterns and trends,
■ to extrapolate trends based on history and look
for anomalies or paradigm shifts, and
■ to present coherent managerial recommendations
based on such analysis.
■ Program management skills
■ Interface with many technologies, vendors, and end-
users in order to deliver results in a timely and cost
effective manner 73
Data Warehouse Design and Usage
74
Data Warehouse Design Process
■ A data warehouse can be built using
■ Top-down approach (overall design and planning)
■ It is useful in cases where the technology is
mature and well known
■ Bottom-up approach(starts with experiments & prototypes)
■ a combination of both.
■ In SE point of view ( Waterfall model or Spiral model)
structured and
systematic
■ planning,
■ requirements study,
■ problem analysis,
■ warehouse design,
■
● rapid generation, short intervals between
successive releases, good choice for data
warehouse development
● turnaround time is short, modifications can
be done quickly, and new designs and
technologies can be adapted in a timely
analysis at each data integration and testing, andmanner
step, one step to■ finally deployment of the data warehouse
the next
Data Warehouse Design and Usage
75
Data Warehouse Design Process
■ 4 major Steps involved in Warehouse design are:
■ 1. Choose a business process to model (e.g., orders,
invoices, shipments, inventory, account administration,
sales, or the general ledger).
■ Data warehouse model - If the business process is
organizational and involves multiple complex object
collections
■ Data mart model - if the process is departmental and
focuses on the analysis of one kind of business
process
Data Warehouse Design and Usage
76
■ 2. Choose the business process grain
■ Fundamental, atomic level of data to be represented
in the fact table
■ (e.g., individual transactions, individual daily
snapshots, and so on).
■ 3. Choose the dimensions that will apply to each
fact table record.
■ Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
■ 4. Choose the measures that will populate each
fact table record.
■ Typical measures are numeric additive quantities like
dollars sold and units sold.
Data Warehouse Design and Usage
77
Data Warehouse Usage for Information Processing
■ Evolution of DW takes place throughout a number of
phases.
■ Initial Phase - DW is used for generating reports and
answering predefined queries.
■ Progressively - to analyze summarized and detailed data,
(results are in the form of reports and charts)
■ Later - for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-
dice operations.
■ Finally - for knowledge discovery and strategic decision
making using data mining tools.
78
Data Warehouse Implementation
79
Data warehouse implementation
■ OLAP servers demand that decision support queries be
answered in the order of seconds.
■ Methods for the efficient implementation of data
warehouse systems.
■ 1. Efficient data cube computation.
■ 2. OLAP data indexing (bitmap or join indices )
■ 3. OLAP query processing
■ 4. Various types of warehouse servers for OLAP
processing.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
80
■ Requires efficient computation of aggregations
across many sets of dimensions.
■ In SQL terms:
■ Aggregations are referred to as group-by’s.
■ Each group-by can be represented by a cuboid,
■ set of group-by’s forms a lattice of cuboids
defining a data cube.
■ Compute cube Operator - computes
aggregates over all subsets of the dimensions
specified in the operation.
■ require excessive storage space for large
number of dimensions.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
81
Example 4.6
■ create a data cube for AllElectronics sales that
contains the following:
■ city, item, year, and sales in dollars.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
82
■ What is the total number of cuboids, or group-
by’s, that can be computed for this data cube?
■ 3 attributes - city, item & year -3 dimensions
■ sales in dollars - measure,
■ the total number of cuboids, or group by’s,
■ 2 POWER 3 = 8.
■ The possible group-by’s are the following:
■ {(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ()}
■ () - group-by is empty (i.e., the dimensions are not
grouped) - all.
■ group-by’s form a lattice of cuboids for the data cube
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
83
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
84
■ Base cuboid contains all three dimensions(city, item, year)
■ returns - total sales for any combination of the three
dimensions.
■ This is least generalized (most specific) of the cuboids.
■ Apex cuboid, or 0-D cuboid, refers to the case where
the group-by is empty (contains total sum of all sales)
■ This is most generalized (least specific) of the cuboids
■ Drill Down equivalent
■ start at the apex cuboid and explore downward in the
lattice
■ akin to rolling up
■ start at the base cuboid and explore upward
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
85
■ zero-dimensional operation:
■ An SQL query containing no group-by
■ Example - “compute the sum of total sales”
■ one-dimensional operation:
■ An SQL query containing one group-by
■ Example - “compute the sum of sales group-by city”
■ A cube operator on n dimensions is equivalent to a
collection of group-by statements, one for each subset of
the n dimensions.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
■ data cube could be defined as:
■ “define cube sales_cube [city, item, year]:
sum(sales_in_dollars)”
■ 2 power n cuboids - For a cube with n dimensions
■ “compute cube sales_cube” - statement
■ computes the sales aggregate cuboids for all eight
subsets of the set {city, item, year}, including the
empty subset.
■ In OLAP, for diff. queries diff. cuboids need to be
accessed.
■ Precomputation - compute in advance all or at least
some of the cuboids in a data cube
■ curse of dimensionality - required storage space
may explode if all the cuboids in a data cube are
precomputed ( for more dimensions) 86
Data warehouse implementation:
87
1.4.1 Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ 2 power n - when no concept hierarchy
■ How many cuboids in an n-dimensional cube with L
levels?
■ where Li is the number of levels associated with
dimension i ( +1 for all )
■ If the cube has 10 dimensions and each dimension has
five levels (including all), the total number of cuboids
that can be generated is 510 ≈ 9.8 × 106 .
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
88
There are three choices for data cube
materialization for a given base cuboid:
■ 1. No materialization: Do not precompute -
expensive multidimensional aggregates -
extremely slow.
■ 2. Full materialization: Precompute all of the
cuboids - full cube - requires huge amounts of
memory space in order to store all of the
precomputed cuboids.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
89
■ 3. Partial materialization: Selectively compute a
proper subset of the whole set of possible cuboids.
■ compute a subset of the cube, which contains only those
cells that satisfy some user-specified criterion - subcube
■ 3 factors to consider:
■ (1) identify the subset of cuboids or subcubes to
materialize;
■ (2) exploit the materialized cuboids or subcubes
during query processing; and
■ (3) efficiently update the materialized cuboids or
subcubes during load and refresh.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
90
■ Partial Materialization: Selected Computation of Cuboids
■ Following should take into account during selection of
the subset of cuboids or subcubes
■ the queries in the workload, their frequencies, and
their accessing costs
■ workload characteristics, the cost for incremental
updates, and the total storage requirements.
■ physical database design such as the generation and
selection of indices.
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
91
■ Heuristic approaches for cuboid and subcube
selection
■ Iceberg cube:
■ data cube that stores only those cube cells with
an aggregate value (e.g., count) that is above
some minimum support threshold.
■ shell cube:
■ precomputing the cuboids for only a small number
of dimensions
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
92
Index structures - To facilitate efficient data accessing
■ Bitmap indexing method - it allows quick searching in
data cubes.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the attribute’s
domain.
■ If a given attribute’s domain consists of n values, then n
bits are needed for each entry in the bitmap index (i.e.,
there are n bit vectors).
■ If the attribute has the value v for a given row in the
data table, then the bit representing that value is set to 1
in the corresponding row of the bitmap index. All other
bits for that row are set to 0.
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
93
● Example:- AllElectronics data warehouse
● dim(item)={H,C,P,S} - 4 values - 4 bit vectors
● dim(city)= {V,T} - 2 values - 2 bit vectors
● Better than Hash & Tree Indices but good for low
cardinality only (cardinality:number of unique items in the database column)
94
Exercise: Bitmap index on CITY
Data warehouse implementation:
Indexing OLAP Data: Join Index
95
■ Traditional indexing maps the value in a given
column to a list of rows having that value.
■ Join indexing registers the joinable rows of
two relations from a relational database.
■ For example,
■ two relations - R(RID, A) and S(B, SID)
■ join on the attributes A and B,
■ join index record contains the pair (RID, SID),
■ where RID and SID are record identifiers from
the R and S relations, respectively
Data warehouse implementation:
Indexing OLAP Data: Join Index
96
■ Advantage:-
■ Identification of joinable tuples without performing
costly join operations.
■ Useful:-
■ To maintain the relationship between a foreign
key(fact table) and its matching primary
keys(dimension table), from the joinable relation.
■ Indexing maintains relationships between attribute
values of a dimension (e.g., within a dimension table)
and the corresponding rows in the fact table.
■ Composite join indices: Join indices with multiple
dimensions.
Data warehouse implementation:
Indexing OLAP Data: Join Index
■ Example:-Star Schema
■ “sales_star [time, item, branch, location]: dollars_sold
= sum (sales_in_dollars).”
■ join index is relationship between
■ Sales fact table and
■ the location, item dimension tables
To speed up query processing - join indexing & bitmap indexing methods
can be integrated to form bitmapped join indices. 97
Data warehouse implementation:
Efficient processing of OLAP queries
98
Given materialized views, query processing should proceed as
follows:
■ 1. Determine which operations should be performed
on the available cuboids:
■ This involves transforming any selection, projection,
roll-up (group-by), and drill-down operations specified
in the query
into
corresponding SQL and/or OLAP operations.
■ Example:
■ slicing and dicing a data cube may correspond to
selection and/or projection operations on a
materialized cuboid.
Data warehouse implementation:
Efficient processing of OLAP queries
99
■ 2. Determine to which materialized cuboid(s) the
relevant operations should be applied:
■ pruning the set using knowledge of
“dominance” relationships among the cuboids,
■ estimating the costs of using the remaining
materialized cuboids, and selecting the cuboid with
the least cost.
Data warehouse implementation:
Efficient processing of OLAP queries
100
Example:-
■ define a data cube for AllElectronics of the
form “sales cube [time, item, location]:
sum(sales in dollars).”
■ dimension hierarchies
■ “day < month < quarter < year” for time;
■ “item_name < brand < type” for item
■ “street < city < province or state < country”
for location
■ Query:
■ {brand, province or state}, with the selection
constant “year = 2010.”
Data warehouse implementation:
Efficient processing of OLAP queries
101
■ suppose that there are four materialized cuboids
available, as follows:
■ Which of these four cuboids should be selected
to process the query? Ans: 1,3,4
■ Low cost cuboid to process the query? Ans: 4
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
102
■ Relational OLAP (ROLAP) servers:
■ ROLAP uses relational tables to store data for
online analytical processing
■ Intermediate servers that stand in
between a relational back-end server and
client front-end tools.
■ Operation:
■ use a relational or extended-relational DBMS to
store and manage warehouse data
■ OLAP middleware to support missing pieces
■ ROLAP has greater scalability than MOLAP.
■ Example:-
■ DSS server of Microstrategy
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
103
■ Multidimensional OLAP (MOLAP) servers:
■ support multidimensional data views through array-
based multidimensional storage engines
■ maps multidimensional views directly to data cube
array structures.
■ Advantage:
■ fast indexing to precomputed summarized data.
■ adopt a two-level storage representation
■ Denser subcubes are stored as array structures
■ Sparse subcubes employ compression
technology
A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-
zeros.
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
■ Hybrid OLAP (HOLAP) servers:
■ Combines ROLAP and MOLAP technology
■ benefits
■ greater scalability from ROLAP and
■ faster computation of MOLAP.
■ HOLAP server may allow
■ large volumes of detailed data to be stored in a
relational database,
■ while aggregations are kept in a separate MOLAP store.
■ Example:- Microsoft SQL Server 2000 (supports)
■ Specialized SQL servers:
■ provide advanced query language and query
processing support for SQL queries over star and
snowflake schemas in a read-only environment. 104
105
From Data Warehousing to Data Mining
106
From DataWarehousing to Data Mining
DataWarehouse Usage
■ Data warehouses and data marts are used in a
wide range of applications.
■ Business executives use the data in data warehouses
and data marts to perform data analysis and make
strategic decisions.
■ data warehouses are used as an integral part of a
plan-execute-assess “closed-loop” feedback
system for enterprise management.
■ Data warehouses are used extensively in banking and
financial services, consumer goods and retail
distribution sectors, and controlled manufacturing,
such as demand-based production.
DataWarehouse Usage
107
■ There are three kinds of data warehouse
applications:
■ information processing
■ analytical processing
■ data mining
DataWarehouse Usage
108
■ Information processing supports
■ querying,
■ basic statistical analysis, and
■ reporting using crosstabs, tables, charts, or
graphs.
■ Analytical processing supports
■ basic OLAP operations,
■ slice-and-dice, drill-down, roll-up, and pivoting.
■ It generally operates on historic data in both
summarized and detailed forms.
■ multidimensional data analysis
DataWarehouse Usage
109
■ Data mining supports
■ knowledge discovery by finding hidden
patterns and associations,
■ constructing analytical models,
■ performing classification and prediction, and
■ presenting the mining results using
visualization tools.
■ Note:-
■ Data Mining is different with Information
Processing and Analytical processing
From Online Analytical Processing
to Multidimensional Data Mining
110
■ On-line analytical mining (OLAM) (also called OLAP
mining) integrates on-line analytical processing (OLAP)
with data mining and mining knowledge in
multidimensional databases.
■ OLAM is particularly important for the following reasons:
■ High quality of data in data warehouses.
■ Available information processing infrastructure
surrounding data warehouses
■ OLAP-based exploratory data analysis:
■ On-line selection of data mining functions
Architecture for On-Line Analytical
Mining
111
■ An OLAM server performs analytical mining in data
cubes in a similar manner as an OLAP server performs
on-line analytical processing.
■ An integrated OLAM and OLAP architecture is shown in
Figure, where the OLAM and OLAP servers both accept
user on-line queries (or commands) via a graphical user
interface API and work with the data cube in the data
analysis via a cube API.
■ The data cube can be constructed by accessing and/or
integrating multiple databases via an MDDB API and/or
by filtering a datawarehouse via a database API that may
support OLE DB or ODBC connections.
112
Data Mining
&
Motivating Challenges
UNIT - II
By
M. Rajesh Reddy
WHAT IS DATA MINING?
• Data mining is the process of automatically discovering
useful information in large data repositories.
• To find novel and useful patterns that might
otherwise remain unknown.
otherwise remain unknown.
• provide capabilities to predict the outcome of a future
observation,
• Example
• predicting whether a newly arrived customer will spend
more than $100 at a department store.
WHAT IS DATA MINING?
• Not all information discovery tasks are considered to be
data mining.
• For example, tasks related to the area of information
retrieval.
retrieval.
• looking up individual records using a database
management system
or
• finding particular Web pages via a query to an Internet
search engine
• To enhance information retrieval systems.
WHAT IS DATA MINING?
Data Mining and Knowledge
• Data mining is an integral part of Knowledge Discovery in
Databases (KDD),
• process of converting raw data into useful
• process of converting raw data into useful
information
• This process consists of a series of transformation
steps
WHAT IS DATA MINING?
• Preprocessing - to transform the raw input data into an
appropriate format for subsequent analysis.
• Steps involved in data preprocessing
• Fusing (joining) data from multiple sources,
• Fusing (joining) data from multiple sources,
• cleaning data to remove noise and duplicate
observations
• selecting records and features that are relevant to the
data mining task at hand.
• most laborious and time-consuming step
WHAT IS DATA MINING?
• Post Processing:
• only valid and useful results are incorporated into the
decision support system.
• Visualization
• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.
• Statistical measures or hypothesis testing methods can
also be applied
• to eliminate spurious (false or fake) data mining
results.
Motivating Challenges:
• challenges that motivated the development of data
mining.
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis
Motivating Challenges:
• Scalability
• Size of datasets are in the order of GB, TB or PB.
• special search strategies
• special search strategies
• implementation of novel data structures ( for efficient
access)
• out-of-core algorithms - for large datasets
• sampling or developing parallel and distributed algorithms.
Motivating Challenges:
• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.
Motivating Challenges:
Heterogeneous and Complex Data
• Traditional data analysis methods - data sets - attributes
of the same type - either continuous or categorical.
• Examples of such non-traditional types of data include
• collections of Web pages containing semi-structured
text and hyperlinks;
text and hyperlinks;
• DNA data with sequential and three-dimensional
structure and
• climate data with time series measurements
• DM should maintain relationships in the data, such as
• temporal and spatial autocorrelation,
• graph connectivity, and
• parent-child relationships between the elements in
semi-structured text and XML documents.
Motivating Challenges:
• Data Ownership and Distribution
• Data is not stored in one location or owned by one organization
• geographically distributed among resources belonging to multiple
entities.
• This requires the development of distributed data mining techniques.
• This requires the development of distributed data mining techniques.
• key challenges in distributed data mining algorithms
• (1) reduction in the amount of communication needed
• (2) effective consolidation of data mining results obtained from
multiple sources, and
• (3) Data security issues.
Motivating Challenges:
• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.
Origins of Data mining,
Data mining Tasks
&
Types of Data
Types of Data
Unit - II
DWDM
The Origins of Data Mining
Data mining draws upon ideas, such as
■ (1) sampling, estimation, and hypothesis testing from statistics and
■ (2) search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
■ (2) search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
The Origins of Data Mining
■ adopt ideas from other areas, including
– optimization,
– evolutionary computing,
– information theory,
– information theory,
– signal processing,
– visualization, and
– information retrieval
The Origins of Data Mining
■ An optimization algorithm is a procedure which is executed iteratively by
comparing various solutions till an optimum or a satisfactory solution is
found.
■ Evolutionary Computation is a field of optimization theory where instead of
■ Evolutionary Computation is a field of optimization theory where instead of
using classical numerical methods to solve optimization problems, we use
inspiration from biological evolution to ‘evolve’ good solutions
– Evolution can be described as a process by
which individuals become ‘fitter’ in different
environments through adaptation,
natural selection, and selective breeding.
picture of the famous finches Charles Darwin depicted
in his journal
The Origins of Data Mining
■ Information theory is the scientific study of the quantification, storage,
and communication of digital information.
■ The field was fundamentally established by the works of Harry
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
■ The field is at the intersection of probability theory, statistics, computer
science, statistical mechanics, information engineering, and electrical
engineering.
The Origins of
Data Mining
■ Other Key areas:
– database systems
■ to provide support for efficient storage, indexing, and query processing.
– Techniques from high performance (parallel) computing
– Techniques from high performance (parallel) computing
■ addressing the massive size of some data sets.
– Distributed techniques
■ also help address the issue of size and are essential when the data cannot
be gathered in one location.
Data Mining Tasks
■ Data mining tasks are generally divided into two major categories:
– Predictive tasks. - Use some variables to predict unknown or future
values of other variables
■ Task Objective: predict the value of a particular attribute based on the
values of other attributes.
■ Target/Dependent Variable: attribute to be predicted
■ Explanatory or independent variables: attributes used for making the
■ Explanatory or independent variables: attributes used for making the
prediction
– Descriptive tasks. - Find human-interpretable patterns that
describe the data.
■ Task objective: derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data.
■ Descriptive data mining tasks are often exploratory in nature and
frequently require post processing techniques to validate and explain the
results.
Data Mining Tasks
■ Correlation is a statistical term describing the degree to which two variables
move in coordination with one another.
■ Trends: a general direction in which something is developing or
changing.(meaning)
Clusters
Trajectory data mining enables to predict the moving
location details of humans, vehicles, animals and so on.
Anomaly detection is a step in data mining that
identifies data points, events, and/or observations
that deviate from a dataset’s normal behavior.
■ Clusters
– Clustering is the task of
data points into a number of groups
such that data points in the same groups
are more similar to other data points
in the same group
than those in other groups
https://www.javatpoint.com/data-mining-cluster-
analysis
Data
Data Mining Tasks …
Milk
Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar
Data Mining Tasks
Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.
Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.
– Example:
■ Classification Task : predicting whether a Web user will make a purchase at an online
■ Classification Task : predicting whether a Web user will make a purchase at an online
bookstore is a classification task because the target variable is binary-valued.
■ Regression Task: forecasting the future price of a stock is a regression task because price
is a continuous-valued attribute.
– Goal of both tasks: learn a model that minimizes the error between the predicted and
true values of the target variable.
– Predictive modeling can be used to:
■ identify customers that will respond to a marketing campaign,
■ predict disturbances in the Earth’s ecosystem, or
■ judge whether a patient has a particular disease based on the results of medical tests.
Data Mining Tasks
■ Example: (Predicting the Type of a Flower): the task of predicting a species of flower
based on the characteristics of the flower.
■ Iris species: Setosa, Versicolour, or Virginica.
■ Requirement: need a data set containing the characteristics of various flowers of these
three species.
■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
■ Petal width is broken into the categories low, medium, and high, which correspond to the
intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively.
■ Also, petal length is broken into categories low, medium, and high, which correspond to the
intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.
■ Based on these categories of petal width and length, the following rules can be derived:
– Petal width low and petal length low implies Setosa.
– Petal width medium and petal length medium implies Versicolour.
– Petal width high and petal length high implies Virginica.
Data Mining Tasks
■ Example: (Predicting the Type of a Flower):
Data Mining Tasks
Example:
(Predicting
the Type of a
Flower)
Flower)
Data Mining Tasks
■ Association analysis
– used to discover patterns that describe strongly associated features in the
data.
– Discovered patterns are represented in the form of implication rules or
feature subsets.
– Goal of association analysis:
■ To extract the most interesting patterns in an efficient manner.
– Example
– Example
■ finding groups of genes that have related functionality,
■ identifying Web pages that are accessed together, or
■ understanding the relationships between different elements of Earth’s climate
system.
Data Mining Tasks
■ Association analysis
■ Example (Market Basket Analysis).
– AIM: find items that are frequently bought together by customers.
– Association rule {Diapers} −→ {Milk},
■ suggests that customers who buy diapers also tend to buy milk.
■ This rule can be used to identify potential cross-selling opportunities among related
items.
items.
The transactions data collected at the checkout counters of a grocery store.
Data Mining Tasks
■ Cluster analysis
– Cluster analysis seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar than
observations that belong to other clusters.
– Clustering has been used to
■ group sets of related customers,
■ find areas of the ocean that have a significant impact on the Earth’s climate, and
■ compress data.
■ compress data.
Data Mining Tasks
■ Cluster analysis
– Example 1.3 (Document Clustering)
– Each article is represented as a set of word-frequency pairs (w, c),
■ where w is a word and
■ c is the number of times the word appears in the article.
– There are two natural clusters in the data set.
– First cluster -> first four articles (news about the economy)
– First cluster -> first four articles (news about the economy)
– Second cluster-> last four articles ( news about health care)
– A good clustering algorithm should be able to identify these two clusters
based on the similarity between words that appear in the articles.
Data Mining Tasks
■ Anomaly Detection:
– Task of identifying observations whose characteristics are significantly
different from the rest of the data.
– Such observations are known as anomalies or outliers.
– A good anomaly detector must have a high detection rate and a low false alarm
rate.
– Applications of anomaly detection include
■ the detection of fraud,
■ the detection of fraud,
■ network intrusions,
■ unusual patterns of disease, and
■ ecosystem disturbances
https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffi
c.png
Data Mining Tasks
■ Anomaly Detection:
– Example 1.4 (Credit Card Fraud Detection).
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
and address.
– Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be
applied to build a profile of legitimate transactions for the users.
– When a new transaction arrives, it is compared against the profile of the user. If
the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.
Types of Data
■ Data set - collection of data objects.
■ Other names for a data object are:-
– record,
– point,
– vector,
– vector,
– pattern,
– event,
– case,
– sample,
– observation, or
– entity.
Types of Data
■ Data objects are described by a number of attributes that
capture the basic characteristics of an object.
■ Example:-
– mass of a physical object or
– time at which an event occurred.
– time at which an event occurred.
■ Other names for an attribute are:-
– variable,
– characteristic,
– field,
– feature, or
– dimension.
Types of Data
■ Example:-
■ Dataset - Student Information.
■ Each row corresponds to a student.
■ Each column is an attribute that describes some aspect of a
student.
student.
Types of Data
■ Attributes and Measurement
– An attribute is a property or characteristic of an object
that may vary, either from one object to another or from
one time to another.
– Example,
– Example,
■ eye color varies from person to person, while the
temperature of an object varies over time.
– Eye color is a symbolic attribute with a small number of
possible values {brown, black, blue, green, hazel, etc.},
– Temperature is a numerical attribute with a potentially
unlimited number of values.
Types of Data
■ Attributes and Measurement
– A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an
object.
– process of measurement
– process of measurement
■ application of a measurement scale to associate a
value with a particular attribute of a specific object.
Properties of Attribute Values
■ The type of an attribute depends on which of the following
properties it possesses:
■ Distinctness: = ≠
■ Order: < >
■ Addition: + ‐
■ Addition: + ‐
■ Multiplication: * /
■ Nominal attribute: distinctness
■ Ordinal attribute: distinctness & order
■ Interval attribute: distinctness, order & addition
■ Ratio attribute: all 4 properties
Types of Data
■ Properties of Attribute Values
– Nominal - attributes to differentiate between one object
and another.
– Roll, EmpID
– Ordinal - attributes to order the objects.
– Rankings, Grades, Height
– Rankings, Grades, Height
– Interval - measured on a scale of equal size units
– no Zero point
– Temperatures in C & F, Calendar Dates
– Ratio - numeric attribute with an inherent zero-point.
– value as being a multiple (or ratio) of another
value.
– Weight, No. of Staff, Income/Salary
Types of Data Properties of Attribute Values
Types of Data
Properties of Attribute Values - Transformations
– yielding the same results when the attribute is
transformed using a transformation that preserves
the attribute’s meaning.
– Example:-
the average length of a set of objects is different
– Example:-
■ the average length of a set of objects is different
when measured in meters rather than in feet, but
both averages represent the same length.
Types of Data
Properties of Attribute Values - Transformations
Types of Data
Attribute Types
Data
Qualitative / Categorical Quantitative / Numeric
Qualitative / Categorical
( no properties of integer)
Quantitative / Numeric
(properties of Integers)
Nominal Ordinal Interval Ratio
Types of Data
■ Describing Attributes by the Number of Values
a. Discrete
■ finite or countably infinite set of values.
■ Categorical - zip codes or ID numbers, or
■ Numeric - counts.
■ Binary attributes (special case of discrete)
■ Binary attributes (special case of discrete)
– assume only two values,
– e.g., true/false, yes/no, male/female, or 0/1.
b. Continuous
■ values are real numbers.
■ Ex:- temperature, height, or weight.
Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined
with any of the types based on the number of attribute values—binary, discrete, and continuous.
Types of Data - Types of Dataset
General Characteristics of Data Sets
■ 3 characteristics that apply to many data sets are:-
– dimensionality,
– sparsity, and
– resolution.
■ Dimensionality - number of attributes that the objects in the data set possess.
– small number of dimensions more quality than moderate or high-
– small number of dimensions more quality than moderate or high-
dimensional data.
– curse of dimensionality & dimensionality reduction.
■ Sparsity - data sets, with asymmetric features, most attributes of an object
have values of 0;
– fewer than 1% of the entries are non-zero.
■ Resolution - Data will be gathered at different levels of resolution
– Example:- the surface of the Earth seems very uneven at a resolution of a
few meters, but is relatively smooth at a resolution of tens of kilometers.
Types of Data - Types of Dataset
■ Record Data
– data set is a collection of records (data objects), each of which consists of
a fixed set of data fields (attributes).
– No relationships b/w records
– Same attributes for all records
– Flat files or relational DB.
Types of Data - Types of Dataset
■ Transaction or Market Basket Data
– special type of record data
– Each record (transaction) involves a set of items.
– Also called market basket data because the items in each record are the
products in a person’s “market basket.”
– Can be viewed as a set of records whose fields are asymmetric attributes.
Types of Data - Types of Dataset
■ Data Matrix / Pattern Matrix
– fixed set of numeric attributes,
– Data objects = points (vectors) in a multidimensional space
– each dimension = a distinct attribute describing the object.
– A set of such data objects can be interpreted as
■ an m by n matrix,
– where there are
– m rows, one for each object,
– and n columns, one for each attribute.
– Standard matrix operation can be applied to transform and manipulate the
data.
Types of Data - Types of Dataset
■ Sparse Data Matrix:
– Special case of a data matrix
– attributes are of the
– attributes are of the
■ same type and
■ asymmetric; i.e., only non-zero values are important.
– Example:-
■ Transaction data which has only 0–1 entries.
■ Document Term Matrix - collection of term vector
– One Term vector represents - one document ( one row in matrix)
– Attribute of vector - each term in the document ( one col in matrix)
– value in term vector under an attribute is number of times the
corresponding term occurs in the document.
Types of Data - Types of Dataset
■ Graph based Data:
– Data can be represented in the form of Graph.
– Graphs are used for 2 specific reasons
■ (1) the graph captures relationships among data objects and
■ (2) the data objects themselves are represented as graphs.
– Data with Relationships among Objects
■ Relationships among objects also convey important information.
■ Relationships among objects are captured by the links between objects
and link properties, such as direction and weight.
■ Example:
– Web page in www contain both text and links to other pages.
– Web search engines collect and process Web pages to extract their
contents.
– Links to and from each page provide a great deal of information
about the relevance of a Web page to a query, and thus, must also
be taken into consideration.
Types of Data - Types of Dataset
■ Graph based Data:
– Data with Relationships among Objects
■ Example:
– Web page in www contain both text and links to other pages.
Types of Data - Types of Dataset
■ Graph based Data:
– Data with Objects That Are Graphs
■ When objects contain sub-objects that have relationships, then such
objects are frequently represented as graphs.
■ Example:-Structure of chemical compounds
■ Atoms are - nodes
■ Chemical Bonds - links between nodes
– ball-and-stick diagram of the chemical compound benzene,
which contains atoms of carbon (black) and hydrogen (gray).
Substructure mining
Types of Data - Types of Dataset
■ Ordered Data:
– In some data, the attributes have relationships that involve order in time or
space.
– Sequential Data
■ Sequential data / temporal data
■ extension of record data - each record has a time associated with it.
■ Ex:- Retail transaction data set - stores the time of transaction
– time information used to find patterns
■ “candy sales peak before Halloween.”
■ Each attribute - also - time associated
– Record - purchase history of a customer
■ with a listing of items purchased at different times.
– find patterns
■ “people who buy DVD players tend to buy DVDs in the period
immediately following the purchase.”
Types of Data - Types of Dataset
■ Ordered Data: Sequential
Types of Data - Types of Dataset
■ Ordered Data: Sequence Data
– consists of a data set that is a sequence
of individual entities,
– Example
■ sequence of words or letters.
– Example:
■ Genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
■ Predicting similarities in the structure
and function of genes from similarities
in nucleotide sequences.
– Ex:- Human genetic code expressed
using the four nucleotides from which all
DNA is constructed: A, T, G, and C.
Types of Data - Types of Dataset
■ Ordered Data: Time Series Data
– Special type of sequential data in
which each record is a time series,
– A series of measurements taken over
time.
– Example:
■ Financial data set might contain
objects that are time series of the
daily prices of various stocks.
– Temporal autocorrelation; i.e., if two
measurements are close in time, then
the values of those measurements are
often very similar. Time series of the average
monthly temperature for
Minneapolis during the years
1982 to 1994.
Types of Data - Types of Dataset
■ Ordered Data: Spatial Data
■ Some objects have spatial attributes,
such as positions or areas, as well as
other types of attributes.
■ An example of spatial data is
– weather data (precipitation,
temperature, pressure) that is
collected for a variety of geographical
locations.
■ spatial autocorrelation; i.e., objects that
are physically close tend to be similar in
other ways as well.
■ Example
– two points on the Earth that are close
to each other usually have similar
values for temperature and rainfall.
Average Monthly
Temperature of land and
ocean
Data Quality
Data Quality
Unit – II- DWDM
Data Quality
● Data mining applications are applied to data that was collected for another purpose, or for
future, but unspecified applications.
● Data mining focuses on
(1) the detection and correction of data quality problems - Data Cleaning
(1) the detection and correction of data quality problems - Data Cleaning
(2) the use of algorithms that can tolerate poor data quality.
● Measurement and Data Collection Issues
● Issues Related to Applications
Data Quality
● Measurement and Data Collection Issues
● problems due to human error,
● limitations of measuring devices, or
● flaws in the data collection process.
Values or even entire data objects may be missing.
● Values or even entire data objects may be missing.
● Spurious or duplicate objects; i.e., multiple data objects that all correspond to a
single “real” object.
○ Example - there might be two different records for a person who has recently lived at two
different addresses.
● Inconsistencies—
○ Example - a person has a height of 2 meters, but weighs only 2 kilograms.
Data Quality
● Measurement and Data Collection Errors
○ Measurement error - any problem resulting from the measurement process.
■ Value recorded differs from the true value to some extent.
■ Continuous attributes:
Numerical difference of the measured and true value is called the
● Numerical difference of the measured and true value is called the
error.
○ Data collection error - errors such as omitting data objects or attribute
values, or inappropriately including a data object.
■ For example, a study of animals of a certain species might include animals
of a related species that are similar in appearance to the species of
interest.
Data Quality
● Measurement and Data Collection Errors
○ Noise and Artifacts:
○ Noise is the random component of a measurement error.
○ It may involve the distortion of a value or the addition of spurious objects.
○ It may involve the distortion of a value or the addition of spurious objects.
Data Quality
Data Quality
● Measurement and Data Collection Errors
○ Noise and Artifacts:
○ used in connection with data that has a spatial or temporal component.
○ Techniques from signal or image processing can frequently be used to reduce
○ Techniques from signal or image processing can frequently be used to reduce
noise
■ These will help to discover patterns (signals) that might be “lost in the
noise.”
○ Note:Elimination of noise - difficult
■ robust algorithms - produce acceptable results even when noise is present.
Data Quality
● Measurement and Data Collection Errors
○ Noise and Artifacts:
■ Artifacts: Deterministic distortions of the data
■ Data errors may be the result of a more deterministic phenomenon, such
■ Data errors may be the result of a more deterministic phenomenon, such
as a streak in the same place on a set of photographs.
Data Quality
● Measurement and Data Collection Errors
● Precision, Bias, and Accuracy:
○ Precision:
■ The closeness of repeated measurements (of the same quantity) to one another.
■ Precision is often measured by the standard deviation of a set of values
○ Bias:
A systematic variation of measurements from the quantity being measured.
○ Bias:
■ A systematic variation of measurements from the quantity being measured.
■ Bias is measured by taking the difference between the mean of the set of values and the
known value of the quantity being measured.
○ Example:
■ standard laboratory weight with a mass of 1g and want to assess the precision and bias of our
new laboratory scale.
■ weigh the mass five times & values are: {1.015, 0.990, 1.013, 1.001, 0.986}.
■ The mean of these values is 1.001, and hence, the bias is 0.001.
■ The precision, as measured by the standard deviation, is 0.013.
Data Quality
● Measurement and Data Collection Errors
● Precision, Bias, and Accuracy:
○ Accuracy:
■ The closeness of measurements to the true value of the
■ The closeness of measurements to the true value of the
quantity being measured.
Data Quality
● Measurement and Data Collection Errors
● Outliers:
○ Outliers are either
■ (1) data objects that, in some sense, have characteristics that
■ (1) data objects that, in some sense, have characteristics that
are different from most of the other data objects in the data set,
or
■ (2) values of an attribute that are unusual with respect to the
typical values for that attribute.
○ Alternatively - anomalous objects or values.
Data Quality
● Measurement and Data Collection Errors
● Missing Values:
○ Eliminate Data Objects or Attributes
○ Estimate Missing Values
○ Estimate Missing Values
○ Ignore the Missing Value during Analysis
○ Inconsistent Values
Data Quality
● Measurement and Data Collection Errors
● Duplicate Data: Same Data in multiple Data Objects
○ To detect and eliminate such duplicates, two main issues
must be addressed.
must be addressed.
■ First - if two objects represent a single object, then the values of
corresponding attributes may differ, and these inconsistent
values must be resolved
■ Second - care needs to be taken to avoid accidentally combining
data objects that are similar - deduplication
Data Quality “data is of high quality if it is suitable for its intended use.”
● Issues Related to Applications:
● Timeliness:
○ If the data is out of date, then so are the models and patterns that are based on it.
● Relevance:
○ The available data must contain the information necessary for the application.
○ Consider the task of building a model that predicts the accident rate for drivers. If information about the age and
gender of the driver is omitted, then it is likely that the model will have limited accuracy unless this information
is indirectly available through other attributes.
is indirectly available through other attributes.
● Knowledge about the Data:
○ Data sets are accompanied documentation that describes different aspects of the data;
○ the quality of this documentation can help in the subsequent analysis.
○ For example,
■ If the documentation is poor, however, and fails to tell us, for example, that the missing values for a
particular field are indicated with a -9999, then our analysis of the data may be faulty.
○ Other important characteristics are the precision of the data, the type of features (nominal, ordinal, interval,
ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.
DATA PREPROCESSING
DATA PREPROCESSING
Datamining
Unit - II
AGGREGATION
• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP
AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.
AGGREGATION
average yearly precipitation has less variability than the average monthly precipitation.
SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.
SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.
SAMPLING
Sampling and Loss of Information
• Larger sample sizes increase the probability that a sample will be representative, but they also eliminate
much of the advantage of sampling.
• Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.
SAMPLING
Determining the Proper Sample Size
• Desired outcome: at least one point will be obtained from each cluster.
• Probability of getting one object from each of the 10 groups increases as the sample size runs from 10
to 60.
SAMPLING
• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).
DIMENSIONALITY REDUCTION
• Data sets can have a large number of features.
• Example
• a set of documents, where each document is represented by a vector
• a set of documents, where each document is represented by a vector
whose components are the frequencies with which each word occurs in
the document.
• thousands or tens of thousands of attributes (components), one for each
word in the vocabulary.
DIMENSIONALITY REDUCTION
• Benefits to dimensionality reduction.
• Data mining algorithms work better if the dimensionality is lower.
• It eliminates irrelevant features and reduce noise
• Lead to a more understandable model
• Lead to a more understandable model
• fewer attributes
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• Reduce the dimensionality of a data set by creating new attributes that are a combination of the old
attributes.
• Feature subset selection or feature selection:
• The reduction of dimensionality by selecting new attributes that are a subset of the old.
DIMENSIONALITY REDUCTION
• The Curse of Dimensionality
• Data analysis become significantly harder as the dimensionality of the data
increases.
increases.
• data becomes increasingly sparse
• Classification
• there are not enough data objects to model a class to all possible objects.
• Clustering
• density and the distance between points - becomes less meaningful
DIMENSIONALITY REDUCTION
• Linear Algebra Techniques for Dimensionality Reduction
• Principal Components Analysis (PCA)
• for continuous attributes
• for continuous attributes
• finds new attributes (principal components) that
• (1) are linear combinations of the original attributes,
• (2) are orthogonal (perpendicular) to each other, and
• (3) capture the maximum amount of variation in the data.
• Singular Value Decomposition (SVD)
• Related to PCA
FEATURE SUBSET SELECTION
• Another way to reduce the dimensionality - use only a subset of the features.
• Redundant Features
• Example:
• Purchase price of a product and the amount of sales tax paid
• Redundant to each other
• contain much of the same information.
• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.
• Redundant and irrelevant features
• reduce classification accuracy and the quality of the clusters that are found.
• can be eliminated immediately by using common sense or domain knowledge,
• systematic approach - for selecting the best subset of features
• Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and
then take the subset that produces the best results.
FEATURE SUBSET SELECTION
•3 standard approaches to feature
selection:
•Embedded
•Filter
•Wrapper
FEATURE SUBSET SELECTION
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm.
• During execution of algorithm, the Algorithm itself decides which attributes to use
and which to ignore.
• Example:- Algorithms for building decision tree classifiers
• Filter approaches:
• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.
• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.
FEATURE SUBSET SELECTION
• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.
• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.
FEATURE SUBSET SELECTION
FEATURE SUBSET SELECTION
• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.
• Stopping criterion
• conditions involving the following:
• the number of iterations,
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.
• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.
FEATURE SUBSET SELECTION
• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights
FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction
FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.
• Feature extraction is highly domain-specific
• New area means development of new features and feature extraction
methods.
FEATURE CREATION
Mapping the Data to a New Space
• A totally different view of the data can reveal important and interesting features.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If, there are a number of periodic patterns and a significant amount of noise is present, then these
patterns are hard to detect.
• Such patterns can be detected by applying a Fourier transform to the time series in order to
change to a representation in which frequency information is explicit.
• Example:
• Power spectrum that can be computed after applying a Fourier transform to the original time series.
FEATURE CREATION
• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.
• If new features constructed out of the original features can be more useful than the original features.
• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
DISCRETIZATION AND BINARIZATION
• Classification algorithms, require that the data be in the form of
categorical attributes.
• Algorithms that find association patterns, require that the data be in
• Algorithms that find association patterns, require that the data be in
the form of binary attributes.
• Discretization - transforming a continuous attribute into a
categorical attribute
• Binarization - transforming both continuous and discrete attributes
into one or more binary attributes
DISCRETIZATION AND BINARIZATION
• Binarization of a categorical attribute (Simple technique):
• If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0, m − 1].
• If the attribute is ordinal, then order must be maintained
by the assignment.
by the assignment.
• Next, convert each of these m integers to a binary number
using n binary attributes
• n = [log2 (m)] binary digits are required to represent these
integers
DISCRETIZATION AND BINARIZATION
Example: a categorical variable with 5 values
{awful, poor, OK, good, great}
require three binary variables x1, x2, and x3.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes ( classification or
association analysis)
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• decide no. of categories
• decide no. of categories
• decide how to map the values of the continuous attribute to these
categories.
• Step I: Sort Attribute Values and divide into n intervals by specifying n − 1
split points.
• Step II : all the values in one interval are mapped to the same categorical
value.
DISCRETIZATION AND BINARIZATION
• Discretization of Continuous Attributes
• Problem of discretization is
• Deciding how many split points to choose and
• where to place them.
• The result can be represented either as
• a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},
where x0 and xn may be +∞ or −∞, respectively,
or
• as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.
DISCRETIZATION AND BINARIZATION
• UnSupervised Discretization
• Discretization methods for Classification
• Supervised - known class information
• Unsupervised - unknown class information
• Equal width approach:
• Equal width approach:
• divides the range of the attribute into a user-specified number of
intervals each having the same width.
• problem with outliers
• Equal frequency (equal depth) approach:
• Puts same number of objects into each interval
• K-means Clustering method
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization
Original Data
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization
Equal Width Discretization
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization
Equal Frequency Discretization
DISCRETIZATION AND BINARIZATION
UnSupervised Discretization
K-means Clustering (better result)
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• When additional information (class labels) are used then it
produces better results.
• Some Concerns: purity of an interval and the minimum size of
an interval.
an interval.
• statistically based approaches:
• start with each attribute value as a separate interval and
create larger intervals by merging adjacent intervals that are
similar according to a statistical test.
• Entropy based approaches:
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Entropy Definition
• e - Entropy in ith interval
• ei - Entropy in ith interval
• pij = mij/mi probability of class j in the i th interval.
• k - no. of different class labels
• mi - no. of values in the i th interval of a partition,
• mij - no. of values of class j in interval i.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Total entropy, e, of the partition is
• weighted average of the individual interval entropies,
• m - no. of values,
• m - no. of values,
• wi = mi/m fraction of values in the i th interval
• n - no. of intervals.
• Perfectly Pure Interval:entropy is 0
• If an interval contains only values of one class
• Impure Interval: entropy is maximum
• classes of values in an interval occur equal
DISCRETIZATION AND BINARIZATION
• Supervised Discretization
• Entropy based approaches:
• Simple approach for partitioning a continuous attribute:
• starts by bisecting the initial values so that the resulting
two intervals give minimum entropy.
two intervals give minimum entropy.
• consider each value as a possible split point
• Repeat splitting process with another interval
• choosing the interval with the worst (highest) entropy,
• until a user-specified number of intervals is reached,
or
• stopping criterion is satisfied.
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 3 categories for
• 3 categories for
both x & y
DISCRETIZATION AND BINARIZATION
• Supervised
Discretization
• Entropy based
approaches:
• 5 categories for
• 5 categories for
both x & y
• Observation:
• no improvement
for 6 categories
DISCRETIZATION AND BINARIZATION
• Categorical Attributes with Too Many Values
• If categorical attribute is an ordinal,
• techniques similar to those for continuous attributes
• If the categorical attribute is nominal,
• Example:-
• Example:-
• University that has a large number of departments.
• department name attribute - dozens of diff. values.
• combine departments into larger groups, such as
• engineering,
• social sciences, or
• biological sciences.
Variable Transformation
• Transformation that is applied to all the values of a variable.
• Example: magnitude of a variable is important
• then the values of the variable can be transformed by taking the absolute
value.
• Simple Function Transformation:
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of such transformations include
• x k,
• log x,
• e x,
• √ x,
• 1/x,
• sin x, or |x|
Variable Transformation
• Variable transformations should be applied with caution since they
change the nature of the data.
• Example:-
• transformation fun. is 1/x
• if value is 1 or >1 then reduces the magnitude of values
• if value is 1 or >1 then reduces the magnitude of values
• values {1, 2, 3} go to {1, 1/ 2, 1/3}
• if value is b/w 0 & 1 then increases the magnitude of values
• values {1, 1/2, 1/3} go to {1, 2, 3}.
• so better ask questions such as the following:
• Does the order need to be maintained?
• Does the transformation apply to all values( -ve & 0)?
• What is the effect of the transformation on the values between 0 & 1?
Variable Transformation
• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation
• creates a new variable that has a mean of 0 and a standard deviation
of 1.
Variable Transformation
• Normalization or Standardization
• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.
Variable Transformation
• Normalization or Standardization
• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.
Measures of
Similarity and
Similarity and
Dissimilarity
Unit - II
Datamining
Measures of Similarity and
Dissimilarity
● Similarity and dissimilarity are important because they are used by a
number of data mining techniques
number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.
Measures of Similarity and Dissimilarity
● Similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity
Measures of Similarity and Dissimilarity
Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely
similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s−1)/9
■ s - Original Similarity
■ s’ - New similarity values
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
Note:-Measures that satisfy all three properties are known as metrics.
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences
A = {1, 2, 3, 4} and B = {2, 3, 4},
then A − B = {1} and
B − A = ∅, the empty set.
B − A = ∅, the empty set.
If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.
d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.
d(1PM, 2PM) = 1 hour
d(2PM, 1PM) = 23 hours
● Example:- when answering the question: “If an event occurs at 1PM
every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”
Distance in python
Measures of Similarity and Dissimilarity
Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
■ S` - new similarity measure.
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients
○ Let x and y be two objects that consist of n binary attributes.
○ The comparison of two objects (or two binary vectors), leads to
the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Simple Matching Coefficient(SMC)
Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
● Cosine similarity - measure of angle between x and y.
● Cosine similarity = 1 (angle is 0◦, and x & y are same (except magnitude or length))
● Cosine similarity = 0 (angle is 90
◦, and x & y do not share any terms (words))
● Cosine similarity = 0 (angle is 90
◦, and x & y do not share any terms (words))
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Extended Jaccard Coefficient (Tanimoto Coefficient)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
● The more tightly linear two variables X and Y are,
the closer Pearson's correlation coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated numbers
■ an increase in the value of one decreases the value of another variable.
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)
CLASSIFICATION
CLASSIFICATION
DATAMINING
UNIT III
BASIC CONCEPTS
• Input data ->collection of records. E
• Record / instance / example -> tuple (x, y)
• x - attribute set
• x - attribute set
• y - special attribute (class label / category / target attribute)
• Attribute set - properties of a Data Object – Discrete / Continuous
• Class label –
• Classification – y is Discrete attribute
• Regression (Predictive Modeling Task) - y is a continuous attribute.
BASIC CONCEPTS
• Definition:
• Classification is the task of learning a target function f that maps each attribute set x to one of the
predefined class labels y.
predefined class labels y.
• The target function is also known informally as a classification model.
BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Descriptive modeling: A classification model can serve as an explanatory tool to distinguish
between objects of different classes.
between objects of different classes.
BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Predictive Modeling:
• A classification model can also be used to predict the class label of unknown records.
• A classification model can also be used to predict the class label of unknown records.
• Automatically assigns a class label when presented with the attribute set of an
unknown record.
• Classification techniques best suit for binary or nominal categories.
• Do not consider the implicit order
• Relationships are also ignored(super-sub class)
General approach to solving a classification problem
• Classification technique (or classifier)
• Systematic approach to building classification models
from an input data set.
• Examples
• Decision tree classifiers,
• Rule-based classifiers,
• Neural networks,
• Neural networks,
• Support vector machines, and
• Naive bayes classifiers.
• Learning algorithm
• Used by the classifier
• To identify a model
• That best fits the relationship between the
attribute set and class label of the input data.
General approach to solving a classification problem
• Model
• Generated by a learning algorithm
• Should satisfy the following:
• Fit the input data well
• Correctly predict the class labels of
records it has never seen before.
• Training set
• Consisting of records whose class labels are
known
• used to build a classification model
General approach to solving a classification problem
• Confusion Matrix
• Used to evaluate the performance of a classification model
• Holds details about
• counts of test records correctly and incorrectly predicted by the model.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• fij – no. of records from class i predicted to be of class j.
• f01 – no. of records from class 0 incorrectly predicted as class 1.
• total no. of correct predictions made (f11 + f00)
• total number of incorrect predictions (f10 + f01).
General approach to solving a classification problem
• Performance Metrics:
1. Accuracy
1. Error Rate
DECISION TREE INDUCTION
Working of Decision Tree
• We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
• Each time we receive an answer, a follow-
up question is asked until we reach a
conclusion about the class label of the
record.
• The series of questions and their possible
answers can be organized in the form of a
decision tree
• Decision tree is a hierarchical structure
consisting of nodes and directed edges.
DECISION TREE INDUCTION
Working of Decision Tree
• Three types of nodes:
• Root node
• No incoming edges
• Zero or more outgoing edges.
• Internal nodes
• Exactly one incoming edge and
• Two or more outgoing edges.
• Leaf or terminal nodes
• Exactly one incoming edge and
• No outgoing edges.
• Each leaf node is assigned a class label.
• Non-terminal nodes (root & other internal nodes)
contain attribute test conditions to separate
records that have different characteristics.
DECISION TREE INDUCTION
Working of Decision Tree
DECISION TREE INDUCTION
Buiding Decision Tree
• Hunt’s algorithm:
• basis of many existing decision tree induction algorithms, including
• ID3,
• C4.5, and
• CART.
• Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets.
• D
• Dt - set of training records with node t
• y= {y1, y2,..., yc} -> class labels.
• Hunt’s algorithm.
• Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
• Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the
test condition and the records in Dt are distributed to the children based on the outcomes.
• Note:-algorithm is then recursively applied to each child node.
DECISION TREE INDUCTION
Buiding Decision Tree • Example:-predicting whether a loan applicant will repay or not (defaulted)
• Construct a training set by examining the records of previous
borrowers.
DECISION TREE INDUCTION
Building Decision Tree
• Hunt’s algorithm will work fine
• if every combination of attribute values is present in the training data and
• if each combination has a unique class label.
• Additional conditions
1. If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class
label as the majority class of training records associated with its parent node.
2. Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the
same class label as the majority class of training records associated with this node.
same class label as the majority class of training records associated with this node.
• Design Issues of Decision Tree Induction
• 1. How should the training records be split?
• Test condition to divide the records into smaller subsets.
• provide a method for specifying the test condition
• measure for evaluating the goodness of each test condition.
• 2. How should the splitting procedure stop?
• A stopping condition is needed
• stop when either all the records belong to the same class or all the records have identical attribute
values.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Binary Attributes
• The test condition for a binary attribute generates two potential outcomes.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Nominal Attributes
• nominal attribute can have many values
• Test condition can be expressed in two ways
• Multiway split - number of outcomes depends on the number of distinct values
produces binary splits by considering all 2k−1 − 1 ways of
• Binary splits(used in CART) - produces binary splits by considering all 2k−1 − 1 ways of
creating a binary partition of k attribute values.
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Ordinal Attributes
• Ordinal attributes can also produce binary or multiway splits.
• values can be grouped without violating the order property.
• 4.10© is invalid
DECISION TREE INDUCTION
Methods for Expressing Attribute Test Conditions
• Test condition for Continuous Attributes
• Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes,
or
• Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k.
• Multiway split
• Multiway split
• Apply the discretization strategies
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• p(i|t) - fraction of records belonging to class i at a given node t.
• Sometimes – used as only Pi
• Two-class problem
• (p0, p1) - class distribution at any node
• p1 = 1 − p0
• p1 = 1 − p0
• (0.5, 0.5) because there are an equal number of records from each class
• Car Type, will result in purer partitions
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• selection of best split is based on the degree of impurity of the child nodes
• Node with class distribution (0, 1) has zero impurity,
• Node with uniform class distribution (0.5, 0.5) has the highest impurity.
• p - fraction of records that belong to one of the two classes.
• P – maximum(0.5) – class distribution is even
• P- min. (0 or 1)– all records belong to the same class
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• Node N1 has the lowest impurity value, followed by N2 and N3.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
• To Determine the performance of test condition – compare the degree of
impurity of the parent node (before splitting) with the degree of impurity
of the child nodes (after splitting).
• The larger their difference, the better the test condition.
• Information Gain:
• I(·) - impurity measure of a given node,
• N - total no. of records at parent node,
• k - no. of attribute values
• N(vj) - no. of records associated with the child node, vj.
• The difference in entropy(Impurity measure) is known as the Information
gain, ∆info
Calculate Impurity using Gini
Find out, which attribute
is selected?
Data Warehousing and mining Complete notes.pdf
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Binary Attributes
○ Before splitting, the Gini index is 0.5
■ because equal number of records
from both classes.
○ If attribute A is chosen to split the
data,
■ Gini index
● node N1 = 0.4898, and
● node N1 = 0.4898, and
● node N2 = 0.480.
■ Weighted average of the Gini index
for the descendent nodes is
● (7/12) × 0.4898 + (5/12) × 0.480
= 0.486.
○ Weighted average of the Gini index for
attribute B is 0.375.
○ B is selected because of small value
Data Warehousing and mining Complete notes.pdf
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Nominal Attributes
○ First Binary Grouping
■ Gini index of {Sports, Luxury} is 0.4922 and
■ the Gini index of {Family} is 0.3750.
■ The weighted average Gini index
16/20 × 0.4922 + 4/20 × 0.3750 =
0.468.
0.468.
○ Second binary grouping of {Sports} and {Family, Luxury},
■ weighted average Gini index is 0.167.
● The second grouping has a
lower Gini index
because its corresponding subsets
are much purer.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Splitting of Continuous Attributes
● A brute-force method -Take every value of the attribute in the N records as a candidate split position.
● Count the number of records with annual income less than or greater than
v(computationally expensive).
● To reduce the complexity, the training records are sorted based on their annual income,
● Candidate split positions are identified by taking the midpoints between two adjacent sorted values:
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
○ Problem:
■ Customer ID - produce purer partitions.
■ Customer ID is not a predictive attribute because its value is
unique for each record.
○ Two Strategies:
■ First strategy(used in CART)
■ First strategy(used in CART)
● restrict the test conditions to binary splits only.
■ Second Strategy(used in C4.5 - Gain Ratio - to determine goodness
of a split)
● modify the splitting criterion
● consider - number of outcomes produced by the attribute test
condition.
DECISION TREE INDUCTION
Measures for Selecting the Best Split
● Gain Ratio
Data Warehousing and mining Complete notes.pdf
Tree-Pruning
• After building the decision tree,
• Tree-pruning step - to reduce the size of the decision
tree.
• Pruning -
• trims the branches of the initial tree
• improves the generalization capability of the
decision tree.
• Decision trees that are too large are susceptible to a
phenomenon known as overfitting.
Data Warehousing and mining Complete notes.pdf
Data Warehousing and mining Complete notes.pdf
Model Overfitting
Model Overfitting
DWDM Unit-III
Model Overfitting
● Errors generally occur in classification Model are:-
○ Training Errors ( or Resubstitution Error or Apparent Error)
■ No. of misclassification errors Committed on Training data
○ Generalization Errors
■ Expected Error of the model on previously unused records.
Model Overfitting:
■ Expected Error of the model on previously unused records.
● Model Overfitting:
○ Model is overfitting your training data when you see that the model performs well on the training data but
does not perform well on the evaluation (Test) data.
○ This is because the model is memorizing the data it has seen and is unable to generalize to unseen
examples.
Model Overfitting
● Model Underfitting:
● Model is underfitting the training data when the model performs poorly on the training data.
● Model is unable to capture the relationship between the input examples (X) and the target values (Y).
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Model Overfitting
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
Data Warehousing and mining Complete notes.pdf
Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 0, Test Error - 30%
Sdsdsd
Sdsdsd
● Humans and dolphins were misclassified
● Spiny anteaters (exceptional case)
● Errors due to exceptional cases are often
Unavoidable and establish the minimum error
rate achievable by any classifier.
Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 20%, Test Error - 10%
Sdsdsd
Sdsdsd
Model Overfitting
Overfitting Due to Lack of Representative Samples
Overfitting occurs when small number of data training records are available
● Training error is zero, Test Error is 30%
● Humans, elephants, and dolphins are misclassified
● Decision tree classifies all warm-blooded vertebrates
that do not hibernate as non-mammals(because of
eagle - Lack of representative samples).
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Methods commonly used to evaluate the performance of a classifier
○ Hold Out method
○ Random Sub Sampling
○ Cross Validation
■ K-fold
■ K-fold
■ Leave-one-out
○ Bootstrap
■ .632 Bootstrap
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Hold Out method
○ Original data - partitioned into two disjoint sets
■ training set
■ test sets
○ A classification model is then induced from the training set
○ Model performance is evaluated on the test set.
○ Analysts can decide the proportion of data reserved for training and for testing
e.g., 50-50 or
○ Analysts can decide the proportion of data reserved for training and for testing
■ e.g., 50-50 or
■ twothirds - training & one-third - testing
○ Limitations
1. Model may not be good because only few records are for Model induction
2. Model may be highly dependent on the composition of the training and test sets.
● training set size=small, then larger the variance of the model.
● training set =too large, then the estimated accuracy of small test set is less reliable.
3. training and test sets are no longer independent
https://www.datavedas.com/holdout-cross-validation/
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Random Sub Sampling
○ The holdout method can be repeated several times to improve the estimation of a classifier’s performance.
○ Overall accuracy,
○ Problems:
■ Does not utilize as much data as possible for training.
■ Does not utilize as much data as possible for training.
■ No control over the number of times each record is used for testing and training.
https://blog.ineuron.ai/Hold-Out-Method-Random-Sub-Sampling-Method-3MLDEXAZML
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ Alternative to Random Subsampling
○ Each record is used the same number of times for training and exactly once for testing.
○ Two fold cross-validation
■ Partition the data into two equal-sized subsets.
■ Partition the data into two equal-sized subsets.
■ one of the subsets for training and the other for testing.
■ Then swap the roles of the subsets
https://fengkehh.github.io/post/introduction-to-cross-validation/ - picture reference
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ K-Fold Cross Validation
■ k equal-sized partitions
■ During each run,
● one of the partitions is chosen for testing,
● one of the partitions is chosen for testing,
● while the rest of them are used for training.
■ Total error is found by summing up the errors for all k runs.
Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
Model Overfitting - Evaluating the Performance of a Classifier
Cross-validation
● Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly
split up into ‘k’ groups. One of the groups is used as the test set and the rest are
used as the training set. The model is trained on the training set and scored on
the test set. Then the process is repeated until each unique group as been used
the test set. Then the process is repeated until each unique group as been used
as the test set.
● For example, for 5-fold cross validation, the dataset would be split into 5 groups,
and the model would be trained and tested 5 separate times so each group would
get a chance to be the test set. This can be seen in the graph below.
● 5-fold cross validation (image credit)
Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
Model Overfitting - Evaluating the Performance of a Classifier
Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
5-fold cross validation (image credit)
Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Cross Validation
○ leave-one-out approach
■ A special case of the k-fold cross-validation
● sets k = N ( Dataset size)
■ Size of test set = 1 record
■ Size of test set = 1 record
■ All remaining records = Training set
■ Advantage
● Utilizing as much data as possible for training
● Test sets are mutually exclusive and they effectively cover the entire data set.
■ Drawback
● computationally expensive
Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
Model Overfitting
Evaluating the Performance of a Classifier
● Bootstrap
○ Training records are sampled with replacement;
■ A record already chosen for training is put back into the original pool of records so that it is equally likely
to be redrawn.
○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N
○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N
■ When N is sufficiently large, the probability asymptotically approaches 1 − e −1 = 0.632.
○ On average, a bootstrap sample contains 63.2% of the records of the original data.
● b -no. of times
● Ei - accuracy of ith bootstrap sample, accs - accuracy on training data
Picture reference - https://bradleyboehmke.github.io/HOML/process.html
Bayesian Classifiers
Bayesian Classifiers
DWDM Unit - III
Bayesian Classifiers
● In many applications the relationship between the attribute set and
the class variable is non-deterministic.
● Example:
Risk for heart disease based on the person’s diet and workout
● Example:
○ Risk for heart disease based on the person’s diet and workout
frequency.
● So, Modeling probabilistic relationships between the attribute
set and the class variable.
● Bayes Theorem
Bayesian Classifiers
● Consider a football game between two rival teams: Team 0 and Team 1.
● Suppose Team 0 wins 65% of the time and Team 1 wins the remaining
matches.
● Among the games won by Team 0, only 30% of them come from playing on
Team 1’s football field.
Team 1’s football field.
● On the other hand, 75% of the victories for Team 1 are obtained while playing at
home.
● If Team 1 is to host the next match between the two teams, which team will
most likely emerge as the winner?
● This Problem can be solved by Bayes Theorem
Bayesian Classifiers
● Bayes Theorem
○ X and Y are random variables.
○ A conditional probability is the probability that a random variable will take on a
particular value given that the outcome for another random variable is known.
particular value given that the outcome for another random variable is known.
○ Example:
■ conditional probability P(Y = y|X = x) refers to the probability that the variable
Y will take on the value y, given that the variable X is observed to have the
value x.
Bayesian Classifiers
● Bayes Theorem
If {X1, X2,..., Xk} is the set of mutually exclusive and exhaustive outcomes of a
random variable X, then the denominator of the previous slide equation can be
expressed as follows:
expressed as follows:
Bayesian Classifiers
● Bayes Theorem
Bayesian Classifiers
● Bayes Theorem
○ Using the Bayes Theorem for Classification
■ X - attribute set
■ Y - class variable.
○ Treat X and Y as random variables -for non-deterministic relationship
○ Capture relationship probabilistically using P(Y |X) - Posterior Probability or Conditional Probability
○ P(Y) - prior probability
○ Training phase
○ Training phase
■ Learn the posterior probabilities P(Y |X) for every combination of X and Y
○ Use these probabilities and classify test record X` by finding the class Y` (max posterior probability - P(y`/x`))
Bayesian Classifiers
Using the Bayes Theorem for Classification
Example:-
● test record
X= (Home Owner = No, Marital Status = Married, Annual Income = $120K)
● Y=?
● Use training data & compute - posterior probabilities P(Yes|X) and P(No|X)
● Y= Yes, if P(Yes|X) > P(No|X)
● Y= No, Otherwise
Bayesian Classifiers
Computing P(X/Y) - Class Conditional Probability
Na¨ıve Bayes Classifier
● assumes that the attributes are conditionally independent, given the class label y.
The conditional independence assumption can be formally stated as follows:
● The conditional independence assumption can be formally stated as follows:
Bayesian Classifiers
How a Na¨ıve Bayes Classifier Works
● Assumption - conditional independence
● Estimate the conditional probability of each Xi, given Y
○ (instead of computing the class-conditional probability for every combination of X)
○ (instead of computing the class-conditional probability for every combination of X)
○ No need of very large training set to obtain a good estimate of the probability.
● To classify a test record,
○ Compute the posterior probability for each class Y:
■ P(X) can be ignored
● It is fixed for every Y, it is sufficient to choose the class that maximizes the
numerator term
Bayesian Classifiers
Estimating Conditional Probabilities for Binary Attributes
Xi - categorical attribute , xi - one of the value under attribute Xi
Y - Target Attribute ( for Class Label), y- one class Label
conditional probability P(Xi = xi |Y = y) = fraction of training instances in class y that take on
attribute value xi.
P(Home Owner=yes|DB=no) =
P(Home Owner=yes|DB=no) =
(No. of HO=yes & No. of DB = no)/(Total No. of DB=no)
=3/7
P(Home Owner=no|DB=no)=4/7
P(Home Owner=yes|DB=yes)=0
P(Home Owner=no|DB=yes)=3/3
Bayesian Classifiers
Estimating Conditional Probabilities for Categorical Attributes
P(MS=single|DB=no) = 2/7
P(MS=married|DB=no) = 4/7
P(MS=divorced|DB=no) =1/7
P(MS=divorced|DB=no) =1/7
P(MS=single|DB=yes) = 2/3
P(MS=married|DB=yes) = 0/3
P(MS=divorced|DB=yes) =1 /3
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Discretization
● Probability Distribution
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Discretization (Transforming continuous attributes into ordinal attributes)
○ Replace the continuous attribute value with its corresponding discrete interval.
○ Estimation error depends on
○ Estimation error depends on
■ the discretization strategy
■ the number of discrete intervals.
○ If the number of intervals is too large, there are too few training records in
each interval
○ If the number of intervals is too small, then some intervals may aggregate
records from different classes and we may miss the correct decision boundary.
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Probability Distribution
○ Gaussian distribution can be used to represent the class-conditional probability for continuous
attributes.
○ The distribution is characterized by two parameters,
■ mean, µ
■ mean, µ
■ variance, σ 2
µij - sample mean of Xi for all training records that belong to the class yj.
σ2
ij - sample variance (s2) of such training records.
Bayesian Classifiers
Estimating Conditional Probabilities for Continuous Attributes
● Probability Distribution
sample mean and variance for this attribute with respect to the class No
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
● Compute the class conditional probability for each categorical attribute
● Compute sample mean and variance for the continuous attribute
● Predict the class label of a test record
X = (Home Owner=No, Marital Status = Married,
X = (Home Owner=No, Marital Status = Married,
Income = $120K)
● compute the posterior probabilities
○ P(No|X)
○ P(Yes|X)
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
● P(yes) = 3/10 =0.3 P(no) =7/10 = 0.7
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
● P(no|x)= ?
● P(yes|x) = ?
● Large value is the class label
● X = (Home Owner=No, Marital Status = Married, Income = $120K)
P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ?
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ?
● P(Y|X) = P(Y) * P(X|Y)
● P(no| Home Owner=No, Marital Status = Married, Income = $120K) =
P(DB=no) * P(Home Owner=No, Marital Status = Married, Income = $120K | DB=no)
● P(X|Y) = P(HM=no|DB=no) * P(MS=married|DB=no) * P(Income=$120K|DB=no)
= 4/7 * 4/7 * 0.0072
=0.0024
Bayesian Classifiers
Example of the Na¨ıve Bayes Classifier
P(DB=no | X)=P(DB=no)*P(X | DB=no) = 7/10 * 0.0024 = 0.0016
P(DB=yes | X)=P(DB=yes)*P(X | DB=yes) = 3/10 * 0 = 0
Class Label for the record is NO
Bayesian Classifiers
Find out Class Label ( Play Golf ) for
today = (Sunny, Hot, Normal, False)
https://www.geeksforgeeks.org/naive-bayes-classifiers/
Association Analysis:
Basic Concepts and Algorithms
Basic Concepts and Algorithms
DWDM Unit - IV
Basic Concepts
● Retailers are interested in analyzing the data to learn
about the purchasing behavior of their customers.
●
about the purchasing behavior of their customers.
● Such Information is used in marketing promotions, inventory
management, and customer relationship management.
● Association analysis - useful for discovering interesting
relationships hidden in large data sets.
● The uncovered relationships can be represented in the form
of association rules or sets of frequent items.
Basic Concepts
● Example Association Rule
○ {Diapers} → {Beer}
● rule suggests - strong relationship exists between the sale of
diapers and beer
● many customers who buy diapers also buy beer.
● many customers who buy diapers also buy beer.
● Association analysis is also applicable to
○ Bioinformatics,
○ Medical diagnosis,
○ Web mining, and
○ Scientific data analysis
● Example - analysis of Earth science data(ocean, land, &
atmospheric processes)
Basic Concepts
Problem Definition:
● Binary Representation Market basket data
● each row - transaction
● each column - item
● each row - transaction
● each column - item
● value is one if the item is present in a transaction and
zero otherwise.
● item is an asymmetric binary variable because the
presence of an item in a transaction is often considered
more important than its absence
Basic Concepts
Itemset and Support Count:
I = {i1,i2,.. .,id} - set of all items
T = {t1, t2,..., tN} - set of all transactions
Each transaction ti contains a subset of items chosen from I
Itemset - collection of zero or more items
K-itemset - itemset contains k items
Example:-
{Beer, Diapers, Milk} - 3-itemset
null (or empty) set - no items
Basic Concepts
Itemset and Support Count:
● Transaction width - number of items present in a
transaction.
● A transaction tj contain an itemset X if X is a subset of
tj.
● A transaction tj contain an itemset X if X is a subset of
tj.
● Example:
○ t2 contains itemset {Bread, Diapers} but not {Bread, Milk}.
● support count,σ(X) - number of transactions that contain a
particular itemset.
● σ(X) = |{ti |X ⊆ ti, ti ∈ T}|,
○ symbol | · | denote the number of elements in a set.
● support count for {Beer, Diapers, Milk} =2
○ ( 2 transactions contain all three items)
Basic Concepts
Association Rule:
● An association rule is an implication expression of
the form X → Y, where X and Y are disjoint itemsets
∅
the form X → Y, where X and Y are disjoint itemsets
○ i.e., X ∩ Y = ∅.
● The strength of an association rule can be measured
in terms of its support and confidence.
Basic Concepts
● Support
○ determines how often a rule is applicable to
a given data set
a given data set
○
● Confidence
○ determines how frequently items in Y appear
in transactions that contain X
Basic Concepts
● Example:
○ Consider the rule {Milk, Diapers} → {Beer}
○ support count for {Milk, Diapers, Beer}=2
○ total number of transactions=5,
○ support count for {Milk, Diapers, Beer}=2
○ total number of transactions=5,
○ rule’s support is 2/5 = 0.4.
○ rule’s confidence =
(support count for {Milk, Diapers, Beer})/(support count for {Milk, Diapers})
= 2/3 = 0.67.
Basic Concepts
Formulation of Association Rule Mining Problem
Association Rule Discovery
Given a set of transactions T, find all the rules having
Given a set of transactions T, find all the rules having
support ≥ minsup and confidence ≥ minconf, where minsup and
minconf are the corresponding support and confidence
thresholds.
Basic Concepts
Formulation of Association Rule Mining Problem
Association Rule Discovery
● Brute-force approach: compute the support and confidence for every
possible rule(expensive)
possible rule(expensive)
● Total number of possible rules extracted from a data set that
contains d items is R = 3d−2d+1+1
● For a dataset of 6 items, no of possible rules are 36−27 +1=602
rules.
● More than 80% of the rules are discarded after applying minsup=20% &
minconf=50%
● most of the computations become wasted.
● Prune the rules early without having to compute their support and
confidence values.
Basic Concepts
Formulation of Association Rule Mining Problem
Association Rule Discovery
● Common strategy - decompose the problem into two major
● Common strategy - decompose the problem into two major
subtasks: (separate support & confidence)
1. Frequent Itemset Generation:
■ Objective:Find all the itemsets that satisfy the minsup threshold.
2. Rule Generation:
■ Objective: Extract all the high-confidence rules from the frequent
itemsets found in the previous step.
■ These rules are called strong rules.
Frequent Itemset Generation
● Lattice structure - list of all
possible itemsets
● itemset lattice for
○ I = {a, b, c, d, e}
○ I = {a, b, c, d, e}
● Data set with k items can generate
up to 2k − 1 frequent itemsets
(without null set)
○ Example:- 25-1=31
● So, search space of itemsets in
practical applications is
exponentially large
Frequent Itemset Generation
● A brute-force approach for finding frequent itemsets
○ determine the support count for every candidate
○ determine the support count for every candidate
itemset in the lattice structure.
● compare each candidate against every transaction
● Very expensive
○ requires O(NMw) comparisons,
○ N- No. of transactions,
○ M = 2k − 1 is the number of candidate itemsets
○ w - maximum transaction width.
Frequent Itemset Generation
several ways to reduce the computational complexity of
frequent itemset generation.
Reduce the number of candidate itemsets (M)
The Apriori principle
Reduce the number of comparisons
by using more advanced data structures
The Apriori
Principle
Frequent Itemset Generation
Principle
If an itemset is
frequent, then all
of its subsets must
also be frequent.
Frequent Itemset Generation
Support-based pruning:
Support-based pruning:
● strategy of trimming the exponential search space based on the
support measure is known as support-based pruning.
● It uses anti-monotone property of the support measure.
● Anti-monotone property of the support measure
○ support for an itemset never exceeds the support for its subsets.
● Example:
○ {a, b} is infrequent,
○ then all of its supersets must be infrequent too.
○ entire subgraph containing the supersets of {a, b} can be pruned immediately
Frequent Itemset Generation
Let,
I - set of items
J = 2I - power set of I
A measure f is monotone/anti-monotone if
A measure f is monotone/anti-monotone if
Monotonicity Property(or upward closed):
∀X, Y ∈ J: (X ⊆ Y) → f(X) ≤ f(Y)
Anti-monotone (or downward closed):
∀X, Y ∈ J: (X ⊆ Y) → f(Y) ≤ f(X)
means that if X is a subset of Y, then f(Y) must not exceed f(X).
Frequent Itemset Generation in the Apriori Algorithm
Identify Frequent Itemset
Data Warehousing and mining Complete notes.pdf
Frequent Itemset Generation in the Apriori Algorithm
Ck-set of k-candidate itemsets
Fk - set of k-frequent itemsets
Frequent Itemset Generation in the Apriori Algorithm
https://www.softwaretestinghelp.com/apriori-
algorithm/#:~:text=Apriori%20algorithm%20is%20a%20sequence,is%20assumed%20by%20the%20user.
Example
Frequent Itemset Generation in the Apriori Algorithm
Example
Data Warehousing and mining Complete notes.pdf
Apriori in Python
https://intellipaat.com/blog/data-science-apriori-algorithm/
Apriori in Python
https://intellipaat.com/blog/data-science-apriori-algorithm/
Frequent Itemset Generation in the Apriori Algorithm
Ck-set of k-candidate itemsets
Fk - set of k-frequent itemsets
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
The apriori-gen function shown in Step 5 of Algorithm 6.1
generates candidate itemsets by performing the following two
operations:
generates candidate itemsets by performing the following two
operations:
1. Candidate Generation (join)
a. Generates new candidate k-itemsets
b. based on the frequent (k − 1)-itemsets found in the previous
iteration.
2. Candidate Pruning
a. Eliminates some of the candidate k-itemsets using the support-based
pruning strategy.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
Requirements for an effective candidate generation
procedure:
procedure:
1. It should avoid generating too many unnecessary
candidates
2. It must ensure that the candidate set is complete,
i.e., no frequent itemsets are left out
3. It should not generate the same candidate itemset more
than once (no duplicates).
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
Candidate Generation Procedures
1. Brute-Force Method
1. Brute-Force Method
2. Fk−1 × F1 Method
3. Fk−1×Fk−1 Method
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
Candidate Generation Procedures
1. Brute-Force Method
a. considers every k-itemset as a
1. Brute-Force Method
a. considers every k-itemset as a
potential candidate
b. candidate pruning ( to remove
unnecessary candidates) becomes
extremely expensive
c. No. of candidate itemsets generated
at level k =(d
k)
d - no. of items
2. Fk−1 × F1 Method
O(|Fk−1| × |F1|) candidate k-itemsets,
|Fj | = no. of frequent j-itemsets.
overall complexity
● The procedure is complete.
● But the same candidate itemset will be generated more than once ( duplicates).
● Example:
○ {Bread, Diapers, Milk} can be generated
○ by merging {Bread, Diapers} with {Milk},
○ {Bread, Milk} with {Diapers}, or
○ {Diapers, Milk} with {Bread}.
● One Solution
● One Solution
○ Generate candidate itemset by joining items
in lexicographical order only
● {Bread, Diapers} join with {Milk}
Don’t join
● {Diapers, Milk} with {Bread}
● {Bread, Milk} with {Diapers}
because violation of lexicographic ordering
Problem:
Large no. of unnecessary candidates
3. Fk−1×Fk−1 Method (used in the apriori-gen function)
● merges a pair of frequent (k−1)-itemsets only if their
first k−2 items are identical.
first k−2 items are identical.
● Let A = {a1, a2,..., ak−1} and B = {b1, b2,..., bk−1} be a
pair of frequent (k−1)-itemsets.
● A and B are merged if they satisfy the following
conditions:
○ ai = bi (for i = 1, 2,..., k−2) and
○ ak−1 != bk−1.
Merge {Bread, Diapers} & {Bread, Milk} to form a candidate 3-
itemset {Bread, Diapers, Milk}
Don’t merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different.
Support Counting
● Support counting is the process of
determining the frequency of
occurrence for every candidate
itemset that survives the candidate
pruning step.
pruning step.
● One approach for doing this is to
compare each transaction against
every candidate itemset (see Figure
6.2) and to update the support
counts of candidates contained in
the transaction.
● This approach is computationally
expensive, especially when the
numbers of transactions and
candidate itemsets are large.
Support Counting
● An alternative approach is to enumerate the itemsets contained in
each transaction and use them to update the support counts of
their respective candidate itemsets.
● To illustrate, consider a transaction t that contains five
items,{1, 2, 3, 5, 6}.
items,{1, 2, 3, 5, 6}.
● Assuming that each itemset keeps its items in increasing
lexicographic order, an itemset can be enumerated by specifying
the smallest item first,followed by the larger items.
● For instance, given t = {1, 2, 3, 5, 6}, all the 3-itemsets
contained in t must begin with item 1, 2, or 3. It is not
possible to construct a 3-itemset that begins with items 5 or 6
because there are only two items in t whose labels are greater
than or equal to 5.
Support Counting
● The number of ways to specify the first item of a 3-itemset
contained in t is illustrated by the Level 1 prefix structures.
For instance, 1 2 3 5 6 represents a 3-itemset that begins with
item 1, followed by two more items chosen from the set {2, 3, 5, 6}
● After fixing the first item, the prefix structures at Level 2
represent the number of ways to select the second item.
For example, 1 2 3 5 6 corresponds to itemsets that begin with
For example, 1 2 3 5 6 corresponds to itemsets that begin with
prefix (1 2) and are followed by items 3, 5, or 6.
● Finally, the prefix structures at Level 3 represent the complete
set of 3-itemsets contained in t.
For example, the 3-itemsets that begin with prefix {1 2} are
{1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with
prefix {2 3} are {2, 3, 5} and {2, 3, 6}.
Support Counting
(steps 6 through 11 of Algorithm 6.1. )
● Enumerate the itemsets contained in
each transaction
● Figure 6.9 demonstrate how itemsets
● Figure 6.9 demonstrate how itemsets
contained in a transaction can be
systematically enumerated, i.e., by
specifying their items one by one,
from the leftmost item to the
rightmost item.
● If enumerated item of transaction
matches one of the candidates, then
the support count of the
corresponding candidate is
incremented.(line 9 in algo.)
For instance, given t = {1, 2, 3, 5, 6}, all the 3- itemsets contained in t
Support Counting Using a Hash Tree
● Candidate itemsets are
partitioned into different
buckets and stored in a
hash tree.
● Itemsets contained in
● Itemsets contained in
each transaction are also
hashed into their
appropriate buckets.
● Instead of comparing each
itemset in the transaction
with every candidate
itemset
● Matched only against
candidate itemsets that
belong to the same bucket
Hash Tree from a Candidate Itemset
https://www.youtube.com/watch?v=btW-uU1dhWI
Data Warehousing and mining Complete notes.pdf
Hash function= p mod 3
Data Warehousing and mining Complete notes.pdf
Data Warehousing and mining Complete notes.pdf
Rule generation
&
Compact representation of frequent
itemsets
DWDM
Unit - IV
Association Analysis
Rule Generation
● Each frequent k-itemset can produce up to 2k−2 association
rules, ignoring rules that have empty antecedents or
consequents .
● An association rule can be extracted by partitioning the
itemset Y into two non-empty subsets, X and Y −X, such that
X → Y −X satisfies the confidence threshold.
Confidence-Based Pruning
Theorem:
If a rule X → Y −X does not satisfy the confidence
threshold, then
any rule X` → Y − X`, where X` is a subset of X, must
not satisfy the confidence threshold as well.
Rule Generation in Apriori Algorithm
● The Apriori algorithm uses a level-wise approach for generating
association rules, where each level corresponds to the
number of items that belong to the rule consequent.
● Initially, all the high-confidence rules that have only one
item in the rule consequent are extracted.
● These rules are then used to generate new candidate rules.
For example, if {acd} →{b} and {abd} →{c} are high-confidence rules, then the
candidate rule {ad} → {bc} is generated by merging the consequents of both rules.
Rule Generation in Apriori Algorithm
● Figure 6.15 shows a lattice
structure for the association
rules generated from the
frequent itemset {a, b, c, d}.
● If any node in the lattice has
low confidence, then
according to Theorem, the
entire sub-graph spanned by
the node can be pruned
immediately.
● Suppose the confidence for
{bcd} → {a} is low. All the rules
containing item a in its
consequent, can be discarded.
In rule generation, we do not have to make additional passes over the data set
to compute the confidence of the candidate rules.
Instead, we determine the confidence of each rule by using the support counts
computed during frequent itemset generation.
Data Warehousing and mining Complete notes.pdf
Compact Representation of Frequent Itemsets
Maximal Frequent Itemsets
Definition
A maximal frequent itemset is defined as a frequent itemset for
which none of its immediate supersets are frequent.
Compact Representation of Frequent Itemsets
● The itemsets in the lattice are divided
into two groups: those that are
frequent and those that are
infrequent.
● A frequent itemset border, which is
represented by a dashed line, is also
illustrated in the diagram.
● Every itemset located above the
border is frequent, while those
located below the border (the shaded
nodes) are infrequent.
● Among the itemsets residing near the
border, {a, d}, {a, c, e}, and {b, c, d, e} are
considered to be maximal frequent
itemsets because their immediate
supersets are infrequent.
● Maximal frequent itemsets do not
contain the support information of
their subsets.
Compact Representation of Frequent Itemsets
● Maximal frequent itemsets effectively provide a compact representation of
frequent itemsets.
● They form the smallest set of itemsets from which all frequent itemsets can
be derived.
● For example, the frequent itemsets shown in Figure 6.16 can be divided into
two groups:
○ Frequent itemsets that begin with item a and that may contain items c, d, or e. This group
includes itemsets such as {a}, {a, c}, {a, d}, {a, e} and {a, c, e}.
○ Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b,
c}, {c, d},{b, c, d, e}, etc.
● Frequent itemsets that belong in the first group are subsets of either {a, c, e}
or {a, d}, while those that belong in the second group are subsets of {b, c, d, e}.
Compact Representation of Frequent Itemsets
● Closed Frequent Itemsets
○ Closed itemsets provide a minimal representation of itemsets
without losing their support information.
○ An itemset X is closed if none of its immediate supersets has
exactly the same support count as X.
Or
○ X is not closed if at least one of its immediate supersets has the
same support count as X.
Compact Representation of Frequent Itemsets
Closed Frequent Itemsets
● An itemset is a closed
frequent itemset if it is
closed and its support is
greater than or equal to
minsup.
Compact Representation of Frequent Itemsets
Closed Frequent Itemsets
● Determine the support counts for the non-closed by using the closed frequent
itemsets
● consider the frequent itemset {a, d} - is not closed, its support count must be
identical to one of its immediate supersets {a, b, d}, {a, c, d}, or {a, d, e}.
● Apriori principle states
○ any transaction that contains the superset of {a, d} must also contain {a, d}.
○ any transaction that contains {a, d} does not have to contain the supersets of {a, d}.
● So, the support for {a, d} = largest support among its supersets = support of
{a,c,d}
● Algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the
smallest frequent itemsets.
Data Warehousing and mining Complete notes.pdf
● The items can be divided into
three groups: (1) Group A,
which contains items a1
through a5; (2) Group B, which
contains items b1 through b5;
and (3) Group C, which
contains items c1 through c5.
● The items within each group
are perfectly associated with
each other and they do not
appear with items from
another group. Assuming the
support threshold is 20%,
the total number of frequent
itemsets is 3×(25−1)= 93.
● There are only three closed
frequent itemsetsin the
data: ({a1, a2, a3, a4, a5}, {b1,
b2, b3, b4, b5}, and {c1, c2, c3,
c4, c5})
● Redundant association rules can be removed by using Closed frequent itemsets
● An association rule X → Y is redundant if there exists another rule X`→ Y`,
where
X is a subset of X` and
Y is a subset of Y `
such that the support and confidence for both rules are identical.
● From table 6.5 {b} is not a closed frequent itemset while {b, c} is closed.
● The association rule {b} → {d, e} is therefore redundant because it has the same
support and confidence as {b, c} → {d, e}.
● Such redundant rules are not generated if closed frequent itemsets are used
for rule generation.
● all maximal frequent itemsets are closed because none
● of the maximal frequent itemsets can have the same support count as their
● immediate supersets.
FP Growth Algorithm
Association Analysis (Unit - IV)
DWDM
FP Growth Algorithm
● FP-growth algorithm takes a radically different approach for discovering frequent itemsets.
● The algorithm encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure
FP-Tree Representation
● An FP-tree is a compressed representation of the input data. It is constructed by reading the data
set one transaction at a time and mapping each transaction onto a path in the FP-tree.
● As different transactions can have several items in common, their paths may overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
● If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract
frequent itemsets directly from the structure in memory instead of making repeated passes over
the data stored on disk.
FP Tree Representation
FP Tree Representation
● Figure 6.24 shows a data set that
contains ten transactions and five
items.
● The structures of the FP-tree after
reading the first three
transactions are also depicted in
the diagram.
● Each node in the tree contains the
label of an item along with a
counter that shows the number of
transactions mapped onto the
given path.
● Initially, the FP-tree contains only
the root node represented by the
null symbol.
FP Tree Representation
1. The data set is scanned once to
determine the support count of
each item. Infrequent items are
discarded, while the frequent
items are sorted in decreasing
support counts. For the data set
shown in Figure, a is the most
frequent item, followed by b, c, d,
and e.
FP Tree Representation
2. The algorithm makes a second
pass over the data to construct
the FP-tree. After reading the
first transaction, {a, b}, the nodes
labeled as a and b are created. A
path is then formed from null →
a → b to encode the transaction.
Every node along the path has a
frequency count of 1.
FP Tree Representation
3. After reading the second transaction, {b,c,d}, a new set of
nodes is created for items b, c, and d. A path is then
formed to represent the transaction by connecting the
nodes null → b → c → d. Every node along this path
also has a frequency count equal to one.
4. The third transaction, {a,c,d,e}, shares a common prefix
item (which is a) with the first transaction. As a result,
the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction,
null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two,
while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been
mapped onto one of the paths given in the FP-tree.
The resulting FP-tree after reading all the transactions
is shown in Figure 6.24.
FP Tree Representation
● The size of an FP-tree is typically smaller
than the size of the uncompressed data
because many transactions in market
basket data often share a few items in
common.
● In the best-case scenario, where all the
transactions have the same set of items,
the FP-tree contains only a single branch
of nodes.
● The worst-case scenario happens when
every transaction has a unique set of
items.
FP Tree Representation
● The size of an FP-tree also
depends on how the items are
ordered.
● If the ordering scheme in the
preceding example is reversed,
i.e., from lowest to highest
support item, the resulting FP-
tree is shown in Figure 6.25.
● An FP-tree also contains a list
of pointers connecting
between nodes that have the
same items.
● These pointers, represented as
dashed lines in Figures 6.24
and 6.25, help to facilitate the
rapid access of individual
items in the tree.
Frequent Itemset Generation using FP-Growth Algorithm
Steps in FP-Growth Algorithm:
Step-1: Scan the database to build Frequent 1-item set which will contain all
the elements whose frequency is greater than or equal to the minimum
support. These elements are stored in descending order of their
respective frequencies.
Step-2: For each transaction, the respective Ordered-Item set is built.
Step-3: Construct the FP tree. by scanning each Ordered-Item set
Step-4: For each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Step-5: For each item, the Conditional Frequent Pattern Tree is built.
Step-6: Frequent Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to each corresponding item.
Frequent Itemset Generation in FP-Growth Algorithm
Example:
The frequency of each individual
item is computed:-
Given Database: min_support=3
Frequent Itemset Generation in FP-Growth Algorithm
● A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.
● L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
● Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained
in the transaction. The following table is built for all the transactions:
Frequent Itemset Generation in FP-Growth Algorithm
Now, all the Ordered-Item sets
are inserted into a Trie Data
Structure.
a) Inserting the set {K, E, M, O,
Y}:
All the items are simply
linked one after the other in
the order of occurrence in
the set and initialize the
support count for each item
as 1.
Frequent Itemset Generation in FP-Growth Algorithm
b) Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and
E, simply the support count is increased
by 1.
There is no direct link between E and O,
therefore a new node for the item O is
initialized with the support count as 1
and item E is linked to this new node.
On inserting Y, we first initialize a new
node for the item Y with support count
as 1 and link the new node of O with the
new node of Y.
Frequent Itemset Generation in FP-Growth Algorithm
c) Inserting the set {K, E, M}:
● Here simply the support
count of each element is
increased by 1.
Frequent Itemset Generation in FP-Growth Algorithm
d) Inserting the set {K, M, Y}:
● Similar to step b), first the
support count of K is
increased, then new nodes
for M and Y are initialized
and linked accordingly.
Frequent Itemset Generation in FP-Growth Algorithm
e) Inserting the set {K, E, O}:
● Here simply the support
counts of the respective
elements are increased.
Frequent Itemset Generation in FP-Growth Algorithm
Now, for each item starting from leaf, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Frequent Itemset Generation in FP-Growth Algorithm
Now for each item, the Conditional Frequent Pattern Tree is built.
It is done by taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the support counts of all
the paths in the Conditional Pattern Base.
The itemsets whose support count >= min_support value are retained in the Conditional
Frequent Pattern Tree and the rest are discarded.
Frequent Itemset Generation in FP-Growth Algorithm
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.
Data Mining
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar
What is Cluster Analysis?
● Given a set of objects, place them in groups such that the
objects in a group are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
2
maximized
distances are
minimized
Applications of Cluster Analysis
● Understanding
– Group related documents
for browsing(Information
Retrieval),
– group genes and proteins
that have similar
functionality(Biology),
3
functionality(Biology),
– group stocks with similar
price fluctuations
(Business)
– Climate
– Psychology & Medicine
Clustering precipitation
in Australia
Applications of Cluster Analysis
● Clustering for Utility
– Summarization
– Compression
– Efficiently finding Nearest
Neighbors
4
Clustering precipitation
in Australia
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
5
Four Clusters
Two Clusters
Types of Clusterings
● A clustering is a set of clusters
● Important distinction between hierarchical and
partitional sets of clusters
– Partitional Clustering (unnested)
◆
6
◆ A division of data objects into non-overlapping subsets (clusters)
– Hierarchical clustering (nested)
◆ A set of nested clusters organized as a hierarchical tree
Partitional Clustering
7
Original Points A Partitional Clustering
Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram
8
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram
Other Distinctions Between Sets of Clusters
● Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
◆ Can belong to multiple classes or could be ‘border’ points
– Fuzzy clustering (one type of non-exclusive)
◆ In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
◆ Weights must sum to 1
◆
9
◆ Weights must sum to 1
◆ Probabilistic clustering has similar characteristics
● Partial versus complete
– In some cases, we only want to cluster some of the data
Types of Clusters
● Well-separated clusters
● Prototype-based clusters
● Contiguity-based clusters
10
● Density-based clusters
● Described by an Objective Function
Types of Clusters: Well-Separated
● Well-Separated Clusters:
– A cluster with a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.
11
3 well-separated clusters
Types of Clusters: Prototype-Based
● Prototype-based ( or center based)
– A cluster with set of points such that a point in a cluster is
closer (more similar) to the prototype or “center” of the
cluster, than to the center of any other cluster
– If Data is Continuous – Center will be Centroid /mean
– If Data is Categorical - Center will be Medoid ( Most
Representative point)
12
Representative point)
4 center-based clusters
Types of Clusters: Contiguity-Based ( Graph)
● Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster with set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
– Graph ( Data-Nodes, links - Connections),Cluster is group of
connected objects.No connections with outside group.
13
connected objects.No connections with outside group.
● Useful when clusters are irregular or intertwined
● Trouble when noise is present
– a small bridge of points can merge two distinct clusters.
8 contiguous clusters
Types of Clusters: Density-Based
● Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
The two circular clusters are not merged, as in Figure, because the bridge between them(previous
14
6 density-based clusters
The two circular clusters are not merged, as in Figure, because the bridge between them(previous
slide figure) fades into the noise.
Curve that is present in previous slide Figure also fades into the noise and does not form a
cluster
A density based definition of a cluster is often employed when the clusters are irregular or intertwined,
and when noise and outliers are present.
Types of Clusters: Density-Based
● Shared property(Conceptual Clusters)
– a cluster as a set of objects that share some
property.
15
A clustering algorithm would need a very specific concept (sophisticated) of a cluster to successfully
detect these clusters. The process of finding such clusters is called conceptual clustering.
Clustering Algorithms
● K-means and its variants
● Hierarchical clustering
● Density-based clustering
16
K-means
● Prototype-based, partitional clustering
technique
● Attempts to find a user-specified number of
clusters (K)
17
Agglomerative Hierarchical Clustering
● Hierarchical clustering
● Starts with each point as a singleton cluster
● Repeatedly merges the two closest clusters
until a single, all encompassing cluster
18
until a single, all encompassing cluster
remains.
● Some Times - graph-based clustering
● Others - prototype-based approach.
DBSCAN
● Density-based clustering algorithm
● Produces a partitional clustering,
● No. of clusters is automatically determined by
the algorithm.
19
the algorithm.
● Noise - Points in low-density regions (omitted)
● Not a complete clustering.
K-means Clustering
● Partitional clustering approach
● Number of clusters, K, must be specified
● Each cluster is associated with a centroid (center point)
● Each point is assigned to the cluster with the closest
centroid
● The basic algorithm is very simple
20
Example of K-means Clustering
Example of K-means Clustering
22
K-means Clustering – Details
● Simple iterative algorithm.
– Choose initial centroids;
– repeat {assign each point to a nearest centroid; re-compute cluster centroids}
– until centroids stop changing.
● Initial centroids are often chosen randomly.
– Clusters produced can vary from one run to another
● The centroid is (typically) the mean of the points in the cluster,
but other definitions are possible
23
but other definitions are possible
● Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
● Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
K-means Clustering – Details
● Centroid can vary, depending on the proximity
measure for the data and the goal of the
clustering.
● The goal of the clustering is typically expressed
by an objective function that depends on the
proximities of the points to one another or to the
cluster centroids.
24
cluster centroids.
● e.g.minimize the squared distance of each point
to its closest centroid
K-means Clustering – Details
25
Centroids and Objective Functions
Data in Euclidean Space
● A common objective function (used with Euclidean
distance measure) is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
center
– To get SSE, we square these errors and sum them.
26
– x is a data pointin cluster Ci and mi is the centroid (mean) for
cluster Ci
– A kmeans run which produces minimum SSE will be considered.
– Centroid (mean) of the i th cluster is
K-means Objective Function
Document Data
● Cosine Similarity
● Document data is represented as Document Term Matrix
● Objective (Cohesion of the cluster)
27
– Maximize the similarity of the documents in a cluster
to the cluster centroid; which is called cohesion of
the cluster
Two different K-means Clusterings
Original Points
Figure a shows a clustering
solution that is the global
minimum of the SSE for three
clusters
Figure b shows suboptimal
clustering that is only a local
minimum.
28
Fig b: Sub-optimal Clustering
Fig a: Optimal Clustering
Importance of Choosing Initial Centroids …
The below 2 figures show the clusters that result from two particular choices of initial centroids.
(For both figures, the positions of the cluster centroids in the various iterations are indicated by
crosses.)
Fig-1
In Figure1, even though all the
initial centroids are from one
natural cluster, the minimum
SSE clustering is still found
In Figure 2, even though the initial
centroids seem to be better
distributed, we obtain a
suboptimal clustering, with higher
squared error. This is considered as
poor starting of centroids
Fig-2
Importance of Choosing Initial Centroids …
30
Problems with Selecting Initial Points
31
● Figure 5.7 shows that if a pair of clusters has only one initial
centroid and the other pair has three, then two of the true
clusters will be combined and one true cluster will be split.
10 Clusters Example
32
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
33
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
34
Starting with some pairs of clusters having three initial centroids, while other
have only one.
10 Clusters Example
35
Starting with some pairs of clusters having three initial centroids, while other have only one.
Solutions to Initial Centroids Problem
● Multiple runs
● K-means++
● Use hierarchical clustering to determine initial
centroids
36
centroids
● Bisecting K-means
Multiple Runs
● One technique that is commonly
used to address the problem of
choosing initial centroids is to
perform multiple runs, each with
a different set of randomly
chosen initial centroids, and then
select the set of clusters with the
minimum SSE
● In Figure 5.6(a), the data
consists of two pairs of clusters,
37
consists of two pairs of clusters,
where the clusters in each (top-
bottom) pair are closer to each
other than to the clusters in the
other pair.
● Figure 5.6 (b–d) shows that if we
start with two initial centroids per
pair of clusters, then even when
both centroids are in a single
cluster, the centroids will
redistribute themselves so that
the “true” clusters are found.
K-means++
●
38
K-means++
39
Bisecting K-means
● Bisecting K-means algorithm
– Variant of K-means that can produce a partitional or a
hierarchical clustering
40
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
https://www.geeks
forgeeks.org/bisec
ting-k-means-
algorithm-
introduction/
41
Limitations of K-means
● K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
42
● K-means has problems when the data contains
outliers.
– One possible solution is to remove outliers before
clustering
Limitations of K-means: Differing Sizes
43
Original Points K-means (3 Clusters)
Limitations of K-means: Differing Density
44
Original Points K-means (3 Clusters)
Limitations of K-means: Non-globular Shapes
45
Original Points K-means (2 Clusters)
Overcoming K-means Limitations
46
Original Points K-means Clusters
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
Overcoming K-means Limitations
47
Original Points K-means Clusters
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
Overcoming K-means Limitations
48
Original Points K-means Clusters
One solution is to find a large number of clusters such that each of them represents a part of
a natural cluster. But these small clusters need to be put together in a post-processing step.
Hierarchical Clustering
● Produces a set of nested clusters organized as a
hierarchical tree
● Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits
49
Strengths of Hierarchical Clustering
● Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
50
Hierarchical Clustering
● Two main types of hierarchical clustering
– Agglomerative:
◆ Start with the points as individual clusters
◆ At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
– Divisive:
◆
51
– Divisive:
◆ Start with one, all-inclusive cluster
◆ At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)
● Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
● Key Idea: Successively merge closest clusters
● Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
52
5. Update the proximity matrix
6. Until only a single cluster remains
● Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms
Steps 1 and 2
● Start with clusters of individual points and a
proximity matrix
p1
p3
p4
p2
p1 p2 p3 p4 p5 . . .
53
p5
p4
.
.
. Proximity Matrix
Intermediate Situation
● After some merging steps, we have some clusters
C4
C3
C2
C1
C1
C3
C4
C2
C3 C4 C5
54
C1
C4
C2 C5
C5
C4
Proximity Matrix
Step 4
● We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C4
C3
C2
C1
C1
C3
C4
C2
C3 C4 C5
55
C1
C4
C2 C5
C5
C4
Proximity Matrix
Step 5
● The question is “How do we update the proximity matrix?”
C4
C3
? ? ? ?
?
?
C2
U
C5
C1
C1
C3
C2 U C5
C3 C4
56
C1
C4
C2 U C5
?
?
C3
C4
Proximity Matrix
How to Define Inter-Cluster Distance
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
Similarity?
● MIN
57
.
.
.
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
Proximity Matrix
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
58
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
59
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
60
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
× ×
61
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
MIN or Single Link
● Proximity of two clusters is based on the two
closest points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph
● Example:
62
Distance Matrix:
Hierarchical Clustering: MIN
1
2
3
5
6
1
2
3
5
63
Nested Clusters Dendrogram
3
4
6
2
4
Strength of MIN
64
Original Points Six Clusters
• Can handle non-elliptical shapes
Limitations of MIN
Two Clusters
65
Original Points
Two Clusters
• Sensitive to noise
Three Clusters
MAX or Complete Linkage
● Proximity of two clusters is based on the two
most distant points in the different clusters
– Determined by all pairs of points in the two clusters
66
Distance Matrix:
Hierarchical Clustering: MAX
1
2
3
5
6
2 5
4
67
Nested Clusters Dendrogram
3
4
6
1
3
Strength of MAX
68
Original Points Two Clusters
• Less susceptible to noise
Limitations of MAX
69
Original Points Two Clusters
• Tends to break large clusters
• Biased towards globular clusters
Group Average
● Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
70
Distance Matrix:
Hierarchical Clustering: Group Average
1
2
3
5
2
5 4
71
Nested Clusters Dendrogram
3
4
6
1
3
Hierarchical Clustering: Group Average
● Compromise between Single and Complete
Link
● Strengths
– Less susceptible to noise
72
– Less susceptible to noise
● Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
● Similarity of two clusters is based on the increase
in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared
● Less susceptible to noise
73
● Less susceptible to noise
● Biased towards globular clusters
● Hierarchical analogue of K-means
– Can be used to initialize K-means
Hierarchical Clustering: Comparison
MIN MAX
1
2
3
4
5
6
1
2 5
3
4
1
2
3
4
5
6
1
2
3
4
5
74
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2
5
3
4
Hierarchical Clustering: Time and Space requirements
● O(N2) space since it uses the proximity matrix.
– N is the number of points.
● O(N3) time in many cases
– There are N steps and at each step the size, N2,
75
– There are N steps and at each step the size, N ,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time with
some cleverness
Hierarchical Clustering: Problems and Limitations
● Once a decision is made to combine two clusters,
it cannot be undone
● No global objective function is directly minimized
Different schemes have problems with one or
76
● Different schemes have problems with one or
more of the following:
– Sensitivity to noise
– Difficulty handling clusters of different sizes and non-
globular shapes
– Breaking large clusters
Density Based Clustering
● Clusters are regions of high density that are
separated from one another by regions on low
density.
77
DBSCAN
● DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)
– A point is a core point if it has at least a specified number of
points (MinPts) within Eps
◆ These are points that are at the interior of a cluster
◆
78
◆
◆ Counts the point itself
– A border point is not a core point, but is in the neighborhood
of a core point
– A noise point is any point that is not a core point or a border
point
DBSCAN: Core, Border, and Noise Points
MinPts = 7
79
DBSCAN: Core, Border and Noise Points
80
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
DBSCAN Algorithm
● Form clusters using core points, and assign
border points to one of its neighboring clusters
1: Label all points as core, border, or noise points.
2: Eliminate noise points.
3: Put an edge between all core points within a distance Eps of each
81
3: Put an edge between all core points within a distance Eps of each
other.
4: Make each group of connected core points into a separate cluster.
5: Assign each border point to one of the clusters of its associated core
points
When DBSCAN Works Well
82
Original Points Clusters (dark blue points indicate noise)
• Can handle clusters of different shapes and sizes
• Resistant to noise
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
83
Original Points
(MinPts=4, Eps=9.92).
(MinPts=4, Eps=9.75)
• Varying densities
• High-dimensional data

More Related Content

PPT
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
DOC
Data mining notes
PPTX
Data warehousing
PPTX
3 tier data warehouse
 
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PPTX
Data warehouse and data mining
PPTX
Introduction to Data Warehousing
PPS
Data Warehouse 101
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Data mining notes
Data warehousing
3 tier data warehouse
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Data warehouse and data mining
Introduction to Data Warehousing
Data Warehouse 101

What's hot (20)

PPTX
DATA WRANGLING presentation.pptx
PPT
Lecture 1 data structures and algorithms
PPTX
An Intro to NoSQL Databases
PPTX
Java – lexical issues
PPTX
Data warehouse architecture
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPTX
Data mining Measuring similarity and desimilarity
PPTX
Data Structure - Elementary Data Organization
PPT
Introduction to Data Warehouse
PPT
Abstract data types
PPTX
Sparse matrix
PDF
Machine Learning Deep Learning AI and Data Science
PPT
data modeling and models
PPTX
Metadata ppt
PPTX
File organization
PPTX
AVL Tree in Data Structure
PPTX
Introduction to data structure ppt
PPTX
Data mining primitives
PPT
Data Structures and Algorithm Analysis
PPTX
B and B+ tree
DATA WRANGLING presentation.pptx
Lecture 1 data structures and algorithms
An Intro to NoSQL Databases
Java – lexical issues
Data warehouse architecture
Data mining :Concepts and Techniques Chapter 2, data
Data mining Measuring similarity and desimilarity
Data Structure - Elementary Data Organization
Introduction to Data Warehouse
Abstract data types
Sparse matrix
Machine Learning Deep Learning AI and Data Science
data modeling and models
Metadata ppt
File organization
AVL Tree in Data Structure
Introduction to data structure ppt
Data mining primitives
Data Structures and Algorithm Analysis
B and B+ tree
Ad

Similar to Data Warehousing and mining Complete notes.pdf (20)

PPT
Datawarehouse and OLAP
PDF
DMDW 1st module.pdf
PPT
Data ware housing - Introduction to data ware housing process.
PPTX
Data warehouse introduction
PDF
6566tyyht656ty55hyhghghghghghg04OLAP.pdf
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
PDF
data warehousing and data mining (1).pdf
PDF
TOPIC 9 data warehousing and data mining.pdf
PPT
Chapter 2
PPTX
datamining techniques and various tools.pptx
PPT
1.4 data warehouse
PPT
Data warehousing and online analytical processing
PDF
PPTX
Data warehousing
PPT
Data Mining Concept & Technique-ch04.ppt
PPT
11667 Bitt I 2008 Lect4
PPTX
Datawarehouse
PPTX
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
PPTX
158001210111bapan data warehousepptse.pptx
PPT
Ch1 data-warehousing
Datawarehouse and OLAP
DMDW 1st module.pdf
Data ware housing - Introduction to data ware housing process.
Data warehouse introduction
6566tyyht656ty55hyhghghghghghg04OLAP.pdf
Chap3-Data Warehousing and OLAP operations..pptx
data warehousing and data mining (1).pdf
TOPIC 9 data warehousing and data mining.pdf
Chapter 2
datamining techniques and various tools.pptx
1.4 data warehouse
Data warehousing and online analytical processing
Data warehousing
Data Mining Concept & Technique-ch04.ppt
11667 Bitt I 2008 Lect4
Datawarehouse
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
158001210111bapan data warehousepptse.pptx
Ch1 data-warehousing
Ad

Recently uploaded (20)

PPTX
IMPACT OF LANDSLIDE.....................
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Managing Community Partner Relationships
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft 365 products and services descrption
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Steganography Project Steganography Project .pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
IMPACT OF LANDSLIDE.....................
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Introduction to Data Science and Data Analysis
CYBER SECURITY the Next Warefare Tactics
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Managing Community Partner Relationships
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Predictive modeling basics in data cleaning process
Microsoft 365 products and services descrption
SET 1 Compulsory MNH machine learning intro
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Navigating the Thai Supplements Landscape.pdf
Optimise Shopper Experiences with a Strong Data Estate.pdf
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Steganography Project Steganography Project .pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx

Data Warehousing and mining Complete notes.pdf

  • 2. 2 DATA WAREHOUSING (Unit - I) ❑Data Warehouse and OLAP Technology: ○ 1.1 An Overview: Data Warehouse ○ 1.2 Data Warehouse Architecture ○ 1.3 A Multidimensional Data Model ○ 1.4 Data Warehouse Implementation ○ 1.5 From Data Warehousing to Data Mining. (Han & Kamber)
  • 4. 4 What is Data Warehouse? ■ Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. ■ Data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. ■ “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” ■ Data warehousing: The process of constructing and using data warehouses
  • 5. Data Warehouse—Subject-Oriented 5 ■ Organized around major subjects, such as customer, product, sales ■ Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing ■ Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
  • 6. Data Warehouse—Integrated 6 ■ Constructed by integrating multiple, heterogeneous data sources ■ relational databases, flat files, on-line transaction records ■ Data cleaning and data integration techniques are applied. ■ Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources ■ E.g., Hotel price: currency, tax, breakfast covered, etc. ■ When data is moved to the warehouse, it is converted.
  • 7. Data Warehouse—Time Variant 7 ■ The time horizon for the data warehouse is significantly longer than that of operational systems ■ Operational database: current value data ■ Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) ■ Every key structure in the data warehouse contains an element of time, explicitly or implicitly. But the key of operational data may or may not contain “time element”
  • 8. Data Warehouse—Nonvolatile 8 ■ A physically separate store of data transformed from the operational environment ■ Operational update of data does not occur in the data warehouse environment ■ Does not require transaction processing, recovery, and concurrency control mechanisms ■ Requires only two operations in data accessing: ■ initial loading of data and access of data
  • 9. OLTP vs OLAP 9 OLTP OLAP User & System Orientation Customer Oriented ( transaction & query processing) market Oriented ( Data Analysis by managers, executives & Analysts) Data Contents Current Data (too detailed) Large amount of data (summarization & aggregation) Database design ER data model ( Application oriented database design) Star or Snowflake model (subject Oriented Database design) View focus on current Data within an enterprise or department multiple versions of database schema(evolutionary process), data from diff. org. & many data stores Access Patterns short, atomic transactions (requires concurrency control & recovery) read-only operations ( Complex queries)
  • 11. 16 Data Warehouse vs. Operational DBMS ■ OLTP (on-line transaction processing) ■ Major task of traditional relational DBMS ■ Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. ■ OLAP (on-line analytical processing) ■ Major task of data warehouse system ■ Data analysis and decision making ■ Distinct features (OLTP vs. OLAP): ■ User and system orientation: customer vs. market ■ Data contents: current, detailed vs. historical, consolidated ■ Database design: ER + application vs. star + subject ■ View: current, local vs. evolutionary, integrated ■ Access patterns: update vs. read-only but complex queries
  • 12. 17 Why Separate Data Warehouse? ■ High performance for both systems ■ DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery ■ Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation ■ Different functions and different data: ■ missing data: Decision support requires historical data which operational DBs do not typically maintain ■ data consolidation: Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources ■ data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled ■ Note: There are more and more systems which perform OLAP analysis directly on relational databases
  • 13. 18
  • 14. Data Warehousing: A Multitiered Architecture 19 ■ Bottom Tier: ■ Warehouse Database server ■ a relational database system ■ Back-end tools and utilities ■ data extraction ■ by using API gateways(ODBC, JDBC & OLEDB) ■ cleaning ■ transformation ■ load & refresh
  • 15. Data Warehousing: A Multitiered Architecture 20 ■ Middle Tier (OLAP server) ■ ROLAP - Relational OLAP ■ extended RDBMS that maps operations on multidimensional data to standard relational operations. ■ MOLAP - Multidimensional OLAP ■ Special-purpose server that directly implements multidimensional data and operations. ■ Top Tier ■ Front-end Client Layer ■ Query and reporting tools, analysis tools and data mining tools.
  • 16. Data Warehousing: A Multitiered Architecture 21 ■ Data Warehouse Models: ■ Enterprise warehouse: ■ collects all of the information about subjects spanning the entire organization. ■ corporate-wide data integration ■ can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. ■ implemented on mainframes, computer superservers, or parallel architecture platforms
  • 17. Data Warehousing: A Multitiered Architecture 22 ■ Data Warehouse Models: ■ Data mart:a subset of corporate-wide data that is of value to a specific group of users ■ confined to specific selected subjects. ■ Example - marketing data mart may confine its subjects to customer, item, and sales. ■ implemented on low-cost departmental servers ■ Independent Data mart - data captured from ■ one or more operational systems or external information providers, or ■ from data generated locally within a particular department or geographic area. ■ Dependent Data mart - sourced directly from enterprise data warehouses.
  • 18. Data Warehousing: A Multitiered Architecture 23 ■ Data Warehouse Models: ■ Virtual warehouse: ■ A virtual warehouse is a set of views over operational databases. ■ easy to build but requires excess capacity on operational database servers.
  • 19. Data Warehousing: A Multitiered Architecture 24 ■ Data extraction: gathers data from multiple, heterogeneous, and external sources. ■ Data Cleaning: detects errors in the data and rectifies them when possible ■ Data transformation: converts data from legacy or host format to warehouse format. ■ Load: sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions. ■ Refresh: propagates the updates from the data sources to the warehouse.
  • 20. Data Warehousing: A Multitiered Architecture 25 Metadata Repository: metadata are the data that define warehouse objects It consists of: 1) Data warehouse structure 2) Operational metadata 3) algorithms used for summarization 4) Mapping from the operational environment to the data warehouse 5) Data related to system performance 6) Business metadata
  • 21. Data Warehousing: A Multitiered Architecture 26 Metadata Repository: ■ data warehouse structure i) warehouse schema, ii) view, dimensions, iii) hierarchies, and iv) derived data definitions, v) data mart locations and contents. ■ Operational metadata i) data lineage (history of migrated data and the sequence of transformations applied to it), ii) currency of data (active, archived, or purged), iii) monitoring information (warehouse usage statistics, error reports, and audit trails).
  • 22. Data Warehousing: A Multitiered Architecture 27 Metadata Repository: ■ The algorithms used for summarization, i) measure and dimension definition algorithms, ii) data on granularity, iii) partitions, iv) subject areas, v) aggregation, vi) summarization, and vii) predefined queries and reports.
  • 23. Data Warehousing: A Multitiered Architecture 28 Metadata Repository: 1) Mapping from the operational environment to the data warehouse i) source databases and their contents, ii) gateway descriptions, iii) data partitions, iv) data extraction, cleaning, transformation rules and defaults v) data refresh and purging rules, and vi) security (user authorization and access control).
  • 24. Data Warehousing: A Multitiered Architecture 29 Metadata Repository: ■ Data related to system performance ■ indices and profiles that improve data access and retrieval performance, ■ rules for the timing and scheduling of refresh, update, and replication cycles. ■ Business metadata, ■ business terms and definitions, ■ data ownership information, and ■ charging policies
  • 26. Data Warehouse Modeling: Data Cube : A Multidimensional Data Model 31 ■ A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. ■ Dimensions are the perspectives or entities with respect to which an organization wants to keep records. ■ Example:- ■ AllElectronics may create a sales data warehouse ■ time, item, branch, and location - These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold.
  • 27. Data Warehouse Modeling: Data Cube : A Multidimensional Data Model 32 ■ Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. ■ For example - a dimension table for item may contain the attributes item name, brand, type. ■ A multidimensional data model is typically organized around a central theme, such as sales. This theme is represented by a fact table. ■ Facts are numeric measures. ■ The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.
  • 28. 33 Data Cube: A Multidimensional Data Model ■ A data warehouse is based on a multidimensional data model which views data in the form of a data cube ■ A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions ■ Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) ■ Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
  • 29. 34 Data Cube: A Multidimensional Data Model ■ A data cube is a lattice of cuboids ■ A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which ■ each dimension corresponds to an attribute or a set of attributes in the schema, and ■ each cell stores the value of some aggregate measure such as count or sum(sales_amount). ■ A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.
  • 30. Data Cube: A Multidimensional Data Model 35 2-D View of Sales data ■ AllElectronics sales data for items sold per quarter in the city of Vancouver. ■ a simple 2-D data cube that is a table or spreadsheet for sales data from AllElectronics
  • 31. Data Cube: A Multidimensional Data Model 36 3-D View of a Sales data The 3-D data in the table are represented as a series of 2-D tables
  • 32. Data Cube: A Multidimensional Data Model 37 3D Data Cube Representation of Sales data we may also represent the same data in the form of a 3D data cube
  • 33. Data Cube: A Multidimensional Data Model 38 4-D Data Cube Representation of Sales Data we may display any n-dimensional data as a series of (n − 1)-dimensional “cubes.”
  • 34. 39 Cube: A Lattice of Cuboids all time item location supplier 0-D(apex) cuboid 1-D cuboids time,location item,location location,supplier time,item time,supplier item,supplier 2-D cuboids time,item,location time,location,supplier cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier
  • 35. 40 ■ In data warehousing literature, an n-D base cube is called a base cuboid. ■ The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. ■ In our example, this is the total sales, or dollars sold, summarized over all four dimensions. ■ The apex cuboid is typically denoted by all. ■ The lattice of cuboids forms a data cube.
  • 36. Schemas for Multidimensional Data Models 41 ■ Modeling data warehouses: dimensions & measures ■ Star schema: A fact table in the middle connected to a set of dimension tables ■ Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake ■ Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
  • 37. Schemas for Multidimensional Data Models 42 ■ Star schema: In this, a data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. ■ Each dimension is represented by only one table. ■ Each table contains a set of attributes ■ Problem: redundancy in dimension tables. ■ ex:- location dimension table will create redundancy among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).
  • 39. 44 Snow flake schema ■ Variant of the star schema model ■ Dimension tables are normalized ( to remove redundancy) ■ Dimension table is splitted into additional tables. ■ The resulting schema graph forms a shape similar to a snowflake. ■ Problem ■ more joins will be needed to execute a query ( affects system performance) ■ so this is not as popular as the star schema in data warehouse design.
  • 41. 46 Fact Constellation ● A fact constellation schema allows dimension tables to be shared between fact tables ● A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. ● For data warehouses, the fact constellation schema is commonly used. ● For data marts, the star or snowflake schema is commonly used
  • 42. 47 Fact Constellation This schema specifies two fact tables, sales and shipping the dimensions tables for time, item, and location are shared between the sales and shipping fact tables.
  • 43. 48 Examples for Defining Star, Snowflake, and Fact Constellation Schemas ■ Just as relational query languages like SQL can be used to specify relational queries, a data mining query language (DMQL) can be used to specify data mining tasks. ■ Data warehouses and data marts can be defined using two language primitives, one for cube definition and one for dimension definition.
  • 44. 49 Syntax for Cube and Dimension Definition in DMQL ■ Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> ■ Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>) ■ Special Case (Shared Dimension Tables) ■ First time as “cube definition” ■ define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>
  • 45. Defining Star Schema in DMQL 50 define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)
  • 46. Defining Snowflake Schema in DMQL 51 define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country))
  • 47. Defining Fact Constellation in DMQL 52 define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
  • 48. Concept Hierarchies courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman ■ A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level. ■ concept hierarchy for the dimension location 53
  • 49. Concept Hierarchies ■ A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy. courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 54
  • 50. Concept Hierarchies 55 ■ Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy. ■ A total or partial order can be defined among groups of values.
  • 51. Measures of Data Cube: Three Categories 56 ■ A multidimensional point in the data cube space can be defined by a set of dimension-value pairs, for example, 〈time = “Q1”, location = “Vancouver”, item = “computer”〉. ■ A data cube measure is a numerical function that can be evaluated at each point in the data cube space. ■ A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point. ■ Based on the kind of aggregate functions used, measures can be organized into three categories : distributive, algebraic, holistic
  • 52. Measures of Data Cube: Three Categories ■ Distributive: An aggregate function is distributive if the result derived by applying the function to n aggregate values is same as that derived by applying the function on all the data without partitioning ■ E.g., count(), sum(), min(), max() ■ Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function. ■ E.g., avg()=sum()/count(), min_N(), standard_deviation() ■ Holistic: An aggregate function is holistic if there is no constant bound on the storage size and there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. ■ E.g., median(), mode(), rank() 57
  • 53. 58 Typical OLAP Operations ■ Roll up (drill-up): ■ Drill down (roll down): ■ Slice and dice: project and select ■ Pivot (rotate): ■ reorient the cube, visualization, 3D to series of 2D planes ■ Other operations ■ drill across: involving (across) more than one fact table ■ drill through: Allows users to analyze the same data through different reports, analyze it with different features and even display it through different visualization methods
  • 54. 59 Fig. 3.10 Typical OLAP Operations
  • 55. Source & Courtesy: https://www.javatpoint.com/olap-operations 60 Typical OLAP Operations:Roll Up/Drill Up ■ summarize data ■ by climbing up hierarchy or ■ by dimension reduction
  • 56. Source & Courtesy: https://www.javatpoint.com/olap-operations 61 Typical OLAP Operations:Roll Down ■ reverse of roll-up ■ from higher level summary to lower level summary or detailed data, or introducing new dimensions
  • 57. Typical OLAP Operations:Slicing ● Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension. ● Example: The sales figures of all sales regions and all product categories of the company in the year 2005 and 2006 are "sliced" out of the data cube. Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube 62
  • 58. Typical OLAP Operations:Slicing Slicing: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. Source & Courtesy: https://www.javatpoint.com/olap-operations 63
  • 59. Typical OLAP Operations:Dice ● Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions ● The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before. Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube 64
  • 60. Typical OLAP Operations:Dicing Dice: It selects a sub- cube from the OLAP cube by selecting two or more dimensions. Source & Courtesy: https://www.javatpoint.com/olap-operations 65
  • 61. Typical OLAP Operations:Pivot 66 Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
  • 62. A Star-Net Query Model 67 ● The querying of multidimensional databases can be based on a starnet model. ● It consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension. ● Each abstraction level in the hierarchy is called a footprint. ● These represent the granularities available for use by OLAP operations such as drill-down and roll-up.
  • 63. A Star-Net Query Model 68
  • 64. A Star-Net Query Model 69 ■ Four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time, respectively ■ footprints representing abstraction levels of the dimension - time line has four footprints: “day,” “month,” “quarter,” and “year.” ■ Concept hierarchies can be used to generalize data by replacing low-level values (such as “day” for the time dimension) by higher-level abstractions (such as “year”) or ■ to specialize data by replacing higher-level abstractions with lower-level values.
  • 65. Data Warehouse Design and Usage 70 A Business Analysis Framework for Data Warehouse Design: ■ To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. ■ Different views are combined to form a complex framework.
  • 66. Data Warehouse Design and Usage 71 ■ Four different views regarding a data warehouse design must be considered: ■ Top-down view ■ allows the selection of the relevant information necessary for the data warehouse (matches current and future business needs). ■ Data source view ■ exposes the information being captured, stored, and managed by operational systems. ■ Documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables. ■ Modeled in ER model or CASE (computer-aided software engineering).
  • 67. Data Warehouse Design and Usage 72 ■ Data warehouse view ■ includes fact tables and dimension tables. ■ It represents the information that is stored inside the data warehouse, including ■ precalculated totals and counts, ■ information regarding the source, date, and time of origin, added to provide historical context. ■ Business query view ■ is the data perspective in the data warehouse from the end-user’s viewpoint.
  • 68. Data Warehouse Design and Usage ■ Skills required to build & use a Data warehouse ■ Business Skills ■ how systems store and manage their data, ■ how to build extractors (operational DBMS to DW) ■ how to build warehouse refresh software(update) ■ Technology skills ■ the ability to discover patterns and trends, ■ to extrapolate trends based on history and look for anomalies or paradigm shifts, and ■ to present coherent managerial recommendations based on such analysis. ■ Program management skills ■ Interface with many technologies, vendors, and end- users in order to deliver results in a timely and cost effective manner 73
  • 69. Data Warehouse Design and Usage 74 Data Warehouse Design Process ■ A data warehouse can be built using ■ Top-down approach (overall design and planning) ■ It is useful in cases where the technology is mature and well known ■ Bottom-up approach(starts with experiments & prototypes) ■ a combination of both. ■ In SE point of view ( Waterfall model or Spiral model) structured and systematic ■ planning, ■ requirements study, ■ problem analysis, ■ warehouse design, ■ ● rapid generation, short intervals between successive releases, good choice for data warehouse development ● turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely analysis at each data integration and testing, andmanner step, one step to■ finally deployment of the data warehouse the next
  • 70. Data Warehouse Design and Usage 75 Data Warehouse Design Process ■ 4 major Steps involved in Warehouse design are: ■ 1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration, sales, or the general ledger). ■ Data warehouse model - If the business process is organizational and involves multiple complex object collections ■ Data mart model - if the process is departmental and focuses on the analysis of one kind of business process
  • 71. Data Warehouse Design and Usage 76 ■ 2. Choose the business process grain ■ Fundamental, atomic level of data to be represented in the fact table ■ (e.g., individual transactions, individual daily snapshots, and so on). ■ 3. Choose the dimensions that will apply to each fact table record. ■ Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. ■ 4. Choose the measures that will populate each fact table record. ■ Typical measures are numeric additive quantities like dollars sold and units sold.
  • 72. Data Warehouse Design and Usage 77 Data Warehouse Usage for Information Processing ■ Evolution of DW takes place throughout a number of phases. ■ Initial Phase - DW is used for generating reports and answering predefined queries. ■ Progressively - to analyze summarized and detailed data, (results are in the form of reports and charts) ■ Later - for strategic purposes, performing multidimensional analysis and sophisticated slice-and- dice operations. ■ Finally - for knowledge discovery and strategic decision making using data mining tools.
  • 74. 79 Data warehouse implementation ■ OLAP servers demand that decision support queries be answered in the order of seconds. ■ Methods for the efficient implementation of data warehouse systems. ■ 1. Efficient data cube computation. ■ 2. OLAP data indexing (bitmap or join indices ) ■ 3. OLAP query processing ■ 4. Various types of warehouse servers for OLAP processing.
  • 75. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 80 ■ Requires efficient computation of aggregations across many sets of dimensions. ■ In SQL terms: ■ Aggregations are referred to as group-by’s. ■ Each group-by can be represented by a cuboid, ■ set of group-by’s forms a lattice of cuboids defining a data cube. ■ Compute cube Operator - computes aggregates over all subsets of the dimensions specified in the operation. ■ require excessive storage space for large number of dimensions.
  • 76. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 81 Example 4.6 ■ create a data cube for AllElectronics sales that contains the following: ■ city, item, year, and sales in dollars.
  • 77. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 82 ■ What is the total number of cuboids, or group- by’s, that can be computed for this data cube? ■ 3 attributes - city, item & year -3 dimensions ■ sales in dollars - measure, ■ the total number of cuboids, or group by’s, ■ 2 POWER 3 = 8. ■ The possible group-by’s are the following: ■ {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()} ■ () - group-by is empty (i.e., the dimensions are not grouped) - all. ■ group-by’s form a lattice of cuboids for the data cube
  • 78. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 83
  • 79. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 84 ■ Base cuboid contains all three dimensions(city, item, year) ■ returns - total sales for any combination of the three dimensions. ■ This is least generalized (most specific) of the cuboids. ■ Apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty (contains total sum of all sales) ■ This is most generalized (least specific) of the cuboids ■ Drill Down equivalent ■ start at the apex cuboid and explore downward in the lattice ■ akin to rolling up ■ start at the base cuboid and explore upward
  • 80. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 85 ■ zero-dimensional operation: ■ An SQL query containing no group-by ■ Example - “compute the sum of total sales” ■ one-dimensional operation: ■ An SQL query containing one group-by ■ Example - “compute the sum of sales group-by city” ■ A cube operator on n dimensions is equivalent to a collection of group-by statements, one for each subset of the n dimensions.
  • 81. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation ■ data cube could be defined as: ■ “define cube sales_cube [city, item, year]: sum(sales_in_dollars)” ■ 2 power n cuboids - For a cube with n dimensions ■ “compute cube sales_cube” - statement ■ computes the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty subset. ■ In OLAP, for diff. queries diff. cuboids need to be accessed. ■ Precomputation - compute in advance all or at least some of the cuboids in a data cube ■ curse of dimensionality - required storage space may explode if all the cuboids in a data cube are precomputed ( for more dimensions) 86
  • 82. Data warehouse implementation: 87 1.4.1 Efficient Data Cube Computation ■ Data cube can be viewed as a lattice of cuboids ■ 2 power n - when no concept hierarchy ■ How many cuboids in an n-dimensional cube with L levels? ■ where Li is the number of levels associated with dimension i ( +1 for all ) ■ If the cube has 10 dimensions and each dimension has five levels (including all), the total number of cuboids that can be generated is 510 ≈ 9.8 × 106 .
  • 83. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 88 There are three choices for data cube materialization for a given base cuboid: ■ 1. No materialization: Do not precompute - expensive multidimensional aggregates - extremely slow. ■ 2. Full materialization: Precompute all of the cuboids - full cube - requires huge amounts of memory space in order to store all of the precomputed cuboids.
  • 84. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 89 ■ 3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids. ■ compute a subset of the cube, which contains only those cells that satisfy some user-specified criterion - subcube ■ 3 factors to consider: ■ (1) identify the subset of cuboids or subcubes to materialize; ■ (2) exploit the materialized cuboids or subcubes during query processing; and ■ (3) efficiently update the materialized cuboids or subcubes during load and refresh.
  • 85. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 90 ■ Partial Materialization: Selected Computation of Cuboids ■ Following should take into account during selection of the subset of cuboids or subcubes ■ the queries in the workload, their frequencies, and their accessing costs ■ workload characteristics, the cost for incremental updates, and the total storage requirements. ■ physical database design such as the generation and selection of indices.
  • 86. Data warehouse implementation: 1.4.1 Efficient Data Cube Computation 91 ■ Heuristic approaches for cuboid and subcube selection ■ Iceberg cube: ■ data cube that stores only those cube cells with an aggregate value (e.g., count) that is above some minimum support threshold. ■ shell cube: ■ precomputing the cuboids for only a small number of dimensions
  • 87. Data warehouse implementation: 1.3.2 Indexing OLAP Data: Bitmap Index 92 Index structures - To facilitate efficient data accessing ■ Bitmap indexing method - it allows quick searching in data cubes. ■ In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s domain. ■ If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). ■ If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.
  • 88. Data warehouse implementation: 1.3.2 Indexing OLAP Data: Bitmap Index 93 ● Example:- AllElectronics data warehouse ● dim(item)={H,C,P,S} - 4 values - 4 bit vectors ● dim(city)= {V,T} - 2 values - 2 bit vectors ● Better than Hash & Tree Indices but good for low cardinality only (cardinality:number of unique items in the database column)
  • 90. Data warehouse implementation: Indexing OLAP Data: Join Index 95 ■ Traditional indexing maps the value in a given column to a list of rows having that value. ■ Join indexing registers the joinable rows of two relations from a relational database. ■ For example, ■ two relations - R(RID, A) and S(B, SID) ■ join on the attributes A and B, ■ join index record contains the pair (RID, SID), ■ where RID and SID are record identifiers from the R and S relations, respectively
  • 91. Data warehouse implementation: Indexing OLAP Data: Join Index 96 ■ Advantage:- ■ Identification of joinable tuples without performing costly join operations. ■ Useful:- ■ To maintain the relationship between a foreign key(fact table) and its matching primary keys(dimension table), from the joinable relation. ■ Indexing maintains relationships between attribute values of a dimension (e.g., within a dimension table) and the corresponding rows in the fact table. ■ Composite join indices: Join indices with multiple dimensions.
  • 92. Data warehouse implementation: Indexing OLAP Data: Join Index ■ Example:-Star Schema ■ “sales_star [time, item, branch, location]: dollars_sold = sum (sales_in_dollars).” ■ join index is relationship between ■ Sales fact table and ■ the location, item dimension tables To speed up query processing - join indexing & bitmap indexing methods can be integrated to form bitmapped join indices. 97
  • 93. Data warehouse implementation: Efficient processing of OLAP queries 98 Given materialized views, query processing should proceed as follows: ■ 1. Determine which operations should be performed on the available cuboids: ■ This involves transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the query into corresponding SQL and/or OLAP operations. ■ Example: ■ slicing and dicing a data cube may correspond to selection and/or projection operations on a materialized cuboid.
  • 94. Data warehouse implementation: Efficient processing of OLAP queries 99 ■ 2. Determine to which materialized cuboid(s) the relevant operations should be applied: ■ pruning the set using knowledge of “dominance” relationships among the cuboids, ■ estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost.
  • 95. Data warehouse implementation: Efficient processing of OLAP queries 100 Example:- ■ define a data cube for AllElectronics of the form “sales cube [time, item, location]: sum(sales in dollars).” ■ dimension hierarchies ■ “day < month < quarter < year” for time; ■ “item_name < brand < type” for item ■ “street < city < province or state < country” for location ■ Query: ■ {brand, province or state}, with the selection constant “year = 2010.”
  • 96. Data warehouse implementation: Efficient processing of OLAP queries 101 ■ suppose that there are four materialized cuboids available, as follows: ■ Which of these four cuboids should be selected to process the query? Ans: 1,3,4 ■ Low cost cuboid to process the query? Ans: 4
  • 97. Data warehouse implementation: OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP 102 ■ Relational OLAP (ROLAP) servers: ■ ROLAP uses relational tables to store data for online analytical processing ■ Intermediate servers that stand in between a relational back-end server and client front-end tools. ■ Operation: ■ use a relational or extended-relational DBMS to store and manage warehouse data ■ OLAP middleware to support missing pieces ■ ROLAP has greater scalability than MOLAP. ■ Example:- ■ DSS server of Microstrategy
  • 98. Data warehouse implementation: OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP 103 ■ Multidimensional OLAP (MOLAP) servers: ■ support multidimensional data views through array- based multidimensional storage engines ■ maps multidimensional views directly to data cube array structures. ■ Advantage: ■ fast indexing to precomputed summarized data. ■ adopt a two-level storage representation ■ Denser subcubes are stored as array structures ■ Sparse subcubes employ compression technology A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non- zeros.
  • 99. Data warehouse implementation: OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP ■ Hybrid OLAP (HOLAP) servers: ■ Combines ROLAP and MOLAP technology ■ benefits ■ greater scalability from ROLAP and ■ faster computation of MOLAP. ■ HOLAP server may allow ■ large volumes of detailed data to be stored in a relational database, ■ while aggregations are kept in a separate MOLAP store. ■ Example:- Microsoft SQL Server 2000 (supports) ■ Specialized SQL servers: ■ provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. 104
  • 100. 105 From Data Warehousing to Data Mining
  • 101. 106 From DataWarehousing to Data Mining DataWarehouse Usage ■ Data warehouses and data marts are used in a wide range of applications. ■ Business executives use the data in data warehouses and data marts to perform data analysis and make strategic decisions. ■ data warehouses are used as an integral part of a plan-execute-assess “closed-loop” feedback system for enterprise management. ■ Data warehouses are used extensively in banking and financial services, consumer goods and retail distribution sectors, and controlled manufacturing, such as demand-based production.
  • 102. DataWarehouse Usage 107 ■ There are three kinds of data warehouse applications: ■ information processing ■ analytical processing ■ data mining
  • 103. DataWarehouse Usage 108 ■ Information processing supports ■ querying, ■ basic statistical analysis, and ■ reporting using crosstabs, tables, charts, or graphs. ■ Analytical processing supports ■ basic OLAP operations, ■ slice-and-dice, drill-down, roll-up, and pivoting. ■ It generally operates on historic data in both summarized and detailed forms. ■ multidimensional data analysis
  • 104. DataWarehouse Usage 109 ■ Data mining supports ■ knowledge discovery by finding hidden patterns and associations, ■ constructing analytical models, ■ performing classification and prediction, and ■ presenting the mining results using visualization tools. ■ Note:- ■ Data Mining is different with Information Processing and Analytical processing
  • 105. From Online Analytical Processing to Multidimensional Data Mining 110 ■ On-line analytical mining (OLAM) (also called OLAP mining) integrates on-line analytical processing (OLAP) with data mining and mining knowledge in multidimensional databases. ■ OLAM is particularly important for the following reasons: ■ High quality of data in data warehouses. ■ Available information processing infrastructure surrounding data warehouses ■ OLAP-based exploratory data analysis: ■ On-line selection of data mining functions
  • 106. Architecture for On-Line Analytical Mining 111 ■ An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on-line analytical processing. ■ An integrated OLAM and OLAP architecture is shown in Figure, where the OLAM and OLAP servers both accept user on-line queries (or commands) via a graphical user interface API and work with the data cube in the data analysis via a cube API. ■ The data cube can be constructed by accessing and/or integrating multiple databases via an MDDB API and/or by filtering a datawarehouse via a database API that may support OLE DB or ODBC connections.
  • 107. 112
  • 108. Data Mining & Motivating Challenges UNIT - II By M. Rajesh Reddy
  • 109. WHAT IS DATA MINING? • Data mining is the process of automatically discovering useful information in large data repositories. • To find novel and useful patterns that might otherwise remain unknown. otherwise remain unknown. • provide capabilities to predict the outcome of a future observation, • Example • predicting whether a newly arrived customer will spend more than $100 at a department store.
  • 110. WHAT IS DATA MINING? • Not all information discovery tasks are considered to be data mining. • For example, tasks related to the area of information retrieval. retrieval. • looking up individual records using a database management system or • finding particular Web pages via a query to an Internet search engine • To enhance information retrieval systems.
  • 111. WHAT IS DATA MINING? Data Mining and Knowledge • Data mining is an integral part of Knowledge Discovery in Databases (KDD), • process of converting raw data into useful • process of converting raw data into useful information • This process consists of a series of transformation steps
  • 112. WHAT IS DATA MINING? • Preprocessing - to transform the raw input data into an appropriate format for subsequent analysis. • Steps involved in data preprocessing • Fusing (joining) data from multiple sources, • Fusing (joining) data from multiple sources, • cleaning data to remove noise and duplicate observations • selecting records and features that are relevant to the data mining task at hand. • most laborious and time-consuming step
  • 113. WHAT IS DATA MINING? • Post Processing: • only valid and useful results are incorporated into the decision support system. • Visualization • Visualization • allows analysts to explore the data and the data mining results from a variety of viewpoints. • Statistical measures or hypothesis testing methods can also be applied • to eliminate spurious (false or fake) data mining results.
  • 114. Motivating Challenges: • challenges that motivated the development of data mining. • Scalability • High Dimensionality • Heterogeneous and Complex Data • Data Ownership and Distribution • Non-traditional Analysis
  • 115. Motivating Challenges: • Scalability • Size of datasets are in the order of GB, TB or PB. • special search strategies • special search strategies • implementation of novel data structures ( for efficient access) • out-of-core algorithms - for large datasets • sampling or developing parallel and distributed algorithms.
  • 116. Motivating Challenges: • High Dimensionality • common today - data sets with hundreds or thousands of attributes • Example • Bio-Informatics - microarray technology has • Bio-Informatics - microarray technology has produced gene expression data involving thousands of features. • Data sets with temporal or spatial components also tend to have high dimensionality. • a data set that contains measurements of temperature at various locations.
  • 117. Motivating Challenges: Heterogeneous and Complex Data • Traditional data analysis methods - data sets - attributes of the same type - either continuous or categorical. • Examples of such non-traditional types of data include • collections of Web pages containing semi-structured text and hyperlinks; text and hyperlinks; • DNA data with sequential and three-dimensional structure and • climate data with time series measurements • DM should maintain relationships in the data, such as • temporal and spatial autocorrelation, • graph connectivity, and • parent-child relationships between the elements in semi-structured text and XML documents.
  • 118. Motivating Challenges: • Data Ownership and Distribution • Data is not stored in one location or owned by one organization • geographically distributed among resources belonging to multiple entities. • This requires the development of distributed data mining techniques. • This requires the development of distributed data mining techniques. • key challenges in distributed data mining algorithms • (1) reduction in the amount of communication needed • (2) effective consolidation of data mining results obtained from multiple sources, and • (3) Data security issues.
  • 119. Motivating Challenges: • Non-traditional Analysis: • Traditional statistical approach: hypothesize-and-test paradigm. • A hypothesis is proposed, • an experiment is designed to gather the data, and • then the data is analyzed with respect to the hypothesis. • then the data is analyzed with respect to the hypothesis. • Current data analysis tasks • Generation and evaluation of thousands of hypotheses, • Some DM techniques automate the process of hypothesis generation and evaluation. • Some data sets frequently involve non-traditional types of data and data distributions.
  • 120. Origins of Data mining, Data mining Tasks & Types of Data Types of Data Unit - II DWDM
  • 121. The Origins of Data Mining Data mining draws upon ideas, such as ■ (1) sampling, estimation, and hypothesis testing from statistics and ■ (2) search algorithms, modeling techniques, and learning theories from artificial intelligence, pattern recognition, and machine learning. ■ (2) search algorithms, modeling techniques, and learning theories from artificial intelligence, pattern recognition, and machine learning.
  • 122. The Origins of Data Mining ■ adopt ideas from other areas, including – optimization, – evolutionary computing, – information theory, – information theory, – signal processing, – visualization, and – information retrieval
  • 123. The Origins of Data Mining ■ An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found. ■ Evolutionary Computation is a field of optimization theory where instead of ■ Evolutionary Computation is a field of optimization theory where instead of using classical numerical methods to solve optimization problems, we use inspiration from biological evolution to ‘evolve’ good solutions – Evolution can be described as a process by which individuals become ‘fitter’ in different environments through adaptation, natural selection, and selective breeding. picture of the famous finches Charles Darwin depicted in his journal
  • 124. The Origins of Data Mining ■ Information theory is the scientific study of the quantification, storage, and communication of digital information. ■ The field was fundamentally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ■ The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.
  • 125. The Origins of Data Mining ■ Other Key areas: – database systems ■ to provide support for efficient storage, indexing, and query processing. – Techniques from high performance (parallel) computing – Techniques from high performance (parallel) computing ■ addressing the massive size of some data sets. – Distributed techniques ■ also help address the issue of size and are essential when the data cannot be gathered in one location.
  • 126. Data Mining Tasks ■ Data mining tasks are generally divided into two major categories: – Predictive tasks. - Use some variables to predict unknown or future values of other variables ■ Task Objective: predict the value of a particular attribute based on the values of other attributes. ■ Target/Dependent Variable: attribute to be predicted ■ Explanatory or independent variables: attributes used for making the ■ Explanatory or independent variables: attributes used for making the prediction – Descriptive tasks. - Find human-interpretable patterns that describe the data. ■ Task objective: derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in data. ■ Descriptive data mining tasks are often exploratory in nature and frequently require post processing techniques to validate and explain the results.
  • 127. Data Mining Tasks ■ Correlation is a statistical term describing the degree to which two variables move in coordination with one another. ■ Trends: a general direction in which something is developing or changing.(meaning) Clusters Trajectory data mining enables to predict the moving location details of humans, vehicles, animals and so on. Anomaly detection is a step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior. ■ Clusters – Clustering is the task of data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups https://www.javatpoint.com/data-mining-cluster- analysis
  • 128. Data Data Mining Tasks … Milk Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar
  • 130. Data Mining Tasks ■ Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables. ■ 2 types of predictive modeling tasks: – Classification: Used for discrete target variables – Regression: used for continuous target variables.
  • 131. Data Mining Tasks ■ Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables. ■ 2 types of predictive modeling tasks: – Classification: Used for discrete target variables – Regression: used for continuous target variables. – Example: ■ Classification Task : predicting whether a Web user will make a purchase at an online ■ Classification Task : predicting whether a Web user will make a purchase at an online bookstore is a classification task because the target variable is binary-valued. ■ Regression Task: forecasting the future price of a stock is a regression task because price is a continuous-valued attribute. – Goal of both tasks: learn a model that minimizes the error between the predicted and true values of the target variable. – Predictive modeling can be used to: ■ identify customers that will respond to a marketing campaign, ■ predict disturbances in the Earth’s ecosystem, or ■ judge whether a patient has a particular disease based on the results of medical tests.
  • 132. Data Mining Tasks ■ Example: (Predicting the Type of a Flower): the task of predicting a species of flower based on the characteristics of the flower. ■ Iris species: Setosa, Versicolour, or Virginica. ■ Requirement: need a data set containing the characteristics of various flowers of these three species. ■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width. ■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width. ■ Petal width is broken into the categories low, medium, and high, which correspond to the intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively. ■ Also, petal length is broken into categories low, medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively. ■ Based on these categories of petal width and length, the following rules can be derived: – Petal width low and petal length low implies Setosa. – Petal width medium and petal length medium implies Versicolour. – Petal width high and petal length high implies Virginica.
  • 133. Data Mining Tasks ■ Example: (Predicting the Type of a Flower):
  • 134. Data Mining Tasks Example: (Predicting the Type of a Flower) Flower)
  • 135. Data Mining Tasks ■ Association analysis – used to discover patterns that describe strongly associated features in the data. – Discovered patterns are represented in the form of implication rules or feature subsets. – Goal of association analysis: ■ To extract the most interesting patterns in an efficient manner. – Example – Example ■ finding groups of genes that have related functionality, ■ identifying Web pages that are accessed together, or ■ understanding the relationships between different elements of Earth’s climate system.
  • 136. Data Mining Tasks ■ Association analysis ■ Example (Market Basket Analysis). – AIM: find items that are frequently bought together by customers. – Association rule {Diapers} −→ {Milk}, ■ suggests that customers who buy diapers also tend to buy milk. ■ This rule can be used to identify potential cross-selling opportunities among related items. items. The transactions data collected at the checkout counters of a grocery store.
  • 137. Data Mining Tasks ■ Cluster analysis – Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar than observations that belong to other clusters. – Clustering has been used to ■ group sets of related customers, ■ find areas of the ocean that have a significant impact on the Earth’s climate, and ■ compress data. ■ compress data.
  • 138. Data Mining Tasks ■ Cluster analysis – Example 1.3 (Document Clustering) – Each article is represented as a set of word-frequency pairs (w, c), ■ where w is a word and ■ c is the number of times the word appears in the article. – There are two natural clusters in the data set. – First cluster -> first four articles (news about the economy) – First cluster -> first four articles (news about the economy) – Second cluster-> last four articles ( news about health care) – A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the articles.
  • 139. Data Mining Tasks ■ Anomaly Detection: – Task of identifying observations whose characteristics are significantly different from the rest of the data. – Such observations are known as anomalies or outliers. – A good anomaly detector must have a high detection rate and a low false alarm rate. – Applications of anomaly detection include ■ the detection of fraud, ■ the detection of fraud, ■ network intrusions, ■ unusual patterns of disease, and ■ ecosystem disturbances https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffi c.png
  • 140. Data Mining Tasks ■ Anomaly Detection: – Example 1.4 (Credit Card Fraud Detection). – A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, – A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, and address. – Since the number of fraudulent cases is relatively small compared to the number of legitimate transactions, anomaly detection techniques can be applied to build a profile of legitimate transactions for the users. – When a new transaction arrives, it is compared against the profile of the user. If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.
  • 141. Types of Data ■ Data set - collection of data objects. ■ Other names for a data object are:- – record, – point, – vector, – vector, – pattern, – event, – case, – sample, – observation, or – entity.
  • 142. Types of Data ■ Data objects are described by a number of attributes that capture the basic characteristics of an object. ■ Example:- – mass of a physical object or – time at which an event occurred. – time at which an event occurred. ■ Other names for an attribute are:- – variable, – characteristic, – field, – feature, or – dimension.
  • 143. Types of Data ■ Example:- ■ Dataset - Student Information. ■ Each row corresponds to a student. ■ Each column is an attribute that describes some aspect of a student. student.
  • 144. Types of Data ■ Attributes and Measurement – An attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another. – Example, – Example, ■ eye color varies from person to person, while the temperature of an object varies over time. – Eye color is a symbolic attribute with a small number of possible values {brown, black, blue, green, hazel, etc.}, – Temperature is a numerical attribute with a potentially unlimited number of values.
  • 145. Types of Data ■ Attributes and Measurement – A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object. – process of measurement – process of measurement ■ application of a measurement scale to associate a value with a particular attribute of a specific object.
  • 146. Properties of Attribute Values ■ The type of an attribute depends on which of the following properties it possesses: ■ Distinctness: = ≠ ■ Order: < > ■ Addition: + ‐ ■ Addition: + ‐ ■ Multiplication: * / ■ Nominal attribute: distinctness ■ Ordinal attribute: distinctness & order ■ Interval attribute: distinctness, order & addition ■ Ratio attribute: all 4 properties
  • 147. Types of Data ■ Properties of Attribute Values – Nominal - attributes to differentiate between one object and another. – Roll, EmpID – Ordinal - attributes to order the objects. – Rankings, Grades, Height – Rankings, Grades, Height – Interval - measured on a scale of equal size units – no Zero point – Temperatures in C & F, Calendar Dates – Ratio - numeric attribute with an inherent zero-point. – value as being a multiple (or ratio) of another value. – Weight, No. of Staff, Income/Salary
  • 148. Types of Data Properties of Attribute Values
  • 149. Types of Data Properties of Attribute Values - Transformations – yielding the same results when the attribute is transformed using a transformation that preserves the attribute’s meaning. – Example:- the average length of a set of objects is different – Example:- ■ the average length of a set of objects is different when measured in meters rather than in feet, but both averages represent the same length.
  • 150. Types of Data Properties of Attribute Values - Transformations
  • 151. Types of Data Attribute Types Data Qualitative / Categorical Quantitative / Numeric Qualitative / Categorical ( no properties of integer) Quantitative / Numeric (properties of Integers) Nominal Ordinal Interval Ratio
  • 152. Types of Data ■ Describing Attributes by the Number of Values a. Discrete ■ finite or countably infinite set of values. ■ Categorical - zip codes or ID numbers, or ■ Numeric - counts. ■ Binary attributes (special case of discrete) ■ Binary attributes (special case of discrete) – assume only two values, – e.g., true/false, yes/no, male/female, or 0/1. b. Continuous ■ values are real numbers. ■ Ex:- temperature, height, or weight. Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined with any of the types based on the number of attribute values—binary, discrete, and continuous.
  • 153. Types of Data - Types of Dataset General Characteristics of Data Sets ■ 3 characteristics that apply to many data sets are:- – dimensionality, – sparsity, and – resolution. ■ Dimensionality - number of attributes that the objects in the data set possess. – small number of dimensions more quality than moderate or high- – small number of dimensions more quality than moderate or high- dimensional data. – curse of dimensionality & dimensionality reduction. ■ Sparsity - data sets, with asymmetric features, most attributes of an object have values of 0; – fewer than 1% of the entries are non-zero. ■ Resolution - Data will be gathered at different levels of resolution – Example:- the surface of the Earth seems very uneven at a resolution of a few meters, but is relatively smooth at a resolution of tens of kilometers.
  • 154. Types of Data - Types of Dataset ■ Record Data – data set is a collection of records (data objects), each of which consists of a fixed set of data fields (attributes). – No relationships b/w records – Same attributes for all records – Flat files or relational DB.
  • 155. Types of Data - Types of Dataset ■ Transaction or Market Basket Data – special type of record data – Each record (transaction) involves a set of items. – Also called market basket data because the items in each record are the products in a person’s “market basket.” – Can be viewed as a set of records whose fields are asymmetric attributes.
  • 156. Types of Data - Types of Dataset ■ Data Matrix / Pattern Matrix – fixed set of numeric attributes, – Data objects = points (vectors) in a multidimensional space – each dimension = a distinct attribute describing the object. – A set of such data objects can be interpreted as ■ an m by n matrix, – where there are – m rows, one for each object, – and n columns, one for each attribute. – Standard matrix operation can be applied to transform and manipulate the data.
  • 157. Types of Data - Types of Dataset ■ Sparse Data Matrix: – Special case of a data matrix – attributes are of the – attributes are of the ■ same type and ■ asymmetric; i.e., only non-zero values are important. – Example:- ■ Transaction data which has only 0–1 entries. ■ Document Term Matrix - collection of term vector – One Term vector represents - one document ( one row in matrix) – Attribute of vector - each term in the document ( one col in matrix) – value in term vector under an attribute is number of times the corresponding term occurs in the document.
  • 158. Types of Data - Types of Dataset ■ Graph based Data: – Data can be represented in the form of Graph. – Graphs are used for 2 specific reasons ■ (1) the graph captures relationships among data objects and ■ (2) the data objects themselves are represented as graphs. – Data with Relationships among Objects ■ Relationships among objects also convey important information. ■ Relationships among objects are captured by the links between objects and link properties, such as direction and weight. ■ Example: – Web page in www contain both text and links to other pages. – Web search engines collect and process Web pages to extract their contents. – Links to and from each page provide a great deal of information about the relevance of a Web page to a query, and thus, must also be taken into consideration.
  • 159. Types of Data - Types of Dataset ■ Graph based Data: – Data with Relationships among Objects ■ Example: – Web page in www contain both text and links to other pages.
  • 160. Types of Data - Types of Dataset ■ Graph based Data: – Data with Objects That Are Graphs ■ When objects contain sub-objects that have relationships, then such objects are frequently represented as graphs. ■ Example:-Structure of chemical compounds ■ Atoms are - nodes ■ Chemical Bonds - links between nodes – ball-and-stick diagram of the chemical compound benzene, which contains atoms of carbon (black) and hydrogen (gray). Substructure mining
  • 161. Types of Data - Types of Dataset ■ Ordered Data: – In some data, the attributes have relationships that involve order in time or space. – Sequential Data ■ Sequential data / temporal data ■ extension of record data - each record has a time associated with it. ■ Ex:- Retail transaction data set - stores the time of transaction – time information used to find patterns ■ “candy sales peak before Halloween.” ■ Each attribute - also - time associated – Record - purchase history of a customer ■ with a listing of items purchased at different times. – find patterns ■ “people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”
  • 162. Types of Data - Types of Dataset ■ Ordered Data: Sequential
  • 163. Types of Data - Types of Dataset ■ Ordered Data: Sequence Data – consists of a data set that is a sequence of individual entities, – Example ■ sequence of words or letters. – Example: ■ Genetic information of plants and animals can be represented in the form of sequences of nucleotides that are known as genes. ■ Predicting similarities in the structure and function of genes from similarities in nucleotide sequences. – Ex:- Human genetic code expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.
  • 164. Types of Data - Types of Dataset ■ Ordered Data: Time Series Data – Special type of sequential data in which each record is a time series, – A series of measurements taken over time. – Example: ■ Financial data set might contain objects that are time series of the daily prices of various stocks. – Temporal autocorrelation; i.e., if two measurements are close in time, then the values of those measurements are often very similar. Time series of the average monthly temperature for Minneapolis during the years 1982 to 1994.
  • 165. Types of Data - Types of Dataset ■ Ordered Data: Spatial Data ■ Some objects have spatial attributes, such as positions or areas, as well as other types of attributes. ■ An example of spatial data is – weather data (precipitation, temperature, pressure) that is collected for a variety of geographical locations. ■ spatial autocorrelation; i.e., objects that are physically close tend to be similar in other ways as well. ■ Example – two points on the Earth that are close to each other usually have similar values for temperature and rainfall. Average Monthly Temperature of land and ocean
  • 167. Data Quality ● Data mining applications are applied to data that was collected for another purpose, or for future, but unspecified applications. ● Data mining focuses on (1) the detection and correction of data quality problems - Data Cleaning (1) the detection and correction of data quality problems - Data Cleaning (2) the use of algorithms that can tolerate poor data quality. ● Measurement and Data Collection Issues ● Issues Related to Applications
  • 168. Data Quality ● Measurement and Data Collection Issues ● problems due to human error, ● limitations of measuring devices, or ● flaws in the data collection process. Values or even entire data objects may be missing. ● Values or even entire data objects may be missing. ● Spurious or duplicate objects; i.e., multiple data objects that all correspond to a single “real” object. ○ Example - there might be two different records for a person who has recently lived at two different addresses. ● Inconsistencies— ○ Example - a person has a height of 2 meters, but weighs only 2 kilograms.
  • 169. Data Quality ● Measurement and Data Collection Errors ○ Measurement error - any problem resulting from the measurement process. ■ Value recorded differs from the true value to some extent. ■ Continuous attributes: Numerical difference of the measured and true value is called the ● Numerical difference of the measured and true value is called the error. ○ Data collection error - errors such as omitting data objects or attribute values, or inappropriately including a data object. ■ For example, a study of animals of a certain species might include animals of a related species that are similar in appearance to the species of interest.
  • 170. Data Quality ● Measurement and Data Collection Errors ○ Noise and Artifacts: ○ Noise is the random component of a measurement error. ○ It may involve the distortion of a value or the addition of spurious objects. ○ It may involve the distortion of a value or the addition of spurious objects.
  • 172. Data Quality ● Measurement and Data Collection Errors ○ Noise and Artifacts: ○ used in connection with data that has a spatial or temporal component. ○ Techniques from signal or image processing can frequently be used to reduce ○ Techniques from signal or image processing can frequently be used to reduce noise ■ These will help to discover patterns (signals) that might be “lost in the noise.” ○ Note:Elimination of noise - difficult ■ robust algorithms - produce acceptable results even when noise is present.
  • 173. Data Quality ● Measurement and Data Collection Errors ○ Noise and Artifacts: ■ Artifacts: Deterministic distortions of the data ■ Data errors may be the result of a more deterministic phenomenon, such ■ Data errors may be the result of a more deterministic phenomenon, such as a streak in the same place on a set of photographs.
  • 174. Data Quality ● Measurement and Data Collection Errors ● Precision, Bias, and Accuracy: ○ Precision: ■ The closeness of repeated measurements (of the same quantity) to one another. ■ Precision is often measured by the standard deviation of a set of values ○ Bias: A systematic variation of measurements from the quantity being measured. ○ Bias: ■ A systematic variation of measurements from the quantity being measured. ■ Bias is measured by taking the difference between the mean of the set of values and the known value of the quantity being measured. ○ Example: ■ standard laboratory weight with a mass of 1g and want to assess the precision and bias of our new laboratory scale. ■ weigh the mass five times & values are: {1.015, 0.990, 1.013, 1.001, 0.986}. ■ The mean of these values is 1.001, and hence, the bias is 0.001. ■ The precision, as measured by the standard deviation, is 0.013.
  • 175. Data Quality ● Measurement and Data Collection Errors ● Precision, Bias, and Accuracy: ○ Accuracy: ■ The closeness of measurements to the true value of the ■ The closeness of measurements to the true value of the quantity being measured.
  • 176. Data Quality ● Measurement and Data Collection Errors ● Outliers: ○ Outliers are either ■ (1) data objects that, in some sense, have characteristics that ■ (1) data objects that, in some sense, have characteristics that are different from most of the other data objects in the data set, or ■ (2) values of an attribute that are unusual with respect to the typical values for that attribute. ○ Alternatively - anomalous objects or values.
  • 177. Data Quality ● Measurement and Data Collection Errors ● Missing Values: ○ Eliminate Data Objects or Attributes ○ Estimate Missing Values ○ Estimate Missing Values ○ Ignore the Missing Value during Analysis ○ Inconsistent Values
  • 178. Data Quality ● Measurement and Data Collection Errors ● Duplicate Data: Same Data in multiple Data Objects ○ To detect and eliminate such duplicates, two main issues must be addressed. must be addressed. ■ First - if two objects represent a single object, then the values of corresponding attributes may differ, and these inconsistent values must be resolved ■ Second - care needs to be taken to avoid accidentally combining data objects that are similar - deduplication
  • 179. Data Quality “data is of high quality if it is suitable for its intended use.” ● Issues Related to Applications: ● Timeliness: ○ If the data is out of date, then so are the models and patterns that are based on it. ● Relevance: ○ The available data must contain the information necessary for the application. ○ Consider the task of building a model that predicts the accident rate for drivers. If information about the age and gender of the driver is omitted, then it is likely that the model will have limited accuracy unless this information is indirectly available through other attributes. is indirectly available through other attributes. ● Knowledge about the Data: ○ Data sets are accompanied documentation that describes different aspects of the data; ○ the quality of this documentation can help in the subsequent analysis. ○ For example, ■ If the documentation is poor, however, and fails to tell us, for example, that the missing values for a particular field are indicated with a -9999, then our analysis of the data may be faulty. ○ Other important characteristics are the precision of the data, the type of features (nominal, ordinal, interval, ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.
  • 181. AGGREGATION • “less is more” • Aggregation - combining of two or more objects into a single object. • In Example, • One way to aggregate transactions for this data set is to replace all the transactions of a single store with a single storewide transaction. • This reduces number of records (1 record per store). • How an aggregate transaction is created • Quantitative attributes, such as price, are typically aggregated by taking a sum or an average. • A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that were sold at that location. • Can also be viewed as a multidimensional array, where each attribute is a dimension. • Used in OLAP
  • 182. AGGREGATION • Motivations for aggregation • Smaller data sets require less memory and processing time which allows the use of more expensive data mining algorithms. • Availability of change of scope or scale • by providing a high-level view of the data instead of a low-level view. • by providing a high-level view of the data instead of a low-level view. • Behavior of groups of objects or attributes is often more stable than that of individual objects or attributes. • Disadvantage of aggregation • potential loss of interesting details.
  • 183. AGGREGATION average yearly precipitation has less variability than the average monthly precipitation.
  • 184. SAMPLING • Approach for selecting a subset of the data objects to be analyzed. • Data miners sample because it is too expensive or time consuming to process all the data. • The key principle for effective sampling is the following: • Using a sample will work almost as well as using the entire data set if the sample • Using a sample will work almost as well as using the entire data set if the sample is representative. • A sample is representative if it has approximately the same property (of interest) as the original set of data. • Choose a sampling scheme/Technique – which gives high probability of getting a representative sample.
  • 185. SAMPLING • Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive • Simple random sampling • equal probability of selecting any particular item. • Two variations on random sampling: • (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that together constitute the population, and • (2) sampling with replacement—objects are not removed from the population as they are selected for the sample. sample. • Problem: When the population consists of different types of objects, with widely different numbers of objects, simple random sampling can fail to adequately represent those types of objects that are less frequent. • Stratified sampling: • starts with prespecified groups of objects • Simpler version -equal numbers of objects are drawn from each group even though the groups are of different sizes. • Other - the number of objects drawn from each group is proportional to the size of that group.
  • 186. SAMPLING Sampling and Loss of Information • Larger sample sizes increase the probability that a sample will be representative, but they also eliminate much of the advantage of sampling. • Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.
  • 187. SAMPLING Determining the Proper Sample Size • Desired outcome: at least one point will be obtained from each cluster. • Probability of getting one object from each of the 10 groups increases as the sample size runs from 10 to 60.
  • 188. SAMPLING • Adaptive/Progressive Sampling: • Proper sample size - Difficult to determine • Start with a small sample, and then increase the sample size until a sample of sufficient size has been obtained. • Initial correct sample size is eliminated • Stop increasing the sample size at leveling-off point(where no improvement in the outcome is identified).
  • 189. DIMENSIONALITY REDUCTION • Data sets can have a large number of features. • Example • a set of documents, where each document is represented by a vector • a set of documents, where each document is represented by a vector whose components are the frequencies with which each word occurs in the document. • thousands or tens of thousands of attributes (components), one for each word in the vocabulary.
  • 190. DIMENSIONALITY REDUCTION • Benefits to dimensionality reduction. • Data mining algorithms work better if the dimensionality is lower. • It eliminates irrelevant features and reduce noise • Lead to a more understandable model • Lead to a more understandable model • fewer attributes • Allow the data to be more easily visualized. • Amount of time and memory required by the data mining algorithm is reduced with a reduction in dimensionality. • Reduce the dimensionality of a data set by creating new attributes that are a combination of the old attributes. • Feature subset selection or feature selection: • The reduction of dimensionality by selecting new attributes that are a subset of the old.
  • 191. DIMENSIONALITY REDUCTION • The Curse of Dimensionality • Data analysis become significantly harder as the dimensionality of the data increases. increases. • data becomes increasingly sparse • Classification • there are not enough data objects to model a class to all possible objects. • Clustering • density and the distance between points - becomes less meaningful
  • 192. DIMENSIONALITY REDUCTION • Linear Algebra Techniques for Dimensionality Reduction • Principal Components Analysis (PCA) • for continuous attributes • for continuous attributes • finds new attributes (principal components) that • (1) are linear combinations of the original attributes, • (2) are orthogonal (perpendicular) to each other, and • (3) capture the maximum amount of variation in the data. • Singular Value Decomposition (SVD) • Related to PCA
  • 193. FEATURE SUBSET SELECTION • Another way to reduce the dimensionality - use only a subset of the features. • Redundant Features • Example: • Purchase price of a product and the amount of sales tax paid • Redundant to each other • contain much of the same information. • Irrelevant features contain almost no useful information for the data mining task at hand. • Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages. • Redundant and irrelevant features • reduce classification accuracy and the quality of the clusters that are found. • can be eliminated immediately by using common sense or domain knowledge, • systematic approach - for selecting the best subset of features • Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and then take the subset that produces the best results.
  • 194. FEATURE SUBSET SELECTION •3 standard approaches to feature selection: •Embedded •Filter •Wrapper
  • 195. FEATURE SUBSET SELECTION • Embedded approaches: • Feature selection occurs naturally as part of the data mining algorithm. • During execution of algorithm, the Algorithm itself decides which attributes to use and which to ignore. • Example:- Algorithms for building decision tree classifiers • Filter approaches: • Filter approaches: • Features are selected before the data mining algorithm is run • Approach that is independent of the data mining task. • Wrapper approaches: • Uses the target data mining algorithm as a black box to find the best subset of attributes • typically without enumerating all possible subsets.
  • 196. FEATURE SUBSET SELECTION • An Architecture for Feature Subset Selection : • The feature selection process is viewed as consisting of four parts: 1. a measure for evaluating a subset, 2. a search strategy that controls the generation of a new subset of features, 3. a stopping criterion, and 4. a validation procedure. • Filter methods and wrapper methods differ only in the way in which they evaluate a subset of features. • wrapper method – uses the target data mining algorithm • filter approach - evaluation technique is distinct from the target data mining algorithm.
  • 198. FEATURE SUBSET SELECTION • Feature subset selection is a search over all possible subsets of features. • Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task • Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes. • Wrapper approach: running the target data mining application, measure the result of the data mining. • Stopping criterion • conditions involving the following: • the number of iterations, • the number of iterations, • whether the value of the subset evaluation measure is optimal or exceeds a certain threshold, • whether a subset of a certain size has been obtained, • whether simultaneous size and evaluation criteria have been achieved, and • whether any improvement can be achieved by the options available to the search strategy. • Validation: • Finally, the results of the target data mining algorithm on the selected subset should be validated. • An evaluation approach: run the algorithm with the full set of features and compare the full results to results obtained using the subset of features.
  • 199. FEATURE SUBSET SELECTION • Feature Weighting • An alternative to keeping or eliminating features. • One Approach • Higher weight - More important features • Lower weight - less important features • Lower weight - less important features • Another Approach – automatic • Example – Classification Scheme - Support vector machines • Other Approach • The normalization of objects – Cosine Similarity – used as weights
  • 200. FEATURE CREATION • Create a new set of attributes that captures the important information in a data set from the original attributes • much more effective. • No. of new attributes < No. of original attributes • Three related methodologies for creating new attributes: • Three related methodologies for creating new attributes: 1. Feature extraction 2. Mapping the data to a new space 3. Feature construction
  • 201. FEATURE CREATION • Feature Extraction • The creation of a new set of features from the original raw data • Example: Classify set of photographs based on existence of human face (present or not) • Raw data (set of pixels) - not suitable for many types of classification algorithms. • Raw data (set of pixels) - not suitable for many types of classification algorithms. • Higher level features( presence or absence of certain types of edges and areas that are highly correlated with the presence of human faces), then a much broader set of classification techniques can be applied to this problem. • Feature extraction is highly domain-specific • New area means development of new features and feature extraction methods.
  • 202. FEATURE CREATION Mapping the Data to a New Space • A totally different view of the data can reveal important and interesting features. • If there is only a single periodic pattern and not much noise, then the pattern is easily detected. • If there is only a single periodic pattern and not much noise, then the pattern is easily detected. • If, there are a number of periodic patterns and a significant amount of noise is present, then these patterns are hard to detect. • Such patterns can be detected by applying a Fourier transform to the time series in order to change to a representation in which frequency information is explicit. • Example: • Power spectrum that can be computed after applying a Fourier transform to the original time series.
  • 203. FEATURE CREATION • Feature Construction • Features in the original data sets consists necessary information, but not suitable for the data mining algorithm. • If new features constructed out of the original features can be more useful than the original features. • If new features constructed out of the original features can be more useful than the original features. • Example (Density). • Dataset contains the volume and mass of historical artifact. • Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most directly yield an accurate classification.
  • 204. DISCRETIZATION AND BINARIZATION • Classification algorithms, require that the data be in the form of categorical attributes. • Algorithms that find association patterns, require that the data be in • Algorithms that find association patterns, require that the data be in the form of binary attributes. • Discretization - transforming a continuous attribute into a categorical attribute • Binarization - transforming both continuous and discrete attributes into one or more binary attributes
  • 205. DISCRETIZATION AND BINARIZATION • Binarization of a categorical attribute (Simple technique): • If there are m categorical values, then uniquely assign each original value to an integer in the interval [0, m − 1]. • If the attribute is ordinal, then order must be maintained by the assignment. by the assignment. • Next, convert each of these m integers to a binary number using n binary attributes • n = [log2 (m)] binary digits are required to represent these integers
  • 206. DISCRETIZATION AND BINARIZATION Example: a categorical variable with 5 values {awful, poor, OK, good, great} require three binary variables x1, x2, and x3.
  • 207. DISCRETIZATION AND BINARIZATION • Discretization of Continuous Attributes ( classification or association analysis) • Transformation of a continuous attribute to a categorical attribute involves two subtasks: • decide no. of categories • decide no. of categories • decide how to map the values of the continuous attribute to these categories. • Step I: Sort Attribute Values and divide into n intervals by specifying n − 1 split points. • Step II : all the values in one interval are mapped to the same categorical value.
  • 208. DISCRETIZATION AND BINARIZATION • Discretization of Continuous Attributes • Problem of discretization is • Deciding how many split points to choose and • where to place them. • The result can be represented either as • a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)}, where x0 and xn may be +∞ or −∞, respectively, or • as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.
  • 209. DISCRETIZATION AND BINARIZATION • UnSupervised Discretization • Discretization methods for Classification • Supervised - known class information • Unsupervised - unknown class information • Equal width approach: • Equal width approach: • divides the range of the attribute into a user-specified number of intervals each having the same width. • problem with outliers • Equal frequency (equal depth) approach: • Puts same number of objects into each interval • K-means Clustering method
  • 210. DISCRETIZATION AND BINARIZATION UnSupervised Discretization Original Data
  • 211. DISCRETIZATION AND BINARIZATION UnSupervised Discretization Equal Width Discretization
  • 212. DISCRETIZATION AND BINARIZATION UnSupervised Discretization Equal Frequency Discretization
  • 213. DISCRETIZATION AND BINARIZATION UnSupervised Discretization K-means Clustering (better result)
  • 214. DISCRETIZATION AND BINARIZATION • Supervised Discretization • When additional information (class labels) are used then it produces better results. • Some Concerns: purity of an interval and the minimum size of an interval. an interval. • statistically based approaches: • start with each attribute value as a separate interval and create larger intervals by merging adjacent intervals that are similar according to a statistical test. • Entropy based approaches:
  • 215. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy based approaches: • Entropy Definition • e - Entropy in ith interval • ei - Entropy in ith interval • pij = mij/mi probability of class j in the i th interval. • k - no. of different class labels • mi - no. of values in the i th interval of a partition, • mij - no. of values of class j in interval i.
  • 216. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy
  • 217. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy based approaches: • Total entropy, e, of the partition is • weighted average of the individual interval entropies, • m - no. of values, • m - no. of values, • wi = mi/m fraction of values in the i th interval • n - no. of intervals. • Perfectly Pure Interval:entropy is 0 • If an interval contains only values of one class • Impure Interval: entropy is maximum • classes of values in an interval occur equal
  • 218. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy based approaches: • Simple approach for partitioning a continuous attribute: • starts by bisecting the initial values so that the resulting two intervals give minimum entropy. two intervals give minimum entropy. • consider each value as a possible split point • Repeat splitting process with another interval • choosing the interval with the worst (highest) entropy, • until a user-specified number of intervals is reached, or • stopping criterion is satisfied.
  • 219. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy based approaches: • 3 categories for • 3 categories for both x & y
  • 220. DISCRETIZATION AND BINARIZATION • Supervised Discretization • Entropy based approaches: • 5 categories for • 5 categories for both x & y • Observation: • no improvement for 6 categories
  • 221. DISCRETIZATION AND BINARIZATION • Categorical Attributes with Too Many Values • If categorical attribute is an ordinal, • techniques similar to those for continuous attributes • If the categorical attribute is nominal, • Example:- • Example:- • University that has a large number of departments. • department name attribute - dozens of diff. values. • combine departments into larger groups, such as • engineering, • social sciences, or • biological sciences.
  • 222. Variable Transformation • Transformation that is applied to all the values of a variable. • Example: magnitude of a variable is important • then the values of the variable can be transformed by taking the absolute value. • Simple Function Transformation: • A simple mathematical function is applied to each value individually. • If x is a variable, then examples of such transformations include • x k, • log x, • e x, • √ x, • 1/x, • sin x, or |x|
  • 223. Variable Transformation • Variable transformations should be applied with caution since they change the nature of the data. • Example:- • transformation fun. is 1/x • if value is 1 or >1 then reduces the magnitude of values • if value is 1 or >1 then reduces the magnitude of values • values {1, 2, 3} go to {1, 1/ 2, 1/3} • if value is b/w 0 & 1 then increases the magnitude of values • values {1, 1/2, 1/3} go to {1, 2, 3}. • so better ask questions such as the following: • Does the order need to be maintained? • Does the transformation apply to all values( -ve & 0)? • What is the effect of the transformation on the values between 0 & 1?
  • 224. Variable Transformation • Normalization or Standardization • Goal of standardization or normalization • To make an entire set of values have a particular property. • A traditional example is that of “standardizing a variable” in statistics. • x - mean (average) of the attribute values and • x - mean (average) of the attribute values and • sx - standard deviation, • Transformation • creates a new variable that has a mean of 0 and a standard deviation of 1.
  • 225. Variable Transformation • Normalization or Standardization • If different variables are to be combined, a transformation is necessary to avoid having a variable with large values dominate the results of the calculation. • Example: • Example: • comparing people based on two variables: age and income. • For any two people, the difference in income will likely be much higher in absolute terms (hundreds or thousands of dollars) than the difference in age (less than 150). • Income values(higher values) will dominate the calculation.
  • 226. Variable Transformation • Normalization or Standardization • Mean and standard deviation are strongly affected by outliers • Mean is replaced by the median, i.e., the middle value. • x - variable • absolute standard deviation of x is • xi - i th value of the variable, • xi - i th value of the variable, • m - number of objects, and • µ - mean or median. • Other approaches • computing estimates of the location (center) and • spread of a set of values in the presence of outliers • These measures can also be used to define a standardization transformation.
  • 227. Measures of Similarity and Similarity and Dissimilarity Unit - II Datamining
  • 228. Measures of Similarity and Dissimilarity ● Similarity and dissimilarity are important because they are used by a number of data mining techniques number of data mining techniques ○ such as ■ clustering, ■ nearest neighbor classification, and ■ anomaly detection. ● Proximity is used to refer to either similarity or dissimilarity. ○ proximity between objects having only one simple attribute, and ○ proximity measures for objects with multiple attributes.
  • 229. Measures of Similarity and Dissimilarity ● Similarity between two objects is a numerical measure of the degree to which the two objects are alike. ○ Similarity - high -objects that are more alike. ○ Similarity - high -objects that are more alike. ○ Non-negative ○ between 0 (no similarity) and 1 (complete similarity). ● Dissimilarity between two objects is a numerical measure of the degree to which the two objects are different. ○ Dissimilarity - low - objects are more similar. ○ Distance - synonym for dissimilarity
  • 230. Measures of Similarity and Dissimilarity Transformations ● Transformations are often applied to ○ convert a similarity to a dissimilarity, ○ convert a dissimilarity to a similarity ○ convert a dissimilarity to a similarity ○ to transform a proximity measure to fall within a particular range, such as [0,1]. ● Example ○ Similarities between objects range from 1 (not at all similar) to 10 (completely similar) ○ we can make them fall within the range [0, 1] by using the transformation ■ s’ = (s−1)/9 ■ s - Original Similarity ■ s’ - New similarity values
  • 231. Measures of Similarity and Dissimilarity
  • 232. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Euclidean Distance
  • 233. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects If d(x, y) is the distance between two points, x and y, then the following properties hold. 1. Positivity (a) d(x, x) ≥ 0 for all x and y, (a) d(x, x) ≥ 0 for all x and y, (b) d(x, y) = 0 only if x = y. 2. Symmetry d(x, y) = d(y, x) for all x and y. 3. Triangle Inequality d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z. Note:-Measures that satisfy all three properties are known as metrics.
  • 234. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 235. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 236. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 237. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 238. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Non-metric Dissimilarities: Set Differences A = {1, 2, 3, 4} and B = {2, 3, 4}, then A − B = {1} and B − A = ∅, the empty set. B − A = ∅, the empty set. If d(A, B) = size(A − B), then it does not satisfy the second part of the positivity property, the symmetry property, or the triangle inequality. d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
  • 239. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Non-metric Dissimilarities: Time Dissimilarity measure that is not a metric,but still useful. d(1PM, 2PM) = 1 hour d(2PM, 1PM) = 23 hours ● Example:- when answering the question: “If an event occurs at 1PM every day, and it is now 2PM, how long do I have to wait for that event to occur again?”
  • 241. Measures of Similarity and Dissimilarity Similarities between Data Objects ● Typical properties of similarities are the following: ○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) ○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry) ● A Non-symmetric Similarity Measure ○ Classify a small set of characters which is flashed on a screen. ○ Confusion matrix - records how often each character is classified as itself, and how often each is classified as another character. and how often each is classified as another character. ○ “0” appeared 200 times but classified as ■ “0” 160 times, ■ “o” 40 times. ○ ‘o’ appeared 200 times and was classified as ■ “o” 170 times ■ “0” only 30 times. ● similarity measure can be made symmetric by setting ○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2, ■ S` - new similarity measure.
  • 242. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data ○ Similarity measures between objects that contain only binary attributes are called similarity coefficients ○ Let x and y be two objects that consist of n binary attributes. ○ The comparison of two objects (or two binary vectors), leads to the following four quantities (frequencies): f00 = the number of attributes where x is 0 and y is 0 f01 = the number of attributes where x is 0 and y is 1 f10 = the number of attributes where x is 1 and y is 0 f11 = the number of attributes where x is 1 and y is 1
  • 243. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data Simple Matching Coefficient(SMC) Jaccard Coefficient
  • 244. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data
  • 245. Measures of Similarity and Dissimilarity Examples of proximity measures Cosine similarity (Document similarity) If x and y are two document vectors, then
  • 246. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity)
  • 247. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) # import required libraries import numpy as np from numpy.linalg import norm # define two lists or array A = np.array([2,1,2,3,2,9]) B = np.array([3,4,2,4,5,5]) print("A:", A) print("B:", B) # compute cosine similarity cosine = np.dot(A,B)/(norm(A)*norm(B)) print("Cosine Similarity:", cosine)
  • 248. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) ● Cosine similarity - measure of angle between x and y. ● Cosine similarity = 1 (angle is 0◦, and x & y are same (except magnitude or length)) ● Cosine similarity = 0 (angle is 90 ◦, and x & y do not share any terms (words)) ● Cosine similarity = 0 (angle is 90 ◦, and x & y do not share any terms (words))
  • 249. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) Note:- Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not considered)
  • 250. Measures of Similarity and Dissimilarity Examples of proximity measures Extended Jaccard Coefficient (Tanimoto Coefficient)
  • 251. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation
  • 252. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation ● The more tightly linear two variables X and Y are, the closer Pearson's correlation coefficient(PCC) ○ PCC = -1, if the relationship is negative, ○ PCC=+1, if the relationship is positive. ○ PCC=+1, if the relationship is positive. ■ an increase in the value of one variable increases the value of another variable ○ PCC = 0 Perfectly linearly uncorrelated numbers ■ an increase in the value of one decreases the value of another variable.
  • 253. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation
  • 254. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
  • 255. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation (manual in python)
  • 257. BASIC CONCEPTS • Input data ->collection of records. E • Record / instance / example -> tuple (x, y) • x - attribute set • x - attribute set • y - special attribute (class label / category / target attribute) • Attribute set - properties of a Data Object – Discrete / Continuous • Class label – • Classification – y is Discrete attribute • Regression (Predictive Modeling Task) - y is a continuous attribute.
  • 258. BASIC CONCEPTS • Definition: • Classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y. predefined class labels y. • The target function is also known informally as a classification model.
  • 259. BASIC CONCEPTS • A classification model is useful for the following purposes. • Descriptive modeling: A classification model can serve as an explanatory tool to distinguish between objects of different classes. between objects of different classes.
  • 260. BASIC CONCEPTS • A classification model is useful for the following purposes. • Predictive Modeling: • A classification model can also be used to predict the class label of unknown records. • A classification model can also be used to predict the class label of unknown records. • Automatically assigns a class label when presented with the attribute set of an unknown record. • Classification techniques best suit for binary or nominal categories. • Do not consider the implicit order • Relationships are also ignored(super-sub class)
  • 261. General approach to solving a classification problem • Classification technique (or classifier) • Systematic approach to building classification models from an input data set. • Examples • Decision tree classifiers, • Rule-based classifiers, • Neural networks, • Neural networks, • Support vector machines, and • Naive bayes classifiers. • Learning algorithm • Used by the classifier • To identify a model • That best fits the relationship between the attribute set and class label of the input data.
  • 262. General approach to solving a classification problem • Model • Generated by a learning algorithm • Should satisfy the following: • Fit the input data well • Correctly predict the class labels of records it has never seen before. • Training set • Consisting of records whose class labels are known • used to build a classification model
  • 263. General approach to solving a classification problem • Confusion Matrix • Used to evaluate the performance of a classification model • Holds details about • counts of test records correctly and incorrectly predicted by the model. • Table 4.2 depicts the confusion matrix for a binary classification problem. • Table 4.2 depicts the confusion matrix for a binary classification problem. • fij – no. of records from class i predicted to be of class j. • f01 – no. of records from class 0 incorrectly predicted as class 1. • total no. of correct predictions made (f11 + f00) • total number of incorrect predictions (f10 + f01).
  • 264. General approach to solving a classification problem • Performance Metrics: 1. Accuracy 1. Error Rate
  • 265. DECISION TREE INDUCTION Working of Decision Tree • We can solve a classification problem by asking a series of carefully crafted questions about the attributes of the test record. • Each time we receive an answer, a follow- up question is asked until we reach a conclusion about the class label of the record. • The series of questions and their possible answers can be organized in the form of a decision tree • Decision tree is a hierarchical structure consisting of nodes and directed edges.
  • 266. DECISION TREE INDUCTION Working of Decision Tree • Three types of nodes: • Root node • No incoming edges • Zero or more outgoing edges. • Internal nodes • Exactly one incoming edge and • Two or more outgoing edges. • Leaf or terminal nodes • Exactly one incoming edge and • No outgoing edges. • Each leaf node is assigned a class label. • Non-terminal nodes (root & other internal nodes) contain attribute test conditions to separate records that have different characteristics.
  • 267. DECISION TREE INDUCTION Working of Decision Tree
  • 268. DECISION TREE INDUCTION Buiding Decision Tree • Hunt’s algorithm: • basis of many existing decision tree induction algorithms, including • ID3, • C4.5, and • CART. • Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets. • D • Dt - set of training records with node t • y= {y1, y2,..., yc} -> class labels. • Hunt’s algorithm. • Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt. • Step 2: If Dt contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in Dt are distributed to the children based on the outcomes. • Note:-algorithm is then recursively applied to each child node.
  • 269. DECISION TREE INDUCTION Buiding Decision Tree • Example:-predicting whether a loan applicant will repay or not (defaulted) • Construct a training set by examining the records of previous borrowers.
  • 270. DECISION TREE INDUCTION Building Decision Tree • Hunt’s algorithm will work fine • if every combination of attribute values is present in the training data and • if each combination has a unique class label. • Additional conditions 1. If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class label as the majority class of training records associated with its parent node. 2. Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the same class label as the majority class of training records associated with this node. same class label as the majority class of training records associated with this node. • Design Issues of Decision Tree Induction • 1. How should the training records be split? • Test condition to divide the records into smaller subsets. • provide a method for specifying the test condition • measure for evaluating the goodness of each test condition. • 2. How should the splitting procedure stop? • A stopping condition is needed • stop when either all the records belong to the same class or all the records have identical attribute values.
  • 271. DECISION TREE INDUCTION Methods for Expressing Attribute Test Conditions • Test condition for Binary Attributes • The test condition for a binary attribute generates two potential outcomes.
  • 272. DECISION TREE INDUCTION Methods for Expressing Attribute Test Conditions • Test condition for Nominal Attributes • nominal attribute can have many values • Test condition can be expressed in two ways • Multiway split - number of outcomes depends on the number of distinct values produces binary splits by considering all 2k−1 − 1 ways of • Binary splits(used in CART) - produces binary splits by considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.
  • 273. DECISION TREE INDUCTION Methods for Expressing Attribute Test Conditions • Test condition for Ordinal Attributes • Ordinal attributes can also produce binary or multiway splits. • values can be grouped without violating the order property. • 4.10© is invalid
  • 274. DECISION TREE INDUCTION Methods for Expressing Attribute Test Conditions • Test condition for Continuous Attributes • Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes, or • Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k. • Multiway split • Multiway split • Apply the discretization strategies
  • 275. DECISION TREE INDUCTION Measures for Selecting the Best Split • p(i|t) - fraction of records belonging to class i at a given node t. • Sometimes – used as only Pi • Two-class problem • (p0, p1) - class distribution at any node • p1 = 1 − p0 • p1 = 1 − p0 • (0.5, 0.5) because there are an equal number of records from each class • Car Type, will result in purer partitions
  • 276. DECISION TREE INDUCTION Measures for Selecting the Best Split • selection of best split is based on the degree of impurity of the child nodes • Node with class distribution (0, 1) has zero impurity, • Node with uniform class distribution (0.5, 0.5) has the highest impurity. • p - fraction of records that belong to one of the two classes. • P – maximum(0.5) – class distribution is even • P- min. (0 or 1)– all records belong to the same class
  • 277. DECISION TREE INDUCTION Measures for Selecting the Best Split • Node N1 has the lowest impurity value, followed by N2 and N3.
  • 278. DECISION TREE INDUCTION Measures for Selecting the Best Split • To Determine the performance of test condition – compare the degree of impurity of the parent node (before splitting) with the degree of impurity of the child nodes (after splitting). • The larger their difference, the better the test condition. • Information Gain: • I(·) - impurity measure of a given node, • N - total no. of records at parent node, • k - no. of attribute values • N(vj) - no. of records associated with the child node, vj. • The difference in entropy(Impurity measure) is known as the Information gain, ∆info
  • 279. Calculate Impurity using Gini Find out, which attribute is selected?
  • 281. DECISION TREE INDUCTION Measures for Selecting the Best Split ● Splitting of Binary Attributes ○ Before splitting, the Gini index is 0.5 ■ because equal number of records from both classes. ○ If attribute A is chosen to split the data, ■ Gini index ● node N1 = 0.4898, and ● node N1 = 0.4898, and ● node N2 = 0.480. ■ Weighted average of the Gini index for the descendent nodes is ● (7/12) × 0.4898 + (5/12) × 0.480 = 0.486. ○ Weighted average of the Gini index for attribute B is 0.375. ○ B is selected because of small value
  • 283. DECISION TREE INDUCTION Measures for Selecting the Best Split ● Splitting of Nominal Attributes ○ First Binary Grouping ■ Gini index of {Sports, Luxury} is 0.4922 and ■ the Gini index of {Family} is 0.3750. ■ The weighted average Gini index 16/20 × 0.4922 + 4/20 × 0.3750 = 0.468. 0.468. ○ Second binary grouping of {Sports} and {Family, Luxury}, ■ weighted average Gini index is 0.167. ● The second grouping has a lower Gini index because its corresponding subsets are much purer.
  • 284. DECISION TREE INDUCTION Measures for Selecting the Best Split ● Splitting of Continuous Attributes ● A brute-force method -Take every value of the attribute in the N records as a candidate split position. ● Count the number of records with annual income less than or greater than v(computationally expensive). ● To reduce the complexity, the training records are sorted based on their annual income, ● Candidate split positions are identified by taking the midpoints between two adjacent sorted values:
  • 285. DECISION TREE INDUCTION Measures for Selecting the Best Split ● Gain Ratio ○ Problem: ■ Customer ID - produce purer partitions. ■ Customer ID is not a predictive attribute because its value is unique for each record. ○ Two Strategies: ■ First strategy(used in CART) ■ First strategy(used in CART) ● restrict the test conditions to binary splits only. ■ Second Strategy(used in C4.5 - Gain Ratio - to determine goodness of a split) ● modify the splitting criterion ● consider - number of outcomes produced by the attribute test condition.
  • 286. DECISION TREE INDUCTION Measures for Selecting the Best Split ● Gain Ratio
  • 288. Tree-Pruning • After building the decision tree, • Tree-pruning step - to reduce the size of the decision tree. • Pruning - • trims the branches of the initial tree • improves the generalization capability of the decision tree. • Decision trees that are too large are susceptible to a phenomenon known as overfitting.
  • 292. Model Overfitting ● Errors generally occur in classification Model are:- ○ Training Errors ( or Resubstitution Error or Apparent Error) ■ No. of misclassification errors Committed on Training data ○ Generalization Errors ■ Expected Error of the model on previously unused records. Model Overfitting: ■ Expected Error of the model on previously unused records. ● Model Overfitting: ○ Model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation (Test) data. ○ This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
  • 293. Model Overfitting ● Model Underfitting: ● Model is underfitting the training data when the model performs poorly on the training data. ● Model is unable to capture the relationship between the input examples (X) and the target values (Y). https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html
  • 296. Model Overfitting Overfitting Due to Presence of Noise: Train Error - 0, Test Error - 30% Sdsdsd Sdsdsd ● Humans and dolphins were misclassified ● Spiny anteaters (exceptional case) ● Errors due to exceptional cases are often Unavoidable and establish the minimum error rate achievable by any classifier.
  • 297. Model Overfitting Overfitting Due to Presence of Noise: Train Error - 20%, Test Error - 10% Sdsdsd Sdsdsd
  • 298. Model Overfitting Overfitting Due to Lack of Representative Samples Overfitting occurs when small number of data training records are available ● Training error is zero, Test Error is 30% ● Humans, elephants, and dolphins are misclassified ● Decision tree classifies all warm-blooded vertebrates that do not hibernate as non-mammals(because of eagle - Lack of representative samples).
  • 299. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Methods commonly used to evaluate the performance of a classifier ○ Hold Out method ○ Random Sub Sampling ○ Cross Validation ■ K-fold ■ K-fold ■ Leave-one-out ○ Bootstrap ■ .632 Bootstrap
  • 300. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Hold Out method ○ Original data - partitioned into two disjoint sets ■ training set ■ test sets ○ A classification model is then induced from the training set ○ Model performance is evaluated on the test set. ○ Analysts can decide the proportion of data reserved for training and for testing e.g., 50-50 or ○ Analysts can decide the proportion of data reserved for training and for testing ■ e.g., 50-50 or ■ twothirds - training & one-third - testing ○ Limitations 1. Model may not be good because only few records are for Model induction 2. Model may be highly dependent on the composition of the training and test sets. ● training set size=small, then larger the variance of the model. ● training set =too large, then the estimated accuracy of small test set is less reliable. 3. training and test sets are no longer independent https://www.datavedas.com/holdout-cross-validation/
  • 301. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Random Sub Sampling ○ The holdout method can be repeated several times to improve the estimation of a classifier’s performance. ○ Overall accuracy, ○ Problems: ■ Does not utilize as much data as possible for training. ■ Does not utilize as much data as possible for training. ■ No control over the number of times each record is used for testing and training. https://blog.ineuron.ai/Hold-Out-Method-Random-Sub-Sampling-Method-3MLDEXAZML
  • 302. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Cross Validation ○ Alternative to Random Subsampling ○ Each record is used the same number of times for training and exactly once for testing. ○ Two fold cross-validation ■ Partition the data into two equal-sized subsets. ■ Partition the data into two equal-sized subsets. ■ one of the subsets for training and the other for testing. ■ Then swap the roles of the subsets https://fengkehh.github.io/post/introduction-to-cross-validation/ - picture reference
  • 303. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Cross Validation ○ K-Fold Cross Validation ■ k equal-sized partitions ■ During each run, ● one of the partitions is chosen for testing, ● one of the partitions is chosen for testing, ● while the rest of them are used for training. ■ Total error is found by summing up the errors for all k runs. Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
  • 304. Model Overfitting - Evaluating the Performance of a Classifier Cross-validation ● Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used the test set. Then the process is repeated until each unique group as been used as the test set. ● For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set. This can be seen in the graph below. ● 5-fold cross validation (image credit) Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
  • 305. Model Overfitting - Evaluating the Performance of a Classifier Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR 5-fold cross validation (image credit)
  • 306. Model Overfitting - Evaluating the Performance of a Classifier Evaluating the Performance of a Classifier ● Cross Validation ○ leave-one-out approach ■ A special case of the k-fold cross-validation ● sets k = N ( Dataset size) ■ Size of test set = 1 record ■ Size of test set = 1 record ■ All remaining records = Training set ■ Advantage ● Utilizing as much data as possible for training ● Test sets are mutually exclusive and they effectively cover the entire data set. ■ Drawback ● computationally expensive Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR
  • 307. Model Overfitting Evaluating the Performance of a Classifier ● Bootstrap ○ Training records are sampled with replacement; ■ A record already chosen for training is put back into the original pool of records so that it is equally likely to be redrawn. ○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N ○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N ■ When N is sufficiently large, the probability asymptotically approaches 1 − e −1 = 0.632. ○ On average, a bootstrap sample contains 63.2% of the records of the original data. ● b -no. of times ● Ei - accuracy of ith bootstrap sample, accs - accuracy on training data Picture reference - https://bradleyboehmke.github.io/HOML/process.html
  • 309. Bayesian Classifiers ● In many applications the relationship between the attribute set and the class variable is non-deterministic. ● Example: Risk for heart disease based on the person’s diet and workout ● Example: ○ Risk for heart disease based on the person’s diet and workout frequency. ● So, Modeling probabilistic relationships between the attribute set and the class variable. ● Bayes Theorem
  • 310. Bayesian Classifiers ● Consider a football game between two rival teams: Team 0 and Team 1. ● Suppose Team 0 wins 65% of the time and Team 1 wins the remaining matches. ● Among the games won by Team 0, only 30% of them come from playing on Team 1’s football field. Team 1’s football field. ● On the other hand, 75% of the victories for Team 1 are obtained while playing at home. ● If Team 1 is to host the next match between the two teams, which team will most likely emerge as the winner? ● This Problem can be solved by Bayes Theorem
  • 311. Bayesian Classifiers ● Bayes Theorem ○ X and Y are random variables. ○ A conditional probability is the probability that a random variable will take on a particular value given that the outcome for another random variable is known. particular value given that the outcome for another random variable is known. ○ Example: ■ conditional probability P(Y = y|X = x) refers to the probability that the variable Y will take on the value y, given that the variable X is observed to have the value x.
  • 312. Bayesian Classifiers ● Bayes Theorem If {X1, X2,..., Xk} is the set of mutually exclusive and exhaustive outcomes of a random variable X, then the denominator of the previous slide equation can be expressed as follows: expressed as follows:
  • 314. Bayesian Classifiers ● Bayes Theorem ○ Using the Bayes Theorem for Classification ■ X - attribute set ■ Y - class variable. ○ Treat X and Y as random variables -for non-deterministic relationship ○ Capture relationship probabilistically using P(Y |X) - Posterior Probability or Conditional Probability ○ P(Y) - prior probability ○ Training phase ○ Training phase ■ Learn the posterior probabilities P(Y |X) for every combination of X and Y ○ Use these probabilities and classify test record X` by finding the class Y` (max posterior probability - P(y`/x`))
  • 315. Bayesian Classifiers Using the Bayes Theorem for Classification Example:- ● test record X= (Home Owner = No, Marital Status = Married, Annual Income = $120K) ● Y=? ● Use training data & compute - posterior probabilities P(Yes|X) and P(No|X) ● Y= Yes, if P(Yes|X) > P(No|X) ● Y= No, Otherwise
  • 316. Bayesian Classifiers Computing P(X/Y) - Class Conditional Probability Na¨ıve Bayes Classifier ● assumes that the attributes are conditionally independent, given the class label y. The conditional independence assumption can be formally stated as follows: ● The conditional independence assumption can be formally stated as follows:
  • 317. Bayesian Classifiers How a Na¨ıve Bayes Classifier Works ● Assumption - conditional independence ● Estimate the conditional probability of each Xi, given Y ○ (instead of computing the class-conditional probability for every combination of X) ○ (instead of computing the class-conditional probability for every combination of X) ○ No need of very large training set to obtain a good estimate of the probability. ● To classify a test record, ○ Compute the posterior probability for each class Y: ■ P(X) can be ignored ● It is fixed for every Y, it is sufficient to choose the class that maximizes the numerator term
  • 318. Bayesian Classifiers Estimating Conditional Probabilities for Binary Attributes Xi - categorical attribute , xi - one of the value under attribute Xi Y - Target Attribute ( for Class Label), y- one class Label conditional probability P(Xi = xi |Y = y) = fraction of training instances in class y that take on attribute value xi. P(Home Owner=yes|DB=no) = P(Home Owner=yes|DB=no) = (No. of HO=yes & No. of DB = no)/(Total No. of DB=no) =3/7 P(Home Owner=no|DB=no)=4/7 P(Home Owner=yes|DB=yes)=0 P(Home Owner=no|DB=yes)=3/3
  • 319. Bayesian Classifiers Estimating Conditional Probabilities for Categorical Attributes P(MS=single|DB=no) = 2/7 P(MS=married|DB=no) = 4/7 P(MS=divorced|DB=no) =1/7 P(MS=divorced|DB=no) =1/7 P(MS=single|DB=yes) = 2/3 P(MS=married|DB=yes) = 0/3 P(MS=divorced|DB=yes) =1 /3
  • 320. Bayesian Classifiers Estimating Conditional Probabilities for Continuous Attributes ● Discretization ● Probability Distribution
  • 321. Bayesian Classifiers Estimating Conditional Probabilities for Continuous Attributes ● Discretization (Transforming continuous attributes into ordinal attributes) ○ Replace the continuous attribute value with its corresponding discrete interval. ○ Estimation error depends on ○ Estimation error depends on ■ the discretization strategy ■ the number of discrete intervals. ○ If the number of intervals is too large, there are too few training records in each interval ○ If the number of intervals is too small, then some intervals may aggregate records from different classes and we may miss the correct decision boundary.
  • 322. Bayesian Classifiers Estimating Conditional Probabilities for Continuous Attributes ● Probability Distribution ○ Gaussian distribution can be used to represent the class-conditional probability for continuous attributes. ○ The distribution is characterized by two parameters, ■ mean, µ ■ mean, µ ■ variance, σ 2 µij - sample mean of Xi for all training records that belong to the class yj. σ2 ij - sample variance (s2) of such training records.
  • 323. Bayesian Classifiers Estimating Conditional Probabilities for Continuous Attributes ● Probability Distribution sample mean and variance for this attribute with respect to the class No
  • 324. Bayesian Classifiers Example of the Na¨ıve Bayes Classifier ● Compute the class conditional probability for each categorical attribute ● Compute sample mean and variance for the continuous attribute ● Predict the class label of a test record X = (Home Owner=No, Marital Status = Married, X = (Home Owner=No, Marital Status = Married, Income = $120K) ● compute the posterior probabilities ○ P(No|X) ○ P(Yes|X)
  • 325. Bayesian Classifiers Example of the Na¨ıve Bayes Classifier ● P(yes) = 3/10 =0.3 P(no) =7/10 = 0.7
  • 326. Bayesian Classifiers Example of the Na¨ıve Bayes Classifier ● P(no|x)= ? ● P(yes|x) = ? ● Large value is the class label ● X = (Home Owner=No, Marital Status = Married, Income = $120K) P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ? ● P(no| Home Owner=No, Marital Status = Married, Income = $120K) = ? ● P(Y|X) = P(Y) * P(X|Y) ● P(no| Home Owner=No, Marital Status = Married, Income = $120K) = P(DB=no) * P(Home Owner=No, Marital Status = Married, Income = $120K | DB=no) ● P(X|Y) = P(HM=no|DB=no) * P(MS=married|DB=no) * P(Income=$120K|DB=no) = 4/7 * 4/7 * 0.0072 =0.0024
  • 327. Bayesian Classifiers Example of the Na¨ıve Bayes Classifier P(DB=no | X)=P(DB=no)*P(X | DB=no) = 7/10 * 0.0024 = 0.0016 P(DB=yes | X)=P(DB=yes)*P(X | DB=yes) = 3/10 * 0 = 0 Class Label for the record is NO
  • 328. Bayesian Classifiers Find out Class Label ( Play Golf ) for today = (Sunny, Hot, Normal, False) https://www.geeksforgeeks.org/naive-bayes-classifiers/
  • 329. Association Analysis: Basic Concepts and Algorithms Basic Concepts and Algorithms DWDM Unit - IV
  • 330. Basic Concepts ● Retailers are interested in analyzing the data to learn about the purchasing behavior of their customers. ● about the purchasing behavior of their customers. ● Such Information is used in marketing promotions, inventory management, and customer relationship management. ● Association analysis - useful for discovering interesting relationships hidden in large data sets. ● The uncovered relationships can be represented in the form of association rules or sets of frequent items.
  • 331. Basic Concepts ● Example Association Rule ○ {Diapers} → {Beer} ● rule suggests - strong relationship exists between the sale of diapers and beer ● many customers who buy diapers also buy beer. ● many customers who buy diapers also buy beer. ● Association analysis is also applicable to ○ Bioinformatics, ○ Medical diagnosis, ○ Web mining, and ○ Scientific data analysis ● Example - analysis of Earth science data(ocean, land, & atmospheric processes)
  • 332. Basic Concepts Problem Definition: ● Binary Representation Market basket data ● each row - transaction ● each column - item ● each row - transaction ● each column - item ● value is one if the item is present in a transaction and zero otherwise. ● item is an asymmetric binary variable because the presence of an item in a transaction is often considered more important than its absence
  • 333. Basic Concepts Itemset and Support Count: I = {i1,i2,.. .,id} - set of all items T = {t1, t2,..., tN} - set of all transactions Each transaction ti contains a subset of items chosen from I Itemset - collection of zero or more items K-itemset - itemset contains k items Example:- {Beer, Diapers, Milk} - 3-itemset null (or empty) set - no items
  • 334. Basic Concepts Itemset and Support Count: ● Transaction width - number of items present in a transaction. ● A transaction tj contain an itemset X if X is a subset of tj. ● A transaction tj contain an itemset X if X is a subset of tj. ● Example: ○ t2 contains itemset {Bread, Diapers} but not {Bread, Milk}. ● support count,σ(X) - number of transactions that contain a particular itemset. ● σ(X) = |{ti |X ⊆ ti, ti ∈ T}|, ○ symbol | · | denote the number of elements in a set. ● support count for {Beer, Diapers, Milk} =2 ○ ( 2 transactions contain all three items)
  • 335. Basic Concepts Association Rule: ● An association rule is an implication expression of the form X → Y, where X and Y are disjoint itemsets ∅ the form X → Y, where X and Y are disjoint itemsets ○ i.e., X ∩ Y = ∅. ● The strength of an association rule can be measured in terms of its support and confidence.
  • 336. Basic Concepts ● Support ○ determines how often a rule is applicable to a given data set a given data set ○ ● Confidence ○ determines how frequently items in Y appear in transactions that contain X
  • 337. Basic Concepts ● Example: ○ Consider the rule {Milk, Diapers} → {Beer} ○ support count for {Milk, Diapers, Beer}=2 ○ total number of transactions=5, ○ support count for {Milk, Diapers, Beer}=2 ○ total number of transactions=5, ○ rule’s support is 2/5 = 0.4. ○ rule’s confidence = (support count for {Milk, Diapers, Beer})/(support count for {Milk, Diapers}) = 2/3 = 0.67.
  • 338. Basic Concepts Formulation of Association Rule Mining Problem Association Rule Discovery Given a set of transactions T, find all the rules having Given a set of transactions T, find all the rules having support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the corresponding support and confidence thresholds.
  • 339. Basic Concepts Formulation of Association Rule Mining Problem Association Rule Discovery ● Brute-force approach: compute the support and confidence for every possible rule(expensive) possible rule(expensive) ● Total number of possible rules extracted from a data set that contains d items is R = 3d−2d+1+1 ● For a dataset of 6 items, no of possible rules are 36−27 +1=602 rules. ● More than 80% of the rules are discarded after applying minsup=20% & minconf=50% ● most of the computations become wasted. ● Prune the rules early without having to compute their support and confidence values.
  • 340. Basic Concepts Formulation of Association Rule Mining Problem Association Rule Discovery ● Common strategy - decompose the problem into two major ● Common strategy - decompose the problem into two major subtasks: (separate support & confidence) 1. Frequent Itemset Generation: ■ Objective:Find all the itemsets that satisfy the minsup threshold. 2. Rule Generation: ■ Objective: Extract all the high-confidence rules from the frequent itemsets found in the previous step. ■ These rules are called strong rules.
  • 341. Frequent Itemset Generation ● Lattice structure - list of all possible itemsets ● itemset lattice for ○ I = {a, b, c, d, e} ○ I = {a, b, c, d, e} ● Data set with k items can generate up to 2k − 1 frequent itemsets (without null set) ○ Example:- 25-1=31 ● So, search space of itemsets in practical applications is exponentially large
  • 342. Frequent Itemset Generation ● A brute-force approach for finding frequent itemsets ○ determine the support count for every candidate ○ determine the support count for every candidate itemset in the lattice structure. ● compare each candidate against every transaction ● Very expensive ○ requires O(NMw) comparisons, ○ N- No. of transactions, ○ M = 2k − 1 is the number of candidate itemsets ○ w - maximum transaction width.
  • 343. Frequent Itemset Generation several ways to reduce the computational complexity of frequent itemset generation. Reduce the number of candidate itemsets (M) The Apriori principle Reduce the number of comparisons by using more advanced data structures
  • 344. The Apriori Principle Frequent Itemset Generation Principle If an itemset is frequent, then all of its subsets must also be frequent.
  • 345. Frequent Itemset Generation Support-based pruning: Support-based pruning: ● strategy of trimming the exponential search space based on the support measure is known as support-based pruning. ● It uses anti-monotone property of the support measure. ● Anti-monotone property of the support measure ○ support for an itemset never exceeds the support for its subsets. ● Example: ○ {a, b} is infrequent, ○ then all of its supersets must be infrequent too. ○ entire subgraph containing the supersets of {a, b} can be pruned immediately
  • 346. Frequent Itemset Generation Let, I - set of items J = 2I - power set of I A measure f is monotone/anti-monotone if A measure f is monotone/anti-monotone if Monotonicity Property(or upward closed): ∀X, Y ∈ J: (X ⊆ Y) → f(X) ≤ f(Y) Anti-monotone (or downward closed): ∀X, Y ∈ J: (X ⊆ Y) → f(Y) ≤ f(X) means that if X is a subset of Y, then f(Y) must not exceed f(X).
  • 347. Frequent Itemset Generation in the Apriori Algorithm
  • 350. Frequent Itemset Generation in the Apriori Algorithm Ck-set of k-candidate itemsets Fk - set of k-frequent itemsets
  • 351. Frequent Itemset Generation in the Apriori Algorithm https://www.softwaretestinghelp.com/apriori- algorithm/#:~:text=Apriori%20algorithm%20is%20a%20sequence,is%20assumed%20by%20the%20user.
  • 352. Example Frequent Itemset Generation in the Apriori Algorithm
  • 357. Frequent Itemset Generation in the Apriori Algorithm Ck-set of k-candidate itemsets Fk - set of k-frequent itemsets
  • 358. Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning The apriori-gen function shown in Step 5 of Algorithm 6.1 generates candidate itemsets by performing the following two operations: generates candidate itemsets by performing the following two operations: 1. Candidate Generation (join) a. Generates new candidate k-itemsets b. based on the frequent (k − 1)-itemsets found in the previous iteration. 2. Candidate Pruning a. Eliminates some of the candidate k-itemsets using the support-based pruning strategy.
  • 359. Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning Requirements for an effective candidate generation procedure: procedure: 1. It should avoid generating too many unnecessary candidates 2. It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out 3. It should not generate the same candidate itemset more than once (no duplicates).
  • 360. Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning Candidate Generation Procedures 1. Brute-Force Method 1. Brute-Force Method 2. Fk−1 × F1 Method 3. Fk−1×Fk−1 Method
  • 361. Frequent Itemset Generation in the Apriori Algorithm Candidate Generation and Pruning Candidate Generation Procedures 1. Brute-Force Method a. considers every k-itemset as a 1. Brute-Force Method a. considers every k-itemset as a potential candidate b. candidate pruning ( to remove unnecessary candidates) becomes extremely expensive c. No. of candidate itemsets generated at level k =(d k) d - no. of items
  • 362. 2. Fk−1 × F1 Method O(|Fk−1| × |F1|) candidate k-itemsets, |Fj | = no. of frequent j-itemsets. overall complexity
  • 363. ● The procedure is complete. ● But the same candidate itemset will be generated more than once ( duplicates). ● Example: ○ {Bread, Diapers, Milk} can be generated ○ by merging {Bread, Diapers} with {Milk}, ○ {Bread, Milk} with {Diapers}, or ○ {Diapers, Milk} with {Bread}. ● One Solution ● One Solution ○ Generate candidate itemset by joining items in lexicographical order only ● {Bread, Diapers} join with {Milk} Don’t join ● {Diapers, Milk} with {Bread} ● {Bread, Milk} with {Diapers} because violation of lexicographic ordering Problem: Large no. of unnecessary candidates
  • 364. 3. Fk−1×Fk−1 Method (used in the apriori-gen function) ● merges a pair of frequent (k−1)-itemsets only if their first k−2 items are identical. first k−2 items are identical. ● Let A = {a1, a2,..., ak−1} and B = {b1, b2,..., bk−1} be a pair of frequent (k−1)-itemsets. ● A and B are merged if they satisfy the following conditions: ○ ai = bi (for i = 1, 2,..., k−2) and ○ ak−1 != bk−1.
  • 365. Merge {Bread, Diapers} & {Bread, Milk} to form a candidate 3- itemset {Bread, Diapers, Milk} Don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different.
  • 366. Support Counting ● Support counting is the process of determining the frequency of occurrence for every candidate itemset that survives the candidate pruning step. pruning step. ● One approach for doing this is to compare each transaction against every candidate itemset (see Figure 6.2) and to update the support counts of candidates contained in the transaction. ● This approach is computationally expensive, especially when the numbers of transactions and candidate itemsets are large.
  • 367. Support Counting ● An alternative approach is to enumerate the itemsets contained in each transaction and use them to update the support counts of their respective candidate itemsets. ● To illustrate, consider a transaction t that contains five items,{1, 2, 3, 5, 6}. items,{1, 2, 3, 5, 6}. ● Assuming that each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by specifying the smallest item first,followed by the larger items. ● For instance, given t = {1, 2, 3, 5, 6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3. It is not possible to construct a 3-itemset that begins with items 5 or 6 because there are only two items in t whose labels are greater than or equal to 5.
  • 368. Support Counting ● The number of ways to specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix structures. For instance, 1 2 3 5 6 represents a 3-itemset that begins with item 1, followed by two more items chosen from the set {2, 3, 5, 6} ● After fixing the first item, the prefix structures at Level 2 represent the number of ways to select the second item. For example, 1 2 3 5 6 corresponds to itemsets that begin with For example, 1 2 3 5 6 corresponds to itemsets that begin with prefix (1 2) and are followed by items 3, 5, or 6. ● Finally, the prefix structures at Level 3 represent the complete set of 3-itemsets contained in t. For example, the 3-itemsets that begin with prefix {1 2} are {1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with prefix {2 3} are {2, 3, 5} and {2, 3, 6}.
  • 369. Support Counting (steps 6 through 11 of Algorithm 6.1. ) ● Enumerate the itemsets contained in each transaction ● Figure 6.9 demonstrate how itemsets ● Figure 6.9 demonstrate how itemsets contained in a transaction can be systematically enumerated, i.e., by specifying their items one by one, from the leftmost item to the rightmost item. ● If enumerated item of transaction matches one of the candidates, then the support count of the corresponding candidate is incremented.(line 9 in algo.) For instance, given t = {1, 2, 3, 5, 6}, all the 3- itemsets contained in t
  • 370. Support Counting Using a Hash Tree ● Candidate itemsets are partitioned into different buckets and stored in a hash tree. ● Itemsets contained in ● Itemsets contained in each transaction are also hashed into their appropriate buckets. ● Instead of comparing each itemset in the transaction with every candidate itemset ● Matched only against candidate itemsets that belong to the same bucket
  • 371. Hash Tree from a Candidate Itemset https://www.youtube.com/watch?v=btW-uU1dhWI
  • 376. Rule generation & Compact representation of frequent itemsets DWDM Unit - IV Association Analysis
  • 377. Rule Generation ● Each frequent k-itemset can produce up to 2k−2 association rules, ignoring rules that have empty antecedents or consequents . ● An association rule can be extracted by partitioning the itemset Y into two non-empty subsets, X and Y −X, such that X → Y −X satisfies the confidence threshold.
  • 378. Confidence-Based Pruning Theorem: If a rule X → Y −X does not satisfy the confidence threshold, then any rule X` → Y − X`, where X` is a subset of X, must not satisfy the confidence threshold as well.
  • 379. Rule Generation in Apriori Algorithm ● The Apriori algorithm uses a level-wise approach for generating association rules, where each level corresponds to the number of items that belong to the rule consequent. ● Initially, all the high-confidence rules that have only one item in the rule consequent are extracted. ● These rules are then used to generate new candidate rules. For example, if {acd} →{b} and {abd} →{c} are high-confidence rules, then the candidate rule {ad} → {bc} is generated by merging the consequents of both rules.
  • 380. Rule Generation in Apriori Algorithm ● Figure 6.15 shows a lattice structure for the association rules generated from the frequent itemset {a, b, c, d}. ● If any node in the lattice has low confidence, then according to Theorem, the entire sub-graph spanned by the node can be pruned immediately. ● Suppose the confidence for {bcd} → {a} is low. All the rules containing item a in its consequent, can be discarded.
  • 381. In rule generation, we do not have to make additional passes over the data set to compute the confidence of the candidate rules. Instead, we determine the confidence of each rule by using the support counts computed during frequent itemset generation.
  • 383. Compact Representation of Frequent Itemsets Maximal Frequent Itemsets Definition A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent.
  • 384. Compact Representation of Frequent Itemsets ● The itemsets in the lattice are divided into two groups: those that are frequent and those that are infrequent. ● A frequent itemset border, which is represented by a dashed line, is also illustrated in the diagram. ● Every itemset located above the border is frequent, while those located below the border (the shaded nodes) are infrequent. ● Among the itemsets residing near the border, {a, d}, {a, c, e}, and {b, c, d, e} are considered to be maximal frequent itemsets because their immediate supersets are infrequent. ● Maximal frequent itemsets do not contain the support information of their subsets.
  • 385. Compact Representation of Frequent Itemsets ● Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. ● They form the smallest set of itemsets from which all frequent itemsets can be derived. ● For example, the frequent itemsets shown in Figure 6.16 can be divided into two groups: ○ Frequent itemsets that begin with item a and that may contain items c, d, or e. This group includes itemsets such as {a}, {a, c}, {a, d}, {a, e} and {a, c, e}. ○ Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b, c}, {c, d},{b, c, d, e}, etc. ● Frequent itemsets that belong in the first group are subsets of either {a, c, e} or {a, d}, while those that belong in the second group are subsets of {b, c, d, e}.
  • 386. Compact Representation of Frequent Itemsets ● Closed Frequent Itemsets ○ Closed itemsets provide a minimal representation of itemsets without losing their support information. ○ An itemset X is closed if none of its immediate supersets has exactly the same support count as X. Or ○ X is not closed if at least one of its immediate supersets has the same support count as X.
  • 387. Compact Representation of Frequent Itemsets Closed Frequent Itemsets ● An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.
  • 388. Compact Representation of Frequent Itemsets Closed Frequent Itemsets ● Determine the support counts for the non-closed by using the closed frequent itemsets ● consider the frequent itemset {a, d} - is not closed, its support count must be identical to one of its immediate supersets {a, b, d}, {a, c, d}, or {a, d, e}. ● Apriori principle states ○ any transaction that contains the superset of {a, d} must also contain {a, d}. ○ any transaction that contains {a, d} does not have to contain the supersets of {a, d}. ● So, the support for {a, d} = largest support among its supersets = support of {a,c,d} ● Algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest frequent itemsets.
  • 390. ● The items can be divided into three groups: (1) Group A, which contains items a1 through a5; (2) Group B, which contains items b1 through b5; and (3) Group C, which contains items c1 through c5. ● The items within each group are perfectly associated with each other and they do not appear with items from another group. Assuming the support threshold is 20%, the total number of frequent itemsets is 3×(25−1)= 93. ● There are only three closed frequent itemsetsin the data: ({a1, a2, a3, a4, a5}, {b1, b2, b3, b4, b5}, and {c1, c2, c3, c4, c5})
  • 391. ● Redundant association rules can be removed by using Closed frequent itemsets ● An association rule X → Y is redundant if there exists another rule X`→ Y`, where X is a subset of X` and Y is a subset of Y ` such that the support and confidence for both rules are identical.
  • 392. ● From table 6.5 {b} is not a closed frequent itemset while {b, c} is closed. ● The association rule {b} → {d, e} is therefore redundant because it has the same support and confidence as {b, c} → {d, e}. ● Such redundant rules are not generated if closed frequent itemsets are used for rule generation. ● all maximal frequent itemsets are closed because none ● of the maximal frequent itemsets can have the same support count as their ● immediate supersets.
  • 393. FP Growth Algorithm Association Analysis (Unit - IV) DWDM
  • 394. FP Growth Algorithm ● FP-growth algorithm takes a radically different approach for discovering frequent itemsets. ● The algorithm encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure FP-Tree Representation ● An FP-tree is a compressed representation of the input data. It is constructed by reading the data set one transaction at a time and mapping each transaction onto a path in the FP-tree. ● As different transactions can have several items in common, their paths may overlap. The more the paths overlap with one another, the more compression we can achieve using the FP-tree structure. ● If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract frequent itemsets directly from the structure in memory instead of making repeated passes over the data stored on disk.
  • 396. FP Tree Representation ● Figure 6.24 shows a data set that contains ten transactions and five items. ● The structures of the FP-tree after reading the first three transactions are also depicted in the diagram. ● Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. ● Initially, the FP-tree contains only the root node represented by the null symbol.
  • 397. FP Tree Representation 1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded, while the frequent items are sorted in decreasing support counts. For the data set shown in Figure, a is the most frequent item, followed by b, c, d, and e.
  • 398. FP Tree Representation 2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a → b to encode the transaction. Every node along the path has a frequency count of 1.
  • 399. FP Tree Representation 3. After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c, and d. A path is then formed to represent the transaction by connecting the nodes null → b → c → d. Every node along this path also has a frequency count equal to one. 4. The third transaction, {a,c,d,e}, shares a common prefix item (which is a) with the first transaction. As a result, the path for the third transaction, null → a → c → d → e, overlaps with the path for the first transaction, null → a → b. Because of their overlapping path, the frequency count for node a is incremented to two, while the frequency counts for the newly created nodes, c, d, and e, are equal to one. 5. This process continues until every transaction has been mapped onto one of the paths given in the FP-tree. The resulting FP-tree after reading all the transactions is shown in Figure 6.24.
  • 400. FP Tree Representation ● The size of an FP-tree is typically smaller than the size of the uncompressed data because many transactions in market basket data often share a few items in common. ● In the best-case scenario, where all the transactions have the same set of items, the FP-tree contains only a single branch of nodes. ● The worst-case scenario happens when every transaction has a unique set of items.
  • 401. FP Tree Representation ● The size of an FP-tree also depends on how the items are ordered. ● If the ordering scheme in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP- tree is shown in Figure 6.25. ● An FP-tree also contains a list of pointers connecting between nodes that have the same items. ● These pointers, represented as dashed lines in Figures 6.24 and 6.25, help to facilitate the rapid access of individual items in the tree.
  • 402. Frequent Itemset Generation using FP-Growth Algorithm Steps in FP-Growth Algorithm: Step-1: Scan the database to build Frequent 1-item set which will contain all the elements whose frequency is greater than or equal to the minimum support. These elements are stored in descending order of their respective frequencies. Step-2: For each transaction, the respective Ordered-Item set is built. Step-3: Construct the FP tree. by scanning each Ordered-Item set Step-4: For each item, the Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree. Step-5: For each item, the Conditional Frequent Pattern Tree is built. Step-6: Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern Tree set to each corresponding item.
  • 403. Frequent Itemset Generation in FP-Growth Algorithm Example: The frequency of each individual item is computed:- Given Database: min_support=3
  • 404. Frequent Itemset Generation in FP-Growth Algorithm ● A Frequent Pattern set is built which will contain all the elements whose frequency is greater than or equal to the minimum support. These elements are stored in descending order of their respective frequencies. ● L = {K : 5, E : 4, M : 3, O : 3, Y : 3} ● Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the Frequent Pattern set and checking if the current item is contained in the transaction. The following table is built for all the transactions:
  • 405. Frequent Itemset Generation in FP-Growth Algorithm Now, all the Ordered-Item sets are inserted into a Trie Data Structure. a) Inserting the set {K, E, M, O, Y}: All the items are simply linked one after the other in the order of occurrence in the set and initialize the support count for each item as 1.
  • 406. Frequent Itemset Generation in FP-Growth Algorithm b) Inserting the set {K, E, O, Y}: Till the insertion of the elements K and E, simply the support count is increased by 1. There is no direct link between E and O, therefore a new node for the item O is initialized with the support count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new node for the item Y with support count as 1 and link the new node of O with the new node of Y.
  • 407. Frequent Itemset Generation in FP-Growth Algorithm c) Inserting the set {K, E, M}: ● Here simply the support count of each element is increased by 1.
  • 408. Frequent Itemset Generation in FP-Growth Algorithm d) Inserting the set {K, M, Y}: ● Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized and linked accordingly.
  • 409. Frequent Itemset Generation in FP-Growth Algorithm e) Inserting the set {K, E, O}: ● Here simply the support counts of the respective elements are increased.
  • 410. Frequent Itemset Generation in FP-Growth Algorithm Now, for each item starting from leaf, the Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree.
  • 411. Frequent Itemset Generation in FP-Growth Algorithm Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its support count by summing the support counts of all the paths in the Conditional Pattern Base. The itemsets whose support count >= min_support value are retained in the Conditional Frequent Pattern Tree and the rest are discarded.
  • 412. Frequent Itemset Generation in FP-Growth Algorithm From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table. For each row, two types of association rules can be inferred for example for the first row which contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the confidence of both the rules is calculated and the one with confidence greater than or equal to the minimum confidence value is retained.
  • 413. Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar
  • 414. What is Cluster Analysis? ● Given a set of objects, place them in groups such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are 2 maximized distances are minimized
  • 415. Applications of Cluster Analysis ● Understanding – Group related documents for browsing(Information Retrieval), – group genes and proteins that have similar functionality(Biology), 3 functionality(Biology), – group stocks with similar price fluctuations (Business) – Climate – Psychology & Medicine Clustering precipitation in Australia
  • 416. Applications of Cluster Analysis ● Clustering for Utility – Summarization – Compression – Efficiently finding Nearest Neighbors 4 Clustering precipitation in Australia
  • 417. Notion of a Cluster can be Ambiguous How many clusters? Six Clusters 5 Four Clusters Two Clusters
  • 418. Types of Clusterings ● A clustering is a set of clusters ● Important distinction between hierarchical and partitional sets of clusters – Partitional Clustering (unnested) ◆ 6 ◆ A division of data objects into non-overlapping subsets (clusters) – Hierarchical clustering (nested) ◆ A set of nested clusters organized as a hierarchical tree
  • 419. Partitional Clustering 7 Original Points A Partitional Clustering
  • 420. Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram 8 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram
  • 421. Other Distinctions Between Sets of Clusters ● Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. ◆ Can belong to multiple classes or could be ‘border’ points – Fuzzy clustering (one type of non-exclusive) ◆ In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 ◆ Weights must sum to 1 ◆ 9 ◆ Weights must sum to 1 ◆ Probabilistic clustering has similar characteristics ● Partial versus complete – In some cases, we only want to cluster some of the data
  • 422. Types of Clusters ● Well-separated clusters ● Prototype-based clusters ● Contiguity-based clusters 10 ● Density-based clusters ● Described by an Objective Function
  • 423. Types of Clusters: Well-Separated ● Well-Separated Clusters: – A cluster with a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 11 3 well-separated clusters
  • 424. Types of Clusters: Prototype-Based ● Prototype-based ( or center based) – A cluster with set of points such that a point in a cluster is closer (more similar) to the prototype or “center” of the cluster, than to the center of any other cluster – If Data is Continuous – Center will be Centroid /mean – If Data is Categorical - Center will be Medoid ( Most Representative point) 12 Representative point) 4 center-based clusters
  • 425. Types of Clusters: Contiguity-Based ( Graph) ● Contiguous Cluster (Nearest neighbor or Transitive) – A cluster with set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. – Graph ( Data-Nodes, links - Connections),Cluster is group of connected objects.No connections with outside group. 13 connected objects.No connections with outside group. ● Useful when clusters are irregular or intertwined ● Trouble when noise is present – a small bridge of points can merge two distinct clusters. 8 contiguous clusters
  • 426. Types of Clusters: Density-Based ● Density-based – A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. The two circular clusters are not merged, as in Figure, because the bridge between them(previous 14 6 density-based clusters The two circular clusters are not merged, as in Figure, because the bridge between them(previous slide figure) fades into the noise. Curve that is present in previous slide Figure also fades into the noise and does not form a cluster A density based definition of a cluster is often employed when the clusters are irregular or intertwined, and when noise and outliers are present.
  • 427. Types of Clusters: Density-Based ● Shared property(Conceptual Clusters) – a cluster as a set of objects that share some property. 15 A clustering algorithm would need a very specific concept (sophisticated) of a cluster to successfully detect these clusters. The process of finding such clusters is called conceptual clustering.
  • 428. Clustering Algorithms ● K-means and its variants ● Hierarchical clustering ● Density-based clustering 16
  • 429. K-means ● Prototype-based, partitional clustering technique ● Attempts to find a user-specified number of clusters (K) 17
  • 430. Agglomerative Hierarchical Clustering ● Hierarchical clustering ● Starts with each point as a singleton cluster ● Repeatedly merges the two closest clusters until a single, all encompassing cluster 18 until a single, all encompassing cluster remains. ● Some Times - graph-based clustering ● Others - prototype-based approach.
  • 431. DBSCAN ● Density-based clustering algorithm ● Produces a partitional clustering, ● No. of clusters is automatically determined by the algorithm. 19 the algorithm. ● Noise - Points in low-density regions (omitted) ● Not a complete clustering.
  • 432. K-means Clustering ● Partitional clustering approach ● Number of clusters, K, must be specified ● Each cluster is associated with a centroid (center point) ● Each point is assigned to the cluster with the closest centroid ● The basic algorithm is very simple 20
  • 433. Example of K-means Clustering
  • 434. Example of K-means Clustering 22
  • 435. K-means Clustering – Details ● Simple iterative algorithm. – Choose initial centroids; – repeat {assign each point to a nearest centroid; re-compute cluster centroids} – until centroids stop changing. ● Initial centroids are often chosen randomly. – Clusters produced can vary from one run to another ● The centroid is (typically) the mean of the points in the cluster, but other definitions are possible 23 but other definitions are possible ● Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’ ● Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
  • 436. K-means Clustering – Details ● Centroid can vary, depending on the proximity measure for the data and the goal of the clustering. ● The goal of the clustering is typically expressed by an objective function that depends on the proximities of the points to one another or to the cluster centroids. 24 cluster centroids. ● e.g.minimize the squared distance of each point to its closest centroid
  • 437. K-means Clustering – Details 25
  • 438. Centroids and Objective Functions Data in Euclidean Space ● A common objective function (used with Euclidean distance measure) is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster center – To get SSE, we square these errors and sum them. 26 – x is a data pointin cluster Ci and mi is the centroid (mean) for cluster Ci – A kmeans run which produces minimum SSE will be considered. – Centroid (mean) of the i th cluster is
  • 439. K-means Objective Function Document Data ● Cosine Similarity ● Document data is represented as Document Term Matrix ● Objective (Cohesion of the cluster) 27 – Maximize the similarity of the documents in a cluster to the cluster centroid; which is called cohesion of the cluster
  • 440. Two different K-means Clusterings Original Points Figure a shows a clustering solution that is the global minimum of the SSE for three clusters Figure b shows suboptimal clustering that is only a local minimum. 28 Fig b: Sub-optimal Clustering Fig a: Optimal Clustering
  • 441. Importance of Choosing Initial Centroids … The below 2 figures show the clusters that result from two particular choices of initial centroids. (For both figures, the positions of the cluster centroids in the various iterations are indicated by crosses.) Fig-1 In Figure1, even though all the initial centroids are from one natural cluster, the minimum SSE clustering is still found In Figure 2, even though the initial centroids seem to be better distributed, we obtain a suboptimal clustering, with higher squared error. This is considered as poor starting of centroids Fig-2
  • 442. Importance of Choosing Initial Centroids … 30
  • 443. Problems with Selecting Initial Points 31 ● Figure 5.7 shows that if a pair of clusters has only one initial centroid and the other pair has three, then two of the true clusters will be combined and one true cluster will be split.
  • 444. 10 Clusters Example 32 Starting with two initial centroids in one cluster of each pair of clusters
  • 445. 10 Clusters Example 33 Starting with two initial centroids in one cluster of each pair of clusters
  • 446. 10 Clusters Example 34 Starting with some pairs of clusters having three initial centroids, while other have only one.
  • 447. 10 Clusters Example 35 Starting with some pairs of clusters having three initial centroids, while other have only one.
  • 448. Solutions to Initial Centroids Problem ● Multiple runs ● K-means++ ● Use hierarchical clustering to determine initial centroids 36 centroids ● Bisecting K-means
  • 449. Multiple Runs ● One technique that is commonly used to address the problem of choosing initial centroids is to perform multiple runs, each with a different set of randomly chosen initial centroids, and then select the set of clusters with the minimum SSE ● In Figure 5.6(a), the data consists of two pairs of clusters, 37 consists of two pairs of clusters, where the clusters in each (top- bottom) pair are closer to each other than to the clusters in the other pair. ● Figure 5.6 (b–d) shows that if we start with two initial centroids per pair of clusters, then even when both centroids are in a single cluster, the centroids will redistribute themselves so that the “true” clusters are found.
  • 452. Bisecting K-means ● Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical clustering 40 CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
  • 454. Limitations of K-means ● K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes 42 ● K-means has problems when the data contains outliers. – One possible solution is to remove outliers before clustering
  • 455. Limitations of K-means: Differing Sizes 43 Original Points K-means (3 Clusters)
  • 456. Limitations of K-means: Differing Density 44 Original Points K-means (3 Clusters)
  • 457. Limitations of K-means: Non-globular Shapes 45 Original Points K-means (2 Clusters)
  • 458. Overcoming K-means Limitations 46 Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step.
  • 459. Overcoming K-means Limitations 47 Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step.
  • 460. Overcoming K-means Limitations 48 Original Points K-means Clusters One solution is to find a large number of clusters such that each of them represents a part of a natural cluster. But these small clusters need to be put together in a post-processing step.
  • 461. Hierarchical Clustering ● Produces a set of nested clusters organized as a hierarchical tree ● Can be visualized as a dendrogram – A tree like diagram that records the sequences of merges or splits 49
  • 462. Strengths of Hierarchical Clustering ● Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level 50
  • 463. Hierarchical Clustering ● Two main types of hierarchical clustering – Agglomerative: ◆ Start with the points as individual clusters ◆ At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: ◆ 51 – Divisive: ◆ Start with one, all-inclusive cluster ◆ At each step, split a cluster until each cluster contains an individual point (or there are k clusters) ● Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time
  • 464. Agglomerative Clustering Algorithm ● Key Idea: Successively merge closest clusters ● Basic algorithm 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 52 5. Update the proximity matrix 6. Until only a single cluster remains ● Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms
  • 465. Steps 1 and 2 ● Start with clusters of individual points and a proximity matrix p1 p3 p4 p2 p1 p2 p3 p4 p5 . . . 53 p5 p4 . . . Proximity Matrix
  • 466. Intermediate Situation ● After some merging steps, we have some clusters C4 C3 C2 C1 C1 C3 C4 C2 C3 C4 C5 54 C1 C4 C2 C5 C5 C4 Proximity Matrix
  • 467. Step 4 ● We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C4 C3 C2 C1 C1 C3 C4 C2 C3 C4 C5 55 C1 C4 C2 C5 C5 C4 Proximity Matrix
  • 468. Step 5 ● The question is “How do we update the proximity matrix?” C4 C3 ? ? ? ? ? ? C2 U C5 C1 C1 C3 C2 U C5 C3 C4 56 C1 C4 C2 U C5 ? ? C3 C4 Proximity Matrix
  • 469. How to Define Inter-Cluster Distance p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . Similarity? ● MIN 57 . . . ● MIN ● MAX ● Group Average ● Distance Between Centroids ● Other methods driven by an objective function – Ward’s Method uses squared error Proximity Matrix
  • 470. How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . ● MIN 58 . . . Proximity Matrix ● MIN ● MAX ● Group Average ● Distance Between Centroids ● Other methods driven by an objective function – Ward’s Method uses squared error
  • 471. How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . ● MIN 59 . . . Proximity Matrix ● MIN ● MAX ● Group Average ● Distance Between Centroids ● Other methods driven by an objective function – Ward’s Method uses squared error
  • 472. How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . ● MIN 60 . . . Proximity Matrix ● MIN ● MAX ● Group Average ● Distance Between Centroids ● Other methods driven by an objective function – Ward’s Method uses squared error
  • 473. How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . ● MIN × × 61 . . . Proximity Matrix ● MIN ● MAX ● Group Average ● Distance Between Centroids ● Other methods driven by an objective function – Ward’s Method uses squared error
  • 474. MIN or Single Link ● Proximity of two clusters is based on the two closest points in the different clusters – Determined by one pair of points, i.e., by one link in the proximity graph ● Example: 62 Distance Matrix:
  • 476. Strength of MIN 64 Original Points Six Clusters • Can handle non-elliptical shapes
  • 477. Limitations of MIN Two Clusters 65 Original Points Two Clusters • Sensitive to noise Three Clusters
  • 478. MAX or Complete Linkage ● Proximity of two clusters is based on the two most distant points in the different clusters – Determined by all pairs of points in the two clusters 66 Distance Matrix:
  • 479. Hierarchical Clustering: MAX 1 2 3 5 6 2 5 4 67 Nested Clusters Dendrogram 3 4 6 1 3
  • 480. Strength of MAX 68 Original Points Two Clusters • Less susceptible to noise
  • 481. Limitations of MAX 69 Original Points Two Clusters • Tends to break large clusters • Biased towards globular clusters
  • 482. Group Average ● Proximity of two clusters is the average of pairwise proximity between points in the two clusters. 70 Distance Matrix:
  • 483. Hierarchical Clustering: Group Average 1 2 3 5 2 5 4 71 Nested Clusters Dendrogram 3 4 6 1 3
  • 484. Hierarchical Clustering: Group Average ● Compromise between Single and Complete Link ● Strengths – Less susceptible to noise 72 – Less susceptible to noise ● Limitations – Biased towards globular clusters
  • 485. Cluster Similarity: Ward’s Method ● Similarity of two clusters is based on the increase in squared error when two clusters are merged – Similar to group average if distance between points is distance squared ● Less susceptible to noise 73 ● Less susceptible to noise ● Biased towards globular clusters ● Hierarchical analogue of K-means – Can be used to initialize K-means
  • 486. Hierarchical Clustering: Comparison MIN MAX 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5 74 Group Average Ward’s Method 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4
  • 487. Hierarchical Clustering: Time and Space requirements ● O(N2) space since it uses the proximity matrix. – N is the number of points. ● O(N3) time in many cases – There are N steps and at each step the size, N2, 75 – There are N steps and at each step the size, N , proximity matrix must be updated and searched – Complexity can be reduced to O(N2 log(N) ) time with some cleverness
  • 488. Hierarchical Clustering: Problems and Limitations ● Once a decision is made to combine two clusters, it cannot be undone ● No global objective function is directly minimized Different schemes have problems with one or 76 ● Different schemes have problems with one or more of the following: – Sensitivity to noise – Difficulty handling clusters of different sizes and non- globular shapes – Breaking large clusters
  • 489. Density Based Clustering ● Clusters are regions of high density that are separated from one another by regions on low density. 77
  • 490. DBSCAN ● DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (Eps) – A point is a core point if it has at least a specified number of points (MinPts) within Eps ◆ These are points that are at the interior of a cluster ◆ 78 ◆ ◆ Counts the point itself – A border point is not a core point, but is in the neighborhood of a core point – A noise point is any point that is not a core point or a border point
  • 491. DBSCAN: Core, Border, and Noise Points MinPts = 7 79
  • 492. DBSCAN: Core, Border and Noise Points 80 Original Points Point types: core, border and noise Eps = 10, MinPts = 4
  • 493. DBSCAN Algorithm ● Form clusters using core points, and assign border points to one of its neighboring clusters 1: Label all points as core, border, or noise points. 2: Eliminate noise points. 3: Put an edge between all core points within a distance Eps of each 81 3: Put an edge between all core points within a distance Eps of each other. 4: Make each group of connected core points into a separate cluster. 5: Assign each border point to one of the clusters of its associated core points
  • 494. When DBSCAN Works Well 82 Original Points Clusters (dark blue points indicate noise) • Can handle clusters of different shapes and sizes • Resistant to noise
  • 495. When DBSCAN Does NOT Work Well (MinPts=4, Eps=9.92). 83 Original Points (MinPts=4, Eps=9.92). (MinPts=4, Eps=9.75) • Varying densities • High-dimensional data