Data Warehousing and mining Complete notes.pdf

2
DATA WAREHOUSING (Unit - I)
❑Data Warehouse and OLAP Technology:
○ 1.1 An Overview: Data Warehouse
○ 1.2 Data Warehouse Architecture
○ 1.3 A Multidimensional Data Model
○ 1.4 Data Warehouse Implementation
○ 1.5 From Data Warehousing to Data
Mining. (Han & Kamber)

4
What is Data Warehouse?
■ Data warehousing provides architectures and tools for business
executives to systematically organize, understand, and use their data
to make strategic decisions.
■ Data warehouse refers to a data repository that is maintained
separately from an organization’s operational databases.
■ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”
■ Data warehousing: The process of constructing and using data
warehouses

Data Warehouse—Subject-Oriented
5
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

Data Warehouse—Integrated
6
■ Constructed by integrating multiple, heterogeneous data
sources
■ relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.

Data Warehouse—Time Variant
7
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse contains an
element of time, explicitly or implicitly. But the key of
operational data may or may not contain “time element”

Data Warehouse—Nonvolatile
8
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

OLTP vs OLAP
9
OLTP OLAP
User & System
Orientation
Customer Oriented
( transaction & query processing)
market Oriented ( Data Analysis
by managers, executives &
Analysts)
Data Contents Current Data (too detailed) Large amount of data
(summarization & aggregation)
Database design ER data model ( Application oriented
database design)
Star or Snowflake model (subject
Oriented Database design)
View focus on current Data within an
enterprise or department
multiple versions of database
schema(evolutionary process),
data from diff. org. & many data
stores
Access Patterns short, atomic transactions (requires
concurrency control & recovery)
read-only operations ( Complex
queries)

15
Data Warehouse Architecture

16
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
■ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system
■ Data analysis and decision making
■ Distinct features (OLTP vs. OLAP):
■ User and system orientation: customer vs. market
■ Data contents: current, detailed vs. historical, consolidated
■ Database design: ER + application vs. star + subject
■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries

17
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases

Data Warehousing: A Multitiered Architecture
19
■ Bottom Tier:
■ Warehouse Database server
■ a relational database system
■ Back-end tools and utilities
■ data extraction
■ by using API gateways(ODBC, JDBC & OLEDB)
■ cleaning
■ transformation
■ load & refresh

20
■ Middle Tier (OLAP server)
■ ROLAP - Relational OLAP
■ extended RDBMS that maps operations on
multidimensional data to standard relational
operations.
■ MOLAP - Multidimensional OLAP
■ Special-purpose server that directly implements
multidimensional data and operations.
■ Top Tier
■ Front-end Client Layer
■ Query and reporting tools, analysis tools and
data mining tools.

21
■ Data Warehouse Models:
■ Enterprise warehouse:
■ collects all of the information about subjects
spanning the entire organization.
■ corporate-wide data integration
■ can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond.
■ implemented on mainframes, computer
superservers, or parallel architecture platforms

22
■ Data mart:a subset of corporate-wide data that is of value
to a specific group of users
■ confined to specific selected subjects.
■ Example - marketing data mart may confine its subjects to
customer, item, and sales.
■ implemented on low-cost departmental servers
■ Independent Data mart - data captured from
■ one or more operational systems or external information
providers,
or
■ from data generated locally within a particular department or
geographic area.
■ Dependent Data mart - sourced directly from enterprise data
warehouses.

23
■ Virtual warehouse:
■ A virtual warehouse is a set of views over operational
databases.
■ easy to build but requires excess capacity on operational
database servers.

24
■ Data extraction: gathers data from multiple,
heterogeneous, and external sources.
■ Data Cleaning: detects errors in the data and
rectifies them when possible
■ Data transformation: converts data from
legacy or host format to warehouse format.
■ Load: sorts, summarizes, consolidates,
computes views, checks integrity, and builds
indices and partitions.
■ Refresh: propagates the updates from the data
sources to the warehouse.

25
Metadata Repository:
metadata are the data that define warehouse
objects
It consists of:
1) Data warehouse structure
2) Operational metadata
3) algorithms used for summarization
4) Mapping from the operational environment to
the data warehouse
5) Data related to system performance
6) Business metadata

26
■ data warehouse structure
i) warehouse schema,
ii) view, dimensions,
iii) hierarchies, and
iv) derived data definitions,
v) data mart locations and contents.
■ Operational metadata
i) data lineage (history of migrated data and the
sequence of transformations applied to it),
ii) currency of data (active, archived, or purged),
iii) monitoring information (warehouse usage
statistics, error reports, and audit trails).

27
■ The algorithms used for summarization,
i) measure and dimension definition algorithms,
ii) data on granularity,
iii) partitions,
iv) subject areas,
v) aggregation,
vi) summarization, and
vii) predefined queries and reports.

28
1) Mapping from the operational environment to the
data warehouse
i) source databases and their contents,
ii) gateway descriptions,
iii) data partitions,
iv) data extraction, cleaning, transformation rules and
defaults
v) data refresh and purging rules, and
vi) security (user authorization and access control).

29
■ Data related to system performance
■ indices and profiles that improve data access and
retrieval performance,
■ rules for the timing and scheduling of refresh,
update, and replication cycles.
■ Business metadata,
■ business terms and definitions,
■ data ownership information, and
■ charging policies

30
A Multidimensional Data Model

Data Warehouse Modeling: Data Cube :
31
■ A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
■ Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
■ Example:-
■ AllElectronics may create a sales data warehouse
■ time, item, branch, and location - These
dimensions allow the store to keep track of things
like monthly sales of items and the branches and
locations at which the items were sold.

Data Warehouse Modeling: Data Cube :
32
■ Each dimension may have a table associated with it, called
a dimension table, which further describes the
dimension.
■ For example - a dimension table for item may contain the
attributes item name, brand, type.
■ A multidimensional data model is typically organized
around a central theme, such as sales. This theme is
represented by a fact table.
■ Facts are numeric measures.
■ The fact table contains the names of the facts, or
measures, as well as keys to each of the related dimension
tables.

33
Data Cube: A Multidimensional Data Model
■ A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables

34
■ A data cube is a lattice of cuboids
■ A data warehouse is usually modeled by a multidimensional data
structure, called a data cube, in which
■ each dimension corresponds to an attribute or a set of
attributes in the schema, and
■ each cell stores the value of some aggregate measure such as
count or sum(sales_amount).
■ A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.

35
2-D View of Sales data
■ AllElectronics sales data for items sold per quarter in the city of Vancouver.
■ a simple 2-D data cube that is a table or spreadsheet for sales data from
AllElectronics

36
3-D View of a Sales data
The 3-D data in the table are represented as a series of 2-D tables

37
3D Data Cube Representation of Sales data
we may also represent the same data in the form of a 3D data cube

38
4-D Data Cube Representation of Sales Data
we may display any n-dimensional data as a series of (n − 1)-dimensional
“cubes.”

39
Cube: A Lattice of Cuboids
all
time item location supplier
0-D(apex) cuboid
1-D cuboids
time,location item,location location,supplier
time,item
time,supplier item,supplier
2-D cuboids
time,item,location
time,location,supplier
cuboids
time,item,supplier item,location,supplier
4-D(base)
cuboid
time, item, location, supplier

40
■ In data warehousing literature, an n-D base cube is called a base
cuboid.
■ The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
■ In our example, this is the total sales, or dollars sold,
summarized over all four dimensions.
■ The apex cuboid is typically denoted by all.
■ The lattice of cuboids forms a data cube.

Schemas for Multidimensional Data Models
41
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

Schemas for Multidimensional Data Models
42
■ Star schema: In this, a data warehouse contains
(1) a large central table (fact table) containing the bulk
of the data, with no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each dimension.
■ Each dimension is represented by only one table.
■ Each table contains a set of attributes
■ Problem: redundancy in dimension tables.
■ ex:- location dimension table will create redundancy
among the attributes province or state and country; that
is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).

44
Snow flake schema
■ Variant of the star schema model
■ Dimension tables are normalized ( to remove
redundancy)
■ Dimension table is splitted into additional tables.
■ The resulting schema graph forms a shape similar to a
snowflake.
■ Problem
■ more joins will be needed to execute a query ( affects
system performance)
■ so this is not as popular as the star schema in data
warehouse design.

46
Fact Constellation
● A fact constellation schema allows dimension tables to be
shared between fact tables
● A data warehouse collects information about subjects that
span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is
enterprise-wide.
● For data warehouses, the fact constellation
schema is commonly used.
● For data marts, the star or snowflake schema is
commonly used

47
Fact
Constellation
This schema specifies two fact tables,
sales and shipping
the dimensions tables for time, item, and
location are shared between the sales
and shipping fact tables.

48
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
■ Just as relational query languages like SQL can be used
to specify relational queries, a data mining query
language (DMQL) can be used to specify data mining
tasks.
■ Data warehouses and data marts can be defined using
two language primitives, one for cube definition and
one for dimension definition.

49
Syntax for Cube and Dimension
Definition in DMQL
■ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
■ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
■ Special Case (Shared Dimension Tables)
■ First time as “cube definition”
■ define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>

Defining Star Schema in DMQL
50
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

Defining Snowflake Schema in DMQL
51
define cube sales_snowflake [time, item, branch, location]:
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

Defining Fact Constellation in DMQL
52
define cube sales [time, item, branch, location]:
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

Concept Hierarchies
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
■ A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level.
■ concept hierarchy for the dimension location
53

Concept Hierarchies
■ A concept hierarchy that is a total or partial order among
attributes in a database schema is called a schema
hierarchy.
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman 54

Concept Hierarchies
55
■ Concept hierarchies may also be defined by discretizing
or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
■ A total or partial order can be defined among groups of
values.

Measures of Data Cube: Three
Categories
56
■ A multidimensional point in the data cube space can be
defined by a set of dimension-value pairs,
for example, 〈time = “Q1”, location = “Vancouver”,
item = “computer”〉.
■ A data cube measure is a numerical function that can be
evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension-value pairs defining the given point.
■ Based on the kind of aggregate functions used, measures
can be organized into three categories : distributive,
algebraic, holistic

Measures of Data Cube: Three
Categories
■ Distributive: An aggregate function is distributive if the result
derived by applying the function to n aggregate values is same
as that derived by applying the function on all the data without
partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: An aggregate function is algebraic if it can be
computed by an algebraic function with M arguments (where M
is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
■ E.g., avg()=sum()/count(), min_N(), standard_deviation()
■ Holistic: An aggregate function is holistic if there is no constant
bound on the storage size and there does not exist an algebraic
function with M arguments (where M is a constant) that
characterizes the computation.
■ E.g., median(), mode(), rank() 57

58
Typical OLAP Operations
■ Roll up (drill-up):
■ Drill down (roll down):
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: Allows users to analyze the same data through
different reports, analyze it with different features and even display it
through different visualization methods

59
Fig. 3.10 Typical OLAP
Operations

Source & Courtesy: https://www.javatpoint.com/olap-operations
60
Typical OLAP Operations:Roll Up/Drill Up
■ summarize data
■ by climbing up
hierarchy
or
■ by dimension
reduction

Source & Courtesy: https://www.javatpoint.com/olap-operations
61
Typical OLAP Operations:Roll Down
■ reverse of roll-up
■ from higher
level summary
to lower level
summary or
detailed data, or
introducing new
dimensions

Typical OLAP Operations:Slicing
● Slice is the act of picking a rectangular subset of a cube by choosing a single
value for one of its dimensions, creating a new cube with one fewer
dimension.
● Example: The sales figures of all sales regions and all product categories of
the company in the year 2005 and 2006 are "sliced" out of the data cube.
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
62

Typical OLAP Operations:Slicing
Slicing:
It selects a single
dimension from the OLAP
cube which results in a new
sub-cube creation.
Source & Courtesy: https://www.javatpoint.com/olap-operations 63

Typical OLAP Operations:Dice
● Dice: The dice operation produces a subcube by allowing the analyst to pick
specific values of multiple dimensions
● The picture shows a dicing operation: The new cube shows the sales figures
of a limited number of product categories, the time and region dimensions
cover the same range as before.
64

Typical OLAP Operations:Dicing
Dice:
It selects a sub-
cube from the
OLAP cube by
selecting two or
more
dimensions.
Source & Courtesy: https://www.javatpoint.com/olap-operations 65

Typical OLAP Operations:Pivot
66
Pivot allows an analyst to rotate the cube in space to see its various faces. For
example, cities could be arranged vertically and products horizontally while viewing
data for a particular quarter.

A Star-Net Query Model
67
● The querying of multidimensional databases can be based
on a starnet model.
● It consists of radial lines emanating from a central point,
where each line represents a concept hierarchy for a
dimension.
● Each abstraction level in the hierarchy is called a footprint.
● These represent the granularities available for use by OLAP
operations such as drill-down and roll-up.

A Star-Net Query Model
69
■ Four radial lines, representing concept hierarchies for the
dimensions location, customer, item, and time,
respectively
■ footprints representing abstraction levels of the
dimension - time line has four footprints: “day,”
“month,” “quarter,” and “year.”
■ Concept hierarchies can be used to generalize data by
replacing low-level values (such as “day” for the time
dimension) by higher-level abstractions (such as “year”)
or
■ to specialize data by replacing higher-level abstractions
with lower-level values.

Data Warehouse Design and Usage
70
A Business Analysis Framework for Data
Warehouse Design:
■ To design an effective data warehouse we need to
understand and analyze business needs and construct a
business analysis framework.
■ Different views are combined to form a complex
framework.

71
■ Four different views regarding a data warehouse design
must be considered:
■ Top-down view
■ allows the selection of the relevant information
necessary for the data warehouse (matches current
and future business needs).
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems.
■ Documented at various levels of detail and accuracy,
from individual data source tables to integrated data
source tables.
■ Modeled in ER model or CASE (computer-aided
software engineering).

72
■ Data warehouse view
■ includes fact tables and dimension tables.
■ It represents the information that is stored inside the
data warehouse, including
■ precalculated totals and counts,
■ information regarding the source, date, and time
of origin, added to provide historical context.
■ Business query view
■ is the data perspective in the data warehouse from
the end-user’s viewpoint.

■ Skills required to build & use a Data warehouse
■ Business Skills
■ how systems store and manage their data,
■ how to build extractors (operational DBMS to DW)
■ how to build warehouse refresh software(update)
■ Technology skills
■ the ability to discover patterns and trends,
■ to extrapolate trends based on history and look
for anomalies or paradigm shifts, and
■ to present coherent managerial recommendations
based on such analysis.
■ Program management skills
■ Interface with many technologies, vendors, and end-
users in order to deliver results in a timely and cost
effective manner 73

74
Data Warehouse Design Process
■ A data warehouse can be built using
■ Top-down approach (overall design and planning)
■ It is useful in cases where the technology is
mature and well known
■ Bottom-up approach(starts with experiments & prototypes)
■ a combination of both.
■ In SE point of view ( Waterfall model or Spiral model)
structured and
systematic
■ planning,
■ requirements study,
■ problem analysis,
■ warehouse design,
■
● rapid generation, short intervals between
successive releases, good choice for data
warehouse development
● turnaround time is short, modifications can
be done quickly, and new designs and
technologies can be adapted in a timely
analysis at each data integration and testing, andmanner
step, one step to■ finally deployment of the data warehouse
the next

75
Data Warehouse Design Process
■ 4 major Steps involved in Warehouse design are:
■ 1. Choose a business process to model (e.g., orders,
invoices, shipments, inventory, account administration,
sales, or the general ledger).
■ Data warehouse model - If the business process is
organizational and involves multiple complex object
collections
■ Data mart model - if the process is departmental and
focuses on the analysis of one kind of business
process

76
■ 2. Choose the business process grain
■ Fundamental, atomic level of data to be represented
in the fact table
■ (e.g., individual transactions, individual daily
snapshots, and so on).
■ 3. Choose the dimensions that will apply to each
fact table record.
■ Typical dimensions are time, item, customer, supplier,
warehouse, transaction type, and status.
■ 4. Choose the measures that will populate each
fact table record.
■ Typical measures are numeric additive quantities like
dollars sold and units sold.

77
Data Warehouse Usage for Information Processing
■ Evolution of DW takes place throughout a number of
phases.
■ Initial Phase - DW is used for generating reports and
answering predefined queries.
■ Progressively - to analyze summarized and detailed data,
(results are in the form of reports and charts)
■ Later - for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-
dice operations.
■ Finally - for knowledge discovery and strategic decision
making using data mining tools.

78
Data Warehouse Implementation

79
Data warehouse implementation
■ OLAP servers demand that decision support queries be
answered in the order of seconds.
■ Methods for the efficient implementation of data
warehouse systems.
■ 1. Efficient data cube computation.
■ 2. OLAP data indexing (bitmap or join indices )
■ 3. OLAP query processing
■ 4. Various types of warehouse servers for OLAP
processing.

Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
80
■ Requires efficient computation of aggregations
across many sets of dimensions.
■ In SQL terms:
■ Aggregations are referred to as group-by’s.
■ Each group-by can be represented by a cuboid,
■ set of group-by’s forms a lattice of cuboids
defining a data cube.
■ Compute cube Operator - computes
aggregates over all subsets of the dimensions
specified in the operation.
■ require excessive storage space for large
number of dimensions.

81
Example 4.6
■ create a data cube for AllElectronics sales that
contains the following:
■ city, item, year, and sales in dollars.

82
■ What is the total number of cuboids, or group-
by’s, that can be computed for this data cube?
■ 3 attributes - city, item & year -3 dimensions
■ sales in dollars - measure,
■ the total number of cuboids, or group by’s,
■ 2 POWER 3 = 8.
■ The possible group-by’s are the following:
■ {(city, item, year), (city, item), (city, year),
(item, year), (city), (item), (year), ()}
■ () - group-by is empty (i.e., the dimensions are not
grouped) - all.
■ group-by’s form a lattice of cuboids for the data cube

83

84
■ Base cuboid contains all three dimensions(city, item, year)
■ returns - total sales for any combination of the three
dimensions.
■ This is least generalized (most specific) of the cuboids.
■ Apex cuboid, or 0-D cuboid, refers to the case where
the group-by is empty (contains total sum of all sales)
■ This is most generalized (least specific) of the cuboids
■ Drill Down equivalent
■ start at the apex cuboid and explore downward in the
lattice
■ akin to rolling up
■ start at the base cuboid and explore upward

85
■ zero-dimensional operation:
■ An SQL query containing no group-by
■ Example - “compute the sum of total sales”
■ one-dimensional operation:
■ An SQL query containing one group-by
■ Example - “compute the sum of sales group-by city”
■ A cube operator on n dimensions is equivalent to a
collection of group-by statements, one for each subset of
the n dimensions.

■ data cube could be defined as:
■ “define cube sales_cube [city, item, year]:
sum(sales_in_dollars)”
■ 2 power n cuboids - For a cube with n dimensions
■ “compute cube sales_cube” - statement
■ computes the sales aggregate cuboids for all eight
subsets of the set {city, item, year}, including the
empty subset.
■ In OLAP, for diff. queries diff. cuboids need to be
accessed.
■ Precomputation - compute in advance all or at least
some of the cuboids in a data cube
■ curse of dimensionality - required storage space
may explode if all the cuboids in a data cube are
precomputed ( for more dimensions) 86

87
■ Data cube can be viewed as a lattice of cuboids
■ 2 power n - when no concept hierarchy
■ How many cuboids in an n-dimensional cube with L
levels?
■ where Li is the number of levels associated with
dimension i ( +1 for all )
■ If the cube has 10 dimensions and each dimension has
five levels (including all), the total number of cuboids
that can be generated is 510 ≈ 9.8 × 106 .

88
There are three choices for data cube
materialization for a given base cuboid:
■ 1. No materialization: Do not precompute -
expensive multidimensional aggregates -
extremely slow.
■ 2. Full materialization: Precompute all of the
cuboids - full cube - requires huge amounts of
memory space in order to store all of the
precomputed cuboids.

89
■ 3. Partial materialization: Selectively compute a
proper subset of the whole set of possible cuboids.
■ compute a subset of the cube, which contains only those
cells that satisfy some user-specified criterion - subcube
■ 3 factors to consider:
■ (1) identify the subset of cuboids or subcubes to
materialize;
■ (2) exploit the materialized cuboids or subcubes
during query processing; and
■ (3) efficiently update the materialized cuboids or
subcubes during load and refresh.

90
■ Partial Materialization: Selected Computation of Cuboids
■ Following should take into account during selection of
the subset of cuboids or subcubes
■ the queries in the workload, their frequencies, and
their accessing costs
■ workload characteristics, the cost for incremental
updates, and the total storage requirements.
■ physical database design such as the generation and
selection of indices.

91
■ Heuristic approaches for cuboid and subcube
selection
■ Iceberg cube:
■ data cube that stores only those cube cells with
an aggregate value (e.g., count) that is above
some minimum support threshold.
■ shell cube:
■ precomputing the cuboids for only a small number
of dimensions

1.3.2 Indexing OLAP Data: Bitmap Index
92
Index structures - To facilitate efficient data accessing
■ Bitmap indexing method - it allows quick searching in
data cubes.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the attribute’s
domain.
■ If a given attribute’s domain consists of n values, then n
bits are needed for each entry in the bitmap index (i.e.,
there are n bit vectors).
■ If the attribute has the value v for a given row in the
data table, then the bit representing that value is set to 1
in the corresponding row of the bitmap index. All other
bits for that row are set to 0.

1.3.2 Indexing OLAP Data: Bitmap Index
93
● Example:- AllElectronics data warehouse
● dim(item)={H,C,P,S} - 4 values - 4 bit vectors
● dim(city)= {V,T} - 2 values - 2 bit vectors
● Better than Hash & Tree Indices but good for low
cardinality only (cardinality:number of unique items in the database column)

94
Exercise: Bitmap index on CITY

Indexing OLAP Data: Join Index
95
■ Traditional indexing maps the value in a given
column to a list of rows having that value.
■ Join indexing registers the joinable rows of
two relations from a relational database.
■ For example,
■ two relations - R(RID, A) and S(B, SID)
■ join on the attributes A and B,
■ join index record contains the pair (RID, SID),
■ where RID and SID are record identifiers from
the R and S relations, respectively

96
■ Advantage:-
■ Identification of joinable tuples without performing
costly join operations.
■ Useful:-
■ To maintain the relationship between a foreign
key(fact table) and its matching primary
keys(dimension table), from the joinable relation.
■ Indexing maintains relationships between attribute
values of a dimension (e.g., within a dimension table)
and the corresponding rows in the fact table.
■ Composite join indices: Join indices with multiple
dimensions.

■ Example:-Star Schema
■ “sales_star [time, item, branch, location]: dollars_sold
= sum (sales_in_dollars).”
■ join index is relationship between
■ Sales fact table and
■ the location, item dimension tables
To speed up query processing - join indexing & bitmap indexing methods
can be integrated to form bitmapped join indices. 97

Efficient processing of OLAP queries
98
Given materialized views, query processing should proceed as
follows:
■ 1. Determine which operations should be performed
on the available cuboids:
■ This involves transforming any selection, projection,
roll-up (group-by), and drill-down operations specified
in the query
into
corresponding SQL and/or OLAP operations.
■ Example:
■ slicing and dicing a data cube may correspond to
selection and/or projection operations on a
materialized cuboid.

99
■ 2. Determine to which materialized cuboid(s) the
relevant operations should be applied:
■ pruning the set using knowledge of
“dominance” relationships among the cuboids,
■ estimating the costs of using the remaining
materialized cuboids, and selecting the cuboid with
the least cost.

100
Example:-
■ define a data cube for AllElectronics of the
form “sales cube [time, item, location]:
sum(sales in dollars).”
■ dimension hierarchies
■ “day < month < quarter < year” for time;
■ “item_name < brand < type” for item
■ “street < city < province or state < country”
for location
■ Query:
■ {brand, province or state}, with the selection
constant “year = 2010.”

101
■ suppose that there are four materialized cuboids
available, as follows:
■ Which of these four cuboids should be selected
to process the query? Ans: 1,3,4
■ Low cost cuboid to process the query? Ans: 4

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
102
■ Relational OLAP (ROLAP) servers:
■ ROLAP uses relational tables to store data for
online analytical processing
■ Intermediate servers that stand in
between a relational back-end server and
client front-end tools.
■ Operation:
■ use a relational or extended-relational DBMS to
store and manage warehouse data
■ OLAP middleware to support missing pieces
■ ROLAP has greater scalability than MOLAP.
■ Example:-
■ DSS server of Microstrategy

103
■ Multidimensional OLAP (MOLAP) servers:
■ support multidimensional data views through array-
based multidimensional storage engines
■ maps multidimensional views directly to data cube
array structures.
■ Advantage:
■ fast indexing to precomputed summarized data.
■ adopt a two-level storage representation
■ Denser subcubes are stored as array structures
■ Sparse subcubes employ compression
technology
A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-
zeros.

■ Hybrid OLAP (HOLAP) servers:
■ Combines ROLAP and MOLAP technology
■ benefits
■ greater scalability from ROLAP and
■ faster computation of MOLAP.
■ HOLAP server may allow
■ large volumes of detailed data to be stored in a
relational database,
■ while aggregations are kept in a separate MOLAP store.
■ Example:- Microsoft SQL Server 2000 (supports)
■ Specialized SQL servers:
■ provide advanced query language and query
processing support for SQL queries over star and
snowflake schemas in a read-only environment. 104

105
From Data Warehousing to Data Mining

106
From DataWarehousing to Data Mining
DataWarehouse Usage
■ Data warehouses and data marts are used in a
wide range of applications.
■ Business executives use the data in data warehouses
and data marts to perform data analysis and make
strategic decisions.
■ data warehouses are used as an integral part of a
plan-execute-assess “closed-loop” feedback
system for enterprise management.
■ Data warehouses are used extensively in banking and
financial services, consumer goods and retail
distribution sectors, and controlled manufacturing,
such as demand-based production.

DataWarehouse Usage
107
■ There are three kinds of data warehouse
applications:
■ information processing
■ analytical processing
■ data mining

DataWarehouse Usage
108
■ Information processing supports
■ querying,
■ basic statistical analysis, and
■ reporting using crosstabs, tables, charts, or
graphs.
■ Analytical processing supports
■ basic OLAP operations,
■ slice-and-dice, drill-down, roll-up, and pivoting.
■ It generally operates on historic data in both
summarized and detailed forms.
■ multidimensional data analysis

DataWarehouse Usage
109
■ Data mining supports
■ knowledge discovery by finding hidden
patterns and associations,
■ constructing analytical models,
■ performing classification and prediction, and
■ presenting the mining results using
visualization tools.
■ Note:-
■ Data Mining is different with Information
Processing and Analytical processing

From Online Analytical Processing
to Multidimensional Data Mining
110
■ On-line analytical mining (OLAM) (also called OLAP
mining) integrates on-line analytical processing (OLAP)
with data mining and mining knowledge in
multidimensional databases.
■ OLAM is particularly important for the following reasons:
■ High quality of data in data warehouses.
■ Available information processing infrastructure
surrounding data warehouses
■ OLAP-based exploratory data analysis:
■ On-line selection of data mining functions

Architecture for On-Line Analytical
Mining
111
■ An OLAM server performs analytical mining in data
cubes in a similar manner as an OLAP server performs
on-line analytical processing.
■ An integrated OLAM and OLAP architecture is shown in
Figure, where the OLAM and OLAP servers both accept
user on-line queries (or commands) via a graphical user
interface API and work with the data cube in the data
analysis via a cube API.
■ The data cube can be constructed by accessing and/or
integrating multiple databases via an MDDB API and/or
by filtering a datawarehouse via a database API that may
support OLE DB or ODBC connections.

Data Mining
&
Motivating Challenges
UNIT - II
By
M. Rajesh Reddy

WHAT IS DATA MINING?
• Data mining is the process of automatically discovering
useful information in large data repositories.
• To find novel and useful patterns that might
otherwise remain unknown.
otherwise remain unknown.
• provide capabilities to predict the outcome of a future
observation,
• Example
• predicting whether a newly arrived customer will spend
more than $100 at a department store.

• Not all information discovery tasks are considered to be
data mining.
• For example, tasks related to the area of information
retrieval.
retrieval.
• looking up individual records using a database
management system
or
• finding particular Web pages via a query to an Internet
search engine
• To enhance information retrieval systems.

Data Mining and Knowledge
• Data mining is an integral part of Knowledge Discovery in
Databases (KDD),
• process of converting raw data into useful
• process of converting raw data into useful
information
• This process consists of a series of transformation
steps

• Preprocessing - to transform the raw input data into an
appropriate format for subsequent analysis.
• Steps involved in data preprocessing
• Fusing (joining) data from multiple sources,
• Fusing (joining) data from multiple sources,
• cleaning data to remove noise and duplicate
observations
• selecting records and features that are relevant to the
data mining task at hand.
• most laborious and time-consuming step

• Post Processing:
• only valid and useful results are incorporated into the
decision support system.
• Visualization
• Visualization
• allows analysts to explore the data and the data
mining results from a variety of viewpoints.
• Statistical measures or hypothesis testing methods can
also be applied
• to eliminate spurious (false or fake) data mining
results.

Motivating Challenges:
• challenges that motivated the development of data
mining.
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis

• Scalability
• Size of datasets are in the order of GB, TB or PB.
• special search strategies
• special search strategies
• implementation of novel data structures ( for efficient
access)
• out-of-core algorithms - for large datasets
• sampling or developing parallel and distributed algorithms.

• High Dimensionality
• common today - data sets with hundreds or thousands
of attributes
• Example
• Bio-Informatics - microarray technology has
• Bio-Informatics - microarray technology has
produced gene expression data involving
thousands of features.
• Data sets with temporal or spatial components
also tend to have high dimensionality.
• a data set that contains measurements of
temperature at various locations.

Heterogeneous and Complex Data
• Traditional data analysis methods - data sets - attributes
of the same type - either continuous or categorical.
• Examples of such non-traditional types of data include
• collections of Web pages containing semi-structured
text and hyperlinks;
text and hyperlinks;
• DNA data with sequential and three-dimensional
structure and
• climate data with time series measurements
• DM should maintain relationships in the data, such as
• temporal and spatial autocorrelation,
• graph connectivity, and
• parent-child relationships between the elements in
semi-structured text and XML documents.

• Data Ownership and Distribution
• Data is not stored in one location or owned by one organization
• geographically distributed among resources belonging to multiple
entities.
• This requires the development of distributed data mining techniques.
• This requires the development of distributed data mining techniques.
• key challenges in distributed data mining algorithms
• (1) reduction in the amount of communication needed
• (2) effective consolidation of data mining results obtained from
multiple sources, and
• (3) Data security issues.

• Non-traditional Analysis:
• Traditional statistical approach: hypothesize-and-test paradigm.
• A hypothesis is proposed,
• an experiment is designed to gather the data, and
• then the data is analyzed with respect to the hypothesis.
• then the data is analyzed with respect to the hypothesis.
• Current data analysis tasks
• Generation and evaluation of thousands of hypotheses,
• Some DM techniques automate the process of hypothesis
generation and evaluation.
• Some data sets frequently involve non-traditional types of data
and data distributions.

Origins of Data mining,
Data mining Tasks
&
Types of Data
Types of Data
Unit - II
DWDM

The Origins of Data Mining
Data mining draws upon ideas, such as
■ (1) sampling, estimation, and hypothesis testing from statistics and
■ (2) search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.
■ (2) search algorithms, modeling techniques, and learning theories from
artificial intelligence, pattern recognition, and machine learning.

■ adopt ideas from other areas, including
– optimization,
– evolutionary computing,
– information theory,
– information theory,
– signal processing,
– visualization, and
– information retrieval

■ An optimization algorithm is a procedure which is executed iteratively by
comparing various solutions till an optimum or a satisfactory solution is
found.
■ Evolutionary Computation is a field of optimization theory where instead of
■ Evolutionary Computation is a field of optimization theory where instead of
using classical numerical methods to solve optimization problems, we use
inspiration from biological evolution to ‘evolve’ good solutions
– Evolution can be described as a process by
which individuals become ‘fitter’ in different
environments through adaptation,
natural selection, and selective breeding.
picture of the famous finches Charles Darwin depicted
in his journal

■ Information theory is the scientific study of the quantification, storage,
and communication of digital information.
■ The field was fundamentally established by the works of Harry
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
■ The field is at the intersection of probability theory, statistics, computer
science, statistical mechanics, information engineering, and electrical
engineering.

The Origins of
Data Mining
■ Other Key areas:
– database systems
■ to provide support for efficient storage, indexing, and query processing.
– Techniques from high performance (parallel) computing
– Techniques from high performance (parallel) computing
■ addressing the massive size of some data sets.
– Distributed techniques
■ also help address the issue of size and are essential when the data cannot
be gathered in one location.

Data Mining Tasks
■ Data mining tasks are generally divided into two major categories:
– Predictive tasks. - Use some variables to predict unknown or future
values of other variables
■ Task Objective: predict the value of a particular attribute based on the
values of other attributes.
■ Target/Dependent Variable: attribute to be predicted
■ Explanatory or independent variables: attributes used for making the
■ Explanatory or independent variables: attributes used for making the
prediction
– Descriptive tasks. - Find human-interpretable patterns that
describe the data.
■ Task objective: derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data.
■ Descriptive data mining tasks are often exploratory in nature and
frequently require post processing techniques to validate and explain the
results.

Data Mining Tasks
■ Correlation is a statistical term describing the degree to which two variables
move in coordination with one another.
■ Trends: a general direction in which something is developing or
changing.(meaning)
Clusters
Trajectory data mining enables to predict the moving
location details of humans, vehicles, animals and so on.
Anomaly detection is a step in data mining that
identifies data points, events, and/or observations
that deviate from a dataset’s normal behavior.
■ Clusters
– Clustering is the task of
data points into a number of groups
such that data points in the same groups
are more similar to other data points
in the same group
than those in other groups
https://www.javatpoint.com/data-mining-cluster-
analysis

Data
Data Mining Tasks …
Milk
Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar

Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.

Data Mining Tasks
■ Predictive modeling refers to the task of building a model for the target variable as a
function of the explanatory variables.
■ 2 types of predictive modeling tasks:
– Classification: Used for discrete target variables
– Regression: used for continuous target variables.
– Example:
■ Classification Task : predicting whether a Web user will make a purchase at an online
■ Classification Task : predicting whether a Web user will make a purchase at an online
bookstore is a classification task because the target variable is binary-valued.
■ Regression Task: forecasting the future price of a stock is a regression task because price
is a continuous-valued attribute.
– Goal of both tasks: learn a model that minimizes the error between the predicted and
true values of the target variable.
– Predictive modeling can be used to:
■ identify customers that will respond to a marketing campaign,
■ predict disturbances in the Earth’s ecosystem, or
■ judge whether a patient has a particular disease based on the results of medical tests.

Data Mining Tasks
■ Example: (Predicting the Type of a Flower): the task of predicting a species of flower
based on the characteristics of the flower.
■ Iris species: Setosa, Versicolour, or Virginica.
■ Requirement: need a data set containing the characteristics of various flowers of these
three species.
■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
■ 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
■ Petal width is broken into the categories low, medium, and high, which correspond to the
intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively.
■ Also, petal length is broken into categories low, medium, and high, which correspond to the
intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.
■ Based on these categories of petal width and length, the following rules can be derived:
– Petal width low and petal length low implies Setosa.
– Petal width medium and petal length medium implies Versicolour.
– Petal width high and petal length high implies Virginica.

Data Mining Tasks
■ Example: (Predicting the Type of a Flower):

Data Mining Tasks
Example:
(Predicting
the Type of a
Flower)
Flower)

Data Mining Tasks
■ Association analysis
– used to discover patterns that describe strongly associated features in the
data.
– Discovered patterns are represented in the form of implication rules or
feature subsets.
– Goal of association analysis:
■ To extract the most interesting patterns in an efficient manner.
– Example
– Example
■ finding groups of genes that have related functionality,
■ identifying Web pages that are accessed together, or
■ understanding the relationships between different elements of Earth’s climate
system.

Data Mining Tasks
■ Association analysis
■ Example (Market Basket Analysis).
– AIM: find items that are frequently bought together by customers.
– Association rule {Diapers} −→ {Milk},
■ suggests that customers who buy diapers also tend to buy milk.
■ This rule can be used to identify potential cross-selling opportunities among related
items.
items.
The transactions data collected at the checkout counters of a grocery store.

Data Mining Tasks
■ Cluster analysis
– Cluster analysis seeks to find groups of closely related observations so that
observations that belong to the same cluster are more similar than
observations that belong to other clusters.
– Clustering has been used to
■ group sets of related customers,
■ find areas of the ocean that have a significant impact on the Earth’s climate, and
■ compress data.
■ compress data.

Data Mining Tasks
■ Cluster analysis
– Example 1.3 (Document Clustering)
– Each article is represented as a set of word-frequency pairs (w, c),
■ where w is a word and
■ c is the number of times the word appears in the article.
– There are two natural clusters in the data set.
– First cluster -> first four articles (news about the economy)
– First cluster -> first four articles (news about the economy)
– Second cluster-> last four articles ( news about health care)
– A good clustering algorithm should be able to identify these two clusters
based on the similarity between words that appear in the articles.

Data Mining Tasks
■ Anomaly Detection:
– Task of identifying observations whose characteristics are significantly
different from the rest of the data.
– Such observations are known as anomalies or outliers.
– A good anomaly detector must have a high detection rate and a low false alarm
rate.
– Applications of anomaly detection include
■ the detection of fraud,
■ the detection of fraud,
■ network intrusions,
■ unusual patterns of disease, and
■ ecosystem disturbances
https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffi
c.png

Data Mining Tasks
■ Anomaly Detection:
– Example 1.4 (Credit Card Fraud Detection).
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
– A credit card company records the transactions made by every credit card
holder, along with personal information such as credit limit, age, annual income,
and address.
– Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be
applied to build a profile of legitimate transactions for the users.
– When a new transaction arrives, it is compared against the profile of the user. If
the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

Types of Data
■ Data set - collection of data objects.
■ Other names for a data object are:-
– record,
– point,
– vector,
– vector,
– pattern,
– event,
– case,
– sample,
– observation, or
– entity.

Types of Data
■ Data objects are described by a number of attributes that
capture the basic characteristics of an object.
■ Example:-
– mass of a physical object or
– time at which an event occurred.
– time at which an event occurred.
■ Other names for an attribute are:-
– variable,
– characteristic,
– field,
– feature, or
– dimension.

Types of Data
■ Example:-
■ Dataset - Student Information.
■ Each row corresponds to a student.
■ Each column is an attribute that describes some aspect of a
student.
student.

Types of Data
■ Attributes and Measurement
– An attribute is a property or characteristic of an object
that may vary, either from one object to another or from
one time to another.
– Example,
– Example,
■ eye color varies from person to person, while the
temperature of an object varies over time.
– Eye color is a symbolic attribute with a small number of
possible values {brown, black, blue, green, hazel, etc.},
– Temperature is a numerical attribute with a potentially
unlimited number of values.

Types of Data
■ Attributes and Measurement
– A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an
object.
– process of measurement
– process of measurement
■ application of a measurement scale to associate a
value with a particular attribute of a specific object.

Properties of Attribute Values
■ The type of an attribute depends on which of the following
properties it possesses:
■ Distinctness: = ≠
■ Order: < >
■ Addition: + ‐
■ Addition: + ‐
■ Multiplication: * /
■ Nominal attribute: distinctness
■ Ordinal attribute: distinctness & order
■ Interval attribute: distinctness, order & addition
■ Ratio attribute: all 4 properties

Types of Data
■ Properties of Attribute Values
– Nominal - attributes to differentiate between one object
and another.
– Roll, EmpID
– Ordinal - attributes to order the objects.
– Rankings, Grades, Height
– Rankings, Grades, Height
– Interval - measured on a scale of equal size units
– no Zero point
– Temperatures in C & F, Calendar Dates
– Ratio - numeric attribute with an inherent zero-point.
– value as being a multiple (or ratio) of another
value.
– Weight, No. of Staff, Income/Salary

Types of Data Properties of Attribute Values

Types of Data
Properties of Attribute Values - Transformations
– yielding the same results when the attribute is
transformed using a transformation that preserves
the attribute’s meaning.
– Example:-
the average length of a set of objects is different
– Example:-
■ the average length of a set of objects is different
when measured in meters rather than in feet, but
both averages represent the same length.

Types of Data
Properties of Attribute Values - Transformations

Types of Data
Attribute Types
Data
Qualitative / Categorical Quantitative / Numeric
Qualitative / Categorical
( no properties of integer)
Quantitative / Numeric
(properties of Integers)
Nominal Ordinal Interval Ratio

Types of Data
■ Describing Attributes by the Number of Values
a. Discrete
■ finite or countably infinite set of values.
■ Categorical - zip codes or ID numbers, or
■ Numeric - counts.
■ Binary attributes (special case of discrete)
■ Binary attributes (special case of discrete)
– assume only two values,
– e.g., true/false, yes/no, male/female, or 0/1.
b. Continuous
■ values are real numbers.
■ Ex:- temperature, height, or weight.
Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined
with any of the types based on the number of attribute values—binary, discrete, and continuous.

Types of Data - Types of Dataset
General Characteristics of Data Sets
■ 3 characteristics that apply to many data sets are:-
– dimensionality,
– sparsity, and
– resolution.
■ Dimensionality - number of attributes that the objects in the data set possess.
– small number of dimensions more quality than moderate or high-
– small number of dimensions more quality than moderate or high-
dimensional data.
– curse of dimensionality & dimensionality reduction.
■ Sparsity - data sets, with asymmetric features, most attributes of an object
have values of 0;
– fewer than 1% of the entries are non-zero.
■ Resolution - Data will be gathered at different levels of resolution
– Example:- the surface of the Earth seems very uneven at a resolution of a
few meters, but is relatively smooth at a resolution of tens of kilometers.

■ Record Data
– data set is a collection of records (data objects), each of which consists of
a fixed set of data fields (attributes).
– No relationships b/w records
– Same attributes for all records
– Flat files or relational DB.

■ Transaction or Market Basket Data
– special type of record data
– Each record (transaction) involves a set of items.
– Also called market basket data because the items in each record are the
products in a person’s “market basket.”
– Can be viewed as a set of records whose fields are asymmetric attributes.

■ Data Matrix / Pattern Matrix
– fixed set of numeric attributes,
– Data objects = points (vectors) in a multidimensional space
– each dimension = a distinct attribute describing the object.
– A set of such data objects can be interpreted as
■ an m by n matrix,
– where there are
– m rows, one for each object,
– and n columns, one for each attribute.
– Standard matrix operation can be applied to transform and manipulate the
data.

■ Sparse Data Matrix:
– Special case of a data matrix
– attributes are of the
– attributes are of the
■ same type and
■ asymmetric; i.e., only non-zero values are important.
– Example:-
■ Transaction data which has only 0–1 entries.
■ Document Term Matrix - collection of term vector
– One Term vector represents - one document ( one row in matrix)
– Attribute of vector - each term in the document ( one col in matrix)
– value in term vector under an attribute is number of times the
corresponding term occurs in the document.

■ Graph based Data:
– Data can be represented in the form of Graph.
– Graphs are used for 2 specific reasons
■ (1) the graph captures relationships among data objects and
■ (2) the data objects themselves are represented as graphs.
– Data with Relationships among Objects
■ Relationships among objects also convey important information.
■ Relationships among objects are captured by the links between objects
and link properties, such as direction and weight.
■ Example:
– Web page in www contain both text and links to other pages.
– Web search engines collect and process Web pages to extract their
contents.
– Links to and from each page provide a great deal of information
about the relevance of a Web page to a query, and thus, must also
be taken into consideration.

– Data with Relationships among Objects
■ Example:
– Web page in www contain both text and links to other pages.

– Data with Objects That Are Graphs
■ When objects contain sub-objects that have relationships, then such
objects are frequently represented as graphs.
■ Example:-Structure of chemical compounds
■ Atoms are - nodes
■ Chemical Bonds - links between nodes
– ball-and-stick diagram of the chemical compound benzene,
which contains atoms of carbon (black) and hydrogen (gray).
Substructure mining

■ Ordered Data:
– In some data, the attributes have relationships that involve order in time or
space.
– Sequential Data
■ Sequential data / temporal data
■ extension of record data - each record has a time associated with it.
■ Ex:- Retail transaction data set - stores the time of transaction
– time information used to find patterns
■ “candy sales peak before Halloween.”
■ Each attribute - also - time associated
– Record - purchase history of a customer
■ with a listing of items purchased at different times.
– find patterns
■ “people who buy DVD players tend to buy DVDs in the period
immediately following the purchase.”

■ Ordered Data: Sequential

■ Ordered Data: Sequence Data
– consists of a data set that is a sequence
of individual entities,
– Example
■ sequence of words or letters.
– Example:
■ Genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
■ Predicting similarities in the structure
and function of genes from similarities
in nucleotide sequences.
– Ex:- Human genetic code expressed
using the four nucleotides from which all
DNA is constructed: A, T, G, and C.

■ Ordered Data: Time Series Data
– Special type of sequential data in
which each record is a time series,
– A series of measurements taken over
time.
– Example:
■ Financial data set might contain
objects that are time series of the
daily prices of various stocks.
– Temporal autocorrelation; i.e., if two
measurements are close in time, then
the values of those measurements are
often very similar. Time series of the average
monthly temperature for
Minneapolis during the years
1982 to 1994.

■ Ordered Data: Spatial Data
■ Some objects have spatial attributes,
such as positions or areas, as well as
other types of attributes.
■ An example of spatial data is
– weather data (precipitation,
temperature, pressure) that is
collected for a variety of geographical
locations.
■ spatial autocorrelation; i.e., objects that
are physically close tend to be similar in
other ways as well.
■ Example
– two points on the Earth that are close
to each other usually have similar
values for temperature and rainfall.
Average Monthly
Temperature of land and
ocean

Data Quality
Data Quality
Unit – II- DWDM

Data Quality
● Data mining applications are applied to data that was collected for another purpose, or for
future, but unspecified applications.
● Data mining focuses on
(1) the detection and correction of data quality problems - Data Cleaning
(1) the detection and correction of data quality problems - Data Cleaning
(2) the use of algorithms that can tolerate poor data quality.
● Measurement and Data Collection Issues
● Issues Related to Applications

Data Quality
● Measurement and Data Collection Issues
● problems due to human error,
● limitations of measuring devices, or
● flaws in the data collection process.
Values or even entire data objects may be missing.
● Values or even entire data objects may be missing.
● Spurious or duplicate objects; i.e., multiple data objects that all correspond to a
single “real” object.
○ Example - there might be two different records for a person who has recently lived at two
different addresses.
● Inconsistencies—
○ Example - a person has a height of 2 meters, but weighs only 2 kilograms.

Data Quality
● Measurement and Data Collection Errors
○ Measurement error - any problem resulting from the measurement process.
■ Value recorded differs from the true value to some extent.
■ Continuous attributes:
Numerical difference of the measured and true value is called the
● Numerical difference of the measured and true value is called the
error.
○ Data collection error - errors such as omitting data objects or attribute
values, or inappropriately including a data object.
■ For example, a study of animals of a certain species might include animals
of a related species that are similar in appearance to the species of
interest.

Data Quality
○ Noise and Artifacts:
○ Noise is the random component of a measurement error.
○ It may involve the distortion of a value or the addition of spurious objects.
○ It may involve the distortion of a value or the addition of spurious objects.

Data Quality
○ used in connection with data that has a spatial or temporal component.
○ Techniques from signal or image processing can frequently be used to reduce
○ Techniques from signal or image processing can frequently be used to reduce
noise
■ These will help to discover patterns (signals) that might be “lost in the
noise.”
○ Note:Elimination of noise - difficult
■ robust algorithms - produce acceptable results even when noise is present.

Data Quality
■ Artifacts: Deterministic distortions of the data
■ Data errors may be the result of a more deterministic phenomenon, such
■ Data errors may be the result of a more deterministic phenomenon, such
as a streak in the same place on a set of photographs.

Data Quality
● Precision, Bias, and Accuracy:
○ Precision:
■ The closeness of repeated measurements (of the same quantity) to one another.
■ Precision is often measured by the standard deviation of a set of values
○ Bias:
A systematic variation of measurements from the quantity being measured.
○ Bias:
■ A systematic variation of measurements from the quantity being measured.
■ Bias is measured by taking the difference between the mean of the set of values and the
known value of the quantity being measured.
○ Example:
■ standard laboratory weight with a mass of 1g and want to assess the precision and bias of our
new laboratory scale.
■ weigh the mass five times & values are: {1.015, 0.990, 1.013, 1.001, 0.986}.
■ The mean of these values is 1.001, and hence, the bias is 0.001.
■ The precision, as measured by the standard deviation, is 0.013.

Data Quality
● Precision, Bias, and Accuracy:
○ Accuracy:
■ The closeness of measurements to the true value of the
■ The closeness of measurements to the true value of the
quantity being measured.

Data Quality
● Outliers:
○ Outliers are either
■ (1) data objects that, in some sense, have characteristics that
■ (1) data objects that, in some sense, have characteristics that
are different from most of the other data objects in the data set,
or
■ (2) values of an attribute that are unusual with respect to the
typical values for that attribute.
○ Alternatively - anomalous objects or values.

Data Quality
● Missing Values:
○ Eliminate Data Objects or Attributes
○ Estimate Missing Values
○ Estimate Missing Values
○ Ignore the Missing Value during Analysis
○ Inconsistent Values

Data Quality
● Duplicate Data: Same Data in multiple Data Objects
○ To detect and eliminate such duplicates, two main issues
must be addressed.
must be addressed.
■ First - if two objects represent a single object, then the values of
corresponding attributes may differ, and these inconsistent
values must be resolved
■ Second - care needs to be taken to avoid accidentally combining
data objects that are similar - deduplication

Data Quality “data is of high quality if it is suitable for its intended use.”
● Issues Related to Applications:
● Timeliness:
○ If the data is out of date, then so are the models and patterns that are based on it.
● Relevance:
○ The available data must contain the information necessary for the application.
○ Consider the task of building a model that predicts the accident rate for drivers. If information about the age and
gender of the driver is omitted, then it is likely that the model will have limited accuracy unless this information
is indirectly available through other attributes.
is indirectly available through other attributes.
● Knowledge about the Data:
○ Data sets are accompanied documentation that describes different aspects of the data;
○ the quality of this documentation can help in the subsequent analysis.
○ For example,
■ If the documentation is poor, however, and fails to tell us, for example, that the missing values for a
particular field are indicated with a -9999, then our analysis of the data may be faulty.
○ Other important characteristics are the precision of the data, the type of features (nominal, ordinal, interval,
ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.

DATA PREPROCESSING
DATA PREPROCESSING
Datamining
Unit - II

AGGREGATION
• “less is more”
• Aggregation - combining of two or more objects into a single object.
• In Example,
• One way to aggregate transactions for this data set is to replace all the transactions of a single store with a
single storewide transaction.
• This reduces number of records (1 record per store).
• How an aggregate transaction is created
• Quantitative attributes, such as price, are typically aggregated by taking a sum or an average.
• A qualitative attribute, such as item, can either be omitted or summarized as the set of all the items that
were sold at that location.
• Can also be viewed as a multidimensional array, where each attribute is a dimension.
• Used in OLAP

AGGREGATION
• Motivations for aggregation
• Smaller data sets require less memory and processing time which
allows the use of more expensive data mining algorithms.
• Availability of change of scope or scale
• by providing a high-level view of the data instead of a low-level view.
• by providing a high-level view of the data instead of a low-level view.
• Behavior of groups of objects or attributes is often more stable than
that of individual objects or attributes.
• Disadvantage of aggregation
• potential loss of interesting details.

AGGREGATION
average yearly precipitation has less variability than the average monthly precipitation.

SAMPLING
• Approach for selecting a subset of the data objects to be analyzed.
• Data miners sample because it is too expensive or time consuming to
process all the data.
• The key principle for effective sampling is the following:
• Using a sample will work almost as well as using the entire data set if the sample
• Using a sample will work almost as well as using the entire data set if the sample
is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
• Choose a sampling scheme/Technique – which gives high probability of getting a
representative sample.

SAMPLING
• Sampling Approaches: (a) Simple random (b) Stratified (c) Adaptive
• Simple random sampling
• equal probability of selecting any particular item.
• Two variations on random sampling:
• (1) sampling without replacement—as each item is selected, it is removed from the set of all objects that
together constitute the population, and
• (2) sampling with replacement—objects are not removed from the population as they are selected for the
sample.
sample.
• Problem: When the population consists of different types of objects, with widely different numbers of
objects, simple random sampling can fail to adequately represent those types of objects that are less
frequent.
• Stratified sampling:
• starts with prespecified groups of objects
• Simpler version -equal numbers of objects are drawn from each group even though the groups are of
different sizes.
• Other - the number of objects drawn from each group is proportional to the size of that group.

SAMPLING
Sampling and Loss of Information
• Larger sample sizes increase the probability that a sample will be representative, but they also eliminate
much of the advantage of sampling.
• Conversely, with smaller sample sizes, patterns may be missed or erroneous patterns can be detected.

SAMPLING
Determining the Proper Sample Size
• Desired outcome: at least one point will be obtained from each cluster.
• Probability of getting one object from each of the 10 groups increases as the sample size runs from 10
to 60.

SAMPLING
• Adaptive/Progressive Sampling:
• Proper sample size - Difficult to determine
• Start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.
• Initial correct sample size is eliminated
• Stop increasing the sample size at leveling-off point(where no
improvement in the outcome is identified).

DIMENSIONALITY REDUCTION
• Data sets can have a large number of features.
• Example
• a set of documents, where each document is represented by a vector
• a set of documents, where each document is represented by a vector
whose components are the frequencies with which each word occurs in
the document.
• thousands or tens of thousands of attributes (components), one for each
word in the vocabulary.

• Benefits to dimensionality reduction.
• Data mining algorithms work better if the dimensionality is lower.
• It eliminates irrelevant features and reduce noise
• Lead to a more understandable model
• Lead to a more understandable model
• fewer attributes
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• Reduce the dimensionality of a data set by creating new attributes that are a combination of the old
attributes.
• Feature subset selection or feature selection:
• The reduction of dimensionality by selecting new attributes that are a subset of the old.

• The Curse of Dimensionality
• Data analysis become significantly harder as the dimensionality of the data
increases.
increases.
• data becomes increasingly sparse
• Classification
• there are not enough data objects to model a class to all possible objects.
• Clustering
• density and the distance between points - becomes less meaningful

• Linear Algebra Techniques for Dimensionality Reduction
• Principal Components Analysis (PCA)
• for continuous attributes
• for continuous attributes
• finds new attributes (principal components) that
• (1) are linear combinations of the original attributes,
• (2) are orthogonal (perpendicular) to each other, and
• (3) capture the maximum amount of variation in the data.
• Singular Value Decomposition (SVD)
• Related to PCA

FEATURE SUBSET SELECTION
• Another way to reduce the dimensionality - use only a subset of the features.
• Redundant Features
• Example:
• Purchase price of a product and the amount of sales tax paid
• Redundant to each other
• contain much of the same information.
• Irrelevant features contain almost no useful information for the data mining task at hand.
• Example: Students’ ID numbers are irrelevant to the task of predicting students’ grade point averages.
• Redundant and irrelevant features
• reduce classification accuracy and the quality of the clusters that are found.
• can be eliminated immediately by using common sense or domain knowledge,
• systematic approach - for selecting the best subset of features
• Best approach - try all possible subsets of features as input to the data mining algorithm of interest, and
then take the subset that produces the best results.

•3 standard approaches to feature
selection:
•Embedded
•Filter
•Wrapper

• Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm.
• During execution of algorithm, the Algorithm itself decides which attributes to use
and which to ignore.
• Example:- Algorithms for building decision tree classifiers
• Filter approaches:
• Filter approaches:
• Features are selected before the data mining algorithm is run
• Approach that is independent of the data mining task.
• Wrapper approaches:
• Uses the target data mining algorithm as a black box to find the best subset of
attributes
• typically without enumerating all possible subsets.

• An Architecture for Feature Subset Selection :
• The feature selection process is viewed as consisting of four parts:
1. a measure for evaluating a subset,
2. a search strategy that controls the generation of a new subset of features,
3. a stopping criterion, and
4. a validation procedure.
• Filter methods and wrapper methods differ only in the way in which they
evaluate a subset of features.
• wrapper method – uses the target data mining algorithm
• filter approach - evaluation technique is distinct from the target data mining
algorithm.

• Feature subset selection is a search over all possible subsets of features.
• Evaluation step - determine the goodness of a subset of attributes with respect to a particular data mining task
• Filter approach: predict how well the actual data mining algorithm will perform on a given set of attributes.
• Wrapper approach: running the target data mining application, measure the result of the data mining.
• Stopping criterion
• conditions involving the following:
• the number of iterations,
• the number of iterations,
• whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• whether a subset of a certain size has been obtained,
• whether simultaneous size and evaluation criteria have been achieved, and
• whether any improvement can be achieved by the options available to the search strategy.
• Validation:
• Finally, the results of the target data mining algorithm on the selected subset should be validated.
• An evaluation approach: run the algorithm with the full set of features and compare the full results to results
obtained using the subset of features.

• Feature Weighting
• An alternative to keeping or eliminating features.
• One Approach
• Higher weight - More important features
• Lower weight - less important features
• Lower weight - less important features
• Another Approach – automatic
• Example – Classification Scheme - Support vector machines
• Other Approach
• The normalization of objects – Cosine Similarity – used as weights

FEATURE CREATION
• Create a new set of attributes that captures the important
information in a data set from the original attributes
• much more effective.
• No. of new attributes < No. of original attributes
• Three related methodologies for creating new attributes:
• Three related methodologies for creating new attributes:
1. Feature extraction
2. Mapping the data to a new space
3. Feature construction

FEATURE CREATION
• Feature Extraction
• The creation of a new set of features from the original raw data
• Example: Classify set of photographs based on existence of human face
(present or not)
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Raw data (set of pixels) - not suitable for many types of classification algorithms.
• Higher level features( presence or absence of certain types of edges and areas that are highly correlated with
the presence of human faces), then a much broader set of classification techniques can be applied to this
problem.
• Feature extraction is highly domain-specific
• New area means development of new features and feature extraction
methods.

FEATURE CREATION
Mapping the Data to a New Space
• A totally different view of the data can reveal important and interesting features.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If there is only a single periodic pattern and not much noise, then the pattern is easily detected.
• If, there are a number of periodic patterns and a significant amount of noise is present, then these
patterns are hard to detect.
• Such patterns can be detected by applying a Fourier transform to the time series in order to
change to a representation in which frequency information is explicit.
• Example:
• Power spectrum that can be computed after applying a Fourier transform to the original time series.

FEATURE CREATION
• Feature Construction
• Features in the original data sets consists necessary information, but not suitable for the data mining
algorithm.
• If new features constructed out of the original features can be more useful than the original features.
• If new features constructed out of the original features can be more useful than the original features.
• Example (Density).
• Dataset contains the volume and mass of historical artifact.
• Density feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.

DISCRETIZATION AND BINARIZATION
• Classification algorithms, require that the data be in the form of
categorical attributes.
• Algorithms that find association patterns, require that the data be in
• Algorithms that find association patterns, require that the data be in
the form of binary attributes.
• Discretization - transforming a continuous attribute into a
categorical attribute
• Binarization - transforming both continuous and discrete attributes
into one or more binary attributes

• Binarization of a categorical attribute (Simple technique):
• If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0, m − 1].
• If the attribute is ordinal, then order must be maintained
by the assignment.
by the assignment.
• Next, convert each of these m integers to a binary number
using n binary attributes
• n = [log2 (m)] binary digits are required to represent these
integers

Example: a categorical variable with 5 values
{awful, poor, OK, good, great}
require three binary variables x1, x2, and x3.

• Discretization of Continuous Attributes ( classification or
association analysis)
• Transformation of a continuous attribute to a categorical attribute
involves two subtasks:
• decide no. of categories
• decide no. of categories
• decide how to map the values of the continuous attribute to these
categories.
• Step I: Sort Attribute Values and divide into n intervals by specifying n − 1
split points.
• Step II : all the values in one interval are mapped to the same categorical
value.

• Discretization of Continuous Attributes
• Problem of discretization is
• Deciding how many split points to choose and
• where to place them.
• The result can be represented either as
• a set of intervals {(x0, x1],(x1, x2],... ,(xn−1, xn)},
where x0 and xn may be +∞ or −∞, respectively,
or
• as a series of inequalities x0 < x ≤ x1,..., xn−1 < x < xn.

• UnSupervised Discretization
• Discretization methods for Classification
• Supervised - known class information
• Unsupervised - unknown class information
• Equal width approach:
• Equal width approach:
• divides the range of the attribute into a user-specified number of
intervals each having the same width.
• problem with outliers
• Equal frequency (equal depth) approach:
• Puts same number of objects into each interval
• K-means Clustering method

UnSupervised Discretization
Original Data

Equal Width Discretization

Equal Frequency Discretization

K-means Clustering (better result)

• Supervised Discretization
• When additional information (class labels) are used then it
produces better results.
• Some Concerns: purity of an interval and the minimum size of
an interval.
an interval.
• statistically based approaches:
• start with each attribute value as a separate interval and
create larger intervals by merging adjacent intervals that are
similar according to a statistical test.
• Entropy based approaches:

• Entropy Definition
• e - Entropy in ith interval
• ei - Entropy in ith interval
• pij = mij/mi probability of class j in the i th interval.
• k - no. of different class labels
• mi - no. of values in the i th interval of a partition,
• mij - no. of values of class j in interval i.

• Supervised
Discretization
• Entropy

• Total entropy, e, of the partition is
• weighted average of the individual interval entropies,
• m - no. of values,
• m - no. of values,
• wi = mi/m fraction of values in the i th interval
• n - no. of intervals.
• Perfectly Pure Interval:entropy is 0
• If an interval contains only values of one class
• Impure Interval: entropy is maximum
• classes of values in an interval occur equal

• Simple approach for partitioning a continuous attribute:
• starts by bisecting the initial values so that the resulting
two intervals give minimum entropy.
two intervals give minimum entropy.
• consider each value as a possible split point
• Repeat splitting process with another interval
• choosing the interval with the worst (highest) entropy,
• until a user-specified number of intervals is reached,
or
• stopping criterion is satisfied.

• Supervised
Discretization
• Entropy based
approaches:
• 3 categories for
both x & y

• Supervised
Discretization
• Entropy based
approaches:
both x & y
• Observation:
• no improvement
for 6 categories

• Categorical Attributes with Too Many Values
• If categorical attribute is an ordinal,
• techniques similar to those for continuous attributes
• If the categorical attribute is nominal,
• Example:-
• Example:-
• University that has a large number of departments.
• department name attribute - dozens of diff. values.
• combine departments into larger groups, such as
• engineering,
• social sciences, or
• biological sciences.

Variable Transformation
• Transformation that is applied to all the values of a variable.
• Example: magnitude of a variable is important
• then the values of the variable can be transformed by taking the absolute
value.
• Simple Function Transformation:
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of such transformations include
• x k,
• log x,
• e x,
• √ x,
• 1/x,
• sin x, or |x|

• Variable transformations should be applied with caution since they
change the nature of the data.
• Example:-
• transformation fun. is 1/x
• if value is 1 or >1 then reduces the magnitude of values
• if value is 1 or >1 then reduces the magnitude of values
• values {1, 2, 3} go to {1, 1/ 2, 1/3}
• if value is b/w 0 & 1 then increases the magnitude of values
• values {1, 1/2, 1/3} go to {1, 2, 3}.
• so better ask questions such as the following:
• Does the order need to be maintained?
• Does the transformation apply to all values( -ve & 0)?
• What is the effect of the transformation on the values between 0 & 1?

• Normalization or Standardization
• Goal of standardization or normalization
• To make an entire set of values have a particular property.
• A traditional example is that of “standardizing a variable” in statistics.
• x - mean (average) of the attribute values and
• x - mean (average) of the attribute values and
• sx - standard deviation,
• Transformation
• creates a new variable that has a mean of 0 and a standard deviation
of 1.

• If different variables are to be combined, a transformation is necessary
to avoid having a variable with large values dominate the results of the
calculation.
• Example:
• Example:
• comparing people based on two variables: age and income.
• For any two people, the difference in income will likely be much
higher in absolute terms (hundreds or thousands of dollars) than the
difference in age (less than 150).
• Income values(higher values) will dominate the calculation.

• Mean and standard deviation are strongly affected by outliers
• Mean is replaced by the median, i.e., the middle value.
• x - variable
• absolute standard deviation of x is
• xi - i th value of the variable,
• xi - i th value of the variable,
• m - number of objects, and
• µ - mean or median.
• Other approaches
• computing estimates of the location (center) and
• spread of a set of values in the presence of outliers
• These measures can also be used to define a standardization transformation.

Measures of
Similarity and
Similarity and
Dissimilarity
Unit - II
Datamining

Measures of Similarity and
Dissimilarity
● Similarity and dissimilarity are important because they are used by a
number of data mining techniques
number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.

Measures of Similarity and Dissimilarity
● Similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity

Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely
similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s−1)/9
■ s - Original Similarity
■ s’ - New similarity values

Dissimilarities between Data Objects
Euclidean Distance

If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
Note:-Measures that satisfy all three properties are known as metrics.

Non-metric Dissimilarities: Set Differences
A = {1, 2, 3, 4} and B = {2, 3, 4},
then A − B = {1} and
B − A = ∅, the empty set.
B − A = ∅, the empty set.
If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.
d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)

Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.
d(1PM, 2PM) = 1 hour
d(2PM, 1PM) = 23 hours
● Example:- when answering the question: “If an event occurs at 1PM
every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”

Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
■ S` - new similarity measure.

Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients
○ Let x and y be two objects that consist of n binary attributes.
○ The comparison of two objects (or two binary vectors), leads to
the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0

Simple Matching Coefficient(SMC)
Jaccard Coefficient

Cosine similarity (Document similarity)
If x and y are two document vectors, then

cosine similarity (Document similarity)

# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)

● Cosine similarity - measure of angle between x and y.
● Cosine similarity = 1 (angle is 0◦, and x & y are same (except magnitude or length))
● Cosine similarity = 0 (angle is 90
◦, and x & y do not share any terms (words))
● Cosine similarity = 0 (angle is 90
◦, and x & y do not share any terms (words))

Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)

Extended Jaccard Coefficient (Tanimoto Coefficient)

Pearson’s correlation

Pearson’s correlation
● The more tightly linear two variables X and Y are,
the closer Pearson's correlation coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated numbers
■ an increase in the value of one decreases the value of another variable.

Pearson’s correlation ( scipy.stats.pearsonr() - automatic)

Pearson’s correlation (manual in python)

CLASSIFICATION
CLASSIFICATION
DATAMINING
UNIT III

BASIC CONCEPTS
• Input data ->collection of records. E
• Record / instance / example -> tuple (x, y)
• x - attribute set
• x - attribute set
• y - special attribute (class label / category / target attribute)
• Attribute set - properties of a Data Object – Discrete / Continuous
• Class label –
• Classification – y is Discrete attribute
• Regression (Predictive Modeling Task) - y is a continuous attribute.

BASIC CONCEPTS
• Definition:
• Classification is the task of learning a target function f that maps each attribute set x to one of the
predefined class labels y.
predefined class labels y.
• The target function is also known informally as a classification model.

BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Descriptive modeling: A classification model can serve as an explanatory tool to distinguish
between objects of different classes.
between objects of different classes.

BASIC CONCEPTS
• A classification model is useful for the following purposes.
• Predictive Modeling:
• A classification model can also be used to predict the class label of unknown records.
• A classification model can also be used to predict the class label of unknown records.
• Automatically assigns a class label when presented with the attribute set of an
unknown record.
• Classification techniques best suit for binary or nominal categories.
• Do not consider the implicit order
• Relationships are also ignored(super-sub class)

General approach to solving a classification problem
• Classification technique (or classifier)
• Systematic approach to building classification models
from an input data set.
• Examples
• Decision tree classifiers,
• Rule-based classifiers,
• Neural networks,
• Neural networks,
• Support vector machines, and
• Naive bayes classifiers.
• Learning algorithm
• Used by the classifier
• To identify a model
• That best fits the relationship between the
attribute set and class label of the input data.

• Model
• Generated by a learning algorithm
• Should satisfy the following:
• Fit the input data well
• Correctly predict the class labels of
records it has never seen before.
• Training set
• Consisting of records whose class labels are
known
• used to build a classification model

• Confusion Matrix
• Used to evaluate the performance of a classification model
• Holds details about
• counts of test records correctly and incorrectly predicted by the model.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• Table 4.2 depicts the confusion matrix for a binary classification problem.
• fij – no. of records from class i predicted to be of class j.
• f01 – no. of records from class 0 incorrectly predicted as class 1.
• total no. of correct predictions made (f11 + f00)
• total number of incorrect predictions (f10 + f01).

• Performance Metrics:
1. Accuracy
1. Error Rate

DECISION TREE INDUCTION
Working of Decision Tree
• We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
• Each time we receive an answer, a follow-
up question is asked until we reach a
conclusion about the class label of the
record.
• The series of questions and their possible
answers can be organized in the form of a
decision tree
• Decision tree is a hierarchical structure
consisting of nodes and directed edges.

• Three types of nodes:
• Root node
• No incoming edges
• Zero or more outgoing edges.
• Internal nodes
• Exactly one incoming edge and
• Two or more outgoing edges.
• Leaf or terminal nodes
• Exactly one incoming edge and
• No outgoing edges.
• Each leaf node is assigned a class label.
• Non-terminal nodes (root & other internal nodes)
contain attribute test conditions to separate
records that have different characteristics.

Buiding Decision Tree
• Hunt’s algorithm:
• basis of many existing decision tree induction algorithms, including
• ID3,
• C4.5, and
• CART.
• Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets.
• D
• Dt - set of training records with node t
• y= {y1, y2,..., yc} -> class labels.
• Hunt’s algorithm.
• Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
• Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the
test condition and the records in Dt are distributed to the children based on the outcomes.
• Note:-algorithm is then recursively applied to each child node.

Buiding Decision Tree • Example:-predicting whether a loan applicant will repay or not (defaulted)
• Construct a training set by examining the records of previous
borrowers.

Building Decision Tree
• Hunt’s algorithm will work fine
• if every combination of attribute values is present in the training data and
• if each combination has a unique class label.
• Additional conditions
1. If a child nodes is empty(no records in training set) then declare it as a leaf node with the same class
label as the majority class of training records associated with its parent node.
2. Identical attribute values and diff. class label. Not possible to further split. declare node as leaf with the
same class label as the majority class of training records associated with this node.
same class label as the majority class of training records associated with this node.
• Design Issues of Decision Tree Induction
• 1. How should the training records be split?
• Test condition to divide the records into smaller subsets.
• provide a method for specifying the test condition
• measure for evaluating the goodness of each test condition.
• 2. How should the splitting procedure stop?
• A stopping condition is needed
• stop when either all the records belong to the same class or all the records have identical attribute
values.

Methods for Expressing Attribute Test Conditions
• Test condition for Binary Attributes
• The test condition for a binary attribute generates two potential outcomes.

• Test condition for Nominal Attributes
• nominal attribute can have many values
• Test condition can be expressed in two ways
• Multiway split - number of outcomes depends on the number of distinct values
produces binary splits by considering all 2k−1 − 1 ways of
• Binary splits(used in CART) - produces binary splits by considering all 2k−1 − 1 ways of
creating a binary partition of k attribute values.

• Test condition for Ordinal Attributes
• Ordinal attributes can also produce binary or multiway splits.
• values can be grouped without violating the order property.
• 4.10© is invalid

• Test condition for Continuous Attributes
• Test condition - Comparison test (A < v) or (A ≥ v) with binary outcomes,
or
• Test condition - a range query with outcomes of the form vi ≤ A < vi+1, for i = 1,..., k.
• Multiway split
• Multiway split
• Apply the discretization strategies

Measures for Selecting the Best Split
• p(i|t) - fraction of records belonging to class i at a given node t.
• Sometimes – used as only Pi
• Two-class problem
• (p0, p1) - class distribution at any node
• p1 = 1 − p0
• p1 = 1 − p0
• (0.5, 0.5) because there are an equal number of records from each class
• Car Type, will result in purer partitions

• selection of best split is based on the degree of impurity of the child nodes
• Node with class distribution (0, 1) has zero impurity,
• Node with uniform class distribution (0.5, 0.5) has the highest impurity.
• p - fraction of records that belong to one of the two classes.
• P – maximum(0.5) – class distribution is even
• P- min. (0 or 1)– all records belong to the same class

• Node N1 has the lowest impurity value, followed by N2 and N3.

• To Determine the performance of test condition – compare the degree of
impurity of the parent node (before splitting) with the degree of impurity
of the child nodes (after splitting).
• The larger their difference, the better the test condition.
• Information Gain:
• I(·) - impurity measure of a given node,
• N - total no. of records at parent node,
• k - no. of attribute values
• N(vj) - no. of records associated with the child node, vj.
• The difference in entropy(Impurity measure) is known as the Information
gain, ∆info

Calculate Impurity using Gini
Find out, which attribute
is selected?

Data Warehousing and mining Complete notes.pdf

● Splitting of Binary Attributes
○ Before splitting, the Gini index is 0.5
■ because equal number of records
from both classes.
○ If attribute A is chosen to split the
data,
■ Gini index
● node N1 = 0.4898, and
● node N1 = 0.4898, and
● node N2 = 0.480.
■ Weighted average of the Gini index
for the descendent nodes is
● (7/12) × 0.4898 + (5/12) × 0.480
= 0.486.
○ Weighted average of the Gini index for
attribute B is 0.375.
○ B is selected because of small value

● Splitting of Nominal Attributes
○ First Binary Grouping
■ Gini index of {Sports, Luxury} is 0.4922 and
■ the Gini index of {Family} is 0.3750.
■ The weighted average Gini index
16/20 × 0.4922 + 4/20 × 0.3750 =
0.468.
0.468.
○ Second binary grouping of {Sports} and {Family, Luxury},
■ weighted average Gini index is 0.167.
● The second grouping has a
lower Gini index
because its corresponding subsets
are much purer.

● Splitting of Continuous Attributes
● A brute-force method -Take every value of the attribute in the N records as a candidate split position.
● Count the number of records with annual income less than or greater than
v(computationally expensive).
● To reduce the complexity, the training records are sorted based on their annual income,
● Candidate split positions are identified by taking the midpoints between two adjacent sorted values:

● Gain Ratio
○ Problem:
■ Customer ID - produce purer partitions.
■ Customer ID is not a predictive attribute because its value is
unique for each record.
○ Two Strategies:
■ First strategy(used in CART)
■ First strategy(used in CART)
● restrict the test conditions to binary splits only.
■ Second Strategy(used in C4.5 - Gain Ratio - to determine goodness
of a split)
● modify the splitting criterion
● consider - number of outcomes produced by the attribute test
condition.

● Gain Ratio

Tree-Pruning
• After building the decision tree,
• Tree-pruning step - to reduce the size of the decision
tree.
• Pruning -
• trims the branches of the initial tree
• improves the generalization capability of the
decision tree.
• Decision trees that are too large are susceptible to a
phenomenon known as overfitting.

Model Overfitting
Model Overfitting
DWDM Unit-III

Model Overfitting
● Errors generally occur in classification Model are:-
○ Training Errors ( or Resubstitution Error or Apparent Error)
■ No. of misclassification errors Committed on Training data
○ Generalization Errors
■ Expected Error of the model on previously unused records.
Model Overfitting:
■ Expected Error of the model on previously unused records.
● Model Overfitting:
○ Model is overfitting your training data when you see that the model performs well on the training data but
does not perform well on the evaluation (Test) data.
○ This is because the model is memorizing the data it has seen and is unable to generalize to unseen
examples.

Model Overfitting
● Model Underfitting:
● Model is underfitting the training data when the model performs poorly on the training data.
● Model is unable to capture the relationship between the input examples (X) and the target values (Y).
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html

Model Overfitting
https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote12.html

Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 0, Test Error - 30%
Sdsdsd
Sdsdsd
● Humans and dolphins were misclassified
● Spiny anteaters (exceptional case)
● Errors due to exceptional cases are often
Unavoidable and establish the minimum error
rate achievable by any classifier.

Model Overfitting
Overfitting Due to Presence of Noise: Train Error - 20%, Test Error - 10%
Sdsdsd
Sdsdsd

Model Overfitting
Overfitting Due to Lack of Representative Samples
Overfitting occurs when small number of data training records are available
● Training error is zero, Test Error is 30%
● Humans, elephants, and dolphins are misclassified
● Decision tree classifies all warm-blooded vertebrates
that do not hibernate as non-mammals(because of
eagle - Lack of representative samples).

Model Overfitting - Evaluating the Performance of a Classifier
Evaluating the Performance of a Classifier
● Methods commonly used to evaluate the performance of a classifier
○ Hold Out method
○ Random Sub Sampling
○ Cross Validation
■ K-fold
■ K-fold
■ Leave-one-out
○ Bootstrap
■ .632 Bootstrap

● Hold Out method
○ Original data - partitioned into two disjoint sets
■ training set
■ test sets
○ A classification model is then induced from the training set
○ Model performance is evaluated on the test set.
○ Analysts can decide the proportion of data reserved for training and for testing
e.g., 50-50 or
○ Analysts can decide the proportion of data reserved for training and for testing
■ e.g., 50-50 or
■ twothirds - training & one-third - testing
○ Limitations
1. Model may not be good because only few records are for Model induction
2. Model may be highly dependent on the composition of the training and test sets.
● training set size=small, then larger the variance of the model.
● training set =too large, then the estimated accuracy of small test set is less reliable.
3. training and test sets are no longer independent
https://www.datavedas.com/holdout-cross-validation/

● Random Sub Sampling
○ The holdout method can be repeated several times to improve the estimation of a classifier’s performance.
○ Overall accuracy,
○ Problems:
■ Does not utilize as much data as possible for training.
■ Does not utilize as much data as possible for training.
■ No control over the number of times each record is used for testing and training.
https://blog.ineuron.ai/Hold-Out-Method-Random-Sub-Sampling-Method-3MLDEXAZML

● Cross Validation
○ Alternative to Random Subsampling
○ Each record is used the same number of times for training and exactly once for testing.
○ Two fold cross-validation
■ Partition the data into two equal-sized subsets.
■ Partition the data into two equal-sized subsets.
■ one of the subsets for training and the other for testing.
■ Then swap the roles of the subsets
https://fengkehh.github.io/post/introduction-to-cross-validation/ - picture reference

○ K-Fold Cross Validation
■ k equal-sized partitions
■ During each run,
● one of the partitions is chosen for testing,
● one of the partitions is chosen for testing,
● while the rest of them are used for training.
■ Total error is found by summing up the errors for all k runs.
Picture Reference - https://blog.ineuron.ai/Cross-Validation-and-its-types-3eHiWiqJiR

Cross-validation
● Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly
split up into ‘k’ groups. One of the groups is used as the test set and the rest are
used as the training set. The model is trained on the training set and scored on
the test set. Then the process is repeated until each unique group as been used
the test set. Then the process is repeated until each unique group as been used
as the test set.
● For example, for 5-fold cross validation, the dataset would be split into 5 groups,
and the model would be trained and tested 5 separate times so each group would
get a chance to be the test set. This can be seen in the graph below.
● 5-fold cross validation (image credit)

5-fold cross validation (image credit)

○ leave-one-out approach
■ A special case of the k-fold cross-validation
● sets k = N ( Dataset size)
■ Size of test set = 1 record
■ Size of test set = 1 record
■ All remaining records = Training set
■ Advantage
● Utilizing as much data as possible for training
● Test sets are mutually exclusive and they effectively cover the entire data set.
■ Drawback
● computationally expensive

Model Overfitting
● Bootstrap
○ Training records are sampled with replacement;
■ A record already chosen for training is put back into the original pool of records so that it is equally likely
to be redrawn.
○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N
○ Probability a record is chosen by a bootstrap sample is 1 − (1 − 1/N) N
■ When N is sufficiently large, the probability asymptotically approaches 1 − e −1 = 0.632.
○ On average, a bootstrap sample contains 63.2% of the records of the original data.
● b -no. of times
● Ei - accuracy of ith bootstrap sample, accs - accuracy on training data
Picture reference - https://bradleyboehmke.github.io/HOML/process.html

Bayesian Classifiers
DWDM Unit - III

● In many applications the relationship between the attribute set and
the class variable is non-deterministic.
● Example:
Risk for heart disease based on the person’s diet and workout
● Example:
○ Risk for heart disease based on the person’s diet and workout
frequency.
● So, Modeling probabilistic relationships between the attribute
set and the class variable.
● Bayes Theorem

● Consider a football game between two rival teams: Team 0 and Team 1.
● Suppose Team 0 wins 65% of the time and Team 1 wins the remaining
matches.
● Among the games won by Team 0, only 30% of them come from playing on
Team 1’s football field.
Team 1’s football field.
● On the other hand, 75% of the victories for Team 1 are obtained while playing at
home.
● If Team 1 is to host the next match between the two teams, which team will
most likely emerge as the winner?
● This Problem can be solved by Bayes Theorem

● Bayes Theorem
○ X and Y are random variables.
○ A conditional probability is the probability that a random variable will take on a
particular value given that the outcome for another random variable is known.
particular value given that the outcome for another random variable is known.
○ Example:
■ conditional probability P(Y = y|X = x) refers to the probability that the variable
Y will take on the value y, given that the variable X is observed to have the
value x.

● Bayes Theorem
If {X1, X2,..., Xk} is the set of mutually exclusive and exhaustive outcomes of a
random variable X, then the denominator of the previous slide equation can be
expressed as follows:
expressed as follows:

● Bayes Theorem

● Bayes Theorem
○ Using the Bayes Theorem for Classification
■ X - attribute set
■ Y - class variable.
○ Treat X and Y as random variables -for non-deterministic relationship
○ Capture relationship probabilistically using P(Y |X) - Posterior Probability or Conditional Probability
○ P(Y) - prior probability
○ Training phase
○ Training phase
■ Learn the posterior probabilities P(Y |X) for every combination of X and Y
○ Use these probabilities and classify test record X` by finding the class Y` (max posterior probability - P(y`/x`))

Using the Bayes Theorem for Classification
Example:-
● test record
X= (Home Owner = No, Marital Status = Married, Annual Income = $120K)
● Y=?
● Use training data & compute - posterior probabilities P(Yes|X) and P(No|X)
● Y= Yes, if P(Yes|X) > P(No|X)
● Y= No, Otherwise

Computing P(X/Y) - Class Conditional Probability
Na¨ıve Bayes Classifier
● assumes that the attributes are conditionally independent, given the class label y.
The conditional independence assumption can be formally stated as follows:
● The conditional independence assumption can be formally stated as follows:

How a Na¨ıve Bayes Classifier Works
● Assumption - conditional independence
● Estimate the conditional probability of each Xi, given Y
○ (instead of computing the class-conditional probability for every combination of X)
○ (instead of computing the class-conditional probability for every combination of X)
○ No need of very large training set to obtain a good estimate of the probability.
● To classify a test record,
○ Compute the posterior probability for each class Y:
■ P(X) can be ignored
● It is fixed for every Y, it is sufficient to choose the class that maximizes the
numerator term

Estimating Conditional Probabilities for Binary Attributes
Xi - categorical attribute , xi - one of the value under attribute Xi
Y - Target Attribute ( for Class Label), y- one class Label
conditional probability P(Xi = xi |Y = y) = fraction of training instances in class y that take on
attribute value xi.
P(Home Owner=yes|DB=no) =
P(Home Owner=yes|DB=no) =
(No. of HO=yes & No. of DB = no)/(Total No. of DB=no)
=3/7
P(Home Owner=no|DB=no)=4/7
P(Home Owner=yes|DB=yes)=0
P(Home Owner=no|DB=yes)=3/3

Estimating Conditional Probabilities for Continuous Attributes
● Discretization
● Probability Distribution

● Discretization (Transforming continuous attributes into ordinal attributes)
○ Replace the continuous attribute value with its corresponding discrete interval.
○ Estimation error depends on
○ Estimation error depends on
■ the discretization strategy
■ the number of discrete intervals.
○ If the number of intervals is too large, there are too few training records in
each interval
○ If the number of intervals is too small, then some intervals may aggregate
records from different classes and we may miss the correct decision boundary.

○ Gaussian distribution can be used to represent the class-conditional probability for continuous
attributes.
○ The distribution is characterized by two parameters,
■ mean, µ
■ mean, µ
■ variance, σ 2
µij - sample mean of Xi for all training records that belong to the class yj.
σ2
ij - sample variance (s2) of such training records.

sample mean and variance for this attribute with respect to the class No

Example of the Na¨ıve Bayes Classifier
● Compute the class conditional probability for each categorical attribute
● Compute sample mean and variance for the continuous attribute
● Predict the class label of a test record
X = (Home Owner=No, Marital Status = Married,
X = (Home Owner=No, Marital Status = Married,
Income = $120K)
● compute the posterior probabilities
○ P(No|X)
○ P(Yes|X)

● P(yes) = 3/10 =0.3 P(no) =7/10 = 0.7

P(DB=no | X)=P(DB=no)*P(X | DB=no) = 7/10 * 0.0024 = 0.0016
P(DB=yes | X)=P(DB=yes)*P(X | DB=yes) = 3/10 * 0 = 0
Class Label for the record is NO

Find out Class Label ( Play Golf ) for
today = (Sunny, Hot, Normal, False)
https://www.geeksforgeeks.org/naive-bayes-classifiers/

Association Analysis:
Basic Concepts and Algorithms
Basic Concepts and Algorithms
DWDM Unit - IV

Basic Concepts
● Retailers are interested in analyzing the data to learn
about the purchasing behavior of their customers.
●
about the purchasing behavior of their customers.
● Such Information is used in marketing promotions, inventory
management, and customer relationship management.
● Association analysis - useful for discovering interesting
relationships hidden in large data sets.
● The uncovered relationships can be represented in the form
of association rules or sets of frequent items.

Basic Concepts
● Example Association Rule
○ {Diapers} → {Beer}
● rule suggests - strong relationship exists between the sale of
diapers and beer
● many customers who buy diapers also buy beer.
● many customers who buy diapers also buy beer.
● Association analysis is also applicable to
○ Bioinformatics,
○ Medical diagnosis,
○ Web mining, and
○ Scientific data analysis
● Example - analysis of Earth science data(ocean, land, &
atmospheric processes)

Basic Concepts
Problem Definition:
● Binary Representation Market basket data
● each row - transaction
● each column - item
● each row - transaction
● each column - item
● value is one if the item is present in a transaction and
zero otherwise.
● item is an asymmetric binary variable because the
presence of an item in a transaction is often considered
more important than its absence

Basic Concepts
Itemset and Support Count:
I = {i1,i2,.. .,id} - set of all items
T = {t1, t2,..., tN} - set of all transactions
Each transaction ti contains a subset of items chosen from I
Itemset - collection of zero or more items
K-itemset - itemset contains k items
Example:-
{Beer, Diapers, Milk} - 3-itemset
null (or empty) set - no items

Basic Concepts
Itemset and Support Count:
● Transaction width - number of items present in a
transaction.
● A transaction tj contain an itemset X if X is a subset of
tj.
● A transaction tj contain an itemset X if X is a subset of
tj.
● Example:
○ t2 contains itemset {Bread, Diapers} but not {Bread, Milk}.
● support count,σ(X) - number of transactions that contain a
particular itemset.
● σ(X) = |{ti |X ⊆ ti, ti ∈ T}|,
○ symbol | · | denote the number of elements in a set.
● support count for {Beer, Diapers, Milk} =2
○ ( 2 transactions contain all three items)

Basic Concepts
Association Rule:
● An association rule is an implication expression of
the form X → Y, where X and Y are disjoint itemsets
∅
the form X → Y, where X and Y are disjoint itemsets
○ i.e., X ∩ Y = ∅.
● The strength of an association rule can be measured
in terms of its support and confidence.

Basic Concepts
● Support
○ determines how often a rule is applicable to
a given data set
a given data set
○
● Confidence
○ determines how frequently items in Y appear
in transactions that contain X

Basic Concepts
● Example:
○ Consider the rule {Milk, Diapers} → {Beer}
○ support count for {Milk, Diapers, Beer}=2
○ total number of transactions=5,
○ support count for {Milk, Diapers, Beer}=2
○ total number of transactions=5,
○ rule’s support is 2/5 = 0.4.
○ rule’s confidence =
(support count for {Milk, Diapers, Beer})/(support count for {Milk, Diapers})
= 2/3 = 0.67.

Basic Concepts
Formulation of Association Rule Mining Problem
Association Rule Discovery
Given a set of transactions T, find all the rules having
Given a set of transactions T, find all the rules having
support ≥ minsup and confidence ≥ minconf, where minsup and
minconf are the corresponding support and confidence
thresholds.

Basic Concepts
● Brute-force approach: compute the support and confidence for every
possible rule(expensive)
possible rule(expensive)
● Total number of possible rules extracted from a data set that
contains d items is R = 3d−2d+1+1
● For a dataset of 6 items, no of possible rules are 36−27 +1=602
rules.
● More than 80% of the rules are discarded after applying minsup=20% &
minconf=50%
● most of the computations become wasted.
● Prune the rules early without having to compute their support and
confidence values.

Basic Concepts
● Common strategy - decompose the problem into two major
● Common strategy - decompose the problem into two major
subtasks: (separate support & confidence)
1. Frequent Itemset Generation:
■ Objective:Find all the itemsets that satisfy the minsup threshold.
2. Rule Generation:
■ Objective: Extract all the high-confidence rules from the frequent
itemsets found in the previous step.
■ These rules are called strong rules.

Frequent Itemset Generation
● Lattice structure - list of all
possible itemsets
● itemset lattice for
○ I = {a, b, c, d, e}
○ I = {a, b, c, d, e}
● Data set with k items can generate
up to 2k − 1 frequent itemsets
(without null set)
○ Example:- 25-1=31
● So, search space of itemsets in
practical applications is
exponentially large

● A brute-force approach for finding frequent itemsets
○ determine the support count for every candidate
○ determine the support count for every candidate
itemset in the lattice structure.
● compare each candidate against every transaction
● Very expensive
○ requires O(NMw) comparisons,
○ N- No. of transactions,
○ M = 2k − 1 is the number of candidate itemsets
○ w - maximum transaction width.

several ways to reduce the computational complexity of
frequent itemset generation.
Reduce the number of candidate itemsets (M)
The Apriori principle
Reduce the number of comparisons
by using more advanced data structures

The Apriori
Principle
Principle
If an itemset is
frequent, then all
of its subsets must
also be frequent.

Support-based pruning:
Support-based pruning:
● strategy of trimming the exponential search space based on the
support measure is known as support-based pruning.
● It uses anti-monotone property of the support measure.
● Anti-monotone property of the support measure
○ support for an itemset never exceeds the support for its subsets.
● Example:
○ {a, b} is infrequent,
○ then all of its supersets must be infrequent too.
○ entire subgraph containing the supersets of {a, b} can be pruned immediately

Let,
I - set of items
J = 2I - power set of I
A measure f is monotone/anti-monotone if
A measure f is monotone/anti-monotone if
Monotonicity Property(or upward closed):
∀X, Y ∈ J: (X ⊆ Y) → f(X) ≤ f(Y)
Anti-monotone (or downward closed):
∀X, Y ∈ J: (X ⊆ Y) → f(Y) ≤ f(X)
means that if X is a subset of Y, then f(Y) must not exceed f(X).

Frequent Itemset Generation in the Apriori Algorithm

Ck-set of k-candidate itemsets
Fk - set of k-frequent itemsets

https://www.softwaretestinghelp.com/apriori-
algorithm/#:~:text=Apriori%20algorithm%20is%20a%20sequence,is%20assumed%20by%20the%20user.

Example

Apriori in Python
https://intellipaat.com/blog/data-science-apriori-algorithm/

Candidate Generation and Pruning
The apriori-gen function shown in Step 5 of Algorithm 6.1
generates candidate itemsets by performing the following two
operations:
generates candidate itemsets by performing the following two
operations:
1. Candidate Generation (join)
a. Generates new candidate k-itemsets
b. based on the frequent (k − 1)-itemsets found in the previous
iteration.
2. Candidate Pruning
a. Eliminates some of the candidate k-itemsets using the support-based
pruning strategy.

Requirements for an effective candidate generation
procedure:
procedure:
1. It should avoid generating too many unnecessary
candidates
2. It must ensure that the candidate set is complete,
i.e., no frequent itemsets are left out
3. It should not generate the same candidate itemset more
than once (no duplicates).

Candidate Generation Procedures
1. Brute-Force Method
2. Fk−1 × F1 Method
3. Fk−1×Fk−1 Method

Candidate Generation Procedures
a. considers every k-itemset as a
a. considers every k-itemset as a
potential candidate
b. candidate pruning ( to remove
unnecessary candidates) becomes
extremely expensive
c. No. of candidate itemsets generated
at level k =(d
k)
d - no. of items

2. Fk−1 × F1 Method
O(|Fk−1| × |F1|) candidate k-itemsets,
|Fj | = no. of frequent j-itemsets.
overall complexity

● The procedure is complete.
● But the same candidate itemset will be generated more than once ( duplicates).
● Example:
○ {Bread, Diapers, Milk} can be generated
○ by merging {Bread, Diapers} with {Milk},
○ {Bread, Milk} with {Diapers}, or
○ {Diapers, Milk} with {Bread}.
● One Solution
● One Solution
○ Generate candidate itemset by joining items
in lexicographical order only
● {Bread, Diapers} join with {Milk}
Don’t join
● {Diapers, Milk} with {Bread}
● {Bread, Milk} with {Diapers}
because violation of lexicographic ordering
Problem:
Large no. of unnecessary candidates

3. Fk−1×Fk−1 Method (used in the apriori-gen function)
● merges a pair of frequent (k−1)-itemsets only if their
first k−2 items are identical.
first k−2 items are identical.
● Let A = {a1, a2,..., ak−1} and B = {b1, b2,..., bk−1} be a
pair of frequent (k−1)-itemsets.
● A and B are merged if they satisfy the following
conditions:
○ ai = bi (for i = 1, 2,..., k−2) and
○ ak−1 != bk−1.

Merge {Bread, Diapers} & {Bread, Milk} to form a candidate 3-
itemset {Bread, Diapers, Milk}
Don’t merge {Beer, Diapers} with {Diapers, Milk} because the
first item in both itemsets is different.

Support Counting
● Support counting is the process of
determining the frequency of
occurrence for every candidate
itemset that survives the candidate
pruning step.
pruning step.
● One approach for doing this is to
compare each transaction against
every candidate itemset (see Figure
6.2) and to update the support
counts of candidates contained in
the transaction.
● This approach is computationally
expensive, especially when the
numbers of transactions and
candidate itemsets are large.

Support Counting
● An alternative approach is to enumerate the itemsets contained in
each transaction and use them to update the support counts of
their respective candidate itemsets.
● To illustrate, consider a transaction t that contains five
items,{1, 2, 3, 5, 6}.
items,{1, 2, 3, 5, 6}.
● Assuming that each itemset keeps its items in increasing
lexicographic order, an itemset can be enumerated by specifying
the smallest item first,followed by the larger items.
● For instance, given t = {1, 2, 3, 5, 6}, all the 3-itemsets
contained in t must begin with item 1, 2, or 3. It is not
possible to construct a 3-itemset that begins with items 5 or 6
because there are only two items in t whose labels are greater
than or equal to 5.

Support Counting
● The number of ways to specify the first item of a 3-itemset
contained in t is illustrated by the Level 1 prefix structures.
For instance, 1 2 3 5 6 represents a 3-itemset that begins with
item 1, followed by two more items chosen from the set {2, 3, 5, 6}
● After fixing the first item, the prefix structures at Level 2
represent the number of ways to select the second item.
For example, 1 2 3 5 6 corresponds to itemsets that begin with
For example, 1 2 3 5 6 corresponds to itemsets that begin with
prefix (1 2) and are followed by items 3, 5, or 6.
● Finally, the prefix structures at Level 3 represent the complete
set of 3-itemsets contained in t.
For example, the 3-itemsets that begin with prefix {1 2} are
{1, 2, 3}, {1, 2, 5}, and {1, 2, 6}, while those that begin with
prefix {2 3} are {2, 3, 5} and {2, 3, 6}.

Support Counting
(steps 6 through 11 of Algorithm 6.1. )
● Enumerate the itemsets contained in
each transaction
● Figure 6.9 demonstrate how itemsets
● Figure 6.9 demonstrate how itemsets
contained in a transaction can be
systematically enumerated, i.e., by
specifying their items one by one,
from the leftmost item to the
rightmost item.
● If enumerated item of transaction
matches one of the candidates, then
the support count of the
corresponding candidate is
incremented.(line 9 in algo.)
For instance, given t = {1, 2, 3, 5, 6}, all the 3- itemsets contained in t

Support Counting Using a Hash Tree
● Candidate itemsets are
partitioned into different
buckets and stored in a
hash tree.
● Itemsets contained in
● Itemsets contained in
each transaction are also
hashed into their
appropriate buckets.
● Instead of comparing each
itemset in the transaction
with every candidate
itemset
● Matched only against
candidate itemsets that
belong to the same bucket

Hash Tree from a Candidate Itemset
https://www.youtube.com/watch?v=btW-uU1dhWI

Rule generation
&
Compact representation of frequent
itemsets
DWDM
Unit - IV
Association Analysis

Rule Generation
● Each frequent k-itemset can produce up to 2k−2 association
rules, ignoring rules that have empty antecedents or
consequents .
● An association rule can be extracted by partitioning the
itemset Y into two non-empty subsets, X and Y −X, such that
X → Y −X satisfies the confidence threshold.

Confidence-Based Pruning
Theorem:
If a rule X → Y −X does not satisfy the confidence
threshold, then
any rule X` → Y − X`, where X` is a subset of X, must
not satisfy the confidence threshold as well.

Rule Generation in Apriori Algorithm
● The Apriori algorithm uses a level-wise approach for generating
association rules, where each level corresponds to the
number of items that belong to the rule consequent.
● Initially, all the high-confidence rules that have only one
item in the rule consequent are extracted.
● These rules are then used to generate new candidate rules.
For example, if {acd} →{b} and {abd} →{c} are high-confidence rules, then the
candidate rule {ad} → {bc} is generated by merging the consequents of both rules.

Rule Generation in Apriori Algorithm
● Figure 6.15 shows a lattice
structure for the association
rules generated from the
frequent itemset {a, b, c, d}.
● If any node in the lattice has
low confidence, then
according to Theorem, the
entire sub-graph spanned by
the node can be pruned
immediately.
● Suppose the confidence for
{bcd} → {a} is low. All the rules
containing item a in its
consequent, can be discarded.

In rule generation, we do not have to make additional passes over the data set
to compute the confidence of the candidate rules.
Instead, we determine the confidence of each rule by using the support counts
computed during frequent itemset generation.

Compact Representation of Frequent Itemsets
Maximal Frequent Itemsets
Definition
A maximal frequent itemset is defined as a frequent itemset for
which none of its immediate supersets are frequent.

● The itemsets in the lattice are divided
into two groups: those that are
frequent and those that are
infrequent.
● A frequent itemset border, which is
represented by a dashed line, is also
illustrated in the diagram.
● Every itemset located above the
border is frequent, while those
located below the border (the shaded
nodes) are infrequent.
● Among the itemsets residing near the
border, {a, d}, {a, c, e}, and {b, c, d, e} are
considered to be maximal frequent
itemsets because their immediate
supersets are infrequent.
● Maximal frequent itemsets do not
contain the support information of
their subsets.

● Maximal frequent itemsets effectively provide a compact representation of
frequent itemsets.
● They form the smallest set of itemsets from which all frequent itemsets can
be derived.
● For example, the frequent itemsets shown in Figure 6.16 can be divided into
two groups:
○ Frequent itemsets that begin with item a and that may contain items c, d, or e. This group
includes itemsets such as {a}, {a, c}, {a, d}, {a, e} and {a, c, e}.
○ Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b,
c}, {c, d},{b, c, d, e}, etc.
● Frequent itemsets that belong in the first group are subsets of either {a, c, e}
or {a, d}, while those that belong in the second group are subsets of {b, c, d, e}.

● Closed Frequent Itemsets
○ Closed itemsets provide a minimal representation of itemsets
without losing their support information.
○ An itemset X is closed if none of its immediate supersets has
exactly the same support count as X.
Or
○ X is not closed if at least one of its immediate supersets has the
same support count as X.

Closed Frequent Itemsets
● An itemset is a closed
frequent itemset if it is
closed and its support is
greater than or equal to
minsup.

Closed Frequent Itemsets
● Determine the support counts for the non-closed by using the closed frequent
itemsets
● consider the frequent itemset {a, d} - is not closed, its support count must be
identical to one of its immediate supersets {a, b, d}, {a, c, d}, or {a, d, e}.
● Apriori principle states
○ any transaction that contains the superset of {a, d} must also contain {a, d}.
○ any transaction that contains {a, d} does not have to contain the supersets of {a, d}.
● So, the support for {a, d} = largest support among its supersets = support of
{a,c,d}
● Algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the
smallest frequent itemsets.

● The items can be divided into
three groups: (1) Group A,
which contains items a1
through a5; (2) Group B, which
contains items b1 through b5;
and (3) Group C, which
contains items c1 through c5.
● The items within each group
are perfectly associated with
each other and they do not
appear with items from
another group. Assuming the
support threshold is 20%,
the total number of frequent
itemsets is 3×(25−1)= 93.
● There are only three closed
frequent itemsetsin the
data: ({a1, a2, a3, a4, a5}, {b1,
b2, b3, b4, b5}, and {c1, c2, c3,
c4, c5})

● Redundant association rules can be removed by using Closed frequent itemsets
● An association rule X → Y is redundant if there exists another rule X`→ Y`,
where
X is a subset of X` and
Y is a subset of Y `
such that the support and confidence for both rules are identical.

● From table 6.5 {b} is not a closed frequent itemset while {b, c} is closed.
● The association rule {b} → {d, e} is therefore redundant because it has the same
support and confidence as {b, c} → {d, e}.
● Such redundant rules are not generated if closed frequent itemsets are used
for rule generation.
● all maximal frequent itemsets are closed because none
● of the maximal frequent itemsets can have the same support count as their
● immediate supersets.

FP Growth Algorithm
Association Analysis (Unit - IV)
DWDM

FP Growth Algorithm
● FP-growth algorithm takes a radically different approach for discovering frequent itemsets.
● The algorithm encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure
FP-Tree Representation
● An FP-tree is a compressed representation of the input data. It is constructed by reading the data
set one transaction at a time and mapping each transaction onto a path in the FP-tree.
● As different transactions can have several items in common, their paths may overlap. The more
the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
● If the size of the FP-tree is small enough to fit into main memory, this will allow us to extract
frequent itemsets directly from the structure in memory instead of making repeated passes over
the data stored on disk.

FP Tree Representation
● Figure 6.24 shows a data set that
contains ten transactions and five
items.
● The structures of the FP-tree after
reading the first three
transactions are also depicted in
the diagram.
● Each node in the tree contains the
label of an item along with a
counter that shows the number of
transactions mapped onto the
given path.
● Initially, the FP-tree contains only
the root node represented by the
null symbol.

1. The data set is scanned once to
determine the support count of
each item. Infrequent items are
discarded, while the frequent
items are sorted in decreasing
support counts. For the data set
shown in Figure, a is the most
frequent item, followed by b, c, d,
and e.

2. The algorithm makes a second
pass over the data to construct
the FP-tree. After reading the
first transaction, {a, b}, the nodes
labeled as a and b are created. A
path is then formed from null →
a → b to encode the transaction.
Every node along the path has a
frequency count of 1.

3. After reading the second transaction, {b,c,d}, a new set of
nodes is created for items b, c, and d. A path is then
formed to represent the transaction by connecting the
nodes null → b → c → d. Every node along this path
also has a frequency count equal to one.
4. The third transaction, {a,c,d,e}, shares a common prefix
item (which is a) with the first transaction. As a result,
the path for the third transaction, null → a → c → d
→ e, overlaps with the path for the first transaction,
null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two,
while the frequency counts for the newly created
nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been
mapped onto one of the paths given in the FP-tree.
The resulting FP-tree after reading all the transactions
is shown in Figure 6.24.

● The size of an FP-tree is typically smaller
than the size of the uncompressed data
because many transactions in market
basket data often share a few items in
common.
● In the best-case scenario, where all the
transactions have the same set of items,
the FP-tree contains only a single branch
of nodes.
● The worst-case scenario happens when
every transaction has a unique set of
items.

● The size of an FP-tree also
depends on how the items are
ordered.
● If the ordering scheme in the
preceding example is reversed,
i.e., from lowest to highest
support item, the resulting FP-
tree is shown in Figure 6.25.
● An FP-tree also contains a list
of pointers connecting
between nodes that have the
same items.
● These pointers, represented as
dashed lines in Figures 6.24
and 6.25, help to facilitate the
rapid access of individual
items in the tree.

Frequent Itemset Generation using FP-Growth Algorithm
Steps in FP-Growth Algorithm:
Step-1: Scan the database to build Frequent 1-item set which will contain all
the elements whose frequency is greater than or equal to the minimum
support. These elements are stored in descending order of their
respective frequencies.
Step-2: For each transaction, the respective Ordered-Item set is built.
Step-3: Construct the FP tree. by scanning each Ordered-Item set
Step-4: For each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Step-5: For each item, the Conditional Frequent Pattern Tree is built.
Step-6: Frequent Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to each corresponding item.

Frequent Itemset Generation in FP-Growth Algorithm
Example:
The frequency of each individual
item is computed:-
Given Database: min_support=3

● A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are
stored in descending order of their respective frequencies.
● L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
● Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained
in the transaction. The following table is built for all the transactions:

Now, all the Ordered-Item sets
are inserted into a Trie Data
Structure.
a) Inserting the set {K, E, M, O,
Y}:
All the items are simply
linked one after the other in
the order of occurrence in
the set and initialize the
support count for each item
as 1.

b) Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and
E, simply the support count is increased
by 1.
There is no direct link between E and O,
therefore a new node for the item O is
initialized with the support count as 1
and item E is linked to this new node.
On inserting Y, we first initialize a new
node for the item Y with support count
as 1 and link the new node of O with the
new node of Y.

c) Inserting the set {K, E, M}:
● Here simply the support
count of each element is
increased by 1.

d) Inserting the set {K, M, Y}:
● Similar to step b), first the
support count of K is
increased, then new nodes
for M and Y are initialized
and linked accordingly.

e) Inserting the set {K, E, O}:
● Here simply the support
counts of the respective
elements are increased.

Now, for each item starting from leaf, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.

Now for each item, the Conditional Frequent Pattern Tree is built.
It is done by taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the support counts of all
the paths in the Conditional Pattern Base.
The itemsets whose support count >= min_support value are retained in the Conditional
Frequent Pattern Tree and the rest are discarded.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are
generated by pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding to the item as given in the below table.
For each row, two types of association rules can be inferred for example for the first
row which contains the element, the rules K -> Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is calculated and the one
with confidence greater than or equal to the minimum confidence value is retained.

Data Mining
Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar

What is Cluster Analysis?
● Given a set of objects, place them in groups such that the
objects in a group are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
2
maximized
distances are
minimized

Applications of Cluster Analysis
● Understanding
– Group related documents
for browsing(Information
Retrieval),
– group genes and proteins
that have similar
functionality(Biology),
3
functionality(Biology),
– group stocks with similar
price fluctuations
(Business)
– Climate
– Psychology & Medicine
Clustering precipitation
in Australia

Applications of Cluster Analysis
● Clustering for Utility
– Summarization
– Compression
– Efficiently finding Nearest
Neighbors
4
Clustering precipitation
in Australia

Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
5
Four Clusters
Two Clusters

Types of Clusterings
● A clustering is a set of clusters
● Important distinction between hierarchical and
partitional sets of clusters
– Partitional Clustering (unnested)
◆
6
◆ A division of data objects into non-overlapping subsets (clusters)
– Hierarchical clustering (nested)
◆ A set of nested clusters organized as a hierarchical tree

Partitional Clustering
7
Original Points A Partitional Clustering

Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram
8
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Traditional Dendrogram

Other Distinctions Between Sets of Clusters
● Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
◆ Can belong to multiple classes or could be ‘border’ points
– Fuzzy clustering (one type of non-exclusive)
◆ In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
◆ Weights must sum to 1
◆
9
◆ Weights must sum to 1
◆ Probabilistic clustering has similar characteristics
● Partial versus complete
– In some cases, we only want to cluster some of the data

Types of Clusters
● Well-separated clusters
● Prototype-based clusters
● Contiguity-based clusters
10
● Density-based clusters
● Described by an Objective Function

Types of Clusters: Well-Separated
● Well-Separated Clusters:
– A cluster with a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.
11
3 well-separated clusters

Types of Clusters: Prototype-Based
● Prototype-based ( or center based)
– A cluster with set of points such that a point in a cluster is
closer (more similar) to the prototype or “center” of the
cluster, than to the center of any other cluster
– If Data is Continuous – Center will be Centroid /mean
– If Data is Categorical - Center will be Medoid ( Most
Representative point)
12
Representative point)
4 center-based clusters

Types of Clusters: Contiguity-Based ( Graph)
● Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster with set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
– Graph ( Data-Nodes, links - Connections),Cluster is group of
connected objects.No connections with outside group.
13
connected objects.No connections with outside group.
● Useful when clusters are irregular or intertwined
● Trouble when noise is present
– a small bridge of points can merge two distinct clusters.
8 contiguous clusters

Types of Clusters: Density-Based
● Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
The two circular clusters are not merged, as in Figure, because the bridge between them(previous
14
6 density-based clusters
The two circular clusters are not merged, as in Figure, because the bridge between them(previous
slide figure) fades into the noise.
Curve that is present in previous slide Figure also fades into the noise and does not form a
cluster
A density based definition of a cluster is often employed when the clusters are irregular or intertwined,
and when noise and outliers are present.

Types of Clusters: Density-Based
● Shared property(Conceptual Clusters)
– a cluster as a set of objects that share some
property.
15
A clustering algorithm would need a very specific concept (sophisticated) of a cluster to successfully
detect these clusters. The process of finding such clusters is called conceptual clustering.

Clustering Algorithms
● K-means and its variants
● Hierarchical clustering
● Density-based clustering
16

K-means
● Prototype-based, partitional clustering
technique
● Attempts to find a user-specified number of
clusters (K)
17

Agglomerative Hierarchical Clustering
● Hierarchical clustering
● Starts with each point as a singleton cluster
● Repeatedly merges the two closest clusters
until a single, all encompassing cluster
18
until a single, all encompassing cluster
remains.
● Some Times - graph-based clustering
● Others - prototype-based approach.

DBSCAN
● Density-based clustering algorithm
● Produces a partitional clustering,
● No. of clusters is automatically determined by
the algorithm.
19
the algorithm.
● Noise - Points in low-density regions (omitted)
● Not a complete clustering.

K-means Clustering
● Partitional clustering approach
● Number of clusters, K, must be specified
● Each cluster is associated with a centroid (center point)
● Each point is assigned to the cluster with the closest
centroid
● The basic algorithm is very simple
20

Example of K-means Clustering
22

K-means Clustering – Details
● Simple iterative algorithm.
– Choose initial centroids;
– repeat {assign each point to a nearest centroid; re-compute cluster centroids}
– until centroids stop changing.
● Initial centroids are often chosen randomly.
– Clusters produced can vary from one run to another
● The centroid is (typically) the mean of the points in the cluster,
but other definitions are possible
23
but other definitions are possible
● Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
● Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

● Centroid can vary, depending on the proximity
measure for the data and the goal of the
clustering.
● The goal of the clustering is typically expressed
by an objective function that depends on the
proximities of the points to one another or to the
cluster centroids.
24
cluster centroids.
● e.g.minimize the squared distance of each point
to its closest centroid

25

Centroids and Objective Functions
Data in Euclidean Space
● A common objective function (used with Euclidean
distance measure) is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
center
– To get SSE, we square these errors and sum them.
26
– x is a data pointin cluster Ci and mi is the centroid (mean) for
cluster Ci
– A kmeans run which produces minimum SSE will be considered.
– Centroid (mean) of the i th cluster is

K-means Objective Function
Document Data
● Cosine Similarity
● Document data is represented as Document Term Matrix
● Objective (Cohesion of the cluster)
27
– Maximize the similarity of the documents in a cluster
to the cluster centroid; which is called cohesion of
the cluster

Two different K-means Clusterings
Original Points
Figure a shows a clustering
solution that is the global
minimum of the SSE for three
clusters
Figure b shows suboptimal
clustering that is only a local
minimum.
28
Fig b: Sub-optimal Clustering
Fig a: Optimal Clustering

Importance of Choosing Initial Centroids …
The below 2 figures show the clusters that result from two particular choices of initial centroids.
(For both figures, the positions of the cluster centroids in the various iterations are indicated by
crosses.)
Fig-1
In Figure1, even though all the
initial centroids are from one
natural cluster, the minimum
SSE clustering is still found
In Figure 2, even though the initial
centroids seem to be better
distributed, we obtain a
suboptimal clustering, with higher
squared error. This is considered as
poor starting of centroids
Fig-2

Importance of Choosing Initial Centroids …
30

Problems with Selecting Initial Points
31
● Figure 5.7 shows that if a pair of clusters has only one initial
centroid and the other pair has three, then two of the true
clusters will be combined and one true cluster will be split.

10 Clusters Example
32
Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example
33
Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example
34
Starting with some pairs of clusters having three initial centroids, while other
have only one.

10 Clusters Example
35
Starting with some pairs of clusters having three initial centroids, while other have only one.

Solutions to Initial Centroids Problem
● Multiple runs
● K-means++
● Use hierarchical clustering to determine initial
centroids
36
centroids
● Bisecting K-means

Multiple Runs
● One technique that is commonly
used to address the problem of
choosing initial centroids is to
perform multiple runs, each with
a different set of randomly
chosen initial centroids, and then
select the set of clusters with the
minimum SSE
● In Figure 5.6(a), the data
consists of two pairs of clusters,
37
consists of two pairs of clusters,
where the clusters in each (top-
bottom) pair are closer to each
other than to the clusters in the
other pair.
● Figure 5.6 (b–d) shows that if we
start with two initial centroids per
pair of clusters, then even when
both centroids are in a single
cluster, the centroids will
redistribute themselves so that
the “true” clusters are found.

Bisecting K-means
● Bisecting K-means algorithm
– Variant of K-means that can produce a partitional or a
hierarchical clustering
40
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

https://www.geeks
forgeeks.org/bisec
ting-k-means-
algorithm-
introduction/
41

Limitations of K-means
● K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
42
● K-means has problems when the data contains
outliers.
– One possible solution is to remove outliers before
clustering

Limitations of K-means: Differing Sizes
43
Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density
44

Limitations of K-means: Non-globular Shapes
45

Overcoming K-means Limitations
46
Original Points K-means Clusters
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.

47
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.

48
One solution is to find a large number of clusters such that each of them represents a part of
a natural cluster. But these small clusters need to be put together in a post-processing step.

● Produces a set of nested clusters organized as a
hierarchical tree
● Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits
49

Strengths of Hierarchical Clustering
● Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
50

● Two main types of hierarchical clustering
– Agglomerative:
◆ Start with the points as individual clusters
◆ At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
– Divisive:
◆
51
– Divisive:
◆ Start with one, all-inclusive cluster
◆ At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)
● Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time

Agglomerative Clustering Algorithm
● Key Idea: Successively merge closest clusters
● Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
52
5. Update the proximity matrix
6. Until only a single cluster remains
● Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms

Steps 1 and 2
● Start with clusters of individual points and a
proximity matrix
p1
p3
p4
p2
p1 p2 p3 p4 p5 . . .
53
p5
p4
.
.
. Proximity Matrix

Intermediate Situation
● After some merging steps, we have some clusters
C4
C3
C2
C1
C1
C3
C4
C2
C3 C4 C5
54
C1
C4
C2 C5
C5
C4
Proximity Matrix

Step 4
● We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C4
C3
C2
C1
C1
C3
C4
C2
C3 C4 C5
55
C1
C4
C2 C5
C5
C4
Proximity Matrix

Step 5
● The question is “How do we update the proximity matrix?”
C4
C3
? ? ? ?
?
?
C2
U
C5
C1
C1
C3
C2 U C5
C3 C4
56
C1
C4
C2 U C5
?
?
C3
C4
Proximity Matrix

How to Define Inter-Cluster Distance
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
Similarity?
● MIN
57
.
.
.
● MIN
● MAX
● Group Average
● Distance Between Centroids
● Other methods driven by an objective
function
– Ward’s Method uses squared error
Proximity Matrix

How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
58
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
function

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
59
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
function

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
60
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
function

p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
● MIN
× ×
61
.
.
.
Proximity Matrix
● MIN
● MAX
● Group Average
function

MIN or Single Link
● Proximity of two clusters is based on the two
closest points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph
● Example:
62
Distance Matrix:

Hierarchical Clustering: MIN
1
2
3
5
6
1
2
3
5
63
Nested Clusters Dendrogram
3
4
6
2
4

Strength of MIN
64
Original Points Six Clusters
• Can handle non-elliptical shapes

Limitations of MIN
Two Clusters
65
Original Points
Two Clusters
• Sensitive to noise
Three Clusters

MAX or Complete Linkage
● Proximity of two clusters is based on the two
most distant points in the different clusters
– Determined by all pairs of points in the two clusters
66
Distance Matrix:

Hierarchical Clustering: MAX
1
2
3
5
6
2 5
4
67
3
4
6
1
3

Strength of MAX
68
Original Points Two Clusters
• Less susceptible to noise

Limitations of MAX
69
Original Points Two Clusters
• Tends to break large clusters
• Biased towards globular clusters

Group Average
● Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
70
Distance Matrix:

Hierarchical Clustering: Group Average
1
2
3
5
2
5 4
71
3
4
6
1
3

Hierarchical Clustering: Group Average
● Compromise between Single and Complete
Link
● Strengths
– Less susceptible to noise
72
– Less susceptible to noise
● Limitations
– Biased towards globular clusters

Cluster Similarity: Ward’s Method
● Similarity of two clusters is based on the increase
in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared
● Less susceptible to noise
73
● Less susceptible to noise
● Biased towards globular clusters
● Hierarchical analogue of K-means
– Can be used to initialize K-means

Hierarchical Clustering: Comparison
MIN MAX
1
2
3
4
5
6
1
2 5
3
4
1
2
3
4
5
6
1
2
3
4
5
74
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2
5
3
4

Hierarchical Clustering: Time and Space requirements
● O(N2) space since it uses the proximity matrix.
– N is the number of points.
● O(N3) time in many cases
– There are N steps and at each step the size, N2,
75
– There are N steps and at each step the size, N ,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time with
some cleverness

Hierarchical Clustering: Problems and Limitations
● Once a decision is made to combine two clusters,
it cannot be undone
● No global objective function is directly minimized
Different schemes have problems with one or
76
● Different schemes have problems with one or
more of the following:
– Sensitivity to noise
– Difficulty handling clusters of different sizes and non-
globular shapes
– Breaking large clusters

Density Based Clustering
● Clusters are regions of high density that are
separated from one another by regions on low
density.
77

DBSCAN
● DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)
– A point is a core point if it has at least a specified number of
points (MinPts) within Eps
◆ These are points that are at the interior of a cluster
◆
78
◆
◆ Counts the point itself
– A border point is not a core point, but is in the neighborhood
of a core point
– A noise point is any point that is not a core point or a border
point

DBSCAN: Core, Border, and Noise Points
MinPts = 7
79

DBSCAN: Core, Border and Noise Points
80
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4

DBSCAN Algorithm
● Form clusters using core points, and assign
border points to one of its neighboring clusters
1: Label all points as core, border, or noise points.
2: Eliminate noise points.
3: Put an edge between all core points within a distance Eps of each
81
3: Put an edge between all core points within a distance Eps of each
other.
4: Make each group of connected core points into a separate cluster.
5: Assign each border point to one of the clusters of its associated core
points

When DBSCAN Works Well
82
Original Points Clusters (dark blue points indicate noise)
• Can handle clusters of different shapes and sizes
• Resistant to noise

When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
83
Original Points
(MinPts=4, Eps=9.92).
(MinPts=4, Eps=9.75)
• Varying densities
• High-dimensional data

Data Warehousing and mining Complete notes.pdf

More Related Content

What's hot (20)

Similar to Data Warehousing and mining Complete notes.pdf (20)

Recently uploaded (20)

Data Warehousing and mining Complete notes.pdf