SlideShare a Scribd company logo
2
Most read
4
Most read
6
Most read
Unit II
Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure because the data may not be
complete and accurate. So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data of interest will
have to be selected and passed to the server. These procedures are not as easy as we think.
Several methods may be performed on the data as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of
the pattern by using a threshold value. It collaborates with the data mining engine to focus
the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold
to filter out discovered patterns. On the other hand, the pattern evaluation module might be
coordinated with the mining module, depending on the implementation of the data mining
techniques used. For efficient data mining, it is abnormally suggested to push the evaluation
of pattern stake as much as possible into the mining procedure to confine the search to only
fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system
and the user. This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data mining system
when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may even
contain user views and data from user experiences that might be helpful in the data mining
process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with
the knowledge base to get inputs, and also update it.
Discretization in data mining
Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a finite
set of intervals with minimum data loss. There are two forms of data discretization first is
supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization
refers to a method depending upon the way which operation proceeds. It means it works on
the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under country
level.
Some Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a
continuous data set. Histogram assists the data inspection for data distribution. For example,
Outliers, skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of idea
hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by
dividing the values of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique
is used. It is done through a supervised procedure. In a numeric attribute discretization, first,
you need to select the attribute that has the least entropy, and then you need to run it with
the help of a recursive process. The recursive process divides it into various discretized
disjoint intervals, from top to bottom, using the same splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval,
and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we can say that a hierarchy
concept refers to a sequence of mappings with a set of more general concepts to complex
concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A document
is placed in a folder in windows at a specific place in the tree structure is the best example of
a computer hierarchical tree model. There are two types of hierarchy: top-down mapping
and the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an
example.
A particular city can map with the belonging country. For example, New Delhi can be mapped
to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and
ends with the top to the generalized information.
Data discretization and binarization in data mining
Data discretization is a method of converting attributes values of continuous data into a finite
set of intervals with minimum data loss. In contrast, data binarization is used to transform
the continuous and discrete attributes into binary attributes.
Why is Discretization important?
As we know, an infinite of degrees of freedom mathematical problem poses with the
continuous data. For many purposes, data scientists need the implementation of
discretization. It is also used to improve signal noise ratio.
Unit II.pdf

More Related Content

PPTX
Seminar Presentation
PPTX
UNIT 2: Part 2: Data Warehousing and Data Mining
DOCX
data mining and data warehousing
DOCX
DOC
ETL QA
PPTX
Introduction to data mining
PPTX
Intro to Data warehousing lecture 17
PPTX
Seminar Presentation
UNIT 2: Part 2: Data Warehousing and Data Mining
data mining and data warehousing
ETL QA
Introduction to data mining
Intro to Data warehousing lecture 17

Similar to Unit II.pdf (20)

PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
PDF
Top 30 Data Analyst Interview Questions.pdf
PPTX
Business Intelligence and Analytics Unit-2 part-A .pptx
PDF
data mining
PPT
Data Mining
PPTX
Unit-V-Introduction to Data Mining.pptx
PPTX
Data Warehousing AWS 12345
PPT
Data processing
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PDF
Data Mining System and Applications: A Review
PPTX
CLUSTER ANALYSIS.pptx
PDF
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
PPT
20IT501_DWDM_PPT_Unit_II.ppt
PPTX
1) Introduction to Data Analyticszz.pptx
PDF
TTG Int.LTD Data Mining Technique
PPTX
FCT UNIT 6 teach Foundation of computing technologies.pptx
PDF
Privacy preservation techniques in data mining
PDF
Privacy preservation techniques in data mining
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Top 30 Data Analyst Interview Questions.pdf
Business Intelligence and Analytics Unit-2 part-A .pptx
data mining
Data Mining
Unit-V-Introduction to Data Mining.pptx
Data Warehousing AWS 12345
Data processing
20IT501_DWDM_PPT_Unit_II.ppt
Data Mining System and Applications: A Review
CLUSTER ANALYSIS.pptx
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Characterizing and Processing of Big Data Using Data Mining Techniques
20IT501_DWDM_PPT_Unit_II.ppt
1) Introduction to Data Analyticszz.pptx
TTG Int.LTD Data Mining Technique
FCT UNIT 6 teach Foundation of computing technologies.pptx
Privacy preservation techniques in data mining
Privacy preservation techniques in data mining
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPTX
Introduction to Inferential Statistics.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
annual-report-2024-2025 original latest.
PPTX
Database Infoormation System (DBIS).pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Introduction to Data Science and Data Analysis
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPT
Predictive modeling basics in data cleaning process
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Managing Community Partner Relationships
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Introduction to Inferential Statistics.pptx
IMPACT OF LANDSLIDE.....................
annual-report-2024-2025 original latest.
Database Infoormation System (DBIS).pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Introduction to Data Science and Data Analysis
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Optimise Shopper Experiences with a Strong Data Estate.pdf
A Complete Guide to Streamlining Business Processes
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Predictive modeling basics in data cleaning process
[EN] Industrial Machine Downtime Prediction
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Managing Community Partner Relationships
Ad

Unit II.pdf

  • 1. Unit II Data Mining Architecture The significant components of data mining systems are a data source, data mining engine, data warehouse server, the pattern evaluation module, graphical user interface, and knowledge base. Data Source: The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other documents. You need a huge amount of historical data for data mining to be successful. Organizations typically store data in databases or data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain text files or spreadsheets may contain information. Another primary source of data is the World Wide Web or the internet.
  • 2. Different processes: Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As the information comes from various sources and in different formats, it can't be used directly for the data mining procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More information than needed will be collected from various data sources, and only the data of interest will have to be selected and passed to the server. These procedures are not as easy as we think. Several methods may be performed on the data as part of selection, integration, and cleaning. Database or Data Warehouse Server: The database or data warehouse server consists of the original data that is ready to be processed. Hence, the server is cause for retrieving the relevant data that is based on data mining as per user request. Data Mining Engine: The data mining engine is a major component of any data mining system. It contains several modules for operating data mining tasks, including association, characterization, classification, clustering, prediction, time-series analysis, etc. In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and software used to obtain insights and knowledge from data collected from various data sources and stored within the data warehouse. Pattern Evaluation Module: The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a threshold value. It collaborates with the data mining engine to focus the search on exciting patterns. This segment commonly employs stake measures that cooperate with the data mining modules to focus the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand, the pattern evaluation module might be coordinated with the mining module, depending on the implementation of the data mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much as possible into the mining procedure to confine the search to only fascinating patterns. Graphical User Interface: The graphical user interface (GUI) module communicates between the data mining system and the user. This module helps the user to easily and efficiently use the system without
  • 3. knowing the complexity of the process. This module cooperates with the data mining system when the user specifies a query or a task and displays the results. Knowledge Base: The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or evaluate the stake of the result patterns. The knowledge base may even contain user views and data from user experiences that might be helpful in the data mining process. The data mining engine may receive inputs from the knowledge base to make the result more accurate and reliable. The pattern assessment module regularly interacts with the knowledge base to get inputs, and also update it. Discretization in data mining Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up merging strategy. Now, we can understand this concept with the help of an example Suppose we have an attribute of Age with the given values Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77 Table before Discretization Attribute Age Age Age Age 1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78 After Discretization Child Young Mature Old Another example is analytics, where we gather the static data of website visitors. For example, all visitors who visit the site with the IP address of India are shown under country level.
  • 4. Some Famous techniques of data discretization Histogram analysis Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set. Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation, normal distribution representation, etc. Binning Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller values. For data discretization and the development of idea hierarchy, this technique can also be used. Cluster Analysis Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers into clusters to isolate a computational feature of x. Data discretization using decision tree analysis Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done through a supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that has the least entropy, and then you need to run it with the help of a recursive process. The recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same splitting criterion. Data discretization using correlation analysis Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised procedure. Data discretization and concept hierarchy generation The term hierarchy represents an organizational structure or mapping in which items are ranked according to their levels of importance. In other words, we can say that a hierarchy concept refers to a sequence of mappings with a set of more general concepts to complex concepts. It means mapping is done from low-level concepts to high-level concepts. For example, in computer science, there are different types of hierarchical systems. A document is placed in a folder in windows at a specific place in the tree structure is the best example of a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and the second one is bottom-up mapping.
  • 5. Let's understand this concept hierarchy for the dimension location with the help of an example. A particular city can map with the belonging country. For example, New Delhi can be mapped to India, and India can be mapped to Asia. Top-down mapping Top-down mapping generally starts with the top with some general information and ends with the bottom to the specialized information. Bottom-up mapping Bottom-up mapping generally starts with the bottom with some specialized information and ends with the top to the generalized information. Data discretization and binarization in data mining Data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. In contrast, data binarization is used to transform the continuous and discrete attributes into binary attributes. Why is Discretization important? As we know, an infinite of degrees of freedom mathematical problem poses with the continuous data. For many purposes, data scientists need the implementation of discretization. It is also used to improve signal noise ratio.