SlideShare a Scribd company logo
DATAMINING AND WAREHOUSING
BY: SHRUTI SHARMA
CONTENT
• Data warehousing components
• Need for data warehousing
• Basic elements of data warehousing
• Data mart
• Data extraction
• Cleanup
• Transformation tools
• Meta data
• Star
• Snow flake and galaxy schemas
multidimensional databases
• Facts and Dimention data
• Partitioning strategy- Horizontal and
Vertical partitioning
DATAWAREHOUSING COMPONENT
• DATA-DATA IS COLLECTION OF FACTS AND FEATURES IN PARTICULAR
FORMATE.
• INFORMATION –INFORMATION IS A PROCESS DATA.
• DATABASE-IT IS A DIFFERENT KIND OF DATA IN A PARTICULAR FORMATE.
• DATAWAREHOUSE-COLLECTION OF DIFFERENT KIND OF DATABASE IN A
PARTICULAR FORMATE.
• MINGING-EXTRACTION OF DATA FROM DATAWAREHOUSING.
DATAWAREHOUSE OBJECTIVE
• DATA WAREHOUSE- A DATAWAREHOUSE CAN BE VIEWED AS A DATABASE
FOR HISTORICAL DATA FROM DIFFERENT FUNCTIONS WITHIN A COMPANY.
DATA
WAREHOUSE
DELHI
BRANCH
MUMBAI
BRANCH
PUNE
BRANCH
INDORE
BRANCH
BUILDING A DATA WAREHOUSE
• Subject oriented- A data warehouse typically provides information on a
topic (such as customer,supplier,product,sale) rather than company
operations.
• Time Varient- Time variant keys (e.g., for the date, month, time) are
typically present. (data identified in a particular time)
• Integrated- A data warehouse combines data from various sources. These
may include a cloud, relational databases, flat files, structured and semi-
structured data, metadata, and master data
BUILDING A DATA WAREHOUSE
Non volatile-Prior data isn’t deleted when new data is added. Historical data
is preserved for comparisons, trends, and analytics.
Scalebility- After increasing the load to a particular system performance is not
decrease.
Data Mining and Data Warehousing Presentation
NEED FOR DATA WAREHOUSING
• Ensure consistency. Data warehouses are programmed to apply a uniform
format to all collected data, which makes it easier for corporate decision-
makers to analyze and share data insights with their colleagues around the
globe.
• Data warehousing improves the speed and efficiency of accessing different
data sets and makes it easier for corporate decision-makers to derive insights
that will guide the business and marketing strategies that set them apart from
their competitors.
NEED FOR DATA WAREHOUSING
• Improve their bottom line-This allows executives to see where they can adjust
their strategy to decrease costs, maximize efficiency and increase sales to
improve their bottom line.
BASIC ELEMENT OF DATA WAREHOUSING
• A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools,
metadata, and access tools. All of these components are engineered for speed so that you can get results
quickly and analyze data on the fly.
• 1. Operational Source –
• An operational Source is a data source consists of Operational Data and External Data.
• Data can come from Relational DBMS like Informix, Oracle.
• 2. Load Manager –
• The Load Manager performs all operations associated with the extraction of loading data in the data warehouse.
• These tasks include the simple transformation of data to prepare data for entry into the warehouse.
• 3. Warehouse Manage –
• The warehouse manager is responsible for the warehouse management process.
• The operations performed by the warehouse manager are the analysis, aggregation, backup and collection of
data, de-normalization of the data.
ELEMENTS
• 4. Query Manager –
• Query Manager performs all the tasks associated with the management of user queries.
• The complexity of the query manager is determined by the end-user access operations tool and
the features provided by the database.
• 5. Detailed Data –
• It is used to store all the detailed data in the database schema.
• Detailed data is loaded into the data warehouse to complement the data collected.
• 6. Summarized Data –
• Summarized Data is a part of the data warehouse that stores predefined aggregations
• These aggregations are generated by the warehouse manager.
ELEMENTS
• 7. Archive and Backup Data –
• The Detailed and Summarized Data are stored for the purpose of archiving and backup.
• The data is relocated to storage archives such as magnetic tapes or optical disks.
• 8. Metadata –
• Metadata is basically data stored above data.
• It is used for extraction and loading process, warehouse, management process, and
query management process.
• 9. End User Access Tools –
• End-User Access Tools consist of Analysis, Reporting, and mining.
• By using end-user access tools users can link with the warehouse.
Data Mining and Data Warehousing Presentation
DATA MART
• A data mart is a subset of a data warehouse focused on a particular line of
business, department, or subject area. Data marts make specific data
available to a defined group of users, which allows those users to quickly
access critical insights without wasting time searching through an entire data
warehouse.
DATA EXTRACTION
• Extraction is the operation of extracting data from a source system for
further use in a data warehouse environment. This is the first step of the ETL
process. After the extraction, this data can be transformed and loaded into
the data warehouse.
CLEANUP
• Step 1 — Identify the Critical Data Fields. ...
• Step 2 — Collect the Data. ...
• Step 3 — Discard Duplicate Values. ...
• Step 4 — Resolve Empty Values. ...
• Step 5 — Standardize the Cleansing Process. ...
• Step 6 — Review, Adapt, Repeat
ETL PROCESS
• ETL is a process in Data Warehousing and it stands
for Extract, Transform and Load. It is a process in which an ETL tool extracts
the data from various data source systems, transforms it in the staging area,
and then finally, loads it into the Data Warehouse system.
ETL
• Extraction: The first step of the ETL process is extraction. In this step, data
from various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It is
important to extract the data from various source systems and store it into the
staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also.
• Transformation-The second step of the ETL process is transformation. In this
step, a set of rules or functions are applied on the extracted data to convert it
into a single standard format.
ETL LOAD PROCESS
• Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals.
• The block diagram of the pipelining of ETL process is shown below:
METADATA
• Metadata means "data about data". Metadata is defined as the data
providing information about one or more aspects of the data; it is used to
summarize basic information about data that can make tracking and working
with specific data easier. Some examples include: Means of creation of the
data.
• There are three main types of metadata: operational , extraction and
transformation , end user meta data.
3 TYPES OF SCHEMA USED IN DATA WAREHOUSES
• Star Schemas-The star schema in a data warehouse is historically one of the most
straightforward designs. This schema follows some distinct design parameters, such as only
permitting one central table and a handful of single-dimension tables joined to the table.
• Characteristics of the Star Schema:
• Star data warehouse schemas create a denormalized database that enables quick
querying responses
• The primary key in the dimension table is joined to the fact table by the foreign key
• Each dimension in the star schema maps to one dimension table
• Dimension tables within a star scheme are not to be connected directly
• Star schema creates denormalized dimension tables
Data Mining and Data Warehousing Presentation
SNOW FLAKE SCHEMA
• Snowflake Schema -The Snowflake Schema is a data warehouse schema that encompasses
a logical arrangement of dimension tables. This data warehouse schema builds on the star
schema by adding additional sub-dimension tables that relate to first-order dimension
tables joined to the fact table.
• Characteristics of the Snowflake Schema:
• Snowflake Schema are permitted to have dimension tables joined to other dimension tables
• Snowflake Schema are to have one fact table only
• Snowflake Schema create normalized dimension tables
• The normalized schema reduces required disk space for running and managing this data
warehouse
• Snowflake Scheme offer an easier way to implement a dimension
Data Mining and Data Warehousing Presentation
GALAXY SCHEMAS
• The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next
iteration of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy
Schema uses multiple fact tables connected with shared normalized dimension tables. Galaxy Schema
can be thought of as star schema interlinked and completely normalized, avoiding any kind of
redundancy or inconsistency of data.
• Characteristics of the Galaxy Schema:
• Galaxy Schema is multidimensional acting as a strong design consideration for complex database
systems
• Galaxy Schema reduces redundancy to near zero redundancy as a result of normalization
• Galaxy Schema is known for high data quality and accuracy and lends to effective reporting and
analytics
Data Mining and Data Warehousing Presentation
BASIC COMPONENT OF DATAWAREHOUSE
• Fact Table -A fact table aggregates metrics, measurements, or facts about
business processes. In this example, fact tables are connected to dimension
tables to form a schema architecture representing how data relates within the
data warehouse. Fact tables store primary keys of dimension tables as
foreign keys within the fact table.
DIMENSION TABLE
• Dimension tables are non-denormalized tables used to store data attributes or
dimensions. As mentioned above, the primary key of a dimension table is
stored as a foreign key in the fact table. Dimension tables are not joined
together. Instead, they are joined via association through the central fact
table.
PARTITION STRATEGY –
HORIZONTAL PARTITIONING
• Partitioning is done to enhance performance and facilitate easy management
of data. Partitioning also helps in balancing the various requirements of the
system.
• In horizontal partitioning, we have to keep in mind the requirements for
manageability of the data warehouse.
• Partitioning by Time into Equal Segments
• In this partitioning strategy, the fact table is partitioned on the basis of time
period.For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.
HORIZONTAL PARTITIONING
• Partition by Time into Different-sized Segments
• This kind of partition is done where the aged data is accessed infrequently. It
is implemented as a set of small partitions for relatively current data, larger
partition for inactive data.
WHAT IS DATAMINING?
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends and
make more-informed business decisions.
Its core elements include machine learning and statistical analysis, along with
data management tasks done to prepare data for analysis. The use of machine
learning algorithms and artificial intelligence (AI) tools has automated more of
the process and made it easier to mine massive data sets, such as customer
databases, transaction records and log files from web servers, mobile apps and
sensors.
DATAMINING BLOCK DIAGRAM
MACHINE LEARNING
• Machine learning (ML), on the other hand, is a subset of data science. ML
primarily focuses on creating algorithms that can learn and predict from given
data. Machine learning and data mining can be combined to deliver results
that can help make better business decisions and boost the profit margins
of an organization.
DBMS(DATABASE MANAGEMENT SYSTEM
• A database management system (or DBMS) is essentially nothing more than a
computerized data-keeping system. Users of the system are given facilities to
perform several kinds of operations on such a system for either manipulation of the
data in the database or the management of the database structure itself.
• DBMS is a database program : it is a software system that uses a standard method
of cataloging, retrieving, and running queries on data. Some DBMS examples
include MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS.
OLAP
• OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to
gain insight into information through fast, consistent, interactive access in a
wide variety of possible views of data that has been transformed from raw
information to reflect the real dimensionality of the enterprise as understood
by the clients. OLAP implement the multidimensional analysis of business
information and support the capability for complex estimations, trend analysis,
and sophisticated data modeling.
OLAP
• The Data cube pictorially shows how different attributes of data are
arranged in the data model. Below is the diagram of a general data cube
OLAP TYPES
• The example above is a 3D cube having attributes like branch(A,B,C,D),item
type(home,entertainment,computer,phone,security), year(1997,1998,1999) .
Data cube operations:
OLAP TYPES
• Roll-up: operation and aggregate certain similar data attributes having the
same dimension together.
• Drill-down: this operation is the reverse of the roll-up operation. It allows us
to take particular information and then subdivide it further for coarser
granularity analysis.
• Slicing: this operation filters the unnecessary portions. Suppose in a particular
dimension, the user doesn’t need everything for analysis, rather a particular
attribute.
• Dicing: this operation does a multidimensional cutting, that not only cuts only
one dimension but also can go to another dimension and cut a certain range
of it.
• Pivot: this operation is very important from a viewing point of view. It
basically transforms the data cube in terms of view.
STATISTICS
Statistics means studying, collecting, analyzing, interpreting, and organizing
data. Statistics is a science that helps to gather and analyze numerical data in
huge quantities. With the help of statistics, you can measure, control, and
communicate uncertainty. It allows you to infer proportions in a whole, derived
from a representative sample.
STAGES OF THE DATAMINING PROCESS
• There are many factors that determine the usefulness of data such as
• accuracy, completeness, consistency, timeliness. The data has to
• quality if it satisfies the intended purpose.
• 1) Data Cleaning
• Data cleaning is the first step in data mining. It holds importance as
• dirty data if used directly in mining can cause confusion in procedures
• and produce inaccurate results
CLEANING STEPS
• This step carries out the routine cleaning work by:
• (i) Fill The Missing Data:
• Missing data can be filled by methods such as:
• Ignoring the tuple.
• Filling the missing value manually.
• Use the measure of central tendency, median or
• Filling in the most probable value.
• (ii) Remove The Noisy Data: Random error is called noisy data.
• Methods to remove noise are :
• Binning: Binning methods are applied by sorting values into buckets
• or bins. Smoothening is performed by consulting the neighboring
• values
CLEANING
• Binning is done by smoothing by bin i.e. each bin is replaced by the
• mean of the bin. Smoothing by a median, where each bin value is
• replaced by a bin median. Smoothing by bin boundaries i.e. The
• minimum and maximum values in the bin are bin boundaries and each
• bin value is replaced by the closest boundary value.
• • Identifying the Outliers
• • Resolving Inconsistencies
• 2) Data Integration
• When multiple heterogeneous data sources such as databases, data
• cubes or files are combined for analysis, this process is called data
• integration. This can help in improving the accuracy and speed of the
• data mining process.
• Different databases have different naming conventions of variables,
• by causing redundancies in the databases. Additional Data Cleaning
• can be performed to remove the redundancies and inconsistencies
• from the data integration without affecting the reliability of data.
• Data Integration can be performed using Data Migration Tools such
• as Oracle Data Service Integrator and Microsoft SQL etc.
• 4) Data Transformation
• In this process, data is transformed into a form suitable for the data
• mining process. Data is consolidated so that the mining process is
• more efficient and the patterns are easier to understand. Data
• Transformation involves Data Mapping and code generation process.
• 5) Data Mining
• Data Mining is a process to identify interesting patterns and
• knowledge from a large amount of data. In these steps, intelligent
• patterns are applied to extract the data patterns. The data is
• represented in the form of patterns and models are structured using
• classification and clustering techniques.
• 6) Pattern Evaluation
• This step involves identifying interesting patterns representing the
• knowledge based on interestingness measures. Data summarization
• and visualization methods are used to make the data understandable
• by the user.
• 7) Knowledge Representation
• Knowledge representation is a step where data visualization and
• knowledge representation tools are used to represent the mined data.
• Data is visualized in the form of reports, tables, etc
KDD PROCESS IN DATA MINING
• KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets. The KDD process in data mining typically involves the following
steps:
• Selection: Select a relevant subset of the data for analysis.
• Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data
normalization, missing value handling, and data integration.
• Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph.
• Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This
may include tasks such as clustering, classification, association rule mining, and anomaly detection.
• Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the
results, evaluating the quality of the discovered patterns and identifying relationships and associations among the data.
• Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful.
• Deployment: Use the discovered knowledge to solve the business problem and make decisions.
• The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate
knowledge from the data.
KDD DIAGRAM
DATAMINING APPLICATION
DATAMINING APPLICATION
• Scientific Analysis: Scientific simulations are generating bulks of data every
day. This includes data collected from nuclear laboratories, data about human
psychology, etc.
• Intrusion Detection: A network intrusion refers to any unauthorized activity on
a digital network. Network intrusions often involve stealing
valuable network resources.
• Business Transactions: Every business industry is memorized for perpetuity.
Such transactions are usually time-related and can be inter-business deals or
intra-business operations.
• Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers.
• Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method.
• Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area.
• Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies,
identify behavior patterns of risky customers and identify fraudulent behavior of customers.
• Transportation: A diversified transportation company with a large direct
sales force can apply data mining to identify the best prospects for its
services.
• Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely to be
interested in a new credit product.
DATAMINING TECHNIQUES
TECHNIQUE DISCRIPTION
• 1. Association
• Association analysis is the finding of association rules showing attribute-value conditions
that occur frequently together in a given set of data.
• Classification
• Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
• Data Mining has a different type of classifier: Decision Tree
• SVM(Support Vector Machine)
• Generalized Linear Models
• Bayesian classification:
DATA MINING HAS A DIFFERENT TYPE OF
CLASSIFIER
• A decision tree is a flow-chart-like tree structure, where each node represents
a test on an attribute value, each branch denotes an outcome of a test, and
tree leaves represent classes or class distributions.
• Support Vector Machines (SVMs) are a type of supervised learning algorithm
that can be used for classification or regression tasks. The main idea behind
SVMs is to find a hyperplane that maximally separates the different classes in
the training data.
• Bayesian Classification: Bayesian classifier is a statistical classifier. They can
predict class membership probabilities, for instance, the probability that a
given sample belongs to a particular class.
TECHNIQUES
• 3. Prediction
• Data Prediction is a two-step process, similar to that of data classification.
Although, for prediction, we do not utilize the phrasing of “Class label attribute”
because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered).
• 4. Clustering
• Unlike classification and prediction, which analyze class-labeled data objects or
attributes, clustering analyzes data objects without consulting an identified class
label. In general, the class labels do not exist in the training data simply
because they are not known to begin with.
TECHNIQUES
• 5. Regression
• Regression can be defined as a statistical modeling method in which previously
obtained data is used to predicting a continuous quantity for new observations.
• 6. Artificial Neural network (ANN) Classifier Method
• An artificial neural network (ANN) also referred to as simply a “Neural Network”
(NN), could be a process model supported by biological neural networks. It
consists of an interconnected collection of artificial neurons.
7. Outlier Detection
• A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The investigation
of OUTLIER data is known as OUTLIER MINING.
• 8. Genetic Algorithm
• Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the
region of better performance in solution space.

More Related Content

PPTX
Data warehouse introduction
PPTX
Data warehouse - Nivetha Durganathan
PPTX
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
PPTX
lec 4 Data warehouse course Advanced database.pptx
PPTX
module 1 DWDM (complete) chapter ppt.pptx
PPTX
Data Warehouse And On-Line Analytical Processing
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
PPTX
DATA WAREHOUSING.2.pptx
Data warehouse introduction
Data warehouse - Nivetha Durganathan
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
lec 4 Data warehouse course Advanced database.pptx
module 1 DWDM (complete) chapter ppt.pptx
Data Warehouse And On-Line Analytical Processing
Chap3-Data Warehousing and OLAP operations..pptx
DATA WAREHOUSING.2.pptx

Similar to Data Mining and Data Warehousing Presentation (20)

PPTX
Datawarehouse org
PPT
Datawarehousing
PPTX
ETL processes , Datawarehouse and Datamarts.pptx
PPTX
Dataware house introduction by InformaticaTrainingClasses
PPTX
Data warehouse
PPTX
Data ware housing and data mining basic
PPTX
UNIT 1.pptxgfghdcsvdsvsvsfffcafcaefefcsdc
PPTX
INTRODUCTION to datawarehouse IN DATA.pptx
PPTX
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
PDF
DWH_data_warehouse_and_the_ETL_Process.pdf
PPTX
Data warehousing.pptx
PPT
Data ware housing - Introduction to data ware housing process.
PPT
kalyani.ppt
PPT
kalyani.ppt
PPT
Data Warehouse
PPT
Data Warehouse
DOC
Dwdm unit 1-2016-Data ingarehousing
PPTX
Leveraging SAS for Efficient Data Warehousing.pptx
PPTX
data mining and data warehousing
PPTX
Data warehouse system and its concepts
Datawarehouse org
Datawarehousing
ETL processes , Datawarehouse and Datamarts.pptx
Dataware house introduction by InformaticaTrainingClasses
Data warehouse
Data ware housing and data mining basic
UNIT 1.pptxgfghdcsvdsvsvsfffcafcaefefcsdc
INTRODUCTION to datawarehouse IN DATA.pptx
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
DWH_data_warehouse_and_the_ETL_Process.pdf
Data warehousing.pptx
Data ware housing - Introduction to data ware housing process.
kalyani.ppt
kalyani.ppt
Data Warehouse
Data Warehouse
Dwdm unit 1-2016-Data ingarehousing
Leveraging SAS for Efficient Data Warehousing.pptx
data mining and data warehousing
Data warehouse system and its concepts
Ad

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Basic Mud Logging Guide for educational purpose
PDF
Classroom Observation Tools for Teachers
PPTX
master seminar digital applications in india
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Institutional Correction lecture only . . .
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Complications of Minimal Access Surgery at WLH
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
GDM (1) (1).pptx small presentation for students
VCE English Exam - Section C Student Revision Booklet
Basic Mud Logging Guide for educational purpose
Classroom Observation Tools for Teachers
master seminar digital applications in india
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Sports Quiz easy sports quiz sports quiz
Microbial diseases, their pathogenesis and prophylaxis
Institutional Correction lecture only . . .
102 student loan defaulters named and shamed – Is someone you know on the list?
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Anesthesia in Laparoscopic Surgery in India
Complications of Minimal Access Surgery at WLH
Module 4: Burden of Disease Tutorial Slides S2 2025
STATICS OF THE RIGID BODIES Hibbelers.pdf
TR - Agricultural Crops Production NC III.pdf
Insiders guide to clinical Medicine.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
GDM (1) (1).pptx small presentation for students
Ad

Data Mining and Data Warehousing Presentation

  • 2. CONTENT • Data warehousing components • Need for data warehousing • Basic elements of data warehousing • Data mart • Data extraction • Cleanup • Transformation tools • Meta data • Star • Snow flake and galaxy schemas multidimensional databases • Facts and Dimention data • Partitioning strategy- Horizontal and Vertical partitioning
  • 3. DATAWAREHOUSING COMPONENT • DATA-DATA IS COLLECTION OF FACTS AND FEATURES IN PARTICULAR FORMATE. • INFORMATION –INFORMATION IS A PROCESS DATA. • DATABASE-IT IS A DIFFERENT KIND OF DATA IN A PARTICULAR FORMATE. • DATAWAREHOUSE-COLLECTION OF DIFFERENT KIND OF DATABASE IN A PARTICULAR FORMATE. • MINGING-EXTRACTION OF DATA FROM DATAWAREHOUSING.
  • 4. DATAWAREHOUSE OBJECTIVE • DATA WAREHOUSE- A DATAWAREHOUSE CAN BE VIEWED AS A DATABASE FOR HISTORICAL DATA FROM DIFFERENT FUNCTIONS WITHIN A COMPANY. DATA WAREHOUSE DELHI BRANCH MUMBAI BRANCH PUNE BRANCH INDORE BRANCH
  • 5. BUILDING A DATA WAREHOUSE • Subject oriented- A data warehouse typically provides information on a topic (such as customer,supplier,product,sale) rather than company operations. • Time Varient- Time variant keys (e.g., for the date, month, time) are typically present. (data identified in a particular time) • Integrated- A data warehouse combines data from various sources. These may include a cloud, relational databases, flat files, structured and semi- structured data, metadata, and master data
  • 6. BUILDING A DATA WAREHOUSE Non volatile-Prior data isn’t deleted when new data is added. Historical data is preserved for comparisons, trends, and analytics. Scalebility- After increasing the load to a particular system performance is not decrease.
  • 8. NEED FOR DATA WAREHOUSING • Ensure consistency. Data warehouses are programmed to apply a uniform format to all collected data, which makes it easier for corporate decision- makers to analyze and share data insights with their colleagues around the globe. • Data warehousing improves the speed and efficiency of accessing different data sets and makes it easier for corporate decision-makers to derive insights that will guide the business and marketing strategies that set them apart from their competitors.
  • 9. NEED FOR DATA WAREHOUSING • Improve their bottom line-This allows executives to see where they can adjust their strategy to decrease costs, maximize efficiency and increase sales to improve their bottom line.
  • 10. BASIC ELEMENT OF DATA WAREHOUSING • A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools. All of these components are engineered for speed so that you can get results quickly and analyze data on the fly. • 1. Operational Source – • An operational Source is a data source consists of Operational Data and External Data. • Data can come from Relational DBMS like Informix, Oracle. • 2. Load Manager – • The Load Manager performs all operations associated with the extraction of loading data in the data warehouse. • These tasks include the simple transformation of data to prepare data for entry into the warehouse. • 3. Warehouse Manage – • The warehouse manager is responsible for the warehouse management process. • The operations performed by the warehouse manager are the analysis, aggregation, backup and collection of data, de-normalization of the data.
  • 11. ELEMENTS • 4. Query Manager – • Query Manager performs all the tasks associated with the management of user queries. • The complexity of the query manager is determined by the end-user access operations tool and the features provided by the database. • 5. Detailed Data – • It is used to store all the detailed data in the database schema. • Detailed data is loaded into the data warehouse to complement the data collected. • 6. Summarized Data – • Summarized Data is a part of the data warehouse that stores predefined aggregations • These aggregations are generated by the warehouse manager.
  • 12. ELEMENTS • 7. Archive and Backup Data – • The Detailed and Summarized Data are stored for the purpose of archiving and backup. • The data is relocated to storage archives such as magnetic tapes or optical disks. • 8. Metadata – • Metadata is basically data stored above data. • It is used for extraction and loading process, warehouse, management process, and query management process. • 9. End User Access Tools – • End-User Access Tools consist of Analysis, Reporting, and mining. • By using end-user access tools users can link with the warehouse.
  • 14. DATA MART • A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area. Data marts make specific data available to a defined group of users, which allows those users to quickly access critical insights without wasting time searching through an entire data warehouse.
  • 15. DATA EXTRACTION • Extraction is the operation of extracting data from a source system for further use in a data warehouse environment. This is the first step of the ETL process. After the extraction, this data can be transformed and loaded into the data warehouse.
  • 16. CLEANUP • Step 1 — Identify the Critical Data Fields. ... • Step 2 — Collect the Data. ... • Step 3 — Discard Duplicate Values. ... • Step 4 — Resolve Empty Values. ... • Step 5 — Standardize the Cleansing Process. ... • Step 6 — Review, Adapt, Repeat
  • 17. ETL PROCESS • ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into the Data Warehouse system.
  • 18. ETL • Extraction: The first step of the ETL process is extraction. In this step, data from various source systems is extracted which can be in various formats like relational databases, No SQL, XML, and flat files into the staging area. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted also. • Transformation-The second step of the ETL process is transformation. In this step, a set of rules or functions are applied on the extracted data to convert it into a single standard format.
  • 19. ETL LOAD PROCESS • Loading: The third and final step of the ETL process is loading. In this step, the transformed data is finally loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse very frequently and sometimes it is done after longer but regular intervals. • The block diagram of the pipelining of ETL process is shown below:
  • 20. METADATA • Metadata means "data about data". Metadata is defined as the data providing information about one or more aspects of the data; it is used to summarize basic information about data that can make tracking and working with specific data easier. Some examples include: Means of creation of the data. • There are three main types of metadata: operational , extraction and transformation , end user meta data.
  • 21. 3 TYPES OF SCHEMA USED IN DATA WAREHOUSES • Star Schemas-The star schema in a data warehouse is historically one of the most straightforward designs. This schema follows some distinct design parameters, such as only permitting one central table and a handful of single-dimension tables joined to the table. • Characteristics of the Star Schema: • Star data warehouse schemas create a denormalized database that enables quick querying responses • The primary key in the dimension table is joined to the fact table by the foreign key • Each dimension in the star schema maps to one dimension table • Dimension tables within a star scheme are not to be connected directly • Star schema creates denormalized dimension tables
  • 23. SNOW FLAKE SCHEMA • Snowflake Schema -The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement of dimension tables. This data warehouse schema builds on the star schema by adding additional sub-dimension tables that relate to first-order dimension tables joined to the fact table. • Characteristics of the Snowflake Schema: • Snowflake Schema are permitted to have dimension tables joined to other dimension tables • Snowflake Schema are to have one fact table only • Snowflake Schema create normalized dimension tables • The normalized schema reduces required disk space for running and managing this data warehouse • Snowflake Scheme offer an easier way to implement a dimension
  • 25. GALAXY SCHEMAS • The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next iteration of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy Schema uses multiple fact tables connected with shared normalized dimension tables. Galaxy Schema can be thought of as star schema interlinked and completely normalized, avoiding any kind of redundancy or inconsistency of data. • Characteristics of the Galaxy Schema: • Galaxy Schema is multidimensional acting as a strong design consideration for complex database systems • Galaxy Schema reduces redundancy to near zero redundancy as a result of normalization • Galaxy Schema is known for high data quality and accuracy and lends to effective reporting and analytics
  • 27. BASIC COMPONENT OF DATAWAREHOUSE • Fact Table -A fact table aggregates metrics, measurements, or facts about business processes. In this example, fact tables are connected to dimension tables to form a schema architecture representing how data relates within the data warehouse. Fact tables store primary keys of dimension tables as foreign keys within the fact table.
  • 28. DIMENSION TABLE • Dimension tables are non-denormalized tables used to store data attributes or dimensions. As mentioned above, the primary key of a dimension table is stored as a foreign key in the fact table. Dimension tables are not joined together. Instead, they are joined via association through the central fact table.
  • 29. PARTITION STRATEGY – HORIZONTAL PARTITIONING • Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. • In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse. • Partitioning by Time into Equal Segments • In this partitioning strategy, the fact table is partitioned on the basis of time period.For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them.
  • 30. HORIZONTAL PARTITIONING • Partition by Time into Different-sized Segments • This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data.
  • 31. WHAT IS DATAMINING? Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools enable enterprises to predict future trends and make more-informed business decisions. Its core elements include machine learning and statistical analysis, along with data management tasks done to prepare data for analysis. The use of machine learning algorithms and artificial intelligence (AI) tools has automated more of the process and made it easier to mine massive data sets, such as customer databases, transaction records and log files from web servers, mobile apps and sensors.
  • 33. MACHINE LEARNING • Machine learning (ML), on the other hand, is a subset of data science. ML primarily focuses on creating algorithms that can learn and predict from given data. Machine learning and data mining can be combined to deliver results that can help make better business decisions and boost the profit margins of an organization.
  • 34. DBMS(DATABASE MANAGEMENT SYSTEM • A database management system (or DBMS) is essentially nothing more than a computerized data-keeping system. Users of the system are given facilities to perform several kinds of operations on such a system for either manipulation of the data in the database or the management of the database structure itself. • DBMS is a database program : it is a software system that uses a standard method of cataloging, retrieving, and running queries on data. Some DBMS examples include MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS.
  • 35. OLAP • OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology which authorizes analysts, managers, and executives to gain insight into information through fast, consistent, interactive access in a wide variety of possible views of data that has been transformed from raw information to reflect the real dimensionality of the enterprise as understood by the clients. OLAP implement the multidimensional analysis of business information and support the capability for complex estimations, trend analysis, and sophisticated data modeling.
  • 36. OLAP • The Data cube pictorially shows how different attributes of data are arranged in the data model. Below is the diagram of a general data cube
  • 37. OLAP TYPES • The example above is a 3D cube having attributes like branch(A,B,C,D),item type(home,entertainment,computer,phone,security), year(1997,1998,1999) . Data cube operations:
  • 38. OLAP TYPES • Roll-up: operation and aggregate certain similar data attributes having the same dimension together. • Drill-down: this operation is the reverse of the roll-up operation. It allows us to take particular information and then subdivide it further for coarser granularity analysis. • Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension, the user doesn’t need everything for analysis, rather a particular attribute.
  • 39. • Dicing: this operation does a multidimensional cutting, that not only cuts only one dimension but also can go to another dimension and cut a certain range of it. • Pivot: this operation is very important from a viewing point of view. It basically transforms the data cube in terms of view.
  • 40. STATISTICS Statistics means studying, collecting, analyzing, interpreting, and organizing data. Statistics is a science that helps to gather and analyze numerical data in huge quantities. With the help of statistics, you can measure, control, and communicate uncertainty. It allows you to infer proportions in a whole, derived from a representative sample.
  • 41. STAGES OF THE DATAMINING PROCESS • There are many factors that determine the usefulness of data such as • accuracy, completeness, consistency, timeliness. The data has to • quality if it satisfies the intended purpose. • 1) Data Cleaning • Data cleaning is the first step in data mining. It holds importance as • dirty data if used directly in mining can cause confusion in procedures • and produce inaccurate results
  • 42. CLEANING STEPS • This step carries out the routine cleaning work by: • (i) Fill The Missing Data: • Missing data can be filled by methods such as: • Ignoring the tuple. • Filling the missing value manually. • Use the measure of central tendency, median or • Filling in the most probable value. • (ii) Remove The Noisy Data: Random error is called noisy data. • Methods to remove noise are : • Binning: Binning methods are applied by sorting values into buckets • or bins. Smoothening is performed by consulting the neighboring • values
  • 43. CLEANING • Binning is done by smoothing by bin i.e. each bin is replaced by the • mean of the bin. Smoothing by a median, where each bin value is • replaced by a bin median. Smoothing by bin boundaries i.e. The • minimum and maximum values in the bin are bin boundaries and each • bin value is replaced by the closest boundary value. • • Identifying the Outliers • • Resolving Inconsistencies
  • 44. • 2) Data Integration • When multiple heterogeneous data sources such as databases, data • cubes or files are combined for analysis, this process is called data • integration. This can help in improving the accuracy and speed of the • data mining process. • Different databases have different naming conventions of variables, • by causing redundancies in the databases. Additional Data Cleaning • can be performed to remove the redundancies and inconsistencies • from the data integration without affecting the reliability of data. • Data Integration can be performed using Data Migration Tools such • as Oracle Data Service Integrator and Microsoft SQL etc.
  • 45. • 4) Data Transformation • In this process, data is transformed into a form suitable for the data • mining process. Data is consolidated so that the mining process is • more efficient and the patterns are easier to understand. Data • Transformation involves Data Mapping and code generation process.
  • 46. • 5) Data Mining • Data Mining is a process to identify interesting patterns and • knowledge from a large amount of data. In these steps, intelligent • patterns are applied to extract the data patterns. The data is • represented in the form of patterns and models are structured using • classification and clustering techniques.
  • 47. • 6) Pattern Evaluation • This step involves identifying interesting patterns representing the • knowledge based on interestingness measures. Data summarization • and visualization methods are used to make the data understandable • by the user. • 7) Knowledge Representation • Knowledge representation is a step where data visualization and • knowledge representation tools are used to represent the mined data. • Data is visualized in the form of reports, tables, etc
  • 48. KDD PROCESS IN DATA MINING • KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process in data mining typically involves the following steps: • Selection: Select a relevant subset of the data for analysis. • Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data normalization, missing value handling, and data integration. • Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph. • Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This may include tasks such as clustering, classification, association rule mining, and anomaly detection. • Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the results, evaluating the quality of the discovered patterns and identifying relationships and associations among the data. • Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful. • Deployment: Use the discovered knowledge to solve the business problem and make decisions. • The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate knowledge from the data.
  • 51. DATAMINING APPLICATION • Scientific Analysis: Scientific simulations are generating bulks of data every day. This includes data collected from nuclear laboratories, data about human psychology, etc. • Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital network. Network intrusions often involve stealing valuable network resources. • Business Transactions: Every business industry is memorized for perpetuity. Such transactions are usually time-related and can be inter-business deals or intra-business operations.
  • 52. • Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of purchases done by a customer in a supermarket. This concept identifies the pattern of frequent purchase items by customers. • Education: For analyzing the education sector, data mining uses Educational Data Mining (EDM) method. • Research: A data mining technique can perform predictions, classification, clustering, associations, and grouping of data with perfection in the research area. • Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity and their outcomes to improve the focusing of high-value physicians and figure out which promoting activities will have the best effect in the following upcoming months, Whereas the Insurance sector, data mining can help to predict which customers will buy new policies, identify behavior patterns of risky customers and identify fraudulent behavior of customers.
  • 53. • Transportation: A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. • Financial/Banking Sector: A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product.
  • 55. TECHNIQUE DISCRIPTION • 1. Association • Association analysis is the finding of association rules showing attribute-value conditions that occur frequently together in a given set of data. • Classification • Classification is the processing of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. • Data Mining has a different type of classifier: Decision Tree • SVM(Support Vector Machine) • Generalized Linear Models • Bayesian classification:
  • 56. DATA MINING HAS A DIFFERENT TYPE OF CLASSIFIER • A decision tree is a flow-chart-like tree structure, where each node represents a test on an attribute value, each branch denotes an outcome of a test, and tree leaves represent classes or class distributions. • Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for classification or regression tasks. The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data. • Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict class membership probabilities, for instance, the probability that a given sample belongs to a particular class.
  • 57. TECHNIQUES • 3. Prediction • Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted is consistently valued(ordered) instead of categorical (discrete-esteemed and unordered). • 4. Clustering • Unlike classification and prediction, which analyze class-labeled data objects or attributes, clustering analyzes data objects without consulting an identified class label. In general, the class labels do not exist in the training data simply because they are not known to begin with.
  • 58. TECHNIQUES • 5. Regression • Regression can be defined as a statistical modeling method in which previously obtained data is used to predicting a continuous quantity for new observations. • 6. Artificial Neural network (ANN) Classifier Method • An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process model supported by biological neural networks. It consists of an interconnected collection of artificial neurons. 7. Outlier Detection • A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are Outliers. The investigation of OUTLIER data is known as OUTLIER MINING.
  • 59. • 8. Genetic Algorithm • Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of random search provided with historical data to direct the search into the region of better performance in solution space.