2. CONTENT
• Data warehousing components
• Need for data warehousing
• Basic elements of data warehousing
• Data mart
• Data extraction
• Cleanup
• Transformation tools
• Meta data
• Star
• Snow flake and galaxy schemas
multidimensional databases
• Facts and Dimention data
• Partitioning strategy- Horizontal and
Vertical partitioning
3. DATAWAREHOUSING COMPONENT
• DATA-DATA IS COLLECTION OF FACTS AND FEATURES IN PARTICULAR
FORMATE.
• INFORMATION –INFORMATION IS A PROCESS DATA.
• DATABASE-IT IS A DIFFERENT KIND OF DATA IN A PARTICULAR FORMATE.
• DATAWAREHOUSE-COLLECTION OF DIFFERENT KIND OF DATABASE IN A
PARTICULAR FORMATE.
• MINGING-EXTRACTION OF DATA FROM DATAWAREHOUSING.
4. DATAWAREHOUSE OBJECTIVE
• DATA WAREHOUSE- A DATAWAREHOUSE CAN BE VIEWED AS A DATABASE
FOR HISTORICAL DATA FROM DIFFERENT FUNCTIONS WITHIN A COMPANY.
DATA
WAREHOUSE
DELHI
BRANCH
MUMBAI
BRANCH
PUNE
BRANCH
INDORE
BRANCH
5. BUILDING A DATA WAREHOUSE
• Subject oriented- A data warehouse typically provides information on a
topic (such as customer,supplier,product,sale) rather than company
operations.
• Time Varient- Time variant keys (e.g., for the date, month, time) are
typically present. (data identified in a particular time)
• Integrated- A data warehouse combines data from various sources. These
may include a cloud, relational databases, flat files, structured and semi-
structured data, metadata, and master data
6. BUILDING A DATA WAREHOUSE
Non volatile-Prior data isn’t deleted when new data is added. Historical data
is preserved for comparisons, trends, and analytics.
Scalebility- After increasing the load to a particular system performance is not
decrease.
8. NEED FOR DATA WAREHOUSING
• Ensure consistency. Data warehouses are programmed to apply a uniform
format to all collected data, which makes it easier for corporate decision-
makers to analyze and share data insights with their colleagues around the
globe.
• Data warehousing improves the speed and efficiency of accessing different
data sets and makes it easier for corporate decision-makers to derive insights
that will guide the business and marketing strategies that set them apart from
their competitors.
9. NEED FOR DATA WAREHOUSING
• Improve their bottom line-This allows executives to see where they can adjust
their strategy to decrease costs, maximize efficiency and increase sales to
improve their bottom line.
10. BASIC ELEMENT OF DATA WAREHOUSING
• A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools,
metadata, and access tools. All of these components are engineered for speed so that you can get results
quickly and analyze data on the fly.
• 1. Operational Source –
• An operational Source is a data source consists of Operational Data and External Data.
• Data can come from Relational DBMS like Informix, Oracle.
• 2. Load Manager –
• The Load Manager performs all operations associated with the extraction of loading data in the data warehouse.
• These tasks include the simple transformation of data to prepare data for entry into the warehouse.
• 3. Warehouse Manage –
• The warehouse manager is responsible for the warehouse management process.
• The operations performed by the warehouse manager are the analysis, aggregation, backup and collection of
data, de-normalization of the data.
11. ELEMENTS
• 4. Query Manager –
• Query Manager performs all the tasks associated with the management of user queries.
• The complexity of the query manager is determined by the end-user access operations tool and
the features provided by the database.
• 5. Detailed Data –
• It is used to store all the detailed data in the database schema.
• Detailed data is loaded into the data warehouse to complement the data collected.
• 6. Summarized Data –
• Summarized Data is a part of the data warehouse that stores predefined aggregations
• These aggregations are generated by the warehouse manager.
12. ELEMENTS
• 7. Archive and Backup Data –
• The Detailed and Summarized Data are stored for the purpose of archiving and backup.
• The data is relocated to storage archives such as magnetic tapes or optical disks.
• 8. Metadata –
• Metadata is basically data stored above data.
• It is used for extraction and loading process, warehouse, management process, and
query management process.
• 9. End User Access Tools –
• End-User Access Tools consist of Analysis, Reporting, and mining.
• By using end-user access tools users can link with the warehouse.
14. DATA MART
• A data mart is a subset of a data warehouse focused on a particular line of
business, department, or subject area. Data marts make specific data
available to a defined group of users, which allows those users to quickly
access critical insights without wasting time searching through an entire data
warehouse.
15. DATA EXTRACTION
• Extraction is the operation of extracting data from a source system for
further use in a data warehouse environment. This is the first step of the ETL
process. After the extraction, this data can be transformed and loaded into
the data warehouse.
17. ETL PROCESS
• ETL is a process in Data Warehousing and it stands
for Extract, Transform and Load. It is a process in which an ETL tool extracts
the data from various data source systems, transforms it in the staging area,
and then finally, loads it into the Data Warehouse system.
18. ETL
• Extraction: The first step of the ETL process is extraction. In this step, data
from various source systems is extracted which can be in various formats like
relational databases, No SQL, XML, and flat files into the staging area. It is
important to extract the data from various source systems and store it into the
staging area first and not directly into the data warehouse because the
extracted data is in various formats and can be corrupted also.
• Transformation-The second step of the ETL process is transformation. In this
step, a set of rules or functions are applied on the extracted data to convert it
into a single standard format.
19. ETL LOAD PROCESS
• Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals.
• The block diagram of the pipelining of ETL process is shown below:
20. METADATA
• Metadata means "data about data". Metadata is defined as the data
providing information about one or more aspects of the data; it is used to
summarize basic information about data that can make tracking and working
with specific data easier. Some examples include: Means of creation of the
data.
• There are three main types of metadata: operational , extraction and
transformation , end user meta data.
21. 3 TYPES OF SCHEMA USED IN DATA WAREHOUSES
• Star Schemas-The star schema in a data warehouse is historically one of the most
straightforward designs. This schema follows some distinct design parameters, such as only
permitting one central table and a handful of single-dimension tables joined to the table.
• Characteristics of the Star Schema:
• Star data warehouse schemas create a denormalized database that enables quick
querying responses
• The primary key in the dimension table is joined to the fact table by the foreign key
• Each dimension in the star schema maps to one dimension table
• Dimension tables within a star scheme are not to be connected directly
• Star schema creates denormalized dimension tables
23. SNOW FLAKE SCHEMA
• Snowflake Schema -The Snowflake Schema is a data warehouse schema that encompasses
a logical arrangement of dimension tables. This data warehouse schema builds on the star
schema by adding additional sub-dimension tables that relate to first-order dimension
tables joined to the fact table.
• Characteristics of the Snowflake Schema:
• Snowflake Schema are permitted to have dimension tables joined to other dimension tables
• Snowflake Schema are to have one fact table only
• Snowflake Schema create normalized dimension tables
• The normalized schema reduces required disk space for running and managing this data
warehouse
• Snowflake Scheme offer an easier way to implement a dimension
25. GALAXY SCHEMAS
• The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the next
iteration of the data warehouse schema. Unlike the Star Schema and Snowflake Schema, the Galaxy
Schema uses multiple fact tables connected with shared normalized dimension tables. Galaxy Schema
can be thought of as star schema interlinked and completely normalized, avoiding any kind of
redundancy or inconsistency of data.
• Characteristics of the Galaxy Schema:
• Galaxy Schema is multidimensional acting as a strong design consideration for complex database
systems
• Galaxy Schema reduces redundancy to near zero redundancy as a result of normalization
• Galaxy Schema is known for high data quality and accuracy and lends to effective reporting and
analytics
27. BASIC COMPONENT OF DATAWAREHOUSE
• Fact Table -A fact table aggregates metrics, measurements, or facts about
business processes. In this example, fact tables are connected to dimension
tables to form a schema architecture representing how data relates within the
data warehouse. Fact tables store primary keys of dimension tables as
foreign keys within the fact table.
28. DIMENSION TABLE
• Dimension tables are non-denormalized tables used to store data attributes or
dimensions. As mentioned above, the primary key of a dimension table is
stored as a foreign key in the fact table. Dimension tables are not joined
together. Instead, they are joined via association through the central fact
table.
29. PARTITION STRATEGY –
HORIZONTAL PARTITIONING
• Partitioning is done to enhance performance and facilitate easy management
of data. Partitioning also helps in balancing the various requirements of the
system.
• In horizontal partitioning, we have to keep in mind the requirements for
manageability of the data warehouse.
• Partitioning by Time into Equal Segments
• In this partitioning strategy, the fact table is partitioned on the basis of time
period.For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments. We can reuse the
partitioned tables by removing the data in them.
30. HORIZONTAL PARTITIONING
• Partition by Time into Different-sized Segments
• This kind of partition is done where the aged data is accessed infrequently. It
is implemented as a set of small partitions for relatively current data, larger
partition for inactive data.
31. WHAT IS DATAMINING?
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends and
make more-informed business decisions.
Its core elements include machine learning and statistical analysis, along with
data management tasks done to prepare data for analysis. The use of machine
learning algorithms and artificial intelligence (AI) tools has automated more of
the process and made it easier to mine massive data sets, such as customer
databases, transaction records and log files from web servers, mobile apps and
sensors.
33. MACHINE LEARNING
• Machine learning (ML), on the other hand, is a subset of data science. ML
primarily focuses on creating algorithms that can learn and predict from given
data. Machine learning and data mining can be combined to deliver results
that can help make better business decisions and boost the profit margins
of an organization.
34. DBMS(DATABASE MANAGEMENT SYSTEM
• A database management system (or DBMS) is essentially nothing more than a
computerized data-keeping system. Users of the system are given facilities to
perform several kinds of operations on such a system for either manipulation of the
data in the database or the management of the database structure itself.
• DBMS is a database program : it is a software system that uses a standard method
of cataloging, retrieving, and running queries on data. Some DBMS examples
include MySQL, PostgreSQL, Microsoft Access, SQL Server, FileMaker, Oracle, RDBMS.
35. OLAP
• OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to
gain insight into information through fast, consistent, interactive access in a
wide variety of possible views of data that has been transformed from raw
information to reflect the real dimensionality of the enterprise as understood
by the clients. OLAP implement the multidimensional analysis of business
information and support the capability for complex estimations, trend analysis,
and sophisticated data modeling.
36. OLAP
• The Data cube pictorially shows how different attributes of data are
arranged in the data model. Below is the diagram of a general data cube
37. OLAP TYPES
• The example above is a 3D cube having attributes like branch(A,B,C,D),item
type(home,entertainment,computer,phone,security), year(1997,1998,1999) .
Data cube operations:
38. OLAP TYPES
• Roll-up: operation and aggregate certain similar data attributes having the
same dimension together.
• Drill-down: this operation is the reverse of the roll-up operation. It allows us
to take particular information and then subdivide it further for coarser
granularity analysis.
• Slicing: this operation filters the unnecessary portions. Suppose in a particular
dimension, the user doesn’t need everything for analysis, rather a particular
attribute.
39. • Dicing: this operation does a multidimensional cutting, that not only cuts only
one dimension but also can go to another dimension and cut a certain range
of it.
• Pivot: this operation is very important from a viewing point of view. It
basically transforms the data cube in terms of view.
40. STATISTICS
Statistics means studying, collecting, analyzing, interpreting, and organizing
data. Statistics is a science that helps to gather and analyze numerical data in
huge quantities. With the help of statistics, you can measure, control, and
communicate uncertainty. It allows you to infer proportions in a whole, derived
from a representative sample.
41. STAGES OF THE DATAMINING PROCESS
• There are many factors that determine the usefulness of data such as
• accuracy, completeness, consistency, timeliness. The data has to
• quality if it satisfies the intended purpose.
• 1) Data Cleaning
• Data cleaning is the first step in data mining. It holds importance as
• dirty data if used directly in mining can cause confusion in procedures
• and produce inaccurate results
42. CLEANING STEPS
• This step carries out the routine cleaning work by:
• (i) Fill The Missing Data:
• Missing data can be filled by methods such as:
• Ignoring the tuple.
• Filling the missing value manually.
• Use the measure of central tendency, median or
• Filling in the most probable value.
• (ii) Remove The Noisy Data: Random error is called noisy data.
• Methods to remove noise are :
• Binning: Binning methods are applied by sorting values into buckets
• or bins. Smoothening is performed by consulting the neighboring
• values
43. CLEANING
• Binning is done by smoothing by bin i.e. each bin is replaced by the
• mean of the bin. Smoothing by a median, where each bin value is
• replaced by a bin median. Smoothing by bin boundaries i.e. The
• minimum and maximum values in the bin are bin boundaries and each
• bin value is replaced by the closest boundary value.
• • Identifying the Outliers
• • Resolving Inconsistencies
44. • 2) Data Integration
• When multiple heterogeneous data sources such as databases, data
• cubes or files are combined for analysis, this process is called data
• integration. This can help in improving the accuracy and speed of the
• data mining process.
• Different databases have different naming conventions of variables,
• by causing redundancies in the databases. Additional Data Cleaning
• can be performed to remove the redundancies and inconsistencies
• from the data integration without affecting the reliability of data.
• Data Integration can be performed using Data Migration Tools such
• as Oracle Data Service Integrator and Microsoft SQL etc.
45. • 4) Data Transformation
• In this process, data is transformed into a form suitable for the data
• mining process. Data is consolidated so that the mining process is
• more efficient and the patterns are easier to understand. Data
• Transformation involves Data Mapping and code generation process.
46. • 5) Data Mining
• Data Mining is a process to identify interesting patterns and
• knowledge from a large amount of data. In these steps, intelligent
• patterns are applied to extract the data patterns. The data is
• represented in the form of patterns and models are structured using
• classification and clustering techniques.
47. • 6) Pattern Evaluation
• This step involves identifying interesting patterns representing the
• knowledge based on interestingness measures. Data summarization
• and visualization methods are used to make the data understandable
• by the user.
• 7) Knowledge Representation
• Knowledge representation is a step where data visualization and
• knowledge representation tools are used to represent the mined data.
• Data is visualized in the form of reports, tables, etc
48. KDD PROCESS IN DATA MINING
• KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets. The KDD process in data mining typically involves the following
steps:
• Selection: Select a relevant subset of the data for analysis.
• Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such as data
normalization, missing value handling, and data integration.
• Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph.
• Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and insights. This
may include tasks such as clustering, classification, association rule mining, and anomaly detection.
• Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as visualizing the
results, evaluating the quality of the discovered patterns and identifying relationships and associations among the data.
• Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and meaningful.
• Deployment: Use the discovered knowledge to solve the business problem and make decisions.
• The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate
knowledge from the data.
51. DATAMINING APPLICATION
• Scientific Analysis: Scientific simulations are generating bulks of data every
day. This includes data collected from nuclear laboratories, data about human
psychology, etc.
• Intrusion Detection: A network intrusion refers to any unauthorized activity on
a digital network. Network intrusions often involve stealing
valuable network resources.
• Business Transactions: Every business industry is memorized for perpetuity.
Such transactions are usually time-related and can be inter-business deals or
intra-business operations.
52. • Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers.
• Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method.
• Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area.
• Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies,
identify behavior patterns of risky customers and identify fraudulent behavior of customers.
53. • Transportation: A diversified transportation company with a large direct
sales force can apply data mining to identify the best prospects for its
services.
• Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely to be
interested in a new credit product.
55. TECHNIQUE DISCRIPTION
• 1. Association
• Association analysis is the finding of association rules showing attribute-value conditions
that occur frequently together in a given set of data.
• Classification
• Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
• Data Mining has a different type of classifier: Decision Tree
• SVM(Support Vector Machine)
• Generalized Linear Models
• Bayesian classification:
56. DATA MINING HAS A DIFFERENT TYPE OF
CLASSIFIER
• A decision tree is a flow-chart-like tree structure, where each node represents
a test on an attribute value, each branch denotes an outcome of a test, and
tree leaves represent classes or class distributions.
• Support Vector Machines (SVMs) are a type of supervised learning algorithm
that can be used for classification or regression tasks. The main idea behind
SVMs is to find a hyperplane that maximally separates the different classes in
the training data.
• Bayesian Classification: Bayesian classifier is a statistical classifier. They can
predict class membership probabilities, for instance, the probability that a
given sample belongs to a particular class.
57. TECHNIQUES
• 3. Prediction
• Data Prediction is a two-step process, similar to that of data classification.
Although, for prediction, we do not utilize the phrasing of “Class label attribute”
because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered).
• 4. Clustering
• Unlike classification and prediction, which analyze class-labeled data objects or
attributes, clustering analyzes data objects without consulting an identified class
label. In general, the class labels do not exist in the training data simply
because they are not known to begin with.
58. TECHNIQUES
• 5. Regression
• Regression can be defined as a statistical modeling method in which previously
obtained data is used to predicting a continuous quantity for new observations.
• 6. Artificial Neural network (ANN) Classifier Method
• An artificial neural network (ANN) also referred to as simply a “Neural Network”
(NN), could be a process model supported by biological neural networks. It
consists of an interconnected collection of artificial neurons.
7. Outlier Detection
• A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The investigation
of OUTLIER data is known as OUTLIER MINING.
59. • 8. Genetic Algorithm
• Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the
region of better performance in solution space.