SlideShare a Scribd company logo
Pre processing
• Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Real-world data is often incomplete,
inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain
many errors.
• Data preprocessing is a proven method of
resolving such issues.
• Data preprocessing prepares raw data for
further processing.
• Data preprocessing is used database-driven
applications such as customer relationship
management and rule-based applications (like
neural networks).
Number of data preprocessing
techniques
•
•
•
•

Data cleaning
Data integration
Data transformation
Data reduction
Pre processing
Data Preprocessing Techniques
• Data cleaning : can be applied to remove
noise and correct inconsistencies in the data.
• Data integration :merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations :such as
normalization, may be applied.
• Data reduction : can reduce the data size by
aggregating, eliminating redundant features,
or clustering ,for instance.
• routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying
or removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Although most mining routines have some
procedures for dealing with incomplete or noisy
data, they are not always robust.
• Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.
• Include data from multiple sources in your
analysis.
• This would involve integrating multiple
databases, data cubes, or files, that is, data
integration.
• Yet some attributes representing a given
concept may have different names in different
databases, causing inconsistencies and
redundancies.
• Having a large amount of redundant data may
slow down or confuse the knowledge
discovery process.
• Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies
during data integration.
• Typically, data cleaning and data integration
are performed as a preprocessing step when
preparing the data for a data warehouse.
• Additional data cleaning can be performed to
detect and remove redundancies that may
have resulted from data integration.
• Getting back to your data, you have decided,
say, that you would like to use a distance
based mining algorithm for your analysis, such
as neural networks, nearest-neighbor
classifiers, or clustering.
• methods provide better results if the data to
be analyzed have been normalized, that is,
scaled to a specific range such as [0.0, 1.0].
• You soon realize that data transformation
operations, such as normalization and
aggregation, are additional data preprocessing
procedures that would contribute toward the
success of the mining process.
• Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results.
• There are a number of strategies for data
reduction.
• These include data aggregation , attribute
subset selection , dimensionality reduction and
numerosity reduction.
DATA REDUCTION
• Data can also be “reduced” by generalization
with the use of concept hierarchies, where lowlevel concepts, such as city for customer location,
are replaced with higher-level concepts, such as
region or province or state.
• A concept hierarchy organizes the concepts into
varying levels of abstraction.
• Data discretization is a form of data reduction
that is very useful for the automatic generation of
concept hierarchies from numerical data.
Pre processing

More Related Content

PPTX
Data preprocessing in Machine learning
PPTX
Data Preprocessing
PPTX
Data reduction
PDF
Data preprocessing
PPT
Data pre processing
PPTX
Introduction to data pre-processing and cleaning
PPTX
Data mining in the field of library
PPTX
Datawarehousing Terminology
Data preprocessing in Machine learning
Data Preprocessing
Data reduction
Data preprocessing
Data pre processing
Introduction to data pre-processing and cleaning
Data mining in the field of library
Datawarehousing Terminology

What's hot (20)

PPTX
Research trends in data warehousing and data mining
PPT
Data processing
PPTX
Data warehouse 5: Data Reconciliation and Transformation in Data Warehouse
PPTX
Data warehouse 14 data reconciliation tools
PPTX
Data Cleaning Techniques
PPT
Data Mining Technniques
PPTX
Introduction to dm and dw
PDF
HashCash big data services
PPTX
Introduction to data mining
PPTX
EDI Training Module 12: An Introduction to Metadata and Data Repositories
PPTX
EDI Training Module 5: Creating Clean Data foro Publishing
PPTX
EDI Training Module 4: Organizing Data Into Publishable Units
PPT
Final presentation
PPTX
EDI Training Module 10: EDI Data Repository Overview
PPTX
Data warehouse 12 reconciled data layers
PDF
03. Data Preprocessing
DOCX
Bt9001, data mining
PPTX
Data Mining: Classification and analysis
PDF
Ijcet 06 06_002
DOCX
Mc0088 data mining
Research trends in data warehousing and data mining
Data processing
Data warehouse 5: Data Reconciliation and Transformation in Data Warehouse
Data warehouse 14 data reconciliation tools
Data Cleaning Techniques
Data Mining Technniques
Introduction to dm and dw
HashCash big data services
Introduction to data mining
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 5: Creating Clean Data foro Publishing
EDI Training Module 4: Organizing Data Into Publishable Units
Final presentation
EDI Training Module 10: EDI Data Repository Overview
Data warehouse 12 reconciled data layers
03. Data Preprocessing
Bt9001, data mining
Data Mining: Classification and analysis
Ijcet 06 06_002
Mc0088 data mining
Ad

Viewers also liked (20)

PPTX
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
PPT
Data Mining Xuequn Shang NorthWestern Polytechnical University
PDF
Preprocessing
PPTX
Data preprocessing
PPT
Iris by @run@$uj! final
PPTX
Session 05 cleaning and exploring
PPTX
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
PDF
Python projects
PDF
Pandas/Data Analysis at Baypiggies
PDF
Pre processing big data
PDF
A look inside pandas design and development
PDF
Getting started with pandas
PDF
R vs Python vs SAS
PPT
Image pre processing - local processing
PDF
pandas - Python Data Analysis
PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
PPTX
Parity check(Error Detecting Codes)
PDF
pandas: Powerful data analysis tools for Python
PPT
Image pre processing
PPT
Error control, parity check, check sum, vrc
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Data Mining Xuequn Shang NorthWestern Polytechnical University
Preprocessing
Data preprocessing
Iris by @run@$uj! final
Session 05 cleaning and exploring
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
Python projects
Pandas/Data Analysis at Baypiggies
Pre processing big data
A look inside pandas design and development
Getting started with pandas
R vs Python vs SAS
Image pre processing - local processing
pandas - Python Data Analysis
pandas: a Foundational Python Library for Data Analysis and Statistics
Parity check(Error Detecting Codes)
pandas: Powerful data analysis tools for Python
Image pre processing
Error control, parity check, check sum, vrc
Ad

Similar to Pre processing (20)

PPT
Data preprocess
PPT
preproccessing level 3 for students.ppt
PPT
Data preprocessing
PPTX
Data preprocessing
PPTX
Data Preprocessing
PDF
KNOLX_Data_preprocessing
PPTX
Data preprocessing
PPTX
Data preprocessing
PDF
Data preprocessing using Machine Learning
PPTX
Data preprocessing
PPTX
Data preprocessing
PPTX
Data preprocessing
PDF
data processing.pdf
PPTX
DRK_Introduction to Data mining and Knowledge discovery
PPTX
Data Preprocessing || Data Mining
PPT
Preprocessing data mining hhxdzsdsasaasa
PPTX
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
Data preprocess
preproccessing level 3 for students.ppt
Data preprocessing
Data preprocessing
Data Preprocessing
KNOLX_Data_preprocessing
Data preprocessing
Data preprocessing
Data preprocessing using Machine Learning
Data preprocessing
Data preprocessing
Data preprocessing
data processing.pdf
DRK_Introduction to Data mining and Knowledge discovery
Data Preprocessing || Data Mining
Preprocessing data mining hhxdzsdsasaasa
Data preprocessing
Data preprocessing
Data preprocessing

Pre processing

  • 2. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.
  • 3. • Data preprocessing is a proven method of resolving such issues. • Data preprocessing prepares raw data for further processing. • Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
  • 4. Number of data preprocessing techniques • • • • Data cleaning Data integration Data transformation Data reduction
  • 6. Data Preprocessing Techniques • Data cleaning : can be applied to remove noise and correct inconsistencies in the data. • Data integration :merges data from multiple sources into a coherent data store, such as a data warehouse. • Data transformations :such as normalization, may be applied. • Data reduction : can reduce the data size by aggregating, eliminating redundant features, or clustering ,for instance.
  • 7. • routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. • If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. • Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust.
  • 8. • Therefore, a useful preprocessing step is to run your data through some data cleaning routines.
  • 9. • Include data from multiple sources in your analysis. • This would involve integrating multiple databases, data cubes, or files, that is, data integration. • Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies.
  • 10. • Having a large amount of redundant data may slow down or confuse the knowledge discovery process. • Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration. • Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse.
  • 11. • Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration.
  • 12. • Getting back to your data, you have decided, say, that you would like to use a distance based mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering. • methods provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0].
  • 13. • You soon realize that data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
  • 14. • Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. • There are a number of strategies for data reduction. • These include data aggregation , attribute subset selection , dimensionality reduction and numerosity reduction.
  • 15. DATA REDUCTION • Data can also be “reduced” by generalization with the use of concept hierarchies, where lowlevel concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state. • A concept hierarchy organizes the concepts into varying levels of abstraction. • Data discretization is a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data.