SlideShare a Scribd company logo
International Association of Scientific Innovation and Research (IASIR)
(An Association Unifying the Sciences, Engineering, and Applied Research)
International Journal of Software and Web Sciences (IJSWS)
www.iasir.net
IJSWS 14-423; © 2014, IJSWS All Rights Reserved Page 32
ISSN (Print): 2279-0063
ISSN (Online): 2279-0071
Normalization of Data in Data Mining
Dr. Himani Goyal1
, Sandeep.D*2
Venu.R*3
Raghavendra Pokuri *4
, Sandeep Kathula *5
, Naveen Battula *6
*1
Dean , Dept.of Electronics and Communications, *2,*3,*4,*5,*6,
Student
MLR Institute of Technology, Dundigal, Hyderabad-43 Telangana, India
_________________________________________________________________________________________
Abstract: In today’s competitive world that thrives on the thirst for profits through excellence, obtaining greater
efficiency by optimum utilization of resources and better decision-making through analytical data mining
methods has become the backbone of every industry. This highlights the importance of the powerful tools and
concepts of Data Mining and Warehousing, which when applied effectively can revolutionize the face of any
industry. The role of normalization techniques has become extremely pivotal for identifying patterns and
maintaining the consistency of database.
Keywords: Data Normalization, Min-Max, Decimal Scaling, Zero-Score.
__________________________________________________________________________________________
I. Introduction
Normalization is a process of decomposing the attribute values so that they are within a specified range of
smaller size. It transforms a complex database into a simple database. Normalization involves a sequence of
rules to be employed to test individual relations so that the database can be normalized to any degree. The
process of normalization is based on the engrossing concept of normal forms. A relational schema may be in
either 1NF or 2NF or 3NF or Boyce-Codd Normal form. If the relational schema is not in the required normal
form, then it has to be transformed into either of the desired normal forms. Normalization can thus be used as a
data transformation technique. The various data normalization techniques are as follows:
II. Min-Max Normalization
This intriguing technique is responsible for accomplishing linear transformation on actual data set and for
retaining the correlation between them. Assume 'R ' to be an attribute of a given relational schema. Also, assume
that the range of values which 'R' can take may vary from MP to XP. In this enticing technique, a value 'd' of
attribute R is mapped to d' in the range [nXP, ,nMP ] by calculating d' using the equation:
d'=(d-MP)(nXP-nMP)/XP-MP +nMP
An error "out-of-bound" is displayed in computer executed program if the input value is greater than the actual
data range.
III. Zero-Score (Z-Score) Normalization:
This method is generally used when the actual range of a particular attribute is unknown. However, this
technique can be used to obtain feasible results if the minimum and maximum values are considered to be
outliers. Normalization can thus be performed by using arithmetic mean and standard deviation. Thus, the value
d may be transformed in d' using the equation:
d'=(d-PA)/σP
Where PA is the arithmetic mean of attribute P, whereas σP is the attribute P.
IV. Normalization using Decimal Scaling
The data value of attribute P is normalized by changing the position of decimal points. The decision
regarding the position of decimal point is based on maximum absolute value of P i.e., Max(!d'!). The value of d
is thus transformed using the equation, d’=d/10Z
V. Elimination of Outliers
Outliers are a common sighting while dealing with data. Their presence creates quite a lot of hassles in the
computations. So, eliminating them is a very clever idea. So, detect the outliers from the box-plots and refine
the data by eliminating them. One legitimate reason to remove outliers is to prevent the distortion of central
Min-Max
Normalization
Zero Score
Normalization
Normalization
Using Decimal
Scaling
Himani Goyal et al., International Journal of Software and Web Sciences, 10(1), September-November, 2014, pp. 32-33
IJSWS 14-423; © 2014, IJSWS All Rights Reserved Page 33
tendency of data. Suppose that the data for analysis includes the attribute age. The age values for the data tuples
in the increasing order are 13,15,16,16,19,20,21,22,25,25,25,25,30,33,33,35,35,35,36,40,45,46,52,70.
Thus using the concept of min-max normalization to transform the value 35 for age within the range [0.0,1.0]:
(MP) min=13 and (XP) max=70. Range is [nMP,nXP]=[0.0,1.0].
Transforming the value 35 as,
d'=(d-MP)(nXP-nMP)/(XP-MP)+nMP
=(35-13)(1.0-0.0)/(70-13)+0.0
=22(1.0)/57=0.38.
Hence, d'=0.38 which is well within the actual range. The arithmetic mean PA=29.96 and Standard deviation
σP=12.94 years. Thus using z-score normalization, d'=d-P'/σP which is same as (5.04)/(12.94)=0.38.The value
obtained using min-max normalization is same as the score obtained through z-score normalization. Further, the
value d' can be transformed using decimal scale normalization as, d'=d/10Z
=35/102
=0.35. The value d' is thus
approximately 0.365 which is obtained by taking into consideration the mean of the above three values.
V. Application
Normalization is extensively used in the following applications:
(i) Neural network classification algorithms such as in back-propagation algorithm that enhances the speed
of learning phase.
(ii) Distance-based method such as k-nearest neighbor classification that prohibits the larger range attribute
values from outweighing the smaller range attribute values.
VI. Conclusion
Normalized relation tables do not contain repeated groups. Hence the concepts of anomalous updation,
anomalous deletion, anomalous insertion, redundancy errors and database inconsistency can be obviated.
Further, simplified results can be obtained which help in efficient maintenance of database integrity. Business
enterprises can thus enhance their data analytics through the predictive behavior of the normalized data.
Acknowledgments
Ineffable are our feelings to Prof. Kamakshi Prasad of JNTU-Hyderabad for assisting us in this work. The
values and beliefs that our professors have instilled in us have been a source of constant inspiration. F.A likes to
extend special thanks to his parents for their amazing insight and guidance. The unflinching support of family
members through thick and thin has helped us in reaching where we are today. S.A would like to thank the
students of JNTU for their constant motivation. T.A extends his warm regards to his friends for their excellent
ideologies and ideations which have been the constant sources of enlightenment.
References
1. Database Management Systems by Raghu RamaKrishna, 2002 edition, McGraw -Hill.
2. Database Management Systems by Abraham Silberschatz, Henry F.Korth, Sudarshan, Ed 5, 2005, Mc-Graw-Hill education.
3. Database Management Systems by Raghuram and RadhaKrishna, Professional Publications.
4. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kambler, Ed4, Morgan Kauffman publications.
5. Data Mining tutorial, tutorialspoint.com.
6. Data Mining Techniques by Arun K Pujari.
7. Fundamentals of Database Systems by Remez Elmasri & Shamkant Navathe, Ed 4.
About the Authors
First Author: Raghavendra Pokuri is a final year Computer Science Engineering student pursuing his Bachelors in Technology from JNTU-
Hyderabad. His fields of interests include research on the captivating subjects of Data Mining and Data Warehousing, adhoc -sensor
networks and extensive C& Java programming.
Second Author: Sandeep Kathula is a final year Computer Science engineering student pursuing his Bachelors in Technology from the
esteemed college of JNTU-Hyderabad. His fields of interest include extensive research on Information retrieval systems, Data Mining and
Warehousing, SQL programming and Web programming.
Third Author: Naveen Battula is a final year Computer Science Engineering student pursuing his Bachelors in Technology from JNTU-
Hyderabad. His fields of interest include substantial research on Database Transactions and Concurrency control, Storage and Indexing
algorithms, Schema Refinement and Relational Calculus.

More Related Content

PPT
PDF
Poster
PDF
61_Empirical
PDF
2-IJCSE-00536
PDF
Survey paper on Big Data Imputation and Privacy Algorithms
PDF
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
PDF
Anomaly detection via eliminating data redundancy and rectifying data error i...
Poster
61_Empirical
2-IJCSE-00536
Survey paper on Big Data Imputation and Privacy Algorithms
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
Anomaly detection via eliminating data redundancy and rectifying data error i...

What's hot (20)

PDF
Comprehensive Survey of Data Classification & Prediction Techniques
PDF
Anomalous symmetry succession for seek out
PDF
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
PDF
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
PDF
IRJET- Prediction of Autism Spectrum Disorder using Deep Learning: A Survey
PDF
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
PDF
20 26 jan17 walter latex
PDF
GCUBE INDEXING
PDF
Mining data streams using option trees
PDF
Enhancing the labelling technique of
PDF
Deep Convolutional Neural Network based Intrusion Detection System
PDF
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
PDF
Analysis on Data Mining Techniques for Heart Disease Dataset
PDF
An Iterative Improved k-means Clustering
PDF
Test PDF
PDF
Modern association rule mining methods
PDF
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
PDF
winbis1005
PDF
New proximity estimate for incremental update of non uniformly distributed cl...
Comprehensive Survey of Data Classification & Prediction Techniques
Anomalous symmetry succession for seek out
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
IRJET- Prediction of Autism Spectrum Disorder using Deep Learning: A Survey
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
20 26 jan17 walter latex
GCUBE INDEXING
Mining data streams using option trees
Enhancing the labelling technique of
Deep Convolutional Neural Network based Intrusion Detection System
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
Analysis on Data Mining Techniques for Heart Disease Dataset
An Iterative Improved k-means Clustering
Test PDF
Modern association rule mining methods
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
winbis1005
New proximity estimate for incremental update of non uniformly distributed cl...
Ad

Similar to Ijsws14 423 (1)-paper-17-normalization of data in (1) (20)

PPT
Data PreProcessing
PPT
DataPreProcessing
PPT
Preprocessing
PPTX
Data preprocessing
PPTX
Data pre processing
PPT
1.6.data preprocessing
DOC
Data Mining: Data Preprocessing
PPTX
Data preprocessing
PPTX
Importance of Data Cleaning in Data Analytics
PPT
Preprocessing.ppt
PPT
PPTX
UNIT 2: Part 2: Data Warehousing and Data Mining
PPT
Data preprocess
PPTX
Data_Preparation.pptx
PPTX
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
PPT
4_22865_IS465_2019_1__2_1_02Data-2.ppt
PPT
Data1
PPT
Data1
PDF
Introduction to Artificial Intelligence_ Lec 5
PPTX
Data preparation
Data PreProcessing
DataPreProcessing
Preprocessing
Data preprocessing
Data pre processing
1.6.data preprocessing
Data Mining: Data Preprocessing
Data preprocessing
Importance of Data Cleaning in Data Analytics
Preprocessing.ppt
UNIT 2: Part 2: Data Warehousing and Data Mining
Data preprocess
Data_Preparation.pptx
Data Preprocessing- Data transformation, Scaling, Normalization, Standardiza...
4_22865_IS465_2019_1__2_1_02Data-2.ppt
Data1
Data1
Introduction to Artificial Intelligence_ Lec 5
Data preparation
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
A Presentation on Artificial Intelligence
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25-Week II
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
A Presentation on Artificial Intelligence
TLE Review Electricity (Electricity).pptx
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Tartificialntelligence_presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...

Ijsws14 423 (1)-paper-17-normalization of data in (1)

  • 1. International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Software and Web Sciences (IJSWS) www.iasir.net IJSWS 14-423; © 2014, IJSWS All Rights Reserved Page 32 ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 Normalization of Data in Data Mining Dr. Himani Goyal1 , Sandeep.D*2 Venu.R*3 Raghavendra Pokuri *4 , Sandeep Kathula *5 , Naveen Battula *6 *1 Dean , Dept.of Electronics and Communications, *2,*3,*4,*5,*6, Student MLR Institute of Technology, Dundigal, Hyderabad-43 Telangana, India _________________________________________________________________________________________ Abstract: In today’s competitive world that thrives on the thirst for profits through excellence, obtaining greater efficiency by optimum utilization of resources and better decision-making through analytical data mining methods has become the backbone of every industry. This highlights the importance of the powerful tools and concepts of Data Mining and Warehousing, which when applied effectively can revolutionize the face of any industry. The role of normalization techniques has become extremely pivotal for identifying patterns and maintaining the consistency of database. Keywords: Data Normalization, Min-Max, Decimal Scaling, Zero-Score. __________________________________________________________________________________________ I. Introduction Normalization is a process of decomposing the attribute values so that they are within a specified range of smaller size. It transforms a complex database into a simple database. Normalization involves a sequence of rules to be employed to test individual relations so that the database can be normalized to any degree. The process of normalization is based on the engrossing concept of normal forms. A relational schema may be in either 1NF or 2NF or 3NF or Boyce-Codd Normal form. If the relational schema is not in the required normal form, then it has to be transformed into either of the desired normal forms. Normalization can thus be used as a data transformation technique. The various data normalization techniques are as follows: II. Min-Max Normalization This intriguing technique is responsible for accomplishing linear transformation on actual data set and for retaining the correlation between them. Assume 'R ' to be an attribute of a given relational schema. Also, assume that the range of values which 'R' can take may vary from MP to XP. In this enticing technique, a value 'd' of attribute R is mapped to d' in the range [nXP, ,nMP ] by calculating d' using the equation: d'=(d-MP)(nXP-nMP)/XP-MP +nMP An error "out-of-bound" is displayed in computer executed program if the input value is greater than the actual data range. III. Zero-Score (Z-Score) Normalization: This method is generally used when the actual range of a particular attribute is unknown. However, this technique can be used to obtain feasible results if the minimum and maximum values are considered to be outliers. Normalization can thus be performed by using arithmetic mean and standard deviation. Thus, the value d may be transformed in d' using the equation: d'=(d-PA)/σP Where PA is the arithmetic mean of attribute P, whereas σP is the attribute P. IV. Normalization using Decimal Scaling The data value of attribute P is normalized by changing the position of decimal points. The decision regarding the position of decimal point is based on maximum absolute value of P i.e., Max(!d'!). The value of d is thus transformed using the equation, d’=d/10Z V. Elimination of Outliers Outliers are a common sighting while dealing with data. Their presence creates quite a lot of hassles in the computations. So, eliminating them is a very clever idea. So, detect the outliers from the box-plots and refine the data by eliminating them. One legitimate reason to remove outliers is to prevent the distortion of central Min-Max Normalization Zero Score Normalization Normalization Using Decimal Scaling
  • 2. Himani Goyal et al., International Journal of Software and Web Sciences, 10(1), September-November, 2014, pp. 32-33 IJSWS 14-423; © 2014, IJSWS All Rights Reserved Page 33 tendency of data. Suppose that the data for analysis includes the attribute age. The age values for the data tuples in the increasing order are 13,15,16,16,19,20,21,22,25,25,25,25,30,33,33,35,35,35,36,40,45,46,52,70. Thus using the concept of min-max normalization to transform the value 35 for age within the range [0.0,1.0]: (MP) min=13 and (XP) max=70. Range is [nMP,nXP]=[0.0,1.0]. Transforming the value 35 as, d'=(d-MP)(nXP-nMP)/(XP-MP)+nMP =(35-13)(1.0-0.0)/(70-13)+0.0 =22(1.0)/57=0.38. Hence, d'=0.38 which is well within the actual range. The arithmetic mean PA=29.96 and Standard deviation σP=12.94 years. Thus using z-score normalization, d'=d-P'/σP which is same as (5.04)/(12.94)=0.38.The value obtained using min-max normalization is same as the score obtained through z-score normalization. Further, the value d' can be transformed using decimal scale normalization as, d'=d/10Z =35/102 =0.35. The value d' is thus approximately 0.365 which is obtained by taking into consideration the mean of the above three values. V. Application Normalization is extensively used in the following applications: (i) Neural network classification algorithms such as in back-propagation algorithm that enhances the speed of learning phase. (ii) Distance-based method such as k-nearest neighbor classification that prohibits the larger range attribute values from outweighing the smaller range attribute values. VI. Conclusion Normalized relation tables do not contain repeated groups. Hence the concepts of anomalous updation, anomalous deletion, anomalous insertion, redundancy errors and database inconsistency can be obviated. Further, simplified results can be obtained which help in efficient maintenance of database integrity. Business enterprises can thus enhance their data analytics through the predictive behavior of the normalized data. Acknowledgments Ineffable are our feelings to Prof. Kamakshi Prasad of JNTU-Hyderabad for assisting us in this work. The values and beliefs that our professors have instilled in us have been a source of constant inspiration. F.A likes to extend special thanks to his parents for their amazing insight and guidance. The unflinching support of family members through thick and thin has helped us in reaching where we are today. S.A would like to thank the students of JNTU for their constant motivation. T.A extends his warm regards to his friends for their excellent ideologies and ideations which have been the constant sources of enlightenment. References 1. Database Management Systems by Raghu RamaKrishna, 2002 edition, McGraw -Hill. 2. Database Management Systems by Abraham Silberschatz, Henry F.Korth, Sudarshan, Ed 5, 2005, Mc-Graw-Hill education. 3. Database Management Systems by Raghuram and RadhaKrishna, Professional Publications. 4. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kambler, Ed4, Morgan Kauffman publications. 5. Data Mining tutorial, tutorialspoint.com. 6. Data Mining Techniques by Arun K Pujari. 7. Fundamentals of Database Systems by Remez Elmasri & Shamkant Navathe, Ed 4. About the Authors First Author: Raghavendra Pokuri is a final year Computer Science Engineering student pursuing his Bachelors in Technology from JNTU- Hyderabad. His fields of interests include research on the captivating subjects of Data Mining and Data Warehousing, adhoc -sensor networks and extensive C& Java programming. Second Author: Sandeep Kathula is a final year Computer Science engineering student pursuing his Bachelors in Technology from the esteemed college of JNTU-Hyderabad. His fields of interest include extensive research on Information retrieval systems, Data Mining and Warehousing, SQL programming and Web programming. Third Author: Naveen Battula is a final year Computer Science Engineering student pursuing his Bachelors in Technology from JNTU- Hyderabad. His fields of interest include substantial research on Database Transactions and Concurrency control, Storage and Indexing algorithms, Schema Refinement and Relational Calculus.