SlideShare a Scribd company logo
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
1 | P a g e
COMPARISION OF PERCENTAGE ERROR BY USING
IMPUTATION METHOD ON MID TERM EXAMINATION
DATA
V.B. Kamble
P.E.S. College of Engineering,
Aurangabad. (M.S.), India
S.N. Deshmukh
Dr. Babasaheb Ambedkar Marathwada University,
Aurangabad. (M.S.) India.
ABSTRACT
The issue of incomplete data exists across the entire field of data mining. In this paper,
Mean Imputation, Median Imputation and Standard Deviation Imputation are used to
deal with challenges of incomplete data on classification problems. By using different
imputation methods converts incomplete dataset in to the complete dataset. On
complete dataset by applying the suitable Imputation Method and comparing the
percentage error of Imputation Method and comparing the result
INTRODUCTION
Missing data are the absence of data items; they hide some information that may be
important. The presence of missing data is a general and challenging problem in the
data analysis field. Fortunately, missing data imputation techniques can be used to
improve the data quality. Missing data imputation techniques refer to any strategy that
fills in missing values of a dataset so that standard data analysis methods can be applied
to analyzed completed dataset [1].
Information quality is important to organization. People use information attribute as a
tool for accessing information quality. Information quality is measured based on users
as well as experts opinion on the information attributes. The commonly known
information attributes for information quality including accuracy, objectivity,
reputation, access, security, relevancy, value added, timeliness, completeness, amount
of data, and ease of understanding and consistent representation. These attribute can
also be applicable to data quality. Commonly, one can rarely find a data set that
contains complete entries [2].
1.1 Related Work- Methods for dealing with missing values can be classified into
three categories 1) Case Deletion, 2) Learning Without Handling of Missing Values,
and 3) Missing Value Imputation. The case deletion is to simply omit those cases with
missing values and only to use the remaining instances to finish the learning
assignments. The second approach is to learn without handling of missing data.
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
2 | P a g e
Missing data imputation methods advocates filling in missing values before a learning
application. Missing data imputation is a procedure that replaces the missing values
with some possible values [3].
1.2 Missing Data- Different methods have been applied in data mining to handle
missing values in database. Data with missing values could be ignored, or a global
constant could be used to fill the missing values, such as attribute mean, attribute mean
of the same class, or an algorithm could be applied to find the missing values. Missing
data imputation techniques means a strategy to fill the missing values of a dataset in
order to apply the standard methods which require completed data set for analysis.
These techniques retain data in incomplete cases, as well as impute values of correlated
variables.
Missing data imputations techniques are classified as ignorable missing data
imputations methods, which include single imputation methods and multiple imputation
methods, and non-ignorable missing data imputations methods which include likelihood
based methods and the non-likelihood based methods. Single imputation methods could
fill one value for each missing values and it is more commonly used at present than
multiple imputations which replace each missing value with several possible values and
better reflects sampling variability about actual value [4].
1.3 Patterns of Missing Values- There are a number of ways to know how missing
data arises. Little and Rubin introduced specific missing data terminology as a standard
framework to deal with missing data mechanisms and their effect on data analysis.
(A) Missing completely at Random (MCAR)- If the probability that a response is
missing is independent of both the observed data for that case and the unobserved
responses are simple a random sample from the observed data. An example of MCAR
missing data arises when investigators randomly assign research participants to
complete two-thirds of a survey instrument
(B) Missing at Random (MAR)- If the probability that a response is missing depends
on the observed data, but not on the unobserved data. This assumes the parameters of
the model for the data are distinct from the parameters of the missingness mechanism.
The missingness mechanism is ignorable. For example, in a reading comprehension test
at the beginning of a survey administration session, research participants with lower
reading comprehension scores may be less likely to complete the entire survey. The
missing data are due to some other external influence.
(C) Not-Missing at Random. (NMAR)- When respondents and non-respondents, with
the same values of some variables observed for both, differ systematically with respect
to values of the variable missing for the non-respondent. In other words, the pattern of
data missingness is non-random and it is not predictable from other variables in the
database. For example, a participant in a weight-loss study does not attend a weight-in
due to concerns about his/her weight loss; his/her data are missing due to non-ignorable
factors.
In practice it is usually difficult to meet the MCAR assumption. It is mentioned that
when making sampling distribution inferences about the parameter of the data, it is
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
3 | P a g e
appropriate to ignore the process that causes missing data if the missing data are
missing at random and the observed data are observed at random, but the inferences are
generally conditional on the observed pattern of missing data [5].
FOUR DIFFERENT METHODS TO DEAL WITH MISSING VALUES
A. Case Deletion (CD) - It is Also known as complete case analysis. It is available in
all statistical packages and is the default method in many programs. This method
consists of discarding all instances (cases) with missing values for at least one feature.
A variation of this method consists of determining the extent of missing data on each
instance and attribute, and deletes the instances and/or attributes with high levels of
missing data. Before deleting any attribute, it is necessary to evaluate its relevance to
the analysis. Unfortunately, relevant attributes should be kept even with a high degree
of missing values for other situations where the sample size is insufficient or some
structure exists in the missing data, CD has been shown to produce more biased
estimates than alternative methods. CD should be applied only in cases in which data
are missing completely at random.
B. Mean Imputation (MI). The method replaces the missing data for a given feature
(attribute) by mean of all the known values for that particular attribute.
Let us consider that the value of ‫ݔ‬௜௝ of the ݇௧௛ class, ‫ܥ‬௞, is missing then ‫ݔ‬௜௝ is
calculated as
‫ݔ‬௜௝
= ∑
௫
௡ೖ
௜௝
(1)
݅: ‫ݔ‬௜௝
€ck
Where nk represents the number of non-missing values in the ݆௧௛feature of the ݇௧௛ class.
However, According to Little and Rubin, the drawback of MI are
a) Sample size is overestimated
b) Variance is underestimated
c) Correlation is negatively biased
d) The distribution of new values is an incorrect representation of the population values
because the shape of the distribution is distorted by adding values equal to the mean.
Replacing all missing records with a single value will deflate the variance and
artificially inflate the significance of any statistical test based on it.
C. Median Imputation (MDI). This method uses median of all known values of the
feature or attribute in the class where the missing instance with missing value belongs.
Consider the value ‫ݔ‬௜௝ of the ݇௧௛ class, ‫ܥ‬௞, is missing. It will be calculated as
‫ݔ‬௜௝ୀ݉݁݀݅ܽ݊{݅: ‫ݔ‬௜௝€ܿ௞}{‫ݔ‬௜௝}} ( 2)
Instead of mean and median, mode also can be in imputation. Imputation method is
applied separately for many attribute. However, imputation does not consider co-
relation structure of the data [6].
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
4 | P a g e
D. Standard Deviation
The standard deviation spread data about the mean value. It is useful in comparing sets
of data which may have the same mean but a different range. The Standard Deviation is
given by the formula
S
୒ୀටଵ
୒
∑ (୶౟ష
ొ
౟సభ ௫ො)ଶ
(3)
Where{x1,x2,…,xn} are the observed values of the sample items and is the mean
value of these observations, while the denominator N stands for the size of the sample
[7].
EXPERIMENTAL ANALYSIS
We use three techniques for data retrieval
1) Imputing the missing values using attribute mean value.
2) Imputing the missing values using Median Imputation
3) Imputing the missing values using Standard Deviation
3.1 Imputing Missing Values Using Attribute Mean Value
This is one of the most frequently used methods. It consists of replacing the missing
data for a given feature (attribute) by the mean of all known values of that attribute in
the class where the instances with missing attribute belongs. In this method by
replacing the missing values by attribute mean and find out percentage accuracy with
original values by using the Mean Imputation Method.
3.2 Imputing Missing Values Using Median Imputation
In this method the missing values of instances are imputed. This method uses
median of all known values of the feature or attribute in the class where the missing
instance with missing value belongs.
Instead of mean and median, mode also can be in imputation. Imputation
method is applied separately for many attribute. However, imputation does not consider
co-relation structure of the data.
3.3 Imputing Missing Values Using Standard Deviation Imputation
The standard deviation spread data about the mean value. It is useful in comparing sets
of data which may have the same mean but a different range. The missing values can be
replaced by Standards Deviation value at respective attribute and then find out the
percentage accuracy with original value.
DATASET USED
In this paper dataset consists of records of class test marks of engineering college. In
dataset contains four attribute Roll No., Imputed value by Mean Imputation, Imputed
value by Median Imputation and Imputed value by Standard Deviation Imputation. In
dataset Roll No. are imaginary and generated for data analysis. Dataset contains some
missing values randomly distributed at particular record number by using imputation
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
5 | P a g e
techniques fill up the missing value so that dataset becomes complete dataset so that it
is easy to analyze the dataset. Dataset is shown in following table.
Table 1. Dataset with Mean imputation, Median Imputation, Standard Deviation
Imputation
Roll.
No.
Impute
d Value
by MI
Impute
d value
by MDI
Impute
d value
by SDI
121 8.45 8 4.14
140 11.16 12 4.53
165 8.45 8 4.14
200 8.45 8 4.14
230 10.56 11 4.34
300 11.16 12 4.53
534 11.16 12 4.53
1000 11.16 12 4.53
1429 8.45 8 4.14
1800 11.16 12 4.53
2187 9.75 10 4.34
2220 11.16 12 4.53
2247 10.56 11 4.34
2350 11.16 12 4.53
2557 10.56 11 4.14
3000 11.16 12 4.53
3160 10.56 11 4.34
3225 8.45 8 4.14
3295 10.56 11 4.34
3350 11.16 12 4.53
3467 10.56 11 4.34
3550 11.16 12 4.53
3679 11.16 12 4.53
3700 10.56 11 4.34
3773 11.16 12 4.53
3900 9.75 10 3.9
4113 11.16 12 4.53
4125 11.16 12 4.53
4132 10.56 11 3.9
4150 8.45 8 4.14
4175 10.56 11 4.34
4190 11.16 12 4.53
4202 8.45 8 4.14
4250 11.16 12 4.53
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
6 | P a g e
4352 11.16 12 4.53
4500 8.45 8 4.14
4600 8.45 8 4.14
4762 9.75 10 3.9
4850 10.56 11 4.34
4920 11.16 12 4.53
4984 8.45 8 4.14
4998 11.16 12 4.53
Percenta
ge Error
2.1 0.04 0.05
4.1 Comparison of Percentage Errors
Percentage Error calculates by observing the Experimental value and the Actual Value.
By taking the difference between Experimental Value and Actual Value Divided by
Actual Value and Multiplied by hundred so that we calculate the Percentage Error by
using the following formula.
Percentage Error= Experimental Value – Actual Value * 100
Actual Value
Table 2. Comparison of Percentage Error
Imputation Technique Percentage Error
Mean imputation 2.1
Median Imputation 0.04
Standard Deviation
Imputation
0.05
By comparing the percentage error of imputation method. The percentage error of Mean
Imputation is 2.1 %.The percentage error of Median Imputation is 0.04 and percentage
error of Standard Deviation Imputation is 0.05 so the Median Imputation method is
having lowest percentage of error as compare to Mean Imputation method and Standard
Deviation Imputation method. So the Median Imputation Method is more suitable as
compare to other method.
CONCLUSION AND FUTURE WORK
Missing values are regarded as serious problems in most of the information systems due
to unavailability of data and must be impute before the dataset is used. To handle these
missing values three techniques are used named as Mean imputation, Median
Imputation and Standard Deviation Imputation. Median Imputation method is having
lowest percentage of error as compare to Mean Imputation method and Standard
Deviation Imputation method. So the Median Imputation Method is more suitable as
compare to other method.
NOVATEUR PUBLICATIONS
INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT]
ISSN: 2394-3696
VOLUME 2, ISSUE 12, DEC.-2015
7 | P a g e
The proposed work handles missing values only for the numerical attributes. Further it
can be extended to handle a categorical attribute. Different classification algorithm can
be used for comparative analysis of missing data techniques. Missing data techniques
can also be implemented in mat lab.
REFERENCES
[1] Dinesh J. Prajapati, Jagruti H. Prajapati, and Handling Missing Values: Application
to University Data set. Issue 1, Vol. 1(August-2011), ISSN 2249-6149
[2] Shamsher Singh, Prof. Jagdish Prasad, Estimation of Missing Values in the Data
Mining and comparison of Imputation Methods. Mathematical Journal of
Interdisciplinary Sciences Vol. 1, Issue 1, March 2013, pp. 75–90
[3] Xiao Feng Zhu, Shichao Zhang, Senior Member, IEEE, Zhi Jin, Zili Zhang, and
Zhuoming Xu, Missing Value Estimation for Mixed-Attribute Data Sets. IEEE
Transactions on Knowledge And Data Engineering, Vol. 23, No. 1, January 2011.
[4] T.R.Sivapriya, V. Thavavel, A.R.Nadira Banu Kamal, Imputation and classification
of Missing Data Using Least Square Support Vector Machines- A New Approach in
Dementia Diagnosis. International Journal of Advanced Research in Artificial
Intelligence, Vol.1, No.4, 2012
[5] Yann-Yann Shieh, Imputation Methods on General Linear Mixed Models of
Longitudinal Studies, American Institutes for Research
[6] ] Edgar Acu˜Na1 And Caroline Rodriguez, The Treatment Of Missing Values And
Its Effect In The Classifier Accuracy Studies In Classification, Data Analysis, And
Knowledge Organization, 2004, Springer.Com
[7] MS. R. Malarvizhi, Dr. Antony Thanamani, Comparision of Imputation Techniques
after Classifying the Dataset Using Knn Classifier for the Imputation of Missing Data,
International Journal of Computational Engineering Research (IJCER online.com)
ISSN 2250-3005, Janaury-2013
[8] Anjana Sharma, Naina Mehta, Iti Sharma, Reasoning With Missing Values in Multi
Attribute Datasets. International Journal of Advanced Research in Computer Science
and Software Engineering, Volume 3, Issue5, May 2013 ISSN: 2277 128X

More Related Content

PDF
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
PDF
Towards reducing the
PDF
SELECTED DATA PREPARATION METHODS
PDF
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
PDF
B0930610
PDF
Exploratory data analysis
PDF
A Magnified Application of Deficient Data Using Bolzano Classifier
PPTX
Introduction to principal component analysis (pca)
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Towards reducing the
SELECTED DATA PREPARATION METHODS
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
B0930610
Exploratory data analysis
A Magnified Application of Deficient Data Using Bolzano Classifier
Introduction to principal component analysis (pca)

What's hot (19)

PDF
Lecture 6 guidelines_and_assignment
PDF
Statistical Methods to Handle Missing Data
PPTX
Data Analysis and Statistics
PPTX
Missing Data and Causes
PPTX
Analysis and Interpretation of Data
PDF
Data analysis
PDF
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
PPT
Data analysis and Interpretation
PPTX
Statistical Modeling in 3D: Explaining, Predicting, Describing
PDF
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
PPT
Mba ii rm unit-4.1 data analysis & presentation a
PPTX
Statistical Approaches to Missing Data
PPTX
Statistical tools in research 1
PPT
PPT
Recommender system
PDF
Survey on Feature Selection and Dimensionality Reduction Techniques
PDF
Statistics as a discipline
Lecture 6 guidelines_and_assignment
Statistical Methods to Handle Missing Data
Data Analysis and Statistics
Missing Data and Causes
Analysis and Interpretation of Data
Data analysis
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
Data analysis and Interpretation
Statistical Modeling in 3D: Explaining, Predicting, Describing
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
Mba ii rm unit-4.1 data analysis & presentation a
Statistical Approaches to Missing Data
Statistical tools in research 1
Recommender system
Survey on Feature Selection and Dimensionality Reduction Techniques
Statistics as a discipline
Ad

Similar to COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMINATION DATA (20)

DOCX
Machine Learning Approaches and its Challenges
PDF
missingpdf
PDF
IRJET- Probability based Missing Value Imputation Method and its Analysis
PDF
IRJET- Missing Data Imputation by Evidence Chain
PDF
A_Study_of_K-Nearest_Neighbour_as_an_Imputation_Me.pdf
PPTX
Missing Observations and how to deal with them.pptx
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
PDF
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
DOC
Poor man's missing value imputation
PPTX
Imputation techniques for missing data in clinical trials
PDF
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...
DOC
Twala2007.doc
PPT
3 Missing data12256429.ppt
PPTX
missingdatahandling-160923201313.pptx
PDF
IRJET- Evidence Chain for Missing Data Imputation: Survey
PDF
Missing data handling
PDF
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
PDF
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
A method for missing values imputation of machine learning datasets
Machine Learning Approaches and its Challenges
missingpdf
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Missing Data Imputation by Evidence Chain
A_Study_of_K-Nearest_Neighbour_as_an_Imputation_Me.pdf
Missing Observations and how to deal with them.pptx
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
Poor man's missing value imputation
Imputation techniques for missing data in clinical trials
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imp...
Twala2007.doc
3 Missing data12256429.ppt
missingdatahandling-160923201313.pptx
IRJET- Evidence Chain for Missing Data Imputation: Survey
Missing data handling
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
Welcome to International Journal of Engineering Research and Development (IJERD)
A method for missing values imputation of machine learning datasets
Ad

More from ijiert bestjournal (20)

PDF
CRACKS IN STEEL CASTING FOR VOLUTE CASING OF A PUMP
PDF
A COMPARATIVE STUDY OF DESIGN OF SIMPLE SPUR GEAR TRAIN AND HELICAL GEAR TRAI...
PDF
COMPARATIVE ANALYSIS OF CONVENTIONAL LEAF SPRING AND COMPOSITE LEAF
PDF
POWER GENERATION BY DIFFUSER AUGMENTED WIND TURBINE
PDF
FINITE ELEMENT ANALYSIS OF CONNECTING ROD OF MG-ALLOY
PDF
REVIEW ON CRITICAL SPEED IMPROVEMENT IN SINGLE CYLINDER ENGINE VALVE TRAIN
PDF
ENERGY CONVERSION PHENOMENON IN IMPLEMENTATION OF WATER LIFTING BY USING PEND...
PDF
SCUDERI SPLIT CYCLE ENGINE: REVOLUTIONARY TECHNOLOGY & EVOLUTIONARY DESIGN RE...
PDF
EXPERIMENTAL EVALUATION OF TEMPERATURE DISTRIBUTION IN JOURNAL BEARING OPERAT...
PDF
STUDY OF SOLAR THERMAL CAVITY RECEIVER FOR PARABOLIC CONCENTRATING COLLECTOR
PDF
DESIGN, OPTIMIZATION AND FINITE ELEMENT ANALYSIS OF CRANKSHAFT
PDF
ELECTRO CHEMICAL MACHINING AND ELECTRICAL DISCHARGE MACHINING PROCESSES MICRO...
PDF
HEAT TRANSFER ENHANCEMENT BY USING NANOFLUID JET IMPINGEMENT
PDF
MODIFICATION AND OPTIMIZATION IN STEEL SANDWICH PANELS USING ANSYS WORKBENCH
PDF
IMPACT ANALYSIS OF ALUMINUM HONEYCOMB SANDWICH PANEL BUMPER BEAM: A REVIEW
PDF
DESIGN OF WELDING FIXTURES AND POSITIONERS
PDF
ADVANCED TRANSIENT THERMAL AND STRUCTURAL ANALYSIS OF DISC BRAKE BY USING ANS...
PDF
REVIEW ON MECHANICAL PROPERTIES OF NON-ASBESTOS COMPOSITE MATERIAL USED IN BR...
PDF
PERFORMANCE EVALUATION OF TRIBOLOGICAL PROPERTIES OF COTTON SEED OIL FOR MULT...
PDF
MAGNETIC ABRASIVE FINISHING
CRACKS IN STEEL CASTING FOR VOLUTE CASING OF A PUMP
A COMPARATIVE STUDY OF DESIGN OF SIMPLE SPUR GEAR TRAIN AND HELICAL GEAR TRAI...
COMPARATIVE ANALYSIS OF CONVENTIONAL LEAF SPRING AND COMPOSITE LEAF
POWER GENERATION BY DIFFUSER AUGMENTED WIND TURBINE
FINITE ELEMENT ANALYSIS OF CONNECTING ROD OF MG-ALLOY
REVIEW ON CRITICAL SPEED IMPROVEMENT IN SINGLE CYLINDER ENGINE VALVE TRAIN
ENERGY CONVERSION PHENOMENON IN IMPLEMENTATION OF WATER LIFTING BY USING PEND...
SCUDERI SPLIT CYCLE ENGINE: REVOLUTIONARY TECHNOLOGY & EVOLUTIONARY DESIGN RE...
EXPERIMENTAL EVALUATION OF TEMPERATURE DISTRIBUTION IN JOURNAL BEARING OPERAT...
STUDY OF SOLAR THERMAL CAVITY RECEIVER FOR PARABOLIC CONCENTRATING COLLECTOR
DESIGN, OPTIMIZATION AND FINITE ELEMENT ANALYSIS OF CRANKSHAFT
ELECTRO CHEMICAL MACHINING AND ELECTRICAL DISCHARGE MACHINING PROCESSES MICRO...
HEAT TRANSFER ENHANCEMENT BY USING NANOFLUID JET IMPINGEMENT
MODIFICATION AND OPTIMIZATION IN STEEL SANDWICH PANELS USING ANSYS WORKBENCH
IMPACT ANALYSIS OF ALUMINUM HONEYCOMB SANDWICH PANEL BUMPER BEAM: A REVIEW
DESIGN OF WELDING FIXTURES AND POSITIONERS
ADVANCED TRANSIENT THERMAL AND STRUCTURAL ANALYSIS OF DISC BRAKE BY USING ANS...
REVIEW ON MECHANICAL PROPERTIES OF NON-ASBESTOS COMPOSITE MATERIAL USED IN BR...
PERFORMANCE EVALUATION OF TRIBOLOGICAL PROPERTIES OF COTTON SEED OIL FOR MULT...
MAGNETIC ABRASIVE FINISHING

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PPTX
web development for engineering and engineering
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
PPT on Performance Review to get promotions
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Sustainable Sites - Green Building Construction
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Mechanical Engineering MATERIALS Selection
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Well-logging-methods_new................
composite construction of structures.pdf
web development for engineering and engineering
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT on Performance Review to get promotions
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Sustainable Sites - Green Building Construction
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mechanical Engineering MATERIALS Selection

COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMINATION DATA

  • 1. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 1 | P a g e COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMINATION DATA V.B. Kamble P.E.S. College of Engineering, Aurangabad. (M.S.), India S.N. Deshmukh Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. (M.S.) India. ABSTRACT The issue of incomplete data exists across the entire field of data mining. In this paper, Mean Imputation, Median Imputation and Standard Deviation Imputation are used to deal with challenges of incomplete data on classification problems. By using different imputation methods converts incomplete dataset in to the complete dataset. On complete dataset by applying the suitable Imputation Method and comparing the percentage error of Imputation Method and comparing the result INTRODUCTION Missing data are the absence of data items; they hide some information that may be important. The presence of missing data is a general and challenging problem in the data analysis field. Fortunately, missing data imputation techniques can be used to improve the data quality. Missing data imputation techniques refer to any strategy that fills in missing values of a dataset so that standard data analysis methods can be applied to analyzed completed dataset [1]. Information quality is important to organization. People use information attribute as a tool for accessing information quality. Information quality is measured based on users as well as experts opinion on the information attributes. The commonly known information attributes for information quality including accuracy, objectivity, reputation, access, security, relevancy, value added, timeliness, completeness, amount of data, and ease of understanding and consistent representation. These attribute can also be applicable to data quality. Commonly, one can rarely find a data set that contains complete entries [2]. 1.1 Related Work- Methods for dealing with missing values can be classified into three categories 1) Case Deletion, 2) Learning Without Handling of Missing Values, and 3) Missing Value Imputation. The case deletion is to simply omit those cases with missing values and only to use the remaining instances to finish the learning assignments. The second approach is to learn without handling of missing data.
  • 2. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 2 | P a g e Missing data imputation methods advocates filling in missing values before a learning application. Missing data imputation is a procedure that replaces the missing values with some possible values [3]. 1.2 Missing Data- Different methods have been applied in data mining to handle missing values in database. Data with missing values could be ignored, or a global constant could be used to fill the missing values, such as attribute mean, attribute mean of the same class, or an algorithm could be applied to find the missing values. Missing data imputation techniques means a strategy to fill the missing values of a dataset in order to apply the standard methods which require completed data set for analysis. These techniques retain data in incomplete cases, as well as impute values of correlated variables. Missing data imputations techniques are classified as ignorable missing data imputations methods, which include single imputation methods and multiple imputation methods, and non-ignorable missing data imputations methods which include likelihood based methods and the non-likelihood based methods. Single imputation methods could fill one value for each missing values and it is more commonly used at present than multiple imputations which replace each missing value with several possible values and better reflects sampling variability about actual value [4]. 1.3 Patterns of Missing Values- There are a number of ways to know how missing data arises. Little and Rubin introduced specific missing data terminology as a standard framework to deal with missing data mechanisms and their effect on data analysis. (A) Missing completely at Random (MCAR)- If the probability that a response is missing is independent of both the observed data for that case and the unobserved responses are simple a random sample from the observed data. An example of MCAR missing data arises when investigators randomly assign research participants to complete two-thirds of a survey instrument (B) Missing at Random (MAR)- If the probability that a response is missing depends on the observed data, but not on the unobserved data. This assumes the parameters of the model for the data are distinct from the parameters of the missingness mechanism. The missingness mechanism is ignorable. For example, in a reading comprehension test at the beginning of a survey administration session, research participants with lower reading comprehension scores may be less likely to complete the entire survey. The missing data are due to some other external influence. (C) Not-Missing at Random. (NMAR)- When respondents and non-respondents, with the same values of some variables observed for both, differ systematically with respect to values of the variable missing for the non-respondent. In other words, the pattern of data missingness is non-random and it is not predictable from other variables in the database. For example, a participant in a weight-loss study does not attend a weight-in due to concerns about his/her weight loss; his/her data are missing due to non-ignorable factors. In practice it is usually difficult to meet the MCAR assumption. It is mentioned that when making sampling distribution inferences about the parameter of the data, it is
  • 3. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 3 | P a g e appropriate to ignore the process that causes missing data if the missing data are missing at random and the observed data are observed at random, but the inferences are generally conditional on the observed pattern of missing data [5]. FOUR DIFFERENT METHODS TO DEAL WITH MISSING VALUES A. Case Deletion (CD) - It is Also known as complete case analysis. It is available in all statistical packages and is the default method in many programs. This method consists of discarding all instances (cases) with missing values for at least one feature. A variation of this method consists of determining the extent of missing data on each instance and attribute, and deletes the instances and/or attributes with high levels of missing data. Before deleting any attribute, it is necessary to evaluate its relevance to the analysis. Unfortunately, relevant attributes should be kept even with a high degree of missing values for other situations where the sample size is insufficient or some structure exists in the missing data, CD has been shown to produce more biased estimates than alternative methods. CD should be applied only in cases in which data are missing completely at random. B. Mean Imputation (MI). The method replaces the missing data for a given feature (attribute) by mean of all the known values for that particular attribute. Let us consider that the value of ‫ݔ‬௜௝ of the ݇௧௛ class, ‫ܥ‬௞, is missing then ‫ݔ‬௜௝ is calculated as ‫ݔ‬௜௝ = ∑ ௫ ௡ೖ ௜௝ (1) ݅: ‫ݔ‬௜௝ €ck Where nk represents the number of non-missing values in the ݆௧௛feature of the ݇௧௛ class. However, According to Little and Rubin, the drawback of MI are a) Sample size is overestimated b) Variance is underestimated c) Correlation is negatively biased d) The distribution of new values is an incorrect representation of the population values because the shape of the distribution is distorted by adding values equal to the mean. Replacing all missing records with a single value will deflate the variance and artificially inflate the significance of any statistical test based on it. C. Median Imputation (MDI). This method uses median of all known values of the feature or attribute in the class where the missing instance with missing value belongs. Consider the value ‫ݔ‬௜௝ of the ݇௧௛ class, ‫ܥ‬௞, is missing. It will be calculated as ‫ݔ‬௜௝ୀ݉݁݀݅ܽ݊{݅: ‫ݔ‬௜௝€ܿ௞}{‫ݔ‬௜௝}} ( 2) Instead of mean and median, mode also can be in imputation. Imputation method is applied separately for many attribute. However, imputation does not consider co- relation structure of the data [6].
  • 4. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 4 | P a g e D. Standard Deviation The standard deviation spread data about the mean value. It is useful in comparing sets of data which may have the same mean but a different range. The Standard Deviation is given by the formula S ୒ୀටଵ ୒ ∑ (୶౟ష ొ ౟సభ ௫ො)ଶ (3) Where{x1,x2,…,xn} are the observed values of the sample items and is the mean value of these observations, while the denominator N stands for the size of the sample [7]. EXPERIMENTAL ANALYSIS We use three techniques for data retrieval 1) Imputing the missing values using attribute mean value. 2) Imputing the missing values using Median Imputation 3) Imputing the missing values using Standard Deviation 3.1 Imputing Missing Values Using Attribute Mean Value This is one of the most frequently used methods. It consists of replacing the missing data for a given feature (attribute) by the mean of all known values of that attribute in the class where the instances with missing attribute belongs. In this method by replacing the missing values by attribute mean and find out percentage accuracy with original values by using the Mean Imputation Method. 3.2 Imputing Missing Values Using Median Imputation In this method the missing values of instances are imputed. This method uses median of all known values of the feature or attribute in the class where the missing instance with missing value belongs. Instead of mean and median, mode also can be in imputation. Imputation method is applied separately for many attribute. However, imputation does not consider co-relation structure of the data. 3.3 Imputing Missing Values Using Standard Deviation Imputation The standard deviation spread data about the mean value. It is useful in comparing sets of data which may have the same mean but a different range. The missing values can be replaced by Standards Deviation value at respective attribute and then find out the percentage accuracy with original value. DATASET USED In this paper dataset consists of records of class test marks of engineering college. In dataset contains four attribute Roll No., Imputed value by Mean Imputation, Imputed value by Median Imputation and Imputed value by Standard Deviation Imputation. In dataset Roll No. are imaginary and generated for data analysis. Dataset contains some missing values randomly distributed at particular record number by using imputation
  • 5. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 5 | P a g e techniques fill up the missing value so that dataset becomes complete dataset so that it is easy to analyze the dataset. Dataset is shown in following table. Table 1. Dataset with Mean imputation, Median Imputation, Standard Deviation Imputation Roll. No. Impute d Value by MI Impute d value by MDI Impute d value by SDI 121 8.45 8 4.14 140 11.16 12 4.53 165 8.45 8 4.14 200 8.45 8 4.14 230 10.56 11 4.34 300 11.16 12 4.53 534 11.16 12 4.53 1000 11.16 12 4.53 1429 8.45 8 4.14 1800 11.16 12 4.53 2187 9.75 10 4.34 2220 11.16 12 4.53 2247 10.56 11 4.34 2350 11.16 12 4.53 2557 10.56 11 4.14 3000 11.16 12 4.53 3160 10.56 11 4.34 3225 8.45 8 4.14 3295 10.56 11 4.34 3350 11.16 12 4.53 3467 10.56 11 4.34 3550 11.16 12 4.53 3679 11.16 12 4.53 3700 10.56 11 4.34 3773 11.16 12 4.53 3900 9.75 10 3.9 4113 11.16 12 4.53 4125 11.16 12 4.53 4132 10.56 11 3.9 4150 8.45 8 4.14 4175 10.56 11 4.34 4190 11.16 12 4.53 4202 8.45 8 4.14 4250 11.16 12 4.53
  • 6. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 6 | P a g e 4352 11.16 12 4.53 4500 8.45 8 4.14 4600 8.45 8 4.14 4762 9.75 10 3.9 4850 10.56 11 4.34 4920 11.16 12 4.53 4984 8.45 8 4.14 4998 11.16 12 4.53 Percenta ge Error 2.1 0.04 0.05 4.1 Comparison of Percentage Errors Percentage Error calculates by observing the Experimental value and the Actual Value. By taking the difference between Experimental Value and Actual Value Divided by Actual Value and Multiplied by hundred so that we calculate the Percentage Error by using the following formula. Percentage Error= Experimental Value – Actual Value * 100 Actual Value Table 2. Comparison of Percentage Error Imputation Technique Percentage Error Mean imputation 2.1 Median Imputation 0.04 Standard Deviation Imputation 0.05 By comparing the percentage error of imputation method. The percentage error of Mean Imputation is 2.1 %.The percentage error of Median Imputation is 0.04 and percentage error of Standard Deviation Imputation is 0.05 so the Median Imputation method is having lowest percentage of error as compare to Mean Imputation method and Standard Deviation Imputation method. So the Median Imputation Method is more suitable as compare to other method. CONCLUSION AND FUTURE WORK Missing values are regarded as serious problems in most of the information systems due to unavailability of data and must be impute before the dataset is used. To handle these missing values three techniques are used named as Mean imputation, Median Imputation and Standard Deviation Imputation. Median Imputation method is having lowest percentage of error as compare to Mean Imputation method and Standard Deviation Imputation method. So the Median Imputation Method is more suitable as compare to other method.
  • 7. NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: 2394-3696 VOLUME 2, ISSUE 12, DEC.-2015 7 | P a g e The proposed work handles missing values only for the numerical attributes. Further it can be extended to handle a categorical attribute. Different classification algorithm can be used for comparative analysis of missing data techniques. Missing data techniques can also be implemented in mat lab. REFERENCES [1] Dinesh J. Prajapati, Jagruti H. Prajapati, and Handling Missing Values: Application to University Data set. Issue 1, Vol. 1(August-2011), ISSN 2249-6149 [2] Shamsher Singh, Prof. Jagdish Prasad, Estimation of Missing Values in the Data Mining and comparison of Imputation Methods. Mathematical Journal of Interdisciplinary Sciences Vol. 1, Issue 1, March 2013, pp. 75–90 [3] Xiao Feng Zhu, Shichao Zhang, Senior Member, IEEE, Zhi Jin, Zili Zhang, and Zhuoming Xu, Missing Value Estimation for Mixed-Attribute Data Sets. IEEE Transactions on Knowledge And Data Engineering, Vol. 23, No. 1, January 2011. [4] T.R.Sivapriya, V. Thavavel, A.R.Nadira Banu Kamal, Imputation and classification of Missing Data Using Least Square Support Vector Machines- A New Approach in Dementia Diagnosis. International Journal of Advanced Research in Artificial Intelligence, Vol.1, No.4, 2012 [5] Yann-Yann Shieh, Imputation Methods on General Linear Mixed Models of Longitudinal Studies, American Institutes for Research [6] ] Edgar Acu˜Na1 And Caroline Rodriguez, The Treatment Of Missing Values And Its Effect In The Classifier Accuracy Studies In Classification, Data Analysis, And Knowledge Organization, 2004, Springer.Com [7] MS. R. Malarvizhi, Dr. Antony Thanamani, Comparision of Imputation Techniques after Classifying the Dataset Using Knn Classifier for the Imputation of Missing Data, International Journal of Computational Engineering Research (IJCER online.com) ISSN 2250-3005, Janaury-2013 [8] Anjana Sharma, Naina Mehta, Iti Sharma, Reasoning With Missing Values in Multi Attribute Datasets. International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue5, May 2013 ISSN: 2277 128X