SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3593
Handwritten Digit Classification using Machine Learning Models
Vidushi Garg
Student, Department of Information Technology, Maharaja Agrasen Institute of Technology, New Delhi, India
----------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The rapid growth of new documents and
multimedia news has created new challenges in pattern
recognition and machine learning. The problem of
handwritten digit recognition has long been an open problem
in the field of pattern classification. Handwriting recognition
is one of the compelling research works going on because
every individual in this world has their own style of writing. It
is the capability of the computer to identify and understand
handwritten digits or characters automatically. The main
objective of this paper is to provide efficient and reliable
techniques for recognition of handwritten numerals by
comparing various classification models. MNIST dataset is
widely used for this recognition process and it has 70000
handwritten digits. This paper performs the analysis of
accuracies and performance measures of algorithms Support
Vector Classification model (SVC), Logistic Regression model
and Random Forest Classification model.
Key Words: Support Vector Classification Model(SVC),
Modified NationalInstituteofStandards and Technology
database(MNIST), Binarization, Logistic Regression
Model, Random Forest Classifier
1. INTRODUCTION
Handwritten digit recognition is an important problem in
optical character recognition, and it can be used as a test
case for theories of pattern recognition and machine
learning algorithms. To promote research of machine
learning and patternrecognition,several standarddatabases
have emerged. Handwriting recognition is one of the
compelling and fascinating works because every individual
in this world has their own style of writing. The main
difficulty of handwritten numerals recognitionistheserious
variance in size, translation, stroke thickness, rotation and
deformation of the numeral image because of handwritten
digits are written by different users and their writingstyleis
different from one user to another user.[2] In real-time
applications like the conversion of handwritten information
into digital format, postal code recognition, bank check
processing, verification of signatures, this recognition is
required.
This research aims to recognize the handwritten digits by
using tools from Machine Learning to train the classifiers, so
it produces a high recognition performance.TheMNISTdata
set is widely used for this recognition process. The MNIST
data set has 70,000 handwritten digits. Each image in this
data set is represented as an array of 28x28. The array has
784 pixels having values ranging from 0 to 255.[6] if the
pixel value is ‘0’ it indicates that the background is black and
if it is ‘1’ the background is white.
This study focuses on feature extraction and classification.
The performance of a classifier can rely as much on the
quality of the features as on the classifier itself. In this study,
we compare the performance of three different machine
learning classifiers for recognition of digits. The three
classifiers namelySupport VectorClassificationmodel (SVC),
LogisticRegression,andRandomForestClassificationmodel.
The main purpose of this research is to build a reliable
method for the recognition of handwritten digit strings. The
main contribution in this work is that Support Vector
Classification model gives the highest accuracy while
compared to the other classification models. Yet 100%
accuracy is something that is tobeachievedandtheresearch
is still actively going on in order to reduce the errorrate.The
accuracy and correctness are very crucial in handwritten
digit recognition applications. Even 1% error may lead to
inappropriate results in real-time applications.
2. RESEARCH METHODOLOGY
2.1 Description of the dataset
The MNIST dataset, a subset of a larger set NIST, is a
database of 70,000 handwritten digits, divided into 60,000
training examples and 10,000 testingsamples.Theimagesin
the MNIST dataset are present in form of an array consisting
of 28x28 values representing an image along with their
labels.[1] This is also the same in case of the testing images.
The gray level values of each pixel are coded in this work in
the [0,255] interval, using a 0 value for white pixels and 255
for black ones.
Table -1: Dataset
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3594
Fig -1: Digit 6 in 28x28 format in MNIST
2.2 Data Preprocessing
An important point for managing a high performance in the
learning process is the construction of a useful training set.
The 70000 different patterns contained in the MNIST
database can be seen as a rather generous set, but evidence
shows that the usual learning algorithms run into serious
trouble for about one hundred or more of the test set
samples. Therefore, some strategy is needed in order to
increase the training set cardinality and variability. Usual
actions comprise geometric transformations such as
displacements, rotation, scaling and other distortions.
The variables proposed in this paper to make handwritten
digit classification require images in binary level. The
binarization process assumes that images contain two
classes of pixel: the foreground (or white pixels, with
maximum intensity, i.e., equal to 1) and the background (or
black pixels with minimum intensity, i.e., equal to 0). The
goal of the method is to classify all pixels with values above
of the given threshold as white, and all other pixels as black.
That is, given a threshold value t and an image X with
pixels denoted as x(i, j), the binarized image Xb with
elements xb (i, j) is obtained as follows:
If x(i, j) > t xb (i, j ) = 1
Else xb (i, j) = 0
Then, the key problem in the binarizationishowto selectthe
correct threshold, t, for a given image. We observe that the
shape of any object in the image is sensitive to variations in
the threshold value, and even more sensitive in the case of
handwritten digit. Therefore, we consider that a binary
handwritten number is better recognized computationallyif
its trace is complete and continuous, this is the criterionthat
we use to the threshold, being its choice of crucial
importance.[3]
Fig -2: The same digit with different threshold value
2.3 Implementation
2.3.1 Logistic Regression Model
Logistic Regression is a Machine Learning classification
algorithm that is used to predict the probability of a
categorical dependent variable. In logistic regression, the
dependent variable is a binary variable that contains data
coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other
words, the logistic regression model predicts P(Y=1) as a
function of X.
Fig -3: Logistic Regression Model
After that training set data is fed as input to the Logistic
Regression model so that the classifier gets trained. The
guessed label is matched with the original togettheaccuracy
of the trained classifier. Once the training is done, the testing
data is given to the classifier to predict the labels and testing
accuracy is obtained. The confusion matrix is generated
which gives the probability between the actual data and the
predicted data. Using the confusion matrix, the performance
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3595
measures like precision, recall and f1 scorecanbecalculated.
UsingLogistic Regression, anaccuracy of 88.97% is obtained
on test data.
Precision = TP/(TP+FP)
where TP = True Positive, FP = False Positive
Recall = TP/(TP+FN)
where FN = False Negative
F1score = 2*Precision+Recall/(Precission+Recall)
The confusion matrix obtained here isintheformofmatrixM
10x10 because it is a multi-class classification (0-9).
The below Table 2 is the confusion matrix obtained for the
trained data set using Logistic Regression Model.
Table -2: Confusion matrix using Logistic Regression
model
The below table 3 shows the precision, recall and f1 score
values obtained for the trained data set using the Logistic
Regression Model.
Table -3: Precision, Recall and F1 score for Logistic
Regression on trained dataset
The Test Data Set obtained accuracy of 88.97% using the
Logistic Regression on MNIST data set.
2.3.2 Random Forest Classifier
Random forest is a supervised learning algorithm. It can be
used both forclassification and regression. It is also the most
flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a
forest is. Random forests creates decision trees on randomly
selected data samples, gets prediction from each tree and
selects the best solution by means of voting.Italsoprovidesa
pretty good indicator of the feature importance.
Fig -4: Random Forest Classifier
Here also, the confusion matrix is obtained andprecisionand
recall values are computed.
Table 4: Confusion matrix using Random Forest Classifier
The above Table 4 shows the confusion matrix obtained for
the trained data set using the Random Forest Classifier.
The below table 5 shows the precision, recall and f1 score
values obtained for the trained data set using the Random
Forest Classifier.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3596
Table 5: Precision, Recall and F1 score for Random Forest
Classifier on trained dataset
The Test Data Set obtained accuracy of 96.3% using the
Random Forest Classifier on MNIST data set.
2.3.3 SupportVectorMachine(SVM)Classifierfordigit
recognition
Support Vector Machine is a supervised machine learning
technique whichisappliedforclassificationandregression.It
is nothing but the representation of the inputdataintopoints
in the space and mapped thus classifying them into classes.
The SVM classifies or separates the classes using the hyper-
plane concept.[9] The separation margin should be
equidistant from the classes.
Fig-5: Support Vector Machine Classifier
Here also, the confusion matrix is obtained andprecisionand
recall values are computed.
Table 6: Confusion matrix using Support Vector Machine
Classifier
The above Table 6 shows the confusion matrix obtained for
the trained data set using the Support Vector Machine
Classifier.
The below table 7 shows the precision, recall and f1 score
values obtained for the trained data set using the Support
Vector Machine Classifier.
Table 7: Precision, Recall and F1 score for Support Vector
Machine Classifier on trained dataset
The Test Data Set obtained accuracy of 98.3% using the
Random Forest Classifier on MNIST data set.
3. CONCLUSIONS
The accuracies of the algorithms Logistic Regression,
Random Forest Classifier and Support Vector Machine
Classifier are tabulated below in table 8.
Table 8: Data Accuracy of different models
Algorithm Data Accuracy
Logistic Regression 88.97%
Random Forest Classifier 96.3%
Support Vector Machine
Classifier
98.3%
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3597
Chart -1: Data accuracy chart of different models
It can be clearly observed that Support Vector Machine
Classifier has more accuracy compared to Logistic
Regression and Random Forest Classifier.
REFERENCES
[1] Wu, Ming & Zhang, Zhen. (2019)paper on
“Handwritten Digit Classification using the MNIST
Data Set”
[2] S M Shamim, Mohammad Badrul Alam Miah,
Angona Sarker, Masud Rana & Abdullah Al Jobair
paper on “Handwritten Digit Recognition using
Machine Learning Algorithms”
[3] Andrea Giuliodori, Rosa Lillo and Daniel Peña paper
on “HANDWRITTEN DIGIT CLASSIFICATION”
[4] Plamondon, R., & Srihari, S. N. Online and off-line
handwriting recognition: a comprehensive survey.
IEEE Transactions on Pattern AnalysisandMachine
Intelligence, 22(1), 63-84
[5] Liu, C. L., Yin, F., Wang, D. H., & Wang, Q. F. (2013).
Online and offline handwritten Chinese character
recognition benchmarking on new databases.
Pattern Recognition, 46(1), 155-162
[6] Handwritten Digit Recognition using Convolutional
Neural Networks in Python with Keras
[7] Al-Wzwazy, Haider& M Albehadili, Hayder&Alwan,
Younes& Islam, Naz& E Student, M &, Usa. (2016).
“HandwrittenDigitRecognitionUsingConvolutional
Neural Networks. International Journal of
Innovative Research in Computer and
Communication Engineering”
[8] AL-Mansoori, Saeed. (2015). “Intelligent
Handwritten Digit Recognition using Artificial
Neural Network”
[9] Meer Zohra, D.Rajeswara Rao paper on “A
Comprehensive Data Analysis on HandwrittenDigit
Recognition using Machine Learning” Approach by.
International Journal of Innovative Technologyand
Exploring Engineering (IJITEE) ISSN: 2278-3075,
Volume-8 Issue-6, April 2019

More Related Content

PDF
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
PDF
Dimensionality Reduction
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PPTX
Dimension reduction(jiten01)
PDF
Classification Techniques: A Review
PDF
Efficient classification of big data using vfdt (very fast decision tree)
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
Ijatcse71852019
Understanding the Applicability of Linear & Non-Linear Models Using a Case-Ba...
Dimensionality Reduction
IRJET- Performance Evaluation of Various Classification Algorithms
Dimension reduction(jiten01)
Classification Techniques: A Review
Efficient classification of big data using vfdt (very fast decision tree)
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Ijatcse71852019

What's hot (18)

PDF
A Review on Non Linear Dimensionality Reduction Techniques for Face Recognition
PDF
A Bayesian approach to estimate probabilities in classification trees
PDF
Feature selection in multimodal
PDF
Solving linear equations from an image using ann
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
PDF
Hypothesis on Different Data Mining Algorithms
PPTX
Decision Trees for Classification: A Machine Learning Algorithm
PDF
IRJET- A Data Mining with Big Data Disease Prediction
PPT
2.8 accuracy and ensemble methods
PDF
IRJET - Finger Vein Extraction and Authentication System for ATM
PPTX
Random forest
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PDF
Understanding random forests
PDF
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
PDF
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
PDF
Anirban part1
PDF
2018 p 2019-ee-a2
PDF
08 classbasic
A Review on Non Linear Dimensionality Reduction Techniques for Face Recognition
A Bayesian approach to estimate probabilities in classification trees
Feature selection in multimodal
Solving linear equations from an image using ann
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
Hypothesis on Different Data Mining Algorithms
Decision Trees for Classification: A Machine Learning Algorithm
IRJET- A Data Mining with Big Data Disease Prediction
2.8 accuracy and ensemble methods
IRJET - Finger Vein Extraction and Authentication System for ATM
Random forest
Data Mining: Concepts and Techniques — Chapter 2 —
Understanding random forests
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Anirban part1
2018 p 2019-ee-a2
08 classbasic
Ad

Similar to IRJET-Handwritten Digit Classification using Machine Learning Models (20)

PDF
IRJET - Skin Disease Predictor using Deep Learning
PDF
A02610104
PDF
Brain Tumor Classification using Support Vector Machine
PDF
AMAZON STOCK PRICE PREDICTION BY USING SMLT
PDF
AIRLINE FARE PRICE PREDICTION
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PDF
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
1376846406 14447221
PDF
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
PDF
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
PDF
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
PDF
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
PDF
IRJET - Stock Market Prediction using Machine Learning Algorithm
PDF
IRJET - Single Image Super Resolution using Machine Learning
PDF
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
PDF
IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...
PDF
Authentic Patient Data and Optimization Process through Cryptographic Image o...
PDF
Authentic Patient Data and Optimization Process through Cryptographic Image o...
IRJET - Skin Disease Predictor using Deep Learning
A02610104
Brain Tumor Classification using Support Vector Machine
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AIRLINE FARE PRICE PREDICTION
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
A Novel Algorithm for Design Tree Classification with PCA
1376846406 14447221
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
Faster Training Algorithms in Neural Network Based Approach For Handwritten T...
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Single Image Super Resolution using Machine Learning
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
IRJET - Effective Workflow for High-Performance Recognition of Fruits using M...
Authentic Patient Data and Optimization Process through Cryptographic Image o...
Authentic Patient Data and Optimization Process through Cryptographic Image o...
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Digital Logic Computer Design lecture notes
PDF
Well-logging-methods_new................
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
DOCX
573137875-Attendance-Management-System-original
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Structs to JSON How Go Powers REST APIs.pdf
Digital Logic Computer Design lecture notes
Well-logging-methods_new................
Strings in CPP - Strings in C++ are sequences of characters used to store and...
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
Internet of Things (IOT) - A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Lesson 3_Tessellation.pptx finite Mathematics
Foundation to blockchain - A guide to Blockchain Tech
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Construction Project Organization Group 2.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...

IRJET-Handwritten Digit Classification using Machine Learning Models

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3593 Handwritten Digit Classification using Machine Learning Models Vidushi Garg Student, Department of Information Technology, Maharaja Agrasen Institute of Technology, New Delhi, India ----------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The rapid growth of new documents and multimedia news has created new challenges in pattern recognition and machine learning. The problem of handwritten digit recognition has long been an open problem in the field of pattern classification. Handwriting recognition is one of the compelling research works going on because every individual in this world has their own style of writing. It is the capability of the computer to identify and understand handwritten digits or characters automatically. The main objective of this paper is to provide efficient and reliable techniques for recognition of handwritten numerals by comparing various classification models. MNIST dataset is widely used for this recognition process and it has 70000 handwritten digits. This paper performs the analysis of accuracies and performance measures of algorithms Support Vector Classification model (SVC), Logistic Regression model and Random Forest Classification model. Key Words: Support Vector Classification Model(SVC), Modified NationalInstituteofStandards and Technology database(MNIST), Binarization, Logistic Regression Model, Random Forest Classifier 1. INTRODUCTION Handwritten digit recognition is an important problem in optical character recognition, and it can be used as a test case for theories of pattern recognition and machine learning algorithms. To promote research of machine learning and patternrecognition,several standarddatabases have emerged. Handwriting recognition is one of the compelling and fascinating works because every individual in this world has their own style of writing. The main difficulty of handwritten numerals recognitionistheserious variance in size, translation, stroke thickness, rotation and deformation of the numeral image because of handwritten digits are written by different users and their writingstyleis different from one user to another user.[2] In real-time applications like the conversion of handwritten information into digital format, postal code recognition, bank check processing, verification of signatures, this recognition is required. This research aims to recognize the handwritten digits by using tools from Machine Learning to train the classifiers, so it produces a high recognition performance.TheMNISTdata set is widely used for this recognition process. The MNIST data set has 70,000 handwritten digits. Each image in this data set is represented as an array of 28x28. The array has 784 pixels having values ranging from 0 to 255.[6] if the pixel value is ‘0’ it indicates that the background is black and if it is ‘1’ the background is white. This study focuses on feature extraction and classification. The performance of a classifier can rely as much on the quality of the features as on the classifier itself. In this study, we compare the performance of three different machine learning classifiers for recognition of digits. The three classifiers namelySupport VectorClassificationmodel (SVC), LogisticRegression,andRandomForestClassificationmodel. The main purpose of this research is to build a reliable method for the recognition of handwritten digit strings. The main contribution in this work is that Support Vector Classification model gives the highest accuracy while compared to the other classification models. Yet 100% accuracy is something that is tobeachievedandtheresearch is still actively going on in order to reduce the errorrate.The accuracy and correctness are very crucial in handwritten digit recognition applications. Even 1% error may lead to inappropriate results in real-time applications. 2. RESEARCH METHODOLOGY 2.1 Description of the dataset The MNIST dataset, a subset of a larger set NIST, is a database of 70,000 handwritten digits, divided into 60,000 training examples and 10,000 testingsamples.Theimagesin the MNIST dataset are present in form of an array consisting of 28x28 values representing an image along with their labels.[1] This is also the same in case of the testing images. The gray level values of each pixel are coded in this work in the [0,255] interval, using a 0 value for white pixels and 255 for black ones. Table -1: Dataset
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3594 Fig -1: Digit 6 in 28x28 format in MNIST 2.2 Data Preprocessing An important point for managing a high performance in the learning process is the construction of a useful training set. The 70000 different patterns contained in the MNIST database can be seen as a rather generous set, but evidence shows that the usual learning algorithms run into serious trouble for about one hundred or more of the test set samples. Therefore, some strategy is needed in order to increase the training set cardinality and variability. Usual actions comprise geometric transformations such as displacements, rotation, scaling and other distortions. The variables proposed in this paper to make handwritten digit classification require images in binary level. The binarization process assumes that images contain two classes of pixel: the foreground (or white pixels, with maximum intensity, i.e., equal to 1) and the background (or black pixels with minimum intensity, i.e., equal to 0). The goal of the method is to classify all pixels with values above of the given threshold as white, and all other pixels as black. That is, given a threshold value t and an image X with pixels denoted as x(i, j), the binarized image Xb with elements xb (i, j) is obtained as follows: If x(i, j) > t xb (i, j ) = 1 Else xb (i, j) = 0 Then, the key problem in the binarizationishowto selectthe correct threshold, t, for a given image. We observe that the shape of any object in the image is sensitive to variations in the threshold value, and even more sensitive in the case of handwritten digit. Therefore, we consider that a binary handwritten number is better recognized computationallyif its trace is complete and continuous, this is the criterionthat we use to the threshold, being its choice of crucial importance.[3] Fig -2: The same digit with different threshold value 2.3 Implementation 2.3.1 Logistic Regression Model Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. Fig -3: Logistic Regression Model After that training set data is fed as input to the Logistic Regression model so that the classifier gets trained. The guessed label is matched with the original togettheaccuracy of the trained classifier. Once the training is done, the testing data is given to the classifier to predict the labels and testing accuracy is obtained. The confusion matrix is generated which gives the probability between the actual data and the predicted data. Using the confusion matrix, the performance
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3595 measures like precision, recall and f1 scorecanbecalculated. UsingLogistic Regression, anaccuracy of 88.97% is obtained on test data. Precision = TP/(TP+FP) where TP = True Positive, FP = False Positive Recall = TP/(TP+FN) where FN = False Negative F1score = 2*Precision+Recall/(Precission+Recall) The confusion matrix obtained here isintheformofmatrixM 10x10 because it is a multi-class classification (0-9). The below Table 2 is the confusion matrix obtained for the trained data set using Logistic Regression Model. Table -2: Confusion matrix using Logistic Regression model The below table 3 shows the precision, recall and f1 score values obtained for the trained data set using the Logistic Regression Model. Table -3: Precision, Recall and F1 score for Logistic Regression on trained dataset The Test Data Set obtained accuracy of 88.97% using the Logistic Regression on MNIST data set. 2.3.2 Random Forest Classifier Random forest is a supervised learning algorithm. It can be used both forclassification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting.Italsoprovidesa pretty good indicator of the feature importance. Fig -4: Random Forest Classifier Here also, the confusion matrix is obtained andprecisionand recall values are computed. Table 4: Confusion matrix using Random Forest Classifier The above Table 4 shows the confusion matrix obtained for the trained data set using the Random Forest Classifier. The below table 5 shows the precision, recall and f1 score values obtained for the trained data set using the Random Forest Classifier.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3596 Table 5: Precision, Recall and F1 score for Random Forest Classifier on trained dataset The Test Data Set obtained accuracy of 96.3% using the Random Forest Classifier on MNIST data set. 2.3.3 SupportVectorMachine(SVM)Classifierfordigit recognition Support Vector Machine is a supervised machine learning technique whichisappliedforclassificationandregression.It is nothing but the representation of the inputdataintopoints in the space and mapped thus classifying them into classes. The SVM classifies or separates the classes using the hyper- plane concept.[9] The separation margin should be equidistant from the classes. Fig-5: Support Vector Machine Classifier Here also, the confusion matrix is obtained andprecisionand recall values are computed. Table 6: Confusion matrix using Support Vector Machine Classifier The above Table 6 shows the confusion matrix obtained for the trained data set using the Support Vector Machine Classifier. The below table 7 shows the precision, recall and f1 score values obtained for the trained data set using the Support Vector Machine Classifier. Table 7: Precision, Recall and F1 score for Support Vector Machine Classifier on trained dataset The Test Data Set obtained accuracy of 98.3% using the Random Forest Classifier on MNIST data set. 3. CONCLUSIONS The accuracies of the algorithms Logistic Regression, Random Forest Classifier and Support Vector Machine Classifier are tabulated below in table 8. Table 8: Data Accuracy of different models Algorithm Data Accuracy Logistic Regression 88.97% Random Forest Classifier 96.3% Support Vector Machine Classifier 98.3%
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 11 | Nov 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 3597 Chart -1: Data accuracy chart of different models It can be clearly observed that Support Vector Machine Classifier has more accuracy compared to Logistic Regression and Random Forest Classifier. REFERENCES [1] Wu, Ming & Zhang, Zhen. (2019)paper on “Handwritten Digit Classification using the MNIST Data Set” [2] S M Shamim, Mohammad Badrul Alam Miah, Angona Sarker, Masud Rana & Abdullah Al Jobair paper on “Handwritten Digit Recognition using Machine Learning Algorithms” [3] Andrea Giuliodori, Rosa Lillo and Daniel Peña paper on “HANDWRITTEN DIGIT CLASSIFICATION” [4] Plamondon, R., & Srihari, S. N. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern AnalysisandMachine Intelligence, 22(1), 63-84 [5] Liu, C. L., Yin, F., Wang, D. H., & Wang, Q. F. (2013). Online and offline handwritten Chinese character recognition benchmarking on new databases. Pattern Recognition, 46(1), 155-162 [6] Handwritten Digit Recognition using Convolutional Neural Networks in Python with Keras [7] Al-Wzwazy, Haider& M Albehadili, Hayder&Alwan, Younes& Islam, Naz& E Student, M &, Usa. (2016). “HandwrittenDigitRecognitionUsingConvolutional Neural Networks. International Journal of Innovative Research in Computer and Communication Engineering” [8] AL-Mansoori, Saeed. (2015). “Intelligent Handwritten Digit Recognition using Artificial Neural Network” [9] Meer Zohra, D.Rajeswara Rao paper on “A Comprehensive Data Analysis on HandwrittenDigit Recognition using Machine Learning” Approach by. International Journal of Innovative Technologyand Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-6, April 2019