SlideShare a Scribd company logo
IAES International Journal of Artificial Intelligence (IJ-AI)
Vol. 14, No. 2, April 2025, pp. 1290~1301
ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i2.pp1290-1301  1290
Journal homepage: http://ijai.iaescore.com
A comparative analysis of optical character recognition models
for extracting and classifying texts in natural scenes
Puneeth Prakash, Sharath Kumar Yeliyur Hanumanthaiah
Department of Information Science and Engineering, Maharaja Institute of Technology Mysore, Affiliated to VTU Belgaum,
Karnataka, India
Article Info ABSTRACT
Article history:
Received Jun 21, 2024
Revised Oct 28, 2024
Accepted Nov 14, 2024
This research introduces prior-guided dynamic tunable network (PDTNet),
an efficient model designed to improve the detection and recognition of text
in complex environments. PDTNet’s architecture combines advanced
preprocessing techniques and deep learning methods to enhance accuracy
and reliability. The study comprehensively evaluates various optical
character recognition (OCR) models, demonstrating PDTNet’s superior
performance in terms of adaptability, accuracy, and reliability across
different environmental conditions. The results emphasize the need for a
context-aware approach in selecting OCR models for specific applications.
This research advocates for the development of hybrid OCR systems that
leverages multiple models, aiming to arrive at a higher accuracy and
adaptability in practical applications. With a precision of 85%, the proposed
model showed an improved performance of 1.7% over existing state of the
arts model. These findings contribute valuable insights into addressing the
technical challenges of text extraction and optimizing OCR model selection
for real-world scenarios.
Keywords:
Environmental adaptability in OCR
Hybrid OCR systems
Optical character recognition
Scene text recognition
Text detection algorithms
This is an open access article under the CC BY-SA license.
Corresponding Author:
Puneeth Prakash
Department of Information Science and Engineering, Maharaja Institute of Technology Mysore
Affiliated to VTU Belgaum
Srirangapatna, Karnataka 571477, India
Email: puneeth.phd20@gmail.com
1. INTRODUCTION
Detecting and recognizing text in natural scene images has emerged as a crucial focus in the fields
of computer vision, machine learning, and pattern recognition. Although there have been significant
advancements in these fields, accurately detecting and recognizing text in scene images and posters remains a
major challenge due to factors such as intricate backgrounds and diverse text orientations [1], [2]. Images can
generally be categorized into document images or scene images, with each presenting distinct characteristics
in text presentation [3], [4]. In natural scene images, the process of text recognition often involves steps like
text detection, segmentation, and optical character recognition (OCR)-based text recognition [5]. The
variability and complexity in these images, such as differing font styles, sizes, colors, and languages, pose
substantial challenges compared to text in documents [6]–[9].
OCR is an essential technology for recognizing text within images. It transforms various document
types, like PDF files and scanned papers, into editable text formats [10], [11]. Originating as a tool to digitize
printed text, OCR has now burgeoned into an essential component in numerous applications, ranging from
document scanning to aiding the visually impaired. Its significance is particularly pronounced in natural
scenes, where text is often embedded within complex and dynamic backgrounds. OCR enables the extraction
and digitization of textual content from various sources such as street signs, billboards, and product labels,
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1291
facilitating a myriad of applications from navigation aids for autonomous vehicles to accessible reading tools
for the visually impaired. By converting unstructured text into machine-readable data, OCR in natural scenes
not only enhances user interaction with the environment but also serves as a foundation for further processing
and analysis in fields like geospatial mapping, retail, and security.
The significance of OCR technology extends beyond mere text digitization; it plays a pivotal role in
interpreting and understanding our immediate environment. In the scenario where text appears in varying
forms and conditions, OCR is instrumental. Its applications span across diverse sectors such as autonomous
navigation, where it aids in interpreting road signs, to healthcare, where it assists in reading handwritten
notes and prescriptions [12]. Figure 1 shows the distinction between OCR documents and the identification
of text within natural settings. Figure 1(a) depicts the basic OCR image, while Figure 1(b) represent an image
in a natural scene.
However, the task of recognizing classified text in natural scenes presents unique challenges. Unlike
standard documents, text in natural environments is subject to a plethora of variables including varying
lighting conditions, diverse backgrounds, and a wide range of font styles and sizes. This variability can
significantly impede the accuracy of OCR systems. Moreover, classified text, often characterized by its
sensitive nature, demands not only high accuracy but also robustness and reliability in recognition [13].
(a) (b)
Figure 1. The distinction between OCR documents and the identification of text within natural settings of
(a) an abstract of a formal letter and (b) captured images from real-world scenes
To address these challenges, several advanced techniques have been developed, among which the
maximum stable extremal regions (MSER) and Canny edge detection algorithms stand out for their
effectiveness in detecting and segmenting text in complex natural scenes. MSER excels in identifying
coherent regions within an image that are stable across a range of thresholds, making it particularly suited for
recognizing text areas that stand out from their backgrounds. The Canny edge detector complements this by
identifying the boundaries of text through gradient analysis, highlighting edges with high contrast to the
surrounding area. The synergy between MSER’s region-based detection and Canny edge detection’s
fine-grained boundary delineation offers a robust foundation for overcoming the intrinsic challenges of text
detection in natural scenes.
Given the complexity of natural scene environments and the critical role OCR plays in interpreting
them, this study aims to conduct a comprehensive evaluation of various OCR models. Specifically, it seeks to
assess these models based on their adaptability to different environmental conditions, accuracy in recognizing
text amidst the myriad challenges posed by natural scenes, and reliability in delivering consistent results
across diverse datasets. By systematically comparing the performance of leading OCR models, this research
endeavours to provide insights into their strengths and limitations, and the technique for selecting the most
appropriate models for specific applications and setting the stage for the development of more advanced,
hybrid OCR systems tailored to the nuanced requirements of text extraction and recognition in natural scenes.
Our contribution to this research is as follows: i) addresses the gaps in OCR research, particularly in
the application of OCR technology in natural scenes, which has not been extensively explored compared to
its use in controlled environments; ii) aims to reassess existing OCR models, examining their performance
and suitability for recognizing classified text in uncontrolled, natural scenes; iii) provides valuable insights
that can guide future advancements in OCR technology, focusing on enhancing its applicability and
reliability in real-world scenarios; and iv) findings from this study are expected to benefit various
 ISSN: 2252-8938
Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301
1292
applications that rely on accurate recognition of classified text in natural settings, contributing to the
development of reliable OCR systems.
The rest of the study is as follows: section 2 is on related works in scene-text detection of various
OCR models. We introduce the methodology in section 3. The experimental results, benchmarked against
various state-of-the-art methods, are detailed in section 4. Lastly, section 5 presents the conclusions and
discusses potential directions for future research.
2. RELATED WORKS
This section provides a critical analysis of recent developments in OCR for natural scene text
recognition, highlighting advancements, challenges, and the innovative learning techniques introduced by
state-of-the-art models. The field has evolved significantly, with various approaches aimed at improving the
efficiency in text detection and recognition. However, the diversity of real-world environments still presents
challenges that many existing models do not adequately address. Despite advancements, challenges related to
complex environments and varying text characteristics continue to drive innovation in the field.
2.1. Text detection in natural scenes
The field of text detection in natural scenes has seen various approaches, especially in dealing with
the intricacies of varying environments. For instance, texture-based methods have been widely used, but their
reliance on handcrafted features limits their adaptability to dynamic real-world scenarios. Texture-based
methods, for instance, consider text as a unique kind of texture and typically employ techniques like the fast
Fourier transform (FFT) [14], discrete cosine transform (DCT), wavelet, and Gabor filters to extract texture
characteristics. These methods use sliding windows to scan the image for potential text blocks and then
employ classifiers to ascertain text locations. Deng et al. [15] approach of using rectangular bounding boxes
with vector regression enhances detection in landscape images, but lacks robustness when dealing with
arbitrary text shapes, which are common in natural scenes. This method employs vector regression for text
detection in the wild, underscoring the need for precision in bounding box generation.
In recent years, deep learning-based approaches have yielded promising outcomes in detecting text
within natural scenes. Methods such as efficient and accurate scene text detector (EAST), TextSnake, and
character-region awareness for text detection (CRAFT), both single-stage and two-stage, have shown
improved adaptability to diverse conditions in complex environments. The objective of scene text detection is
to create algorithms capable of reliably detecting and labeling text with bounding boxes in uncontrolled
settings like street signs, billboards, or license plates. Deep learning-based approaches for scene text
detection can be broadly classified into two categories: single-stage and two-stage methods. Single-stage
methods predict both the bounding boxes and the corresponding text labels in a single step, whereas
two-stage methods initially generate candidate regions and then classify these regions as text or non-text [16].
The single-stage methods include EAST, TextBoxes++, and TextSnake. EAST is a faster-connected
neural network (F-CNN) that directly predicts the geometry of text regions and the corresponding text labels
in one step. TextBoxes++ is an improved version of TextBoxes that uses a more efficient backbone network
and a novel feature fusion module. TextSnake is a segmentation-based method that predicts the text center
line and the corresponding text orientation and width.
The two-stage methods include faster region-based convolutional neural network (R-CNN), mask
R-CNN, and CRAFT. Faster R-CNN is a region proposal-based method that first generates a set of candidate
regions and then classifies them as text or non-text regions. Mask R-CNN extends faster R-CNN by adding a
mask branch that predicts the pixel-level segmentation of text regions. CRAFT is a segmentation-based
method that predicts the character regions and the corresponding affinity regions [17].
The strengths of these models lie in their accuracy and detection speed in well-lit, controlled
environments, but they often struggle in low-light or complex backgrounds. Prior-guided dynamic tunable
network (PDTNet), the model proposed in this study, addresses these shortcomings by incorporating
advanced preprocessing techniques and leveraging convolutional layers with rectified linear unit (ReLU)
activations to enhance text detection in more challenging scenes, thereby offering a more reliable solution.
It leverages the convolutional layers and ReLU activations, to enhance feature extraction and classification
accuracy. The model is particularly adept at handling diverse environmental conditions and varying text
characteristics, an ideal solution for real-world applications. By incorporating advanced preprocessing
techniques and a specialized spell-checking mechanism, PDTNet achieves superior accuracy and reliability
compared to traditional models. This positions PDTNet as a significant contribution to the ongoing evolution
of OCR technology, addressing both the challenges and opportunities in the field.
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1293
2.2. Challenges and innovations in scene text recognition
Text recognition in natural scenes presents unique challenges, with varying font sizes, orientations,
and complex backgrounds. Existing OCR models struggle with certain aspects such as curved text and noise
in low-quality images, and methods like CRAFT and TextSnake provide partial solutions but are not robust
enough for highly distorted or noisy text. Agughasi et al. [18] introduces an intuitive method for multi-
orientation text detection, inspired by the two-stage R-CNN framework. This method associates each feature
map location with a single reference box, aiming for high target box coverage. In contrast, Yadav et al. [19]
presents a unified system for handling scientific document images, highlighting the variety of components,
such as images, tables, text, and expressions, that complicate text recognition in these documents.
One of the major limitations of current approaches is their inability to effectively handle complex
backgrounds or low-quality images. For example, arbitrary-shaped text is handled by contour-based methods
[20] and polygon-based methods [21], but these techniques are not fully adaptable to real-world conditions
where text may be curved or partially [22] obscured. To overcome this, our proposed PDTNet incorporates a
spell-checking mechanism to enhance the post-processing of recognized text, significantly improving
recognition accuracy in cases where traditional methods fail.
Moreover, the presence of noise, low resolution, and motion blur further complicates text
recognition. Some studies have proposed enhancing the image quality through techniques like
super-resolution and deblurring, yet these approaches require significant computational power [23] PDTNet
addresses this by employing a lightweight preprocessing technique, allowing for better handling of
low-quality input images without the need for excessive computational resources. These factors can degrade
the text visibility and readability, making it harder for OCR models to recognize the text correctly. To
overcome this challenge, some studies have proposed methods that can enhance the image quality or reduce
the noise, such as super-resolution methods [24], deblurring methods, and denoising methods [25].
These methods aim to improve the text clarity and contrast, facilitating the subsequent text recognition
process. To this end, some studies have proposed methods that can leverages deep learning techniques, such
as CNNs [26], recurrent recurrent neural network (RNNs) [27], and attention mechanisms [28], to enhance
text detection and recognition performance. These methods can learn high-level features and sequential
dependencies from the image data, enhancing the text representation and interpretation.
2.3. Advancements and applications
The advancements in OCR technology have significantly improved not only text recognition
accuracy but also its practical applications across various fields. For example, in the work of
Ronneberger et al. [29] on offline handwritten text recognition (HTR) in historical documents addresses the
challenges associated with mislabeling in training sets, thereby enhancing the accuracy of document
digitization. Similarly, Research by Law and Deng [30] on automatic number plate recognition (ANPR)
demonstrates the effectiveness of a single neural network in recognizing mixed-style license plates,
highlighting the adaptability of OCR technology in diverse real-world scenarios.
Another application of OCR technology is in the field of print media conversion, which involves
transforming printed documents, such as books, magazines, and newspapers, into digital formats, such as
PDFs, e-books, and web pages [31]. This application can benefit various sectors, such as education, research,
and entertainment, by making the content more accessible, searchable, and shareable. Moreover, OCR
technology can also be used for enhancing the security and efficiency of various processes, such as identity
verification, document authentication, and data extraction. For example, OCR can be used to scan and
validate passports, driver's licenses, and other identification documents, reducing the risk of fraud and human
error [32]. Some examples of OCR software that can perform these tasks are Veryfi, ABBYY FineReader,
and Readiris [33].
Further, OCR technology has been wildly applied to improve the quality and efficiency of various
domains, such as historical document analysis, print media conversion, and document processing. It can also
enhance the security and accuracy of various processes, such as identity verification, document
authentication, and data extraction [34]. From enhancing detection accuracy to contributing to assistive
technologies, these studies lay a foundational framework for the proposed research, emphasizing the
complexities and the need for advanced OCR models in natural scenes. However, OCR technology still faces
some challenges, such as low-quality images, complex layouts, and diverse languages, which require further
research and development [35].
3. METHOD
This paper introduces a novel approach for detecting and recognizing inscriptions within images via
natural environments using the PDTNet model. PDTNet leverages a combination of text detection algorithms
and deep learning techniques to enhance accuracy and reliability in recognizing text amidst complex
 ISSN: 2252-8938
Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301
1294
backgrounds. The PDTNet model’s architecture is composed of various layers, each with unique dimensions
and functionalities. The initial layer is a convolutional layer with an input activation volume of 28×28×16,
generating 2,368 output values [36]. This is succeeded by another convolutional layer, doubling the depth to
32 channels and producing 4,640 outputs. Detailed explanation of the model is in subsequent sections, and
the overall architecture is depicted in Figure 2. Once the text is extracted from an image, the OCR technology
is used to convert image-based text into machine-readable text. To ensure the accuracy of the recognized text,
a spell-checking mechanism, specifically designed for OCR outputs, is employed.
Figure 2. The proposed methodology for the OCR model for scene text recognition
3.1. Dataset description and image enhancement
Our study emphasizes the importance of diverse datasets to reflect real-world conditions better. For
this reason, we used three datasets: ICDAR2015, ICDAR2017, and our custom PDT2023 dataset. The
ICDAR2015 [37] and ICDAR2017 [38] datasets comprise 1,670 camera-captured images and 18,000 images,
respectively, and have been used as benchmarks in the robust OCR competition for computer vision tasks.
Additionally, the PDT2023 dataset, which contains 300 images captured in and around the Mysore region of
India, was employed to evaluate the robustness of the proposed method. To ensure consistency, the images
were preprocessed and resized to 256×256 pixels. All images were in .jpg format, as shown in Figures 3.
Figure 3(a) depicts a typical camera captured image on a busy Indian road, while Figure 3(b) depicts the
variability in complex background. The parameters chosen for image augmentation on the PDT2023 dataset
are presented in Table 1. This diversity allows our model to perform effectively across different
environmental conditions and highlights its adaptability in comparison with relevant methods. Initially, the
images were preprocessed and resized to 256×256 pixels to ensure uniformity. The specific parameters used
for preprocessing are listed in Table 1.
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1295
(a) (b)
Figure 3. Select samples of images from the PDT2023 dataset of (a) typical camera captured image on a busy
Indian road and (b) variability in complex background
Table 1. Parameters for image augmentation on the PDT2023 dataset
Method Default Augmented
Rotation - 300, 450, 600
Rescale - 1./255
Zoom range - 0.25
x-Shift, y-Shift None 0.1
x-Scale, y-Scale None 0.1
Adjusted image Varies 256×256
3.2. PDTNet model
The PDTNet model’s architecture is composed of various layers, each with unique dimensions and
functionalities. The initial layer is a convolutional layer with an input activation volume of 28×28×16, which
generates 2,368 output values. This is succeeded by another convolutional layer, doubling the depth to
32 channels, producing 4,640 outputs. Subsequent to these layers, the model applies a ReLU activation
function without altering the output count. This is followed by a batch normalization step that processes the
32 channels with an output size of 128, aimed at stabilizing the learning process.
As the model progresses, the spatial dimensions were reduced to 14×14 while maintaining
32 channels, which increase the output to 9,248. ReLU activation and batch normalization are applied again,
followed by a dropout layer to reduce overfitting, though without changing the output size. The model has an
additional convolutional layer with 64 filters, resulting in 18,496 outputs, and then goes through the ReLU
and batch normalization steps. This pattern is retained as the depth doubles to 128 channels, first maintaining
then reducing the spatial dimensions, which culminates in an output of 147,584 from a convolutional layer
with an activation shape of 4×4×128.
The network’s output is then flattened in preparation for the fully connected layers, starting with a
dense layer containing 512 neurons, resulting in 1,049,088 outputs. This layer is followed by a ReLU
activation function and a batch normalization layer. A dropout layer is applied next, preceding the final dense
layer, which consists of 63 neurons. The final layer uses a SoftMax activation function to produce 63 class
probabilities, each corresponding to a potential classification within the model's structure. This process is
illustrated in Figure 4.
Figure 4. The block diagram of the PDTNet showing important layers
 ISSN: 2252-8938
Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301
1296
3.3. Optical character recognition and spell-checking mechanism
OCR is an important software application that specializes in recognizing text embedded within
images and video content. It converts various document types, such as PDF’s, scanned physical documents,
and images captured by cameras, into machine-editable text, as depicted in Figure 5. However, the OCR
process is not immune to errors, necessitating the use of error correction tools. To address misspellings that
may arise during OCR, a spell-checking method is incorporated within the proposed PDTNet system.
This spell-checker plays a significant role in achieving high levels of recognition accuracy. It functions by
aligning unrecognized characters with words from its dictionary that have similar spellings. For example, if
the OCR software misreads the character sequence “tne,” the spell checker identifies the misplaced “n” and
corrects it to “h,” enhancing the overall reliability of the OCR system. Figure 5 presents a detailed
breakdown of the OCR process and the application of the spell-checking mechanism. Figure 5(a) shows the
input image containing text, Figure 5(b) displays the initial output from the OCR process, and Figure 5(c)
demonstrates the corrected output after the spell-checking is applied.
(a) (b) (c)
Figure 5. Process of text identification through OCR for (a) area containing text: input image, (b) the result of
OCR application on the input image, and (c) the results of OCR correction
3.4. Experimental configuration
The PDTNet model was trained using a 3×3 filter with a single stride. For this purpose, 240 images
were allocated for the training set, and 60 were set aside for the validation set, adhering to the conventional
80:20 distribution for training and validation datasets. Standardization of the input features between 0 and 1
was implemented to enhance the speed of the network training. To enhance the precision of the deep neural
networks, data augmentation techniques were applied, including random rotations of the images by 30, 45,
and 60 degrees. The dataset, recognized as a benchmark, contains 1,670 and 18,000 images, respectively.
For training, only the cropped images of words that underwent data augmentation were included in the
compilation of the training set. The optimization process employed the adaptive momentum (ADAM)
optimizer with a learning rate of 10-5
. The network was trained on an NVIDIA 1060 GPU, which has 24 GB
of memory, using batches of ten images at a time. All experimental procedures were conducted in Python,
utilizing the TensorFlow library as the framework for training the deep learning models. The method took
0.5 seconds for training and 0.4 seconds for testing to process an image.
4. RESULTS AND DISCUSSION
In this section, we evaluated we evaluated the performance of the proposed system using the
publicly accessible ICDAR2015 dataset. The system's effectiveness was analyzed and compared with similar
systems, and the results obtained from detailed experimentation are presented in Table 2. The evaluation
metrics included accuracy, precision, and recall.
Accuracy measures the proportion of correct predictions out of the total number of predictions, and
is defined as (1).
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒𝑃+ 𝑇𝑟𝑢𝑒𝑁
𝑇𝑟𝑢𝑒𝑃+ 𝑇𝑟𝑢𝑒𝑁+ 𝐹𝑎𝑙𝑠𝑒𝑃+ 𝐹𝑎𝑙𝑠𝑒𝑁
(1)
Precision indicates the proportion of true positive identifications out of all identified positives and is
calculated as (2).
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
(2)
Where TP is true positive.
Sensitivity measures the proportion of actual positives correctly identified and is defined as (3).
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
(3)
Where FN is false positive.
FIRE EXTINGUISHER BREAK GLASS
Fire Extinguisher Break Glass
Suggestions OCR:
1: break 2: extinguisher 3: fire 4: glass
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1297
4.1. Results of using PDTNet across benchmark datasets
This section presents the results of the proposed technique, PDTNet, evaluated across various
benchmark datasets. The images selected for testing represent a diverse range of illumination conditions and
scene complexities to ensure a thorough evaluation of the model’s performance. The results demonstrate
PDTNet's capability to effectively detect and recognize text under various environmental challenges,
showcasing its robustness and adaptability.
Table 2 presents a straightforward comparison of text recognition performance on two datasets,
showing the OCR model’ ability to detect text using bounding boxes. The input images for each dataset and
their corresponding OCR outputs with bounding boxes are provided to illustrate the model's effectiveness in
this domain. Further evaluation was done on our dataset, known as PDT2023. Preliminary observations from
this dataset shows the model’s initial performance in detecting and recognizing text within the input images.
Upon iterative refinement of the model, marked improvements in detection accuracy was observed
demonstrating enhanced accuracy at recognizing text in English, Kannada, and Tamil, and the results
presented in Table 3.
Table 2. Segmentation and recognition results of the three different datasets
Dataset Input image OCR detection with bbox predictions
ICDAR2015
(a) (b)
ICDAR2017
(c) (d)
Table 3. Performance metrics on three different datasets
Dataset Accuracy (%) Precision (%) Recall (%)
ICDAR2015 98.20 98.30 97.70
ICDAR2017 96.30 97.40 92.50
PDT2023 89.20 85.40 88.40
Table 3 presents the performance metrics of PDTNet across three different datasets, highlighting its
effectiveness in text recognition. For the ICDAR2015 dataset, the model achieved an accuracy of 98.20%,
precision of 98.30%, and recall of 97.70%, which are the highest metrics across all datasets tested. This
suggests that the model is particularly effective on this dataset, outperforming its performance on
ICDAR2017 and PDT2023, where lower metrics were recorded.
4.2. Results of training and validation accuracy and loss
The training and validation accuracy and loss curves for two state-of-the-art models, VGG16 and
ResNet18, are presented in this section. These models were chosen due to their demonstrated efficiency and
high performance on benchmark datasets, making them suitable candidates for comparative analysis.
Figures 6 and 7 depict the segmentation results, highlighting the performance of each model throughout the
training process. Particular emphasis is placed on their ability to minimize loss and enhance accuracy over
time. This comparison offers valuable insights into the strengths and limitations of each model when
addressing complex text recognition tasks.
 ISSN: 2252-8938
Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301
1298
Figure 6. The training and validation accuracies of VGG16 in comparison with the ground truth data
Figure 7. The training and validation accuracies of ResNet50 in comparison with the ground truth data
From Figures 6 and 7, it can be observed that model 2 (ResNet50) performed better than VGG16
since it predicted exact bounding boxes that contained the text, unlike VGG16, which generated bounding
boxes with some few misclassifications [39]. Overall, the loss gradually decreased over training epochs until
convergence around the 50th epoch. Better accuracy was obtained based on the choice of hyperparameters
and variations of drop-out regularization and the results presented in Tables 4 to 6 respectively.
Table 4. Comparative analysis of various OCR models
Domain Metrics/Features EasyOCR Tesseract Surya DOCTR
Cargo containers, stationeries, named logos,
license plate numbers, household inscriptions
Accuracy 92% 88% 80% 95%
Precision 91% 90% 85% 96%
Recall 89% 87% 82% 94%
Speed Fast Moderate Slow Fast
Resource use Low Moderate High Low
Ease of integration Very Easy Easy Moderate Very easy
Adaptability Excellent Good Poor Excellent
Language support Multilingual Extensive Limited Multilingual
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1299
Table 5. Text recognition accuracy across different number of RNN models with same number of units
Metrics/Features EasyOCR [1] Tesseract [2] Surya [4] DOCTR [39] Proposed (PDTNet)
Accuracy (%) 92 85 80 95 96
Precision (%) 91 90 85 96 94
Recall (%) 89 87 82 94 92
Speed Fast Moderate Slow Very Fast Very Fast
Resource Use Low Moderate High Low Moderate
Ease of Integration Very Easy Easy Moderate Very Easy Easy
Adaptability Excellent Good Poor Excellent Superior
Table 6. Comparison with relevant methods
Author Methods Accuracy Precision Recall
Ma et al. [40] DocUNet 0.41 NR NR
Zhou et al [41] EAST NR 83.3 78.3
Pratikakis et al. [42] AdaptiveBinarization NR NR 17.53
Ma et al [43] RRPN 91.2 90.0 72.0
Proposed PDT-Net 98.2 85.0 75.0
Where NR is not reported
4.3. Comparative analysis
A comparative analysis was conducted on the methods proposed by [40]–[43], to evaluate their
performance in text recognition tasks. The accuracy, precision, and recall metrics were computed for each
method to provide a comprehensive evaluation of their effectiveness, following the (1)-(3) outlined earlier in
the paper. This analysis highlights the strengths and limitations of each approach, offering a clearer
understanding of how they perform relative to each other in various OCR tasks.
Table 4 illustrates that VGG-16 achieves superior precision over ResNet18 when utilizing 256 RNN
units. Yet, increasing the RNN units to 512 allows ResNet18 to surpass VGG-16 in recall metrics
significantly. An expansion in bidirectional long short-term memory (BiLSTM) units from 256 to 512
enhances the precision for VGG-16 but only nominally improves recall, suggesting an improved detection of
true positives without a proportionate increase in the capture of all positive instances.
According to Table 5, consistent unit numbers reveal BiLSTM's superior recognition accuracy over
BiGRU, which supports BiLSTM’s proficiency in handling long-term dependencies in the context of this
task. ResNet50 coupled with BiLSTM and 512 units registers the highest scores in accuracy and precision,
indicating the potential effectiveness of deeper network architectures for complex tasks such as text
recognition. Table 6 shows that the PDTNet model, as proposed, notably achieves an accuracy of 98.2%,
which is a standout result compared to other advanced methods. Despite this high accuracy, its precision and
recall are not the leading scores, indicating possible areas for refinement in identifying true positives and
capturing all positive instances. These results provide a comprehensive assessment of the PDTNet's
capabilities in comparison to other models and various configurations, underscoring its effectiveness and
highlighting opportunities for further enhancements.
5. CONCLUSION AND FUTURE SCOPE
This study introduces PDTNet, a novel OCR model that leverages deep learning techniques to tackle
the challenges of text detection and recognition in natural scenes. The results indicate that PDTNet surpasses
existing models in terms of accuracy, precision, and recall, efficiently addressing common OCR challenges
such as low contrast and complex backgrounds. Despite these advancements, the study identifies areas for
further improvement, particularly in precision and recall metrics. The evaluation, conducted on datasets like
ICDAR2015, ICDAR2017, and PDT2023, highlights the model's superior performance, but also points out
the need for enhancements in language and environmental adaptability. Future research will focus on refining
PDTNet to enhance true positive identification and comprehensive capture of all positive instances, thereby
improving precision and recall metrics. Additionally, expanding the model's capability to recognize and
process a broader range of languages will increase its applicability in multilingual environments. Developing
techniques to improve the model’s performance under varied and challenging environmental conditions will
make it more robust in real-world applications. Exploring the integration of PDTNet with other OCR models
to leverage their strengths is also a priority, aiming for higher accuracy and adaptability in practical
applications. The findings from this study contribute significantly to the field of OCR technology, providing
valuable insights into model selection for specific applications and paving the way for the development of
more advanced hybrid OCR systems. This research underscores the importance of context-aware OCR
solutions, emphasizing the need for continued innovation to meet the requirements of text extraction and
 ISSN: 2252-8938
Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301
1300
recognition in natural scenes. By focusing on these areas, the study aims to ensure that PDTNet remains at
the forefront of OCR technology advancements, addressing both current challenges and future opportunities
in the field.
REFERENCES
[1] M. A. M. Salehudin et al., “Analysis of optical character recognition using EasyOCR under image degradation,” Journal of
Physics: Conference Series, vol. 2641, no. 1, Nov. 2023, doi: 10.1088/1742-6596/2641/1/012001.
[2] S. Kumar, N. K. Sharma, M. Sharma, and N. Agrawal, “Text extraction from images using Tesseract,” in Deep Learning
Techniques for Automation and Industrial Applications, John Wiley & Sons, Ltd, 2024, pp. 1–18. doi:
10.1002/9781394234271.ch1.
[3] D. Shruthi, H. K. Chethan, and V. I. Agughasi, “Effective approach for fine-tuning pre-trained models for the extraction of texts
from source codes,” in ITM Web of Conferences, 2024, vol. 65, doi: 10.1051/itmconf/20246503004.
[4] L. Mosbah, I. Moalla, T. M. Hamdani, B. Neji, T. Beyrouthy, and A. M. Alimi, “ADOCRNet: A deep learning OCR for Arabic
documents recognition,” IEEE Access, vol. 12, pp. 55620–55631, 2024, doi: 10.1109/ACCESS.2024.3379530.
[5] V. I. Agughasi and M. Srinivasiah, “Semi-supervised labelling of chest x-ray images using unsupervised clustering for ground-
truth generation,” Applied Engineering and Technology, vol. 2, no. 3, pp. 188–202, 2023, doi: 10.31763/aet.v2i3.1143.
[6] A. V. Ikechukwu and S. Murali, “i-Net: a deep CNN model for white blood cancer segmentation and classification,” International
Journal of Advanced Technology and Engineering Exploration, vol. 9, no. 95, pp. 1448–1464, 2022, doi:
10.19101/IJATEE.2021.875564.
[7] S. Bhimshetty and A. V. Ikechukwu, “Energy-efficient deep Q-network: reinforcement learning for efficient routing protocol in
wireless internet of things,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 33, no. 2, pp. 971–980,
2024, doi: 10.11591/ijeecs.v33.i2.pp971-980.
[8] A. V. Ikechukwu and S. Murali, “XAI: An explainable ai model for the diagnosis of COPD from CXR images,” in 2023 IEEE
2nd International Conference on Data, Decision and Systems (ICDDS), Mangaluru, India, 2023, pp. 1-6, doi:
10.1109/ICDDS59137.2023.10434619.
[9] A. V. Ikechukwu, S. Murali, and B. Honnaraju, “COPDNet: An explainable ResNet50 model for the diagnosis of COPD from
CXR images,” in 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), Mysore,
India, 2023, pp. 1-7, doi: 10.1109/INDISCON58499.2023.10270604.
[10] A. V. Ikechukwu, “The superiority of fine-tuning over full-training for the efficient diagnosis of COPD from CXR images,”
Inteligencia Artificial, vol. 27, no. 74, pp. 62–79, 2024, doi: 10.4114/intartif.vol27iss74pp62-79.
[11] A. V. Ikechukwu and S. Murali, “CX-Net: an efficient ensemble semantic deep neural network for ROI identification from chest-
x-ray images for COPD diagnosis,” Machine Learning: Science and Technology, vol. 4, no. 2, 2023, doi: 10.1088/2632-
2153/acd2a5.
[12] R. Jalloul, C. H. Krishnappa, V. I. Agughasi, and R. Alkhatib, “Enhancing Early Breast Cancer Detection with Infrared
Thermography: A Comparative Evaluation of Deep Learning and Machine Learning Models,” Technologies, vol. 13, no. 1, Art.
no. 1, Jan. 2025, doi: 10.3390/technologies13010007.
[13] Y. Tang and X. Wu, “Scene text detection using superpixel-based stroke feature transform and deep learning based region
classification,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2276–2288, 2018, doi: 10.1109/TMM.2018.2802644.
[14] P. Prakash, S. K. Y. Hanumanthaiah, and S. B. Mayigowda, “CRNN model for text detection and classification from natural
scenes,” IAES International Journal of Artificial Intelligence, vol. 13, no. 1, pp. 839–849, 2024, doi: 10.11591/ijai.v13.i1.pp839-
849.
[15] D. Deng, H. Liu, X. Li, and D. Cai, “PixelLink: Detecting scene text via instance segmentation,” 32nd AAAI Conference on
Artificial Intelligence, AAAI 2018, pp. 6773–6780, 2018, doi: 10.1609/aaai.v32i1.12269.
[16] X. Li, “A deep learning-based text detection and recognition approach for natural scenes,” Journal of Circuits, Systems and
Computers, vol. 32, no. 5, 2023, doi: 10.1142/S0218126623500731.
[17] I. Marthot-Santaniello, M. T. Vu, O. Serbaeva, and M. Beurton-Aimar, “Stylistic similarities in Greek Papyri based on letter
shapes: a deep learning approach,” Document Analysis and Recognition – ICDAR 2023 Workshops, pp. 307–323, 2023, doi:
10.1007/978-3-031-41498-5_22.
[18] V. I. Agughasi, S. Bhimshetty, R. Deepu, and M. V. Mala, “Advances in thermal imaging: a convolutional neural network
approach for improved breast cancer diagnosis,” International Conference on Distributed Computing and Optimization
Techniques, ICDCOT 2024, 2024, doi: 10.1109/ICDCOT61034.2024.10515323.
[19] A. Yadav, S. Singh, M. Siddique, N. Mehta, and A. Kotangale, “OCR using CRNN: a deep learning approach for text
recognition,” 2023 4th International Conference for Emerging Technology, INCET 2023, 2023, doi:
10.1109/INCET57972.2023.10170436.
[20] R. Najam and S. Faizullah, “Analysis of recent deep learning techniques for arabic handwritten-text OCR and post-OCR
correction,” Applied Sciences, vol. 13, no. 13, 2023, doi: 10.3390/app13137568.
[21] P. Chhabra, A. Shrivastava, and Z. Gupta, “Comparative analysis on text detection for scenic images using EAST and CTPN,” in
7th International Conference on Trends in Electronics and Informatics, ICOEI 2023 - Proceedings, 2023, pp. 1303–1308, doi:
10.1109/ICOEI56765.2023.10125894.
[22] A. Rahman, A. Ghosh, and C. Arora, “UTRNet: high-resolution Urdu text recognition in printed documents,” Document Analysis
and Recognition - ICDAR 2023, pp. 305–324, 2023, doi: 10.1007/978-3-031-41734-4_19.
[23] S. Kaur, S. Bawa, and R. Kumar, “Heuristic-based text segmentation of bilingual handwritten documents for Gurumukhi-Latin
scripts,” Multimedia Tools and Applications, vol. 83, no. 7, pp. 18667–18697, 2024, doi: 10.1007/s11042-023-15335-8.
[24] A. V. Ikechukwu, “Leveraging transfer learning for efficient diagnosis of COPD using CXR images and explainable AI
techniques,” Inteligencia Artificial, vol. 27, no. 74, pp. 133–151, 2024, doi: 10.4114/intartif.vol27iss74pp133-151.
[25] S. Long, X. He, and C. Yao, “Scene text detection and recognition: the deep learning era,” International Journal of Computer
Vision, vol. 129, no. 1, pp. 161–184, 2021, doi: 10.1007/s11263-020-01369-0.
[26] T. Khan, R. Sarkar, and A. F. Mollah, “Deep learning approaches to scene text detection: a comprehensive review,” Artificial
Intelligence Review, vol. 54, no. 5, pp. 3239–3298, 2021, doi: 10.1007/s10462-020-09930-6.
[27] E. Hassan and V. L. Lekshmi, “Scene text detection using attention with depthwise separable convolutions,” Applied Sciences,
vol. 12, no. 13, 2022, doi: 10.3390/app12136425.
Int J Artif Intell ISSN: 2252-8938 
A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash)
1301
[28] X. Liu, G. Meng, and C. Pan, “Scene text detection and recognition with advances in deep learning: a survey,” International
Journal on Document Analysis and Recognition, vol. 22, no. 2, pp. 143–162, 2019, doi: 10.1007/s10032-019-00320-5.
[29] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” IEEE Access,
vol. 9, pp. 16591–16603, May 2015, doi: 10.1109/ACCESS.2021.3053408.
[30] H. Law and J. Deng, “CornerNet: detecting objects as paired keypoints,” International Journal of Computer Vision, vol. 128,
no. 3, pp. 642–656, 2020, doi: 10.1007/s11263-019-01204-1.
[31] N. Otsu, “Threshold selection method from gray-level histograms,” IEEE Trans Syst Man Cybern, vol. SMC-9, no. 1, pp. 62–66,
1979, doi: 10.1109/tsmc.1979.4310076.
[32] B. Gatos, I. Pratikakis, and S. J. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognition, vol. 39,
no. 3, pp. 317–327, 2006, doi: 10.1016/j.patcog.2005.09.010.
[33] N. Phansalkar, S. More, A. Sabale, and M. Joshi, “Adaptive local thresholding for detection of nuclei in diversity stained cytology
images,” in ICCSP 2011 - 2011 International Conference on Communications and Signal Processing, 2011, pp. 218–220, doi:
10.1109/ICCSP.2011.5739305.
[34] F. Z. A. Bella, M. El Rhabi, A. Hakim, and A. Laghrib, “An innovative document image binarization approach driven by the non-
local p-Laplacian,” Eurasip Journal on Advances in Signal Processing, vol. 2022, no. 1, 2022, doi: 10.1186/s13634-022-00883-2.
[35] M. Cheriet, J. N. Said, and C. Y. Suen, “A recursive thresholding technique for image segmentation,” IEEE Transactions on
Image Processing, vol. 7, no. 6, pp. 918–921, 1998, doi: 10.1109/83.679444.
[36] Y. C. Wei and C. H. Lin, “A robust video text detection approach using SVM,” Expert Systems with Applications, vol. 39, no. 12,
pp. 10832–10840, 2012, doi: 10.1016/j.eswa.2012.03.010.
[37] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and
Vision Computing, vol. 22, no. 10 SPEC. ISS., pp. 761–767, 2004, doi: 10.1016/j.imavis.2004.02.006.
[38] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: A unified text detection system in natural scene images,” in
Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, pp. 4651–4659, doi:
10.1109/ICCV.2015.528.
[39] P. Batra, N. Phalnikar, D. Kurmi, J. Tembhurne, P. Sahare, and T. Diwan, “OCR-MRD: performance analysis of different optical
character recognition engines for medical report digitization,” Int. J. Inf. Technol., vol. 16, no. 1, pp. 447–455, Jan. 2024, doi:
10.1007/s41870-023-01610-2.
[40] K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: document image unwarping via a stacked U-Net,” in Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 4700–4709, doi:
10.1109/CVPR.2018.00494.
[41] X. Zhou et al., “EAST: An efficient and accurate scene text detector,” Proceedings - 30th IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, pp. 2642–2651, 2017, doi: 10.1109/CVPR.2017.283.
[42] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, “ICDAR2017 competition on document image binarization (DIBCO 2017),” in
Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2017, vol. 1, pp. 1395–1403, doi:
10.1109/ICDAR.2017.228.
[43] J. Ma et al., “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11,
pp. 3111–3122, 2018, doi: 10.1109/TMM.2018.2818020.
BIOGRAPHIES OF AUTHORS
Puneeth Prakash is an Assistant Professor at Maharaja Institute of Technology
Mysore. He has 8 years of teaching and research experience. He has done bachelor’s degree in
information science and a master in computer science from VTU Belagavi. His key area of
interest is image processing, machine learning, and computer vision. He mainly works on
scene text-related images and has published papers in national and international journals. He is
keen on multiprogramming paradigm implementation. He can be contacted at email:
puneeth.phd20@gmail.com.
Sharath Kumar Yeliyur Hanumanthaiah is a Professor and Head in
Department of Information Science and Engineering, Maharaja Institute of Technology
Mysore. His areas of interest are image processing, pattern recognition, and information
retrieval. He has around 12 years of experience in teaching. He completed a B.E. in computer
science and engineering from VTU and an M.Tech. in computer cognition technology from
the University of Mysore. Further, completed Ph.D. from the University of Mysore. Published
50 research articles in reputed conferences and journals. He served as the BoE and BoS of the
University of Mysore from 2016 to 2020. He can be contacted at email:
sharathyhk@gmail.com.

More Related Content

PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PDF
Optical Character Recognition from Text Image
PDF
E1803012329
PDF
40120140501009
PDF
CRNN model for text detection and classification from natural scenes
PDF
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
PDF
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
Optical Character Recognition from Text Image
E1803012329
40120140501009
CRNN model for text detection and classification from natural scenes
Investigating the Effect of BD-CRAFT to Text Detection Algorithms
INVESTIGATING THE EFFECT OF BD-CRAFT TO TEXT DETECTION ALGORITHMS

Similar to A comparative analysis of optical character recognition models for extracting and classifying texts in natural scenes (20)

PDF
IRJET- Detection and Recognition of Text for Dusty Image using Long Short...
PDF
IRJET - Text Detection in Natural Scene Images: A Survey
PDF
OPTICAL CHARACTER RECOGNITION USING RBFNN
PPTX
Text extraction from natural scene image, a survey
PDF
En31919926
PDF
PDF
Cc31331335
PDF
50120130406005
PDF
Optical Character Recognition (OCR) System
PDF
D017222226
PDF
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
PDF
Z04405149151
PDF
IRJET- A Survey on MSER Based Scene Text Detection
PDF
A Detailed Study And Recent Research On OCR
PDF
Text Detection and Recognition with Speech Output for Visually Challenged Per...
DOCX
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
PDF
Enhanced scene text recognition using deep learning based hybrid attention re...
PDF
Methodology for eliminating plain regions from captured images
PDF
CONTENT RECOVERY AND IMAGE RETRIVAL IN IMAGE DATABASE CONTENT RETRIVING IN TE...
PDF
Character recognition of kannada text in scene images using neural
IRJET- Detection and Recognition of Text for Dusty Image using Long Short...
IRJET - Text Detection in Natural Scene Images: A Survey
OPTICAL CHARACTER RECOGNITION USING RBFNN
Text extraction from natural scene image, a survey
En31919926
Cc31331335
50120130406005
Optical Character Recognition (OCR) System
D017222226
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
Z04405149151
IRJET- A Survey on MSER Based Scene Text Detection
A Detailed Study And Recent Research On OCR
Text Detection and Recognition with Speech Output for Visually Challenged Per...
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
Enhanced scene text recognition using deep learning based hybrid attention re...
Methodology for eliminating plain regions from captured images
CONTENT RECOVERY AND IMAGE RETRIVAL IN IMAGE DATABASE CONTENT RETRIVING IN TE...
Character recognition of kannada text in scene images using neural
Ad

More from IAESIJAI (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Exploring DenseNet architectures with particle swarm optimization: efficient ...
PDF
A transfer learning-based deep neural network for tomato plant disease classi...
PDF
U-Net for wheel rim contour detection in robotic deburring
PDF
Deep learning-based classifier for geometric dimensioning and tolerancing sym...
PDF
Enhancing fire detection capabilities: Leveraging you only look once for swif...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Depression detection through transformers-based emotion recognition in multiv...
PDF
Enhancing financial cybersecurity via advanced machine learning: analysis, co...
PDF
Crop classification using object-oriented method and Google Earth Engine
PDF
Enhanced intrusion detection through dual reduction and robust mean
PDF
Enhancing sepsis detection using feed-forward neural networks with hyperparam...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Boosting industrial internet of things intrusion detection: leveraging machin...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Learning high-level spectral-spatial features for hyperspectral image classif...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
A hybrid feature selection with data-driven approach for cardiovascular disea...
PDF
Optimizing the gallstone detection process with feature selection statistical...
PDF
Hybrid methods to identify ovarian cancer from imbalanced high-dimensional mi...
A novel scalable deep ensemble learning framework for big data classification...
Exploring DenseNet architectures with particle swarm optimization: efficient ...
A transfer learning-based deep neural network for tomato plant disease classi...
U-Net for wheel rim contour detection in robotic deburring
Deep learning-based classifier for geometric dimensioning and tolerancing sym...
Enhancing fire detection capabilities: Leveraging you only look once for swif...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Depression detection through transformers-based emotion recognition in multiv...
Enhancing financial cybersecurity via advanced machine learning: analysis, co...
Crop classification using object-oriented method and Google Earth Engine
Enhanced intrusion detection through dual reduction and robust mean
Enhancing sepsis detection using feed-forward neural networks with hyperparam...
Heart disease approach using modified random forest and particle swarm optimi...
Boosting industrial internet of things intrusion detection: leveraging machin...
Per capita expenditure prediction using model stacking based on satellite ima...
Learning high-level spectral-spatial features for hyperspectral image classif...
Advanced methodologies resolving dimensionality complications for autism neur...
A hybrid feature selection with data-driven approach for cardiovascular disea...
Optimizing the gallstone detection process with feature selection statistical...
Hybrid methods to identify ovarian cancer from imbalanced high-dimensional mi...
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf

A comparative analysis of optical character recognition models for extracting and classifying texts in natural scenes

  • 1. IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 14, No. 2, April 2025, pp. 1290~1301 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i2.pp1290-1301  1290 Journal homepage: http://ijai.iaescore.com A comparative analysis of optical character recognition models for extracting and classifying texts in natural scenes Puneeth Prakash, Sharath Kumar Yeliyur Hanumanthaiah Department of Information Science and Engineering, Maharaja Institute of Technology Mysore, Affiliated to VTU Belgaum, Karnataka, India Article Info ABSTRACT Article history: Received Jun 21, 2024 Revised Oct 28, 2024 Accepted Nov 14, 2024 This research introduces prior-guided dynamic tunable network (PDTNet), an efficient model designed to improve the detection and recognition of text in complex environments. PDTNet’s architecture combines advanced preprocessing techniques and deep learning methods to enhance accuracy and reliability. The study comprehensively evaluates various optical character recognition (OCR) models, demonstrating PDTNet’s superior performance in terms of adaptability, accuracy, and reliability across different environmental conditions. The results emphasize the need for a context-aware approach in selecting OCR models for specific applications. This research advocates for the development of hybrid OCR systems that leverages multiple models, aiming to arrive at a higher accuracy and adaptability in practical applications. With a precision of 85%, the proposed model showed an improved performance of 1.7% over existing state of the arts model. These findings contribute valuable insights into addressing the technical challenges of text extraction and optimizing OCR model selection for real-world scenarios. Keywords: Environmental adaptability in OCR Hybrid OCR systems Optical character recognition Scene text recognition Text detection algorithms This is an open access article under the CC BY-SA license. Corresponding Author: Puneeth Prakash Department of Information Science and Engineering, Maharaja Institute of Technology Mysore Affiliated to VTU Belgaum Srirangapatna, Karnataka 571477, India Email: puneeth.phd20@gmail.com 1. INTRODUCTION Detecting and recognizing text in natural scene images has emerged as a crucial focus in the fields of computer vision, machine learning, and pattern recognition. Although there have been significant advancements in these fields, accurately detecting and recognizing text in scene images and posters remains a major challenge due to factors such as intricate backgrounds and diverse text orientations [1], [2]. Images can generally be categorized into document images or scene images, with each presenting distinct characteristics in text presentation [3], [4]. In natural scene images, the process of text recognition often involves steps like text detection, segmentation, and optical character recognition (OCR)-based text recognition [5]. The variability and complexity in these images, such as differing font styles, sizes, colors, and languages, pose substantial challenges compared to text in documents [6]–[9]. OCR is an essential technology for recognizing text within images. It transforms various document types, like PDF files and scanned papers, into editable text formats [10], [11]. Originating as a tool to digitize printed text, OCR has now burgeoned into an essential component in numerous applications, ranging from document scanning to aiding the visually impaired. Its significance is particularly pronounced in natural scenes, where text is often embedded within complex and dynamic backgrounds. OCR enables the extraction and digitization of textual content from various sources such as street signs, billboards, and product labels,
  • 2. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1291 facilitating a myriad of applications from navigation aids for autonomous vehicles to accessible reading tools for the visually impaired. By converting unstructured text into machine-readable data, OCR in natural scenes not only enhances user interaction with the environment but also serves as a foundation for further processing and analysis in fields like geospatial mapping, retail, and security. The significance of OCR technology extends beyond mere text digitization; it plays a pivotal role in interpreting and understanding our immediate environment. In the scenario where text appears in varying forms and conditions, OCR is instrumental. Its applications span across diverse sectors such as autonomous navigation, where it aids in interpreting road signs, to healthcare, where it assists in reading handwritten notes and prescriptions [12]. Figure 1 shows the distinction between OCR documents and the identification of text within natural settings. Figure 1(a) depicts the basic OCR image, while Figure 1(b) represent an image in a natural scene. However, the task of recognizing classified text in natural scenes presents unique challenges. Unlike standard documents, text in natural environments is subject to a plethora of variables including varying lighting conditions, diverse backgrounds, and a wide range of font styles and sizes. This variability can significantly impede the accuracy of OCR systems. Moreover, classified text, often characterized by its sensitive nature, demands not only high accuracy but also robustness and reliability in recognition [13]. (a) (b) Figure 1. The distinction between OCR documents and the identification of text within natural settings of (a) an abstract of a formal letter and (b) captured images from real-world scenes To address these challenges, several advanced techniques have been developed, among which the maximum stable extremal regions (MSER) and Canny edge detection algorithms stand out for their effectiveness in detecting and segmenting text in complex natural scenes. MSER excels in identifying coherent regions within an image that are stable across a range of thresholds, making it particularly suited for recognizing text areas that stand out from their backgrounds. The Canny edge detector complements this by identifying the boundaries of text through gradient analysis, highlighting edges with high contrast to the surrounding area. The synergy between MSER’s region-based detection and Canny edge detection’s fine-grained boundary delineation offers a robust foundation for overcoming the intrinsic challenges of text detection in natural scenes. Given the complexity of natural scene environments and the critical role OCR plays in interpreting them, this study aims to conduct a comprehensive evaluation of various OCR models. Specifically, it seeks to assess these models based on their adaptability to different environmental conditions, accuracy in recognizing text amidst the myriad challenges posed by natural scenes, and reliability in delivering consistent results across diverse datasets. By systematically comparing the performance of leading OCR models, this research endeavours to provide insights into their strengths and limitations, and the technique for selecting the most appropriate models for specific applications and setting the stage for the development of more advanced, hybrid OCR systems tailored to the nuanced requirements of text extraction and recognition in natural scenes. Our contribution to this research is as follows: i) addresses the gaps in OCR research, particularly in the application of OCR technology in natural scenes, which has not been extensively explored compared to its use in controlled environments; ii) aims to reassess existing OCR models, examining their performance and suitability for recognizing classified text in uncontrolled, natural scenes; iii) provides valuable insights that can guide future advancements in OCR technology, focusing on enhancing its applicability and reliability in real-world scenarios; and iv) findings from this study are expected to benefit various
  • 3.  ISSN: 2252-8938 Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301 1292 applications that rely on accurate recognition of classified text in natural settings, contributing to the development of reliable OCR systems. The rest of the study is as follows: section 2 is on related works in scene-text detection of various OCR models. We introduce the methodology in section 3. The experimental results, benchmarked against various state-of-the-art methods, are detailed in section 4. Lastly, section 5 presents the conclusions and discusses potential directions for future research. 2. RELATED WORKS This section provides a critical analysis of recent developments in OCR for natural scene text recognition, highlighting advancements, challenges, and the innovative learning techniques introduced by state-of-the-art models. The field has evolved significantly, with various approaches aimed at improving the efficiency in text detection and recognition. However, the diversity of real-world environments still presents challenges that many existing models do not adequately address. Despite advancements, challenges related to complex environments and varying text characteristics continue to drive innovation in the field. 2.1. Text detection in natural scenes The field of text detection in natural scenes has seen various approaches, especially in dealing with the intricacies of varying environments. For instance, texture-based methods have been widely used, but their reliance on handcrafted features limits their adaptability to dynamic real-world scenarios. Texture-based methods, for instance, consider text as a unique kind of texture and typically employ techniques like the fast Fourier transform (FFT) [14], discrete cosine transform (DCT), wavelet, and Gabor filters to extract texture characteristics. These methods use sliding windows to scan the image for potential text blocks and then employ classifiers to ascertain text locations. Deng et al. [15] approach of using rectangular bounding boxes with vector regression enhances detection in landscape images, but lacks robustness when dealing with arbitrary text shapes, which are common in natural scenes. This method employs vector regression for text detection in the wild, underscoring the need for precision in bounding box generation. In recent years, deep learning-based approaches have yielded promising outcomes in detecting text within natural scenes. Methods such as efficient and accurate scene text detector (EAST), TextSnake, and character-region awareness for text detection (CRAFT), both single-stage and two-stage, have shown improved adaptability to diverse conditions in complex environments. The objective of scene text detection is to create algorithms capable of reliably detecting and labeling text with bounding boxes in uncontrolled settings like street signs, billboards, or license plates. Deep learning-based approaches for scene text detection can be broadly classified into two categories: single-stage and two-stage methods. Single-stage methods predict both the bounding boxes and the corresponding text labels in a single step, whereas two-stage methods initially generate candidate regions and then classify these regions as text or non-text [16]. The single-stage methods include EAST, TextBoxes++, and TextSnake. EAST is a faster-connected neural network (F-CNN) that directly predicts the geometry of text regions and the corresponding text labels in one step. TextBoxes++ is an improved version of TextBoxes that uses a more efficient backbone network and a novel feature fusion module. TextSnake is a segmentation-based method that predicts the text center line and the corresponding text orientation and width. The two-stage methods include faster region-based convolutional neural network (R-CNN), mask R-CNN, and CRAFT. Faster R-CNN is a region proposal-based method that first generates a set of candidate regions and then classifies them as text or non-text regions. Mask R-CNN extends faster R-CNN by adding a mask branch that predicts the pixel-level segmentation of text regions. CRAFT is a segmentation-based method that predicts the character regions and the corresponding affinity regions [17]. The strengths of these models lie in their accuracy and detection speed in well-lit, controlled environments, but they often struggle in low-light or complex backgrounds. Prior-guided dynamic tunable network (PDTNet), the model proposed in this study, addresses these shortcomings by incorporating advanced preprocessing techniques and leveraging convolutional layers with rectified linear unit (ReLU) activations to enhance text detection in more challenging scenes, thereby offering a more reliable solution. It leverages the convolutional layers and ReLU activations, to enhance feature extraction and classification accuracy. The model is particularly adept at handling diverse environmental conditions and varying text characteristics, an ideal solution for real-world applications. By incorporating advanced preprocessing techniques and a specialized spell-checking mechanism, PDTNet achieves superior accuracy and reliability compared to traditional models. This positions PDTNet as a significant contribution to the ongoing evolution of OCR technology, addressing both the challenges and opportunities in the field.
  • 4. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1293 2.2. Challenges and innovations in scene text recognition Text recognition in natural scenes presents unique challenges, with varying font sizes, orientations, and complex backgrounds. Existing OCR models struggle with certain aspects such as curved text and noise in low-quality images, and methods like CRAFT and TextSnake provide partial solutions but are not robust enough for highly distorted or noisy text. Agughasi et al. [18] introduces an intuitive method for multi- orientation text detection, inspired by the two-stage R-CNN framework. This method associates each feature map location with a single reference box, aiming for high target box coverage. In contrast, Yadav et al. [19] presents a unified system for handling scientific document images, highlighting the variety of components, such as images, tables, text, and expressions, that complicate text recognition in these documents. One of the major limitations of current approaches is their inability to effectively handle complex backgrounds or low-quality images. For example, arbitrary-shaped text is handled by contour-based methods [20] and polygon-based methods [21], but these techniques are not fully adaptable to real-world conditions where text may be curved or partially [22] obscured. To overcome this, our proposed PDTNet incorporates a spell-checking mechanism to enhance the post-processing of recognized text, significantly improving recognition accuracy in cases where traditional methods fail. Moreover, the presence of noise, low resolution, and motion blur further complicates text recognition. Some studies have proposed enhancing the image quality through techniques like super-resolution and deblurring, yet these approaches require significant computational power [23] PDTNet addresses this by employing a lightweight preprocessing technique, allowing for better handling of low-quality input images without the need for excessive computational resources. These factors can degrade the text visibility and readability, making it harder for OCR models to recognize the text correctly. To overcome this challenge, some studies have proposed methods that can enhance the image quality or reduce the noise, such as super-resolution methods [24], deblurring methods, and denoising methods [25]. These methods aim to improve the text clarity and contrast, facilitating the subsequent text recognition process. To this end, some studies have proposed methods that can leverages deep learning techniques, such as CNNs [26], recurrent recurrent neural network (RNNs) [27], and attention mechanisms [28], to enhance text detection and recognition performance. These methods can learn high-level features and sequential dependencies from the image data, enhancing the text representation and interpretation. 2.3. Advancements and applications The advancements in OCR technology have significantly improved not only text recognition accuracy but also its practical applications across various fields. For example, in the work of Ronneberger et al. [29] on offline handwritten text recognition (HTR) in historical documents addresses the challenges associated with mislabeling in training sets, thereby enhancing the accuracy of document digitization. Similarly, Research by Law and Deng [30] on automatic number plate recognition (ANPR) demonstrates the effectiveness of a single neural network in recognizing mixed-style license plates, highlighting the adaptability of OCR technology in diverse real-world scenarios. Another application of OCR technology is in the field of print media conversion, which involves transforming printed documents, such as books, magazines, and newspapers, into digital formats, such as PDFs, e-books, and web pages [31]. This application can benefit various sectors, such as education, research, and entertainment, by making the content more accessible, searchable, and shareable. Moreover, OCR technology can also be used for enhancing the security and efficiency of various processes, such as identity verification, document authentication, and data extraction. For example, OCR can be used to scan and validate passports, driver's licenses, and other identification documents, reducing the risk of fraud and human error [32]. Some examples of OCR software that can perform these tasks are Veryfi, ABBYY FineReader, and Readiris [33]. Further, OCR technology has been wildly applied to improve the quality and efficiency of various domains, such as historical document analysis, print media conversion, and document processing. It can also enhance the security and accuracy of various processes, such as identity verification, document authentication, and data extraction [34]. From enhancing detection accuracy to contributing to assistive technologies, these studies lay a foundational framework for the proposed research, emphasizing the complexities and the need for advanced OCR models in natural scenes. However, OCR technology still faces some challenges, such as low-quality images, complex layouts, and diverse languages, which require further research and development [35]. 3. METHOD This paper introduces a novel approach for detecting and recognizing inscriptions within images via natural environments using the PDTNet model. PDTNet leverages a combination of text detection algorithms and deep learning techniques to enhance accuracy and reliability in recognizing text amidst complex
  • 5.  ISSN: 2252-8938 Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301 1294 backgrounds. The PDTNet model’s architecture is composed of various layers, each with unique dimensions and functionalities. The initial layer is a convolutional layer with an input activation volume of 28×28×16, generating 2,368 output values [36]. This is succeeded by another convolutional layer, doubling the depth to 32 channels and producing 4,640 outputs. Detailed explanation of the model is in subsequent sections, and the overall architecture is depicted in Figure 2. Once the text is extracted from an image, the OCR technology is used to convert image-based text into machine-readable text. To ensure the accuracy of the recognized text, a spell-checking mechanism, specifically designed for OCR outputs, is employed. Figure 2. The proposed methodology for the OCR model for scene text recognition 3.1. Dataset description and image enhancement Our study emphasizes the importance of diverse datasets to reflect real-world conditions better. For this reason, we used three datasets: ICDAR2015, ICDAR2017, and our custom PDT2023 dataset. The ICDAR2015 [37] and ICDAR2017 [38] datasets comprise 1,670 camera-captured images and 18,000 images, respectively, and have been used as benchmarks in the robust OCR competition for computer vision tasks. Additionally, the PDT2023 dataset, which contains 300 images captured in and around the Mysore region of India, was employed to evaluate the robustness of the proposed method. To ensure consistency, the images were preprocessed and resized to 256×256 pixels. All images were in .jpg format, as shown in Figures 3. Figure 3(a) depicts a typical camera captured image on a busy Indian road, while Figure 3(b) depicts the variability in complex background. The parameters chosen for image augmentation on the PDT2023 dataset are presented in Table 1. This diversity allows our model to perform effectively across different environmental conditions and highlights its adaptability in comparison with relevant methods. Initially, the images were preprocessed and resized to 256×256 pixels to ensure uniformity. The specific parameters used for preprocessing are listed in Table 1.
  • 6. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1295 (a) (b) Figure 3. Select samples of images from the PDT2023 dataset of (a) typical camera captured image on a busy Indian road and (b) variability in complex background Table 1. Parameters for image augmentation on the PDT2023 dataset Method Default Augmented Rotation - 300, 450, 600 Rescale - 1./255 Zoom range - 0.25 x-Shift, y-Shift None 0.1 x-Scale, y-Scale None 0.1 Adjusted image Varies 256×256 3.2. PDTNet model The PDTNet model’s architecture is composed of various layers, each with unique dimensions and functionalities. The initial layer is a convolutional layer with an input activation volume of 28×28×16, which generates 2,368 output values. This is succeeded by another convolutional layer, doubling the depth to 32 channels, producing 4,640 outputs. Subsequent to these layers, the model applies a ReLU activation function without altering the output count. This is followed by a batch normalization step that processes the 32 channels with an output size of 128, aimed at stabilizing the learning process. As the model progresses, the spatial dimensions were reduced to 14×14 while maintaining 32 channels, which increase the output to 9,248. ReLU activation and batch normalization are applied again, followed by a dropout layer to reduce overfitting, though without changing the output size. The model has an additional convolutional layer with 64 filters, resulting in 18,496 outputs, and then goes through the ReLU and batch normalization steps. This pattern is retained as the depth doubles to 128 channels, first maintaining then reducing the spatial dimensions, which culminates in an output of 147,584 from a convolutional layer with an activation shape of 4×4×128. The network’s output is then flattened in preparation for the fully connected layers, starting with a dense layer containing 512 neurons, resulting in 1,049,088 outputs. This layer is followed by a ReLU activation function and a batch normalization layer. A dropout layer is applied next, preceding the final dense layer, which consists of 63 neurons. The final layer uses a SoftMax activation function to produce 63 class probabilities, each corresponding to a potential classification within the model's structure. This process is illustrated in Figure 4. Figure 4. The block diagram of the PDTNet showing important layers
  • 7.  ISSN: 2252-8938 Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301 1296 3.3. Optical character recognition and spell-checking mechanism OCR is an important software application that specializes in recognizing text embedded within images and video content. It converts various document types, such as PDF’s, scanned physical documents, and images captured by cameras, into machine-editable text, as depicted in Figure 5. However, the OCR process is not immune to errors, necessitating the use of error correction tools. To address misspellings that may arise during OCR, a spell-checking method is incorporated within the proposed PDTNet system. This spell-checker plays a significant role in achieving high levels of recognition accuracy. It functions by aligning unrecognized characters with words from its dictionary that have similar spellings. For example, if the OCR software misreads the character sequence “tne,” the spell checker identifies the misplaced “n” and corrects it to “h,” enhancing the overall reliability of the OCR system. Figure 5 presents a detailed breakdown of the OCR process and the application of the spell-checking mechanism. Figure 5(a) shows the input image containing text, Figure 5(b) displays the initial output from the OCR process, and Figure 5(c) demonstrates the corrected output after the spell-checking is applied. (a) (b) (c) Figure 5. Process of text identification through OCR for (a) area containing text: input image, (b) the result of OCR application on the input image, and (c) the results of OCR correction 3.4. Experimental configuration The PDTNet model was trained using a 3×3 filter with a single stride. For this purpose, 240 images were allocated for the training set, and 60 were set aside for the validation set, adhering to the conventional 80:20 distribution for training and validation datasets. Standardization of the input features between 0 and 1 was implemented to enhance the speed of the network training. To enhance the precision of the deep neural networks, data augmentation techniques were applied, including random rotations of the images by 30, 45, and 60 degrees. The dataset, recognized as a benchmark, contains 1,670 and 18,000 images, respectively. For training, only the cropped images of words that underwent data augmentation were included in the compilation of the training set. The optimization process employed the adaptive momentum (ADAM) optimizer with a learning rate of 10-5 . The network was trained on an NVIDIA 1060 GPU, which has 24 GB of memory, using batches of ten images at a time. All experimental procedures were conducted in Python, utilizing the TensorFlow library as the framework for training the deep learning models. The method took 0.5 seconds for training and 0.4 seconds for testing to process an image. 4. RESULTS AND DISCUSSION In this section, we evaluated we evaluated the performance of the proposed system using the publicly accessible ICDAR2015 dataset. The system's effectiveness was analyzed and compared with similar systems, and the results obtained from detailed experimentation are presented in Table 2. The evaluation metrics included accuracy, precision, and recall. Accuracy measures the proportion of correct predictions out of the total number of predictions, and is defined as (1). 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑟𝑢𝑒𝑃+ 𝑇𝑟𝑢𝑒𝑁 𝑇𝑟𝑢𝑒𝑃+ 𝑇𝑟𝑢𝑒𝑁+ 𝐹𝑎𝑙𝑠𝑒𝑃+ 𝐹𝑎𝑙𝑠𝑒𝑁 (1) Precision indicates the proportion of true positive identifications out of all identified positives and is calculated as (2). 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 (2) Where TP is true positive. Sensitivity measures the proportion of actual positives correctly identified and is defined as (3). 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 (3) Where FN is false positive. FIRE EXTINGUISHER BREAK GLASS Fire Extinguisher Break Glass Suggestions OCR: 1: break 2: extinguisher 3: fire 4: glass
  • 8. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1297 4.1. Results of using PDTNet across benchmark datasets This section presents the results of the proposed technique, PDTNet, evaluated across various benchmark datasets. The images selected for testing represent a diverse range of illumination conditions and scene complexities to ensure a thorough evaluation of the model’s performance. The results demonstrate PDTNet's capability to effectively detect and recognize text under various environmental challenges, showcasing its robustness and adaptability. Table 2 presents a straightforward comparison of text recognition performance on two datasets, showing the OCR model’ ability to detect text using bounding boxes. The input images for each dataset and their corresponding OCR outputs with bounding boxes are provided to illustrate the model's effectiveness in this domain. Further evaluation was done on our dataset, known as PDT2023. Preliminary observations from this dataset shows the model’s initial performance in detecting and recognizing text within the input images. Upon iterative refinement of the model, marked improvements in detection accuracy was observed demonstrating enhanced accuracy at recognizing text in English, Kannada, and Tamil, and the results presented in Table 3. Table 2. Segmentation and recognition results of the three different datasets Dataset Input image OCR detection with bbox predictions ICDAR2015 (a) (b) ICDAR2017 (c) (d) Table 3. Performance metrics on three different datasets Dataset Accuracy (%) Precision (%) Recall (%) ICDAR2015 98.20 98.30 97.70 ICDAR2017 96.30 97.40 92.50 PDT2023 89.20 85.40 88.40 Table 3 presents the performance metrics of PDTNet across three different datasets, highlighting its effectiveness in text recognition. For the ICDAR2015 dataset, the model achieved an accuracy of 98.20%, precision of 98.30%, and recall of 97.70%, which are the highest metrics across all datasets tested. This suggests that the model is particularly effective on this dataset, outperforming its performance on ICDAR2017 and PDT2023, where lower metrics were recorded. 4.2. Results of training and validation accuracy and loss The training and validation accuracy and loss curves for two state-of-the-art models, VGG16 and ResNet18, are presented in this section. These models were chosen due to their demonstrated efficiency and high performance on benchmark datasets, making them suitable candidates for comparative analysis. Figures 6 and 7 depict the segmentation results, highlighting the performance of each model throughout the training process. Particular emphasis is placed on their ability to minimize loss and enhance accuracy over time. This comparison offers valuable insights into the strengths and limitations of each model when addressing complex text recognition tasks.
  • 9.  ISSN: 2252-8938 Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301 1298 Figure 6. The training and validation accuracies of VGG16 in comparison with the ground truth data Figure 7. The training and validation accuracies of ResNet50 in comparison with the ground truth data From Figures 6 and 7, it can be observed that model 2 (ResNet50) performed better than VGG16 since it predicted exact bounding boxes that contained the text, unlike VGG16, which generated bounding boxes with some few misclassifications [39]. Overall, the loss gradually decreased over training epochs until convergence around the 50th epoch. Better accuracy was obtained based on the choice of hyperparameters and variations of drop-out regularization and the results presented in Tables 4 to 6 respectively. Table 4. Comparative analysis of various OCR models Domain Metrics/Features EasyOCR Tesseract Surya DOCTR Cargo containers, stationeries, named logos, license plate numbers, household inscriptions Accuracy 92% 88% 80% 95% Precision 91% 90% 85% 96% Recall 89% 87% 82% 94% Speed Fast Moderate Slow Fast Resource use Low Moderate High Low Ease of integration Very Easy Easy Moderate Very easy Adaptability Excellent Good Poor Excellent Language support Multilingual Extensive Limited Multilingual
  • 10. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1299 Table 5. Text recognition accuracy across different number of RNN models with same number of units Metrics/Features EasyOCR [1] Tesseract [2] Surya [4] DOCTR [39] Proposed (PDTNet) Accuracy (%) 92 85 80 95 96 Precision (%) 91 90 85 96 94 Recall (%) 89 87 82 94 92 Speed Fast Moderate Slow Very Fast Very Fast Resource Use Low Moderate High Low Moderate Ease of Integration Very Easy Easy Moderate Very Easy Easy Adaptability Excellent Good Poor Excellent Superior Table 6. Comparison with relevant methods Author Methods Accuracy Precision Recall Ma et al. [40] DocUNet 0.41 NR NR Zhou et al [41] EAST NR 83.3 78.3 Pratikakis et al. [42] AdaptiveBinarization NR NR 17.53 Ma et al [43] RRPN 91.2 90.0 72.0 Proposed PDT-Net 98.2 85.0 75.0 Where NR is not reported 4.3. Comparative analysis A comparative analysis was conducted on the methods proposed by [40]–[43], to evaluate their performance in text recognition tasks. The accuracy, precision, and recall metrics were computed for each method to provide a comprehensive evaluation of their effectiveness, following the (1)-(3) outlined earlier in the paper. This analysis highlights the strengths and limitations of each approach, offering a clearer understanding of how they perform relative to each other in various OCR tasks. Table 4 illustrates that VGG-16 achieves superior precision over ResNet18 when utilizing 256 RNN units. Yet, increasing the RNN units to 512 allows ResNet18 to surpass VGG-16 in recall metrics significantly. An expansion in bidirectional long short-term memory (BiLSTM) units from 256 to 512 enhances the precision for VGG-16 but only nominally improves recall, suggesting an improved detection of true positives without a proportionate increase in the capture of all positive instances. According to Table 5, consistent unit numbers reveal BiLSTM's superior recognition accuracy over BiGRU, which supports BiLSTM’s proficiency in handling long-term dependencies in the context of this task. ResNet50 coupled with BiLSTM and 512 units registers the highest scores in accuracy and precision, indicating the potential effectiveness of deeper network architectures for complex tasks such as text recognition. Table 6 shows that the PDTNet model, as proposed, notably achieves an accuracy of 98.2%, which is a standout result compared to other advanced methods. Despite this high accuracy, its precision and recall are not the leading scores, indicating possible areas for refinement in identifying true positives and capturing all positive instances. These results provide a comprehensive assessment of the PDTNet's capabilities in comparison to other models and various configurations, underscoring its effectiveness and highlighting opportunities for further enhancements. 5. CONCLUSION AND FUTURE SCOPE This study introduces PDTNet, a novel OCR model that leverages deep learning techniques to tackle the challenges of text detection and recognition in natural scenes. The results indicate that PDTNet surpasses existing models in terms of accuracy, precision, and recall, efficiently addressing common OCR challenges such as low contrast and complex backgrounds. Despite these advancements, the study identifies areas for further improvement, particularly in precision and recall metrics. The evaluation, conducted on datasets like ICDAR2015, ICDAR2017, and PDT2023, highlights the model's superior performance, but also points out the need for enhancements in language and environmental adaptability. Future research will focus on refining PDTNet to enhance true positive identification and comprehensive capture of all positive instances, thereby improving precision and recall metrics. Additionally, expanding the model's capability to recognize and process a broader range of languages will increase its applicability in multilingual environments. Developing techniques to improve the model’s performance under varied and challenging environmental conditions will make it more robust in real-world applications. Exploring the integration of PDTNet with other OCR models to leverage their strengths is also a priority, aiming for higher accuracy and adaptability in practical applications. The findings from this study contribute significantly to the field of OCR technology, providing valuable insights into model selection for specific applications and paving the way for the development of more advanced hybrid OCR systems. This research underscores the importance of context-aware OCR solutions, emphasizing the need for continued innovation to meet the requirements of text extraction and
  • 11.  ISSN: 2252-8938 Int J Artif Intell, Vol. 14, No. 2, April 2025: 1290-1301 1300 recognition in natural scenes. By focusing on these areas, the study aims to ensure that PDTNet remains at the forefront of OCR technology advancements, addressing both current challenges and future opportunities in the field. REFERENCES [1] M. A. M. Salehudin et al., “Analysis of optical character recognition using EasyOCR under image degradation,” Journal of Physics: Conference Series, vol. 2641, no. 1, Nov. 2023, doi: 10.1088/1742-6596/2641/1/012001. [2] S. Kumar, N. K. Sharma, M. Sharma, and N. Agrawal, “Text extraction from images using Tesseract,” in Deep Learning Techniques for Automation and Industrial Applications, John Wiley & Sons, Ltd, 2024, pp. 1–18. doi: 10.1002/9781394234271.ch1. [3] D. Shruthi, H. K. Chethan, and V. I. Agughasi, “Effective approach for fine-tuning pre-trained models for the extraction of texts from source codes,” in ITM Web of Conferences, 2024, vol. 65, doi: 10.1051/itmconf/20246503004. [4] L. Mosbah, I. Moalla, T. M. Hamdani, B. Neji, T. Beyrouthy, and A. M. Alimi, “ADOCRNet: A deep learning OCR for Arabic documents recognition,” IEEE Access, vol. 12, pp. 55620–55631, 2024, doi: 10.1109/ACCESS.2024.3379530. [5] V. I. Agughasi and M. Srinivasiah, “Semi-supervised labelling of chest x-ray images using unsupervised clustering for ground- truth generation,” Applied Engineering and Technology, vol. 2, no. 3, pp. 188–202, 2023, doi: 10.31763/aet.v2i3.1143. [6] A. V. Ikechukwu and S. Murali, “i-Net: a deep CNN model for white blood cancer segmentation and classification,” International Journal of Advanced Technology and Engineering Exploration, vol. 9, no. 95, pp. 1448–1464, 2022, doi: 10.19101/IJATEE.2021.875564. [7] S. Bhimshetty and A. V. Ikechukwu, “Energy-efficient deep Q-network: reinforcement learning for efficient routing protocol in wireless internet of things,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 33, no. 2, pp. 971–980, 2024, doi: 10.11591/ijeecs.v33.i2.pp971-980. [8] A. V. Ikechukwu and S. Murali, “XAI: An explainable ai model for the diagnosis of COPD from CXR images,” in 2023 IEEE 2nd International Conference on Data, Decision and Systems (ICDDS), Mangaluru, India, 2023, pp. 1-6, doi: 10.1109/ICDDS59137.2023.10434619. [9] A. V. Ikechukwu, S. Murali, and B. Honnaraju, “COPDNet: An explainable ResNet50 model for the diagnosis of COPD from CXR images,” in 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), Mysore, India, 2023, pp. 1-7, doi: 10.1109/INDISCON58499.2023.10270604. [10] A. V. Ikechukwu, “The superiority of fine-tuning over full-training for the efficient diagnosis of COPD from CXR images,” Inteligencia Artificial, vol. 27, no. 74, pp. 62–79, 2024, doi: 10.4114/intartif.vol27iss74pp62-79. [11] A. V. Ikechukwu and S. Murali, “CX-Net: an efficient ensemble semantic deep neural network for ROI identification from chest- x-ray images for COPD diagnosis,” Machine Learning: Science and Technology, vol. 4, no. 2, 2023, doi: 10.1088/2632- 2153/acd2a5. [12] R. Jalloul, C. H. Krishnappa, V. I. Agughasi, and R. Alkhatib, “Enhancing Early Breast Cancer Detection with Infrared Thermography: A Comparative Evaluation of Deep Learning and Machine Learning Models,” Technologies, vol. 13, no. 1, Art. no. 1, Jan. 2025, doi: 10.3390/technologies13010007. [13] Y. Tang and X. Wu, “Scene text detection using superpixel-based stroke feature transform and deep learning based region classification,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2276–2288, 2018, doi: 10.1109/TMM.2018.2802644. [14] P. Prakash, S. K. Y. Hanumanthaiah, and S. B. Mayigowda, “CRNN model for text detection and classification from natural scenes,” IAES International Journal of Artificial Intelligence, vol. 13, no. 1, pp. 839–849, 2024, doi: 10.11591/ijai.v13.i1.pp839- 849. [15] D. Deng, H. Liu, X. Li, and D. Cai, “PixelLink: Detecting scene text via instance segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 6773–6780, 2018, doi: 10.1609/aaai.v32i1.12269. [16] X. Li, “A deep learning-based text detection and recognition approach for natural scenes,” Journal of Circuits, Systems and Computers, vol. 32, no. 5, 2023, doi: 10.1142/S0218126623500731. [17] I. Marthot-Santaniello, M. T. Vu, O. Serbaeva, and M. Beurton-Aimar, “Stylistic similarities in Greek Papyri based on letter shapes: a deep learning approach,” Document Analysis and Recognition – ICDAR 2023 Workshops, pp. 307–323, 2023, doi: 10.1007/978-3-031-41498-5_22. [18] V. I. Agughasi, S. Bhimshetty, R. Deepu, and M. V. Mala, “Advances in thermal imaging: a convolutional neural network approach for improved breast cancer diagnosis,” International Conference on Distributed Computing and Optimization Techniques, ICDCOT 2024, 2024, doi: 10.1109/ICDCOT61034.2024.10515323. [19] A. Yadav, S. Singh, M. Siddique, N. Mehta, and A. Kotangale, “OCR using CRNN: a deep learning approach for text recognition,” 2023 4th International Conference for Emerging Technology, INCET 2023, 2023, doi: 10.1109/INCET57972.2023.10170436. [20] R. Najam and S. Faizullah, “Analysis of recent deep learning techniques for arabic handwritten-text OCR and post-OCR correction,” Applied Sciences, vol. 13, no. 13, 2023, doi: 10.3390/app13137568. [21] P. Chhabra, A. Shrivastava, and Z. Gupta, “Comparative analysis on text detection for scenic images using EAST and CTPN,” in 7th International Conference on Trends in Electronics and Informatics, ICOEI 2023 - Proceedings, 2023, pp. 1303–1308, doi: 10.1109/ICOEI56765.2023.10125894. [22] A. Rahman, A. Ghosh, and C. Arora, “UTRNet: high-resolution Urdu text recognition in printed documents,” Document Analysis and Recognition - ICDAR 2023, pp. 305–324, 2023, doi: 10.1007/978-3-031-41734-4_19. [23] S. Kaur, S. Bawa, and R. Kumar, “Heuristic-based text segmentation of bilingual handwritten documents for Gurumukhi-Latin scripts,” Multimedia Tools and Applications, vol. 83, no. 7, pp. 18667–18697, 2024, doi: 10.1007/s11042-023-15335-8. [24] A. V. Ikechukwu, “Leveraging transfer learning for efficient diagnosis of COPD using CXR images and explainable AI techniques,” Inteligencia Artificial, vol. 27, no. 74, pp. 133–151, 2024, doi: 10.4114/intartif.vol27iss74pp133-151. [25] S. Long, X. He, and C. Yao, “Scene text detection and recognition: the deep learning era,” International Journal of Computer Vision, vol. 129, no. 1, pp. 161–184, 2021, doi: 10.1007/s11263-020-01369-0. [26] T. Khan, R. Sarkar, and A. F. Mollah, “Deep learning approaches to scene text detection: a comprehensive review,” Artificial Intelligence Review, vol. 54, no. 5, pp. 3239–3298, 2021, doi: 10.1007/s10462-020-09930-6. [27] E. Hassan and V. L. Lekshmi, “Scene text detection using attention with depthwise separable convolutions,” Applied Sciences, vol. 12, no. 13, 2022, doi: 10.3390/app12136425.
  • 12. Int J Artif Intell ISSN: 2252-8938  A comparative analysis of optical character recognition models for extracting … (Puneeth Prakash) 1301 [28] X. Liu, G. Meng, and C. Pan, “Scene text detection and recognition with advances in deep learning: a survey,” International Journal on Document Analysis and Recognition, vol. 22, no. 2, pp. 143–162, 2019, doi: 10.1007/s10032-019-00320-5. [29] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” IEEE Access, vol. 9, pp. 16591–16603, May 2015, doi: 10.1109/ACCESS.2021.3053408. [30] H. Law and J. Deng, “CornerNet: detecting objects as paired keypoints,” International Journal of Computer Vision, vol. 128, no. 3, pp. 642–656, 2020, doi: 10.1007/s11263-019-01204-1. [31] N. Otsu, “Threshold selection method from gray-level histograms,” IEEE Trans Syst Man Cybern, vol. SMC-9, no. 1, pp. 62–66, 1979, doi: 10.1109/tsmc.1979.4310076. [32] B. Gatos, I. Pratikakis, and S. J. Perantonis, “Adaptive degraded document image binarization,” Pattern Recognition, vol. 39, no. 3, pp. 317–327, 2006, doi: 10.1016/j.patcog.2005.09.010. [33] N. Phansalkar, S. More, A. Sabale, and M. Joshi, “Adaptive local thresholding for detection of nuclei in diversity stained cytology images,” in ICCSP 2011 - 2011 International Conference on Communications and Signal Processing, 2011, pp. 218–220, doi: 10.1109/ICCSP.2011.5739305. [34] F. Z. A. Bella, M. El Rhabi, A. Hakim, and A. Laghrib, “An innovative document image binarization approach driven by the non- local p-Laplacian,” Eurasip Journal on Advances in Signal Processing, vol. 2022, no. 1, 2022, doi: 10.1186/s13634-022-00883-2. [35] M. Cheriet, J. N. Said, and C. Y. Suen, “A recursive thresholding technique for image segmentation,” IEEE Transactions on Image Processing, vol. 7, no. 6, pp. 918–921, 1998, doi: 10.1109/83.679444. [36] Y. C. Wei and C. H. Lin, “A robust video text detection approach using SVM,” Expert Systems with Applications, vol. 39, no. 12, pp. 10832–10840, 2012, doi: 10.1016/j.eswa.2012.03.010. [37] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and Vision Computing, vol. 22, no. 10 SPEC. ISS., pp. 761–767, 2004, doi: 10.1016/j.imavis.2004.02.006. [38] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: A unified text detection system in natural scene images,” in Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015, pp. 4651–4659, doi: 10.1109/ICCV.2015.528. [39] P. Batra, N. Phalnikar, D. Kurmi, J. Tembhurne, P. Sahare, and T. Diwan, “OCR-MRD: performance analysis of different optical character recognition engines for medical report digitization,” Int. J. Inf. Technol., vol. 16, no. 1, pp. 447–455, Jan. 2024, doi: 10.1007/s41870-023-01610-2. [40] K. Ma, Z. Shu, X. Bai, J. Wang, and D. Samaras, “DocUNet: document image unwarping via a stacked U-Net,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 4700–4709, doi: 10.1109/CVPR.2018.00494. [41] X. Zhou et al., “EAST: An efficient and accurate scene text detector,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 2642–2651, 2017, doi: 10.1109/CVPR.2017.283. [42] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, “ICDAR2017 competition on document image binarization (DIBCO 2017),” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2017, vol. 1, pp. 1395–1403, doi: 10.1109/ICDAR.2017.228. [43] J. Ma et al., “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018, doi: 10.1109/TMM.2018.2818020. BIOGRAPHIES OF AUTHORS Puneeth Prakash is an Assistant Professor at Maharaja Institute of Technology Mysore. He has 8 years of teaching and research experience. He has done bachelor’s degree in information science and a master in computer science from VTU Belagavi. His key area of interest is image processing, machine learning, and computer vision. He mainly works on scene text-related images and has published papers in national and international journals. He is keen on multiprogramming paradigm implementation. He can be contacted at email: puneeth.phd20@gmail.com. Sharath Kumar Yeliyur Hanumanthaiah is a Professor and Head in Department of Information Science and Engineering, Maharaja Institute of Technology Mysore. His areas of interest are image processing, pattern recognition, and information retrieval. He has around 12 years of experience in teaching. He completed a B.E. in computer science and engineering from VTU and an M.Tech. in computer cognition technology from the University of Mysore. Further, completed Ph.D. from the University of Mysore. Published 50 research articles in reputed conferences and journals. He served as the BoE and BoS of the University of Mysore from 2016 to 2020. He can be contacted at email: sharathyhk@gmail.com.