Intelligent Text Document Correction System Based on Similarity Technique

Intelligent Text Document
Correction System Based on
Similarity Technique
A Thesis
Submitted to the Council of the College of Information Technology,
University of Babylon in Partial Fulfillment of the Requirements
for the Degree of Master of Sciences in Computer Sciences.
By
Marwa Kadhim Obeid Al-Rikaby
Supervised by
Prof. Dr. Abbas Mohsen Al-Bakry
2015 D.C. 1436 A.H.
Ministry of Higher Education and
Scientific Research
University of Babylon- College of Information
Technology
Software Department

II
ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ي‬ِ‫ح‬َّ‫ر‬‫ال‬ ِ‫ن‬ٰ‫ـ‬‫ـ‬‫ـ‬‫ـ‬َ‫م‬ْ‫ح‬َّ‫ر‬‫ال‬ ِ‫ه‬َّ‫الل‬ ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬ْ‫س‬ِ‫ب‬
{ِ‫ب‬ ‫ِي‬‫د‬ْ‫ه‬َ‫ي‬‫َا‬‫و‬ْ‫ض‬ِ‫ر‬ َ‫ع‬َ‫ب‬َّ‫ْت‬‫ا‬ ِ‫ن‬َ‫م‬ ُ‫هلل‬ْ‫ا‬ ِ‫ه‬‫َم‬‫ل‬َّ‫س‬‫ْل‬‫ا‬ َ‫ل‬ُ‫ب‬ُ‫س‬ ُ‫ه‬َ‫ن‬ِٰ
َ‫و‬ُ‫ي‬ْ‫خ‬ِ‫ر‬ُ‫ج‬ُ‫ه‬ِِّ‫م‬ ‫م‬َ‫ن‬ْ‫ا‬ُّ‫لظ‬ُ‫ل‬َ‫م‬‫ا‬ِ‫ت‬ِ‫إ‬َ‫ل‬‫ى‬َْٰ‫ا‬ُّ‫ن‬‫ل‬ِ‫ر‬‫و‬ِ‫ب‬ِ‫إ‬ْ‫ذ‬ِ‫ن‬ِ‫ه‬َ‫و‬َ‫ي‬ْ‫ه‬ِ‫د‬ِ‫ه‬‫ي‬ْ‫م‬ِ‫إ‬‫ىل‬َٰ
َِ‫ص‬‫ر‬ٍٰ‫ط‬ُّ‫م‬ْ‫س‬َ‫ت‬ِ‫ق‬ٍ‫م‬‫ي‬}
َ‫ص‬َ‫د‬َ‫ق‬َ‫ع‬‫ال‬ ‫اهلل‬ِ‫ل‬ُ‫ي‬َ‫ع‬‫ال‬ِ‫ظ‬‫يم‬
‫امل‬ ‫سورة‬‫ئادةة‬‫آية‬16

III
Supervisor Certification
I certify that this thesis was prepared under my supervision at the Department of
Software / Information Technology / University of Babylon, by Marwa
Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the
degree of Master of Sciences in Computer Science.
Signature:
Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry
Title : Professor.
Date : / / 2015
The Head of the Department Certification
In view of the available recommendation, we forward this thesis for debate by
the examining committee.
Signature:
Name : Dr. Eman Salih Al-Shamery
Title: Assist. Professor.
Date: / / 2015

IV
To
Master of creatures,
Loved by Allah,
The Prophet Muhammad
(Allah bless him and his family)

V
Acknowledgements
All praise be to Allah Almighty who enabled me to complete this task
successfully and utmost respect to His last Prophet Mohammad PBUH.
First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al-
Bakry, for his advice and guidance that led to the completion of this thesis.
I would like to thank the staff of the Software Department for the help they
have offered, especially, the head of the Software Department Dr. Eman Salih
Al-Shamery.
Most importantly, I would like to thank my parents, my sisters, my brothers
and my friends for their support.

VI
Abstract
Automatic text correction is one of the human-computer interaction
challenges. It is directly interposed with several application areas like post
handwritten text digitizing correction or indirectly such as user's queries correction
before applying a retrieval process in interactive databases.
Automatic text correction process passes through two major phases: error
detection and candidates suggestion. Techniques for both phases are categorized
into: Procedural and statistical. Procedural techniques are based on using rules to
govern texts acceptability, including Natural Language Processing Techniques.
Statistical techniques, on the other hand, are dependent on statistics and
probabilities collected from large corpus based on what is commonly used by
humans.
In this work, natural language processing techniques are used as bases for
analysis and both spell and grammar acceptance checking of English texts. A
prefix dependent hash-indexing scheme is used to shorten the time of looking up
the underhand dictionary which contains all English tokens. The dictionary is used
as a base for the error detection process.
Candidates generation is based on calculating source token similarity,
measured using an improved Levenshtein method, to the dictionary tokens and
ranking them accordingly; however this process is time extensive, therefore, tokens
are divided into smaller groups according to spell similarity in such a way keeps
the random access availability. Finally, candidates suggestion involves examining
a set of commonly committed mistakes related features. The system selects the
optimal candidate which provides the highest suitability and doesn't violate
grammar rules to generate linguistically accepted text.
Testing the system accuracy showed better results than Microsoft Word and
some other systems. The enhanced similarity measure reduced the time complexity
to be on the boundaries of the original Levenshtein method with an additional error
type discovery.

VII
Table of Contents
Subject Page
No.
Chapter One : Overview
1.1 Introduction 1
1.2 Problem Statement 3
1.3 Literature Review 5
1.4 Research Objectives 10
1.5 Thesis Outlines 11
Chapter Two: Background and Related Concepts
Part I: Natural Language Processing 12
2.1 Introduction 12
2.2 Natural Language Processing Definition 12
2.3 Natural Language Processing Applications 13
2.3.1 Text Techniques 14
2.3.2 Speech Techniques 15
2.4 Natural Language Processing and Linguistics 16
2.4.1 Linguistics 16
2.4.1.1 Terms of Linguistic Analysis 17
2.4.1.2 Linguistic Units Hierarchy 19
2.4.1.3 Sentence Structure and Constituency 19
2.4.1.4 Language and Grammar 20
2.5 Natural Language Processing Techniques 22
2.5.1 Morphological Analysis 22
2.5.2 Part of Speech Tagging 23
2.5.3 Syntactic Analysis 26
2.5.4 Semantic Analysis 27
2.5.5 Discourse Integration 27
2.5.6 Pragmatic Analysis 28
2.6 Natural Language Processing Challenges 28
2.6.1 Linguistics Units Challenges 28
2.6.1.1 Tokenization 28
2.6.1.2 Segmentation 29
2.6.2 Ambiguity 31
2.6.2.1 Lexical Ambiguity 31

VIII
Subject Page
No.
2.6.2.2 Syntactic Ambiguity 31
2.6.2.3 Semantic Ambiguity 32
2.6.2.4 Anaphoric Ambiguity 32
2.6.3 Language Change 32
2.6.3.1 Phonological Change 33
2.6.3.2 Morphological Change 33
2.6.3.3 Syntactic Change 33
2.6.3.4 Lexical Change 33
2.6.3.5 Semantic Change 34
Part II: Text Correction 35
2.7 Introduction 35
2.8 Text Errors 35
2.8.1 Non-words Errors 36
2.8.2 Real-word Errors 36
2.9 Error Detection Techniques 37
2.9.1 Dictionary Looking Up 37
2.9.1.1 Dictionaries Resources 37
2.9.1.2 Dictionaries Structures 38
2.9.2 N-gram Analysis 39
2.10 Error Correction Techniques 40
2.10.1 Minimum Edit Distance Techniques 40
2.10.2 Similarity Key Techniques 43
2.10.3 Rule Based Techniques 43
2.10.4 Probabilistic Techniques 43
2.11 Suggestion of Corrections 44
2.12 The Suggested Approach 44
2.12.1 Finding Candidates Using Minimum Edit Distance 45
2.12.2 Candidates Mining 45
2.12.3 Part-of-Speech Tagging and Parsing 46
Chapter Three : Hashed Dictionary and Looking Up Technique
3.1 Introduction 48
3.2 Hashing 48
3.2.1 Hash Function 49
3.2.2 Formulation 52
3.2.3 Indexing 53
3.3 Looking Up Procedure 56

IX
Subject Page
No.
3.4 Dictionary Structure Properties 58
3.5 Similarity Based Looking-Up 59
3.5.1 Bi-grams Generation 60
3.5.2 Primary Centroids Selection 62
3.5.3 Centroids Referencing 63
3.6 Application of Similarity Based Looking up approach 64
3.7 The Similarity Based Looking up Properties 67
Chapter Four : Error Detection and Candidates Generation
4.1 Introduction 69
4.2 Non-word Error Detection 69
4.3 Real-Words Error Detection 71
4.4 Candidates Generation 72
4.4.1 Candidates Generation for Non-word Errors 72
4.4.1.2 Enhanced Levenshtein Method 74
4.4.1.3 Similarity Measure 78
4.4.1.4 Looking for Candidates 79
4.4.2 Candidates Generation for Real-words Errors 81
Chapter Five : Text Correction and Candidates Suggestion
5.1 Introduction 82
5.2 Correction and Candidates Suggestion Structure 82
5.3 Named-Entity Recognition 85
5.4 Candidates Ranking 86
5.4.1 Edit Distance Based Similarity 87
5.4.2 First and End Symbols Matching 87
5.4.3 Difference in Lengths 88
5.4.4 Transposition Probability 89
5.4.5 Confusion Probability 90
5.4.6 Consecutive Letters (Duplication) 91
5.4.7 Different Symbols Existence 92
5.5 Syntax Analysis 93
5.5.1 Sentence Phrasing 93
5.5.2 Candidates Optimization 95
5.5.3 Grammar Correction 95
5.5.4 Document Correction 97
Chapter Six: Experimental Results, Conclusions, and Future Works

X
Subject Page
No.
6.1 Experimental Results 98
6.1.1 Tagging and Error Detection Time Reduction 98
6.1.1.1 Successful Looking Up 99
6.1.1.2 Failure Looking Up 100
6.1.2 Candidates Generation and Similarity Search Space
Reduction
101
6.1.3 Time Reduction of the Damerau-Levenshtein method 103
6.1.4 Features Effect on Candidates Suggestion 104
6.2 Conclusions 107
6.3 Future Works 108
References 110
Appendix A 117
Appendix B 122
List of Figures
Figure
No.
Title Page
No.
(2.1) NLP dimensions 16
(2.2) Linguistics analysis steps 17
(2.3) Linguistic Units Hierarchy 19
(2.4) Classification of POS tagging models 24
(2.5) An example of lexical change 34
(2.6) Outlines of Spell Correction Algorithm 38
(2.7) Levenshtein Edit Distance Algorithm 41
(2.8) Damerau-Levenshtein Edit Distance Algorithm 42
(2.9) The Suggested System Block Diagram 47
(3.1) Token Hashing Algorithm 54

XI
Figure
No.
Title Page
No.
(3.2) Dictionary Structure and Indexing Scheme 55
(3.3) Algorithm of Looking Up Procedure 57
(3.4) Semi Hash Clustering block diagram 61
(3.5) Similarity Based Hashing algorithm 64
(3.6) Block diagram of candidates generation using SBL 66
(3.7) Similarity Based Looking up algorithm 68
(4.1) Tagging Flow Chart 70
(4.2) The Enhanced Levenshtein Method Algorithm 76
(4.3) Original Levenshtein Example 77
(4.4) Damerau-Levenshtein Example 77
(4.5) Enhanced Levenshtein Example 78
(5.1) Candidates ranking flowchart 84
(5.2) Syntax analysis flowchart 94
(6.1) Tokens distribution in primary packets 99
(6.2) Tokens distribution in secondary packets 99
(6.3) Time complexity Variance of Levenshtein, Damerau-
Levenshtein, and Enhanced Levenshtein (our modification)
103
(6.4) Suggestion Accuracy with a comparison to Microsoft Office
Word on a Sample from the Wikipedia
104
(6.5) Testing the suggested system accuracy and comparing the
results with other systems using the same dataset
105
(6.6) Discarding one feature at a time for optimal candidate
selection
106
(6.7) Using one feature at a time for optimal candidate selection 107

XII
List of Tables
Table
No.
Title Page
No.
(1-1) Summary of Literature Review 9
(3-1) Alphabet Encoding 50
(3-2) Addressing Range 52
(3-3) Predicting errors using Bi-grams analysis 61
(5-1) Transposition Matrix 90
(5-2) Confusion Matrix 91
List of Symbols and Abbreviations
MeaningAbbreviation
Alphabet∑
Adjectival PhraseA
Absolute Differenceabs
Sentence ComplementC
Context Free GrammarCFG
DictionaryD
Dioxide Nuclear AcidDNA
ErrorE
GrammarG
Grammar Error CorrectionGEC
Hidden Markov ModelHMM
Information RetrievalIR
Machine TranslationMT
Named EntityNE
Named-Entity RecognitionNER
Noun GroupNG
Natural Language GenerationNLG
Natural Language ProcessingNLP
Natural LanguagesNLs
Natural Language UnderstandingNLU

XIII
Noun PhraseNP
big-Oh notation ( =at most)O( )
Optical Character RecognitionOCR
Production RuleP
Part Of SpeechPOS
Prepositional PhrasePP
QueryQ
Ranking ValueR
Relative DistanceR_Dist
Start SymbolS
Stanford Machine TranslatorSMT
Speech RecognitionSR
String1, String2St1,St2
VariableV
Adverbial Phrasev
Verb PhraseVP
big-Omega notation (= at least)Ω( )



Chapter One
Overview

1
Chapter One
Overview
1.1 Introduction
Natural Language Processing, also known as computational Linguistics,
is the field of computer science that deals with linguistics; it is a form of
human- computer interaction where formalization is applied on the elements
of human language to be performed by a computer [Ach14]. Natural
Language Processing (NLP) is the implementation of systems that are
capable of manipulating and processing natural languages (NLs)
sentences[Jac02] like English, Arabic, Chinese and not formal languages
like Python, Java, C++; nor descriptive languages such as DNA in biology
and Chemical formulas in chemist [Mom12]. NLP task is the designing and
building of software for analyzing, understanding and generating spoken
and/or written NLs. [Man08] [Mis13]
NLP has many applications such as automatic summarization, Machine
Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition
(SR), Optical Character Recognition (OCR), Information Retrieval (IR),
Opinion Mining [Nad11], and others [Wol11].
Text Correction is another significant application of NLP. It includes
both Spell Checking and Grammar Error Correction (GEC). Spell checking
research extends early back to the mid of 20th
century by Lee Earnest at
Stanford University but the first application was created in 1971 by Ralph
Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of
10,000 English words. [Set14] [Pet80]
Grammar error correction, in spite of its central role in semantic and
meaning representations, is largely ignored by NLP community. In recent

Chapter One   Overview
________________________________________________________________________
2
years, an improvement noticed in automatic GEC techniques. [Voo05]
[Jul13] However, most of these techniques are limited in specific domains
such as real-word spell correction [Hwe14], subject-verb disagreement
[Han06], verb tense misuse [Gam10], determiners or articles and improper
preposition usage. [Tet10] [Dah11]
Different techniques like edit distance [Wan74], rule-based techniques
[Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98],
probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel
model [Tou02] have been proposed for text correction purposes. Each
technique needs some sort of resources. Edit distance, rule-based and
similarity key techniques require a dictionary (or lexicon), n-grams and
probabilistic work with statistical and frequency information, neural nets are
learned with training patterns, etc…
Text correction, spell and grammar, is an extensive process includes,
typically, three major steps: [Ach14] [Jul13]
The first step is to detect the incorrect words. The most popular way to
decide if a word is misspelled is to look for it in a dictionary, a list of
correctly spelled words. This way can detect non-word errors not the real-
word errors [Kuk92] [Mis13] because an unintended word may match a
word in the dictionary. NLs have a large number of words resulting in a
huge dictionary, therefore, the task of looking every word consumes a long
time. Whereas, in GEC this step is more complicated, it requires applying
more analysis at the level of sentences and phrases using computational
linguistics basics to detect the word that makes the sentence incorrect.
Next, a list of candidates or alternatives should be generated for the
incorrect word (misspelled or misused). This list is preferred to be short and
contains the words with highest similarity or suitability; and to produce it, a
technique is needed to calculate the similarity of the incorrect word with

________________________________________________________________________
3
every word in the dictionary. Efficiency and accuracy are major factors in
the selection of such technique. GEC requires broad knowledge of diverse
grammatical error categories and extensive linguistic technique to identify
alternatives because a grammatical error mayn't be resulted from a unique
word.
Finally, suggesting the intended word or a list of alternatives contains
the intended word. This task requires ranking the words according to the
similarity amount to the incorrect word and some other considerations may
or may not be taken depending on the technique in use.
Text mining techniques started to enter the area of text correction;
Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and
Information Retrieval [Kir11] are examples. Statistics and probabilistic also
played a great role specifically in analyzing common mistakes and n-gram
datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and
phonetic level, can be used in reducing the looking up space; NER may help
in avoiding interpreting proper nouns as misspellings; statistics merged with
NLP techniques to provide more precise parsing and POS tagging, usually,
in context dependent applications. The application of a given technique
differs according to what level of correction is intended; it starts from the
character level [Far14], passes through word, phrase (usually in GEC),
sentence, and ends in the context or document subject level.
1.2 Problem Statement
Although many text checking and correction systems are produced,
each has its variances from the sides of input quality restrictions, techniques
used, output accuracy, speed, performance conditions…etc. [Ahm09]
[Pet80]. This field of NLP is really an open research from all sides because
there is no complete algorithm or technique handles all considerations.

________________________________________________________________________
4
The limited linguistic knowledge, the huge number of lexicons, the
extensive grammar, language ambiguity and change over time, variety of
committed errors and computational requirements are challenges facing the
process of developing a text correction application.
In this work, some of the above mentioned problems are solved using a
set of solutions:
 Integrating two lexicon datasets (WordNet and Ispell).
 Using brute-force approach to solve some sorts of ambiguity.
 Applying hashing and indexing in looking up the dictionary.
 Reducing search space in candidates collecting process by
grouping similarly spelled words into semi clusters.
The Levenshtein method [Hal11] is also enhanced to consider Damerau
four types of errors within time period shorter than Damerau-Levenshtein
method [Hal11]. Named Entity Recognition, letters confusion and
transposition, and candidate length effect are used as features to optimize the
candidates' suggestion. In addition to applying rules of Part-Of-Speech tags
and sentence constituency for checking sentence grammar correctness,
whether it is lexically corrected or is not.
The proposed three components of this system are: (1)a spell error
detection is based on a fast looking up technique in a dictionary of more than
300,000 tokens, constructed by applying a string prefix dependent hash
function and indexing method; grammar error detector is a brute-force
parser. (2)For candidates generation, an enhancement was implemented on
the Levenshtein method to consider Damerau four errors types and then used
to measure similarity according to the minimum edit distance and difference
in lengths effect, the dictionary tokens are grouped into spell based clusters
to reduce search space. (3)The candidates suggestion exploits NER features,

________________________________________________________________________
5
transposition error and confusion statistics, affixes analysis (including first
and last letters matching), length of candidates, and parsing success.
1.3 Literature Review
 Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to
string transformation includes a model consists of rules and weights for
training and an algorithm depends on scoring and ranking according to
conditional probability distribution for generating the top k-candidates at
the character level where both high and low frequency words can be
generated. Spell checking is one of many applications on which the
approach was applied; the misspelled strings (words or characters) are
transformed by applying a number of operators into the k-most similar
strings in a dictionary (start and end letters are constants). [Ach14]
 Mariano F., Zheng Y., and others, 2014, talked the correction of
grammatical errors by processes pipelining which combines results from
multiple systems. The components of the approach are: a rule based error
corrector uses rules automatically derived from the Cambridge Learner
Corpus which based on N-grams that have been annotated as incorrect;
SMT system translates incorrectly written English into correct English;
NLTK1
was used to perform segmentation, tokenization, and POS
tagging; the candidates generation produce all the possible combinations
of corrections for the sentence, in addition to the sentence itself to
consider the "no correction" option; finally the candidates are ranked
using a language model. [Fel14]
__________________________________________________________
1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research
and teaching in computational linguistics and natural language processing. NLTK is written in Python and
distributed under the GPL open source license. Over the past year the toolkit has been rewritten,
simplifying many linguistic data structures and taking advantage of recent enhancements in the Python
language.

________________________________________________________________________
6
 Anubhav G., 2014, presented a rule-based approach that used two POS
taggers to correct non-native English speakers' grammatical errors,
Stanford parser and Tree Tagger. The detection of errors depends on the
outputs of the two taggers, if they differ then the sentence is not correct.
Errors are corrected using Nodebox English Linguistic library. Error
correction includes subject-verb disagreement, verb form, and errors
detected by POS tag mismatch. [Gup14]
 Stephan R., 2013, proposed a model for spelling correction based on
treating words as "documents" and spell correction as a form of
document retrieval in that the model retrieves the best matching correct
spell for a given input. The words are transformed into tiny documents of
bits and hamming distance is used to predict the closest string of bits
from a dictionary holding the correctly spelled words as strings of bits.
The model is knowledge free and only contains a list of correct words.
[Raa13]
 Youssef B., 2012, produced a parallel spell checking algorithm for
spelling errors detection and correction. The algorithm is based on
information from Yahoo! N-gram dataset 2.0; it is a shared memory
model allowing concurrency among threads for both parallel multi
processor and multi core machines. The three major components (error
detector, candidates' generator and error corrector) are designed to run in
a parallel fashion. Error detector, based on unigrams, detects non-word
errors; candidates' generator is based on bi-grams; the error corrector,
context sensitive, is based on 5-grams information.[Bas12]
 Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and
Gary G. L., 2012, presented a novel method for grammatical error
correction by building a meta-classifier. The meta-classifier decides the
final output depending on the internal results from several base
classifiers; they used multiple grammatical errors tagged corpora with

________________________________________________________________________
7
different properties in various aspects. The method focused on the articles
and the correction arises only when a mismatching occur with the
observed articles. [Seo12]
 Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic
information retrieval system performing automatic spell correction for
user queries before applying the retrieval process. The correcting
procedure depends on matching the misspelled word against a correctly
spelled words dictionary using Levenshtein algorithm. If an incorrect
word is encountered then the system retrieves the most similar word
depending on the Levenshtein measure and the occurrence frequency of
the misspelled word.[Kir11]
 Farag, Ernesto, and Andreas, 2008, developed a language-independent
spell checker. It is based on the enhancement of N-gram model through
creating a ranked list of correction candidates derived based on N-gram
statistics and lexical resources then selecting the most promising
candidates as correction suggestions. Their algorithm assigns weights to
the possible suggestions to detect non-word errors. They depended a
"MultiWordNet" dictionary of about 80,000 entries.[Ahm09]
 Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of
real-words spelling error correction. They assumed that the observed
sentence is a signal passed through a noisy channel, where the channel
reflects the typist and the distortion reflects errors committed by the
typist. The probability of the sentence correctness, given by the channel
(typist), is a parameter associated with that sentence. The probability of
every word in the sentence to be the intended one is equivalent to the
sentence correctness probability and the word is associated with a set of
spell variants words excluding the word itself. Correction can be applied
to one word in the sentence by replacing the incorrect one by another

________________________________________________________________________
8
from the candidates (its real-word spelling variations) set so that it gives
the maximum probability.[Amb08]
 Stoyan, Svetla, and others, 2005, described an approach for lexical post-
correction of the output of optical character recognizer OCR as a two
research project. They worked on multiple sides; on the dictionary side,
they enriched their large sizes dictionaries with specialty dictionaries; on
the candidates selection, they used a very fast searching algorithm
depends on Levenshtein automata for efficient selecting the correction
candidates with a bound not exceeding 3; they ranked candidates
depending on a number of features such as frequency and edit
distance.[Mih04]
 Suzan V., 2002, described a context sensitive spell checking algorithm
based on the BESL spell checker lexicons and word trigrams for
detecting and correcting real-word errors using probability information.
The algorithm splits up the input text into trigrams and every trigram is
looked up in a precompiled database which contains a list of trigrams and
their occurrence number in the corpus used for database compiling. The
trigram is correct if it is in the trigram database, otherwise it is considered
an erroneous trigram containing a real-word error. The correction
algorithm uses BESL spell checker to find candidates but the most
frequent in the trigrams database are suggested to the user.[Ver02]

________________________________________________________________________
9
No. Reference Methodology Technique
1
[Ach14] Generating the top K-
candidates at the
character level for both
high and low frequency.
A model consists of rules and
weights, and a conditional
probability distribution
dependent algorithm
2
[Fel14] Grammatical errors
correction based on
generating all possible
correct alternatives for
the sentence
Combining the results of
multiple systems: rule based
error corrector, SMT English
to Correct English translator,
and NLTK for segmentation,
tokenization and tagging
3
[Gup14] Non-native English
speakers' grammatical
errors correction
Error detection used Stanford
parser and Tree Tagger.
Correction based on
Nodebox English Linguistic
library
4
[Raa13] Dictionary based Spell
correction treats the
misspelled word as a
document.
Converting the misspelled
word into a tiny document of
bits and retrieving the most
similar documents using
Hamming Distance
5
[Bas12] Context sensitive spell
checking using a shared
memory model allowing
concurrency among
threads for parallel
execution
Different N-grams levels for
error detection, candidates
generation, and candidates
suggestion depending on
Yahoo! N-Grams dataset 2.0
6
[Seo12] Meta-classifier for
grammatical errors
correction focused
mainly on the articles.
Deciding the output
depending on the internal
results from several base
classifiers
7
[Kir11] Automatic spell
correction for user
queries before applying
retrieval process
Using Levenshtein algorithm
for both error detection and
correction in a dictionary
looking up technique
Table 1.1: Summary of Literature Review

________________________________________________________________________
11
8
[Ahm09] Language independent
model for non-word error
correction based on N-
gram statistics and lexical
resources
Ranking a list of correction
candidates by assigning
weights to the possible
suggestions depending on a
"MultiWordNet" dictionary
of about 80,000 entries
9
[Amb08] Noisy channel model for
Real words error
correction based on
probability.
Channel represents the typist,
distortion represents the
error, and the noise
probability is a parameter
10
[Mih04] OCR output post
correction
Levenshtein automata for
candidates generation and
frequency for ranking
11
[Ver02] Context sensitive spell
checking algorithm based
on tri-grams
Splitting texts into word
trigrams and matching them
against the precompiled
BESL spell checker lexicons,
suggestion depends on
probability information.
1.4 Research Objectives
This research is attempted to design and implement a smart text
document correction system for English texts. It is based on mining a typed
text for detecting spelling and grammar errors and giving the optimal
suggestion(s) from a set of candidates, its steps are:
1. Analyzing the given text by using Natural Language Processing
techniques, at each step detect the erroneous words.
2. Looking up candidates for the erroneous words and ranking them
according to a given set of features and conditions to be the initial
solutions.
3. Optimizing the initial solutions depending on the extracted
information from the given text and the detected errors.

________________________________________________________________________
11
4. Recovering the input text document with the optimal solutions and
associating the best set of candidates with each incorrect detected
word.
1.5 Thesis Outlines
The next five chapters are:
1. Chapter Two: "Background and Related Concepts" consisted of two
parts. The first overviews NLP fundamentals, applications and
techniques; whereas, the second is about text correction techniques.
2. Chapter Three: "Dictionary Structure and Looking up Technique"
describes the suggested approach of constructing the dictionary of the
system for both perfect matching and similarity looking up.
3. Chapter Four: "Error Detection and Candidates Generation", declares
the suggested technique for indicating incorrect words and the method
of generating candidates.
4. Chapter Five: "Automatic Text Correction and Candidates
Suggestion", describes the techniques of suggestions selection and
optimization.
5. Chapter Six: "Experimental Results, Conclusion, and Future Works",
the experimental results of applying the techniques described in
chapters three, four and five, conclusion of the system and the future
directions are shown.



Chapter Two
Background
and
Related Concepts

 12 
Chapter Two
Background and Related Concepts
Part I
Natural Language Processing
2.1 Introduction
Natural Language Processing (NLP) began in the late 1940s. It was
focused on machine translations; in 1958, NLP was linked to the
information retrieval by the Washington International Conference of
Scientific Information; [Jon01] primary ideas for developing applications
for detecting and correcting text errors started at that period of time.
[Pet80] [Boo58]
Natural Language Processing has a great interest from that time till
our days because it plays an important role in the interaction between
human and computers. It represents the intersection of linguistics and
artificial intelligence [Nad11] where machine can be programmed to
manipulate natural language.
2.2 Natural Language Processing Definition
"Natural Language Processing (NLP) is the computerized approach
for analyzing text that is based on both a set of theories and a set of
technologies." [Sag13]
NLP describes the function of software or hardware components in a
computer system that is capable of analyzing or synthesizing human
languages (spoken or written) [Jac02] [Mis13] like English, Arabic,
Chinese …etc, not formal languages like Python, Java, C++ … etc, nor

Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 13 
descriptive languages such as DNA in biology and Chemical formulas in
chemist [Mom12].
"NLP is a tool that can reside inside almost any text processing
software application" [Wol11]
We can define NLP as a subfield of Artificial Intelligence
encompasses anything needed by a computer to understand and generate
natural language. It is based on processing human language for two tasks:
the first receives a natural language input (text or speech), applies analysis,
reasons what was meant by that input, and outputs in computer language;
this is the task of Natural Language Understanding (NLU). While the
second task is to generate human sentences according to specific
considerations, the input is in computer language but the output is in human
languages; it is called Natural Language Generation (NLG). [Raj09]
"Natural Language Understanding is associated with the more
ambitious goals of having a computer system actually comprehend natural
language as a human being might". [Jac02]
2.3 Natural Language Processing Applications
Even of its wide usage in computer systems, NLP is entirely
disappeared into the background; where it is invisible to the user and adds
significant business value. [Wol11]
The major distinction of NLP applications from other data
processing systems is that they use Language Knowledge. Natural
Language Processing applications are mainly divided into two categories
according to the given NL format [Mom12] [Wol11]:

_________________________________________________________________________
 14 
2.3.1Text Technologies
 Spell and Grammar Checking: systems deal with indicating
lexical and grammar errors and suggest corrections.
 Text Categorization and Information Filtering: In such
applications, NLP represents the documents linguistically and
compares each one to the others. In text categorization, the
documents are grouped according to their linguistic
representation characteristics into several categories. Information
filtering signals out, from a collection of documents, the
documents that are satisfying some criterion.
 Information Retrieval: finds and collects relevant information to
a given query. A user expresses the information need by a query,
then the system attempts to match the given query to the database
documents that is satisfying the user’s query. Query and
documents are transformed into a sort of linguistic structure, and
the matching is performed accordingly.
 Summarization: according to an information need or a query
from the user, this type of applications finds the most relevant
part of the document.
 Information Extraction: refers to the automatic extraction of
structured information from unstructured sources. Structured
information like entities, their relationships, and attributes
describing them. This can integrate structured and unstructured
data sources, if both are exist, and pose queries for spanning the
integrated information giving better results than applying
searches by keywords alone.
 Question Answering: works with plain speech or text input,
applies an information search based on the input. Such as IBM®
Watson™ and the reigning JEOPARDY! Champion, which read

_________________________________________________________________________
 15 
questions and understand their intention, then looking up the
knowledge library to find a match.
 Machine Translation: translate a given text from a specific
natural language to another natural language, some applications
have the ability to recognize the given text language even if the
user didn't specify it correctly.
 Data Fusion: Combining extracted information from several text
files into a database or an ontology.
 Optical Character Recognition: digitizing handwritten and
printed texts. I.e. converting characters from images to digital
codes.
 Classification: this NLP application type sorts and organizes
information into relevant categories. Like e-mail spam filters and
Google News™ news service.
 And also NLP entered other applications such as educational
essay test-scoring systems, voice-mail phone trees, and even e-
mail spam detection software.
2.3.2 Speech Technologies
 Speech Recognition: mostly used on telephone voice response
systems as a service client. Its task is processing plain speech. It
is also used to convert speech into text.
 Speech Synthesis: means converting text into speech. This
process requires working at the level of phones and converting
from alphabetic symbols into sound signals.

_________________________________________________________________________
 16 
2.4 Natural Language Processing and Linguistics
Natural Language Processing is concerned with three dimensions:
language, algorithm and problem as presented in figure (2.1). On the
language dimension, NLP considers linguistics; algorithm dimension
mentions NLP techniques and tasks, while the problem dimension depicts
the applied mechanisms to solve problems. [Bha12]
2.4.1 Linguistics
Natural Language is a communication mean. It is a system of
arbitrary signals such as the voice sound and written symbols. [Ali11]
Linguistics is the scientific study of language; it starts from the simple
acoustic signals which form sounds and ends with pragmatic understanding
to produce the full context meaning.
There are two major levels of linguistic, Speech Recognition (SR)
and Natural Language Processing (NLP) as shown in figure (2.2).
Figure (2.1) : NLP dimensions [Bha12]

_________________________________________________________________________
 17 
2.4.1.1 Terms of Linguistic Analysis
A natural language, as formal language does, has a set of basic
components that may vary from one language to another but remain
bounded under specific considerations giving the special characteristics to
every language.
From the computational view, a language is a set of strings generated
over a finite alphabet and can be considered by a grammar. The definition
Acoustic Signal
Phones
Letters and Strings
Morphemes
Words
Phrases and Sentences
Meaning out of Context
Meaning in Context
SR
NLP
Phonetics
Phonology
Lexicon
Morphology
Syntax
Semantics
Pragmatics
Figure (2.2) : Linguistics analysis steps [Cha10]

_________________________________________________________________________
 18 
of the three abstracted names is dependent on the language itself; i.e.
strings, alphabet and grammar formulate and characterize language.
 Strings:
In natural language processing, the strings are the morphemes of the
language, their combinations (words) and the combinations of their
combinations (sentences), but linguistics going somewhat deeper than this.
It starts with phones, the primitive acoustic patterns, which are significant
and distinguishable from one natural language to another. Phonology
groups phones together to produce phonemes represented by symbols.
Morphemes consist of one or more symbols; thus, NLs can be further
distinguished.
 Alphabet:
When individual symbols, usually thousands, represent words then
the language is "logographic"; if the individual symbols represent syllables,
it is a "syllabic" one. But when they represent sounds, the language is
"alphabetic". Syllabic and alphabetic languages have typically less than 100
symbols, unlike logographic.
English is an alphabetic language system consists of 26 symbols,
these symbols represents phones combined into morphemes which may or
may not combined further more to form words.
 Grammar:
Grammar is a set of rules specifying the legal structure of the
language; it is a declarative representation about the language syntactic
facts. Usually, grammar is represented by a set of productive rules.

_________________________________________________________________________
 19 
2.4.1.2 Linguistic Units Hierarchy
Language can be divided into pieces; there is a typical structure or
form for every level of analysis. Those pieces can be put into a hierarchical
structure starting from a meaningful sentence as the top level, proceeding
in the separation of building units until reaching the primary acoustic
sounds. Figure (2.3) presented an example.
Figure (2.3) : Linguistic Units Hierarchy
2.4.1.3 Sentence Structure and Constituency
"It is constantly necessary to refer to units smaller than the sentence
itself units such as those which are commonly referred as CLAUSE,
PHRASE, WORD, and MORPHEME. The relation between one unit and
another unit of which it is a part is CONSTITUENCY." [Qui85]
The task of dividing a sentence into constituents is a complex task
________________________________________________________
1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries
The teacher talked to the students
The teach er talk ed to the student s
Sentence
Phrase
Word
Morphem
e
Phonemes1
ᶞᵊ ᵗ ː
ᶴ ᵊ ᵓː
ᵏ ᵗ ᵗu
ᶞᵊ ʹˢᵗᴶː
ᵈᵑᵗ ˢ

_________________________________________________________________________
 20 
requires incorporating more than one analysis stage; tokenization,
segmentation, parsing, (and sometimes stemming) usually merged together
to build the parse tree for a given sentence.
2.4.1.4 Language and Grammar
A language is a 'set' of sentences and a sentence is a 'sequence' of
'symbols' [Gru08]; it can be generated given its context free grammar
G=(V,∑,S,P). [Cla10]
Commonly, grammars are represented as a set of production rules
which is taken by the parser and compared against the input sentences.
Every matched rule adds something to the sentence complete structure
which is called 'parse tree'. [Ric91]
Context free grammar (CFG) is a popular method for generating
formal grammars. It is used extensively to define languages syntax. The
four components of the grammar are defined in CFG as [Sag13]:
 Terminals (∑): represent the basic elements which form the
strings of the language.
 Nonterminals or Syntactic Variables (V): sets of strings define the
language which is generated by the grammar. Nonterminals
represent a key in syntax analyzing and translation via imposing a
hierarchical structure for the language.
 Set of production rules (P): this set define the way of combining
terminals with nonterminals to produce strings. The production
rule is consisted of a variable on the left side represents its head,
this head defines
 Start symbol (S).
The following is an example describes the structure of English sentence

_________________________________________________________________________
 21 
V = {S, NP, N, V P, V, Art}
∑ = {boy, icecream, dog, bite, like, ate, the, a},
P = {S NP V P,
NP  N,
NP  ART N,
V P  V NP,
N  boy | icecream | dog,
V  ate | like | bite,
Art  the | a}
The grammar specifies two things about the language: [Ric91]
 Its weak generative capacity; the limited set of sentences which can
be completely matched by a series of grammar rules.
 Its strong generative capacity, grammatical structure(s) of each
sentence in the language.
Generally, there are an infinite number of sentences for each grammar
which can be structured with it. The strength and importance of grammars
lurk in their ability of supplying structure to an infinite number of
sentences because they succinctly summarize an infinite number of objects
structures of a certain class. [Gru08]
The grammar is said to be generative if it has a fixed size production
rules which, if followed, can generate every sentence in the language using
an infinite number of actions. [Gru08]

_________________________________________________________________________
 22 
2.5 Natural Language Processing Techniques
2.5.1 Morphological Analysis
Morphology is the study of how words are constructed from
morphemes which represent the minimal meaning-bearing language
primitive units.[Raj09] [Jur00]
There are two broad classes of morphemes: stems and affixes; the
distinction between the two classes is language dependent in that it varies
from one language to another. The stem, usually, refers to the main part of
the word and the affixes can be added to the words to give it additional
meaning. [Jur00]
Further more, affixes can be divided into four categories according to
the position where they are added. Prefixes, suffixes, circumfixes and
infixes generally refer to the different types of affixes but it is not necessary
to a language to have all the types. English accept both prefixes to precede
stems and suffixes to follow stems, while there is no good example for a
circumfixe (precede and follow a stem) in English, and infixing (inserting
inside the stem) is not allowed (unlike German and Philippine languages,
consecutively) . [Jur00]
Morphology is concerned with recognizing the modification of base
words to form other words with different syntactic categories but similar
meanings.
Generally, three forms of word modifications are found [Jur00]:
 Inflection: syntactic rules change the textual representation of the
words; such as adding the suffix 's' to convert nouns into plurals,
adding 'er' and 'est' convert regular adjectives into comparative and

_________________________________________________________________________
 23 
superlative forms, consecutively. This type of modification usually
results a word from the same word class of the stem word.
 Derivation: new words are produced by adding morphemes, usually
more complex and harder in meaning than inflectional morphology.
It often occurs in a regular manner and results words differ in their
word class from the stem word. Like adding the suffix 'ness' to
'happy' to produce 'happiness'.
 Compounding: this type modifies stem words by another stem words
by grouping them. Like grouping 'head' with 'ache' to produce
'headache'. In English, this type is infrequent.
Morphological processing, also known as stemming, depends heavily on
the analyzed language. The output is the set of morphemes that are
combined to form words. Morphemes can be stem words, affixes, and
punctuations.
2.5.2 Part Of Speech Tagging
Part of Speech (POS) tagging is the process of giving the proper
lexical information or POS tag (also known as word classes, lexical tags,
and morphological classes), which is encoded as a symbol, for every word
(or token) in a sentence. [Sco99] [Has06b]
In English, POS tags are classified into four basic classes of words: [Qui85]
1. Closed classes: include prepositions, pronouns, determiners,
conjunctions, modal verbs and primary verbs.
2. Open classes: include nouns, adjectives, adverbs, and full verbs.
3. Numerals: include numbers and orders.
4. Interjections: include small set of words like oh, ah, ugh, phew.
Usually, a POS tag indicates one or more of the previous information and it
is sometimes holds other features like the tense of the verb or the number

_________________________________________________________________________
 24 
(plural or singular). POS tagging may generate tagged corpora or serve as a
preprocessing step for the next NLP processes. [Sco99]
Most of tagging systems performance is typically limited because
they only use local lexical information available in the sentence, at the
opposite of syntax analyzing systems which exploit both lexical and
structural information. [Sco99] More research was done and several models
and methods have been proposed to enhance taggers performance, they fall
mainly into supervised and unsupervised methods where the main
difference between the two categories is the set of training corpora that is
pre tagged in supervised methods unlike unsupervised methods which
needs advanced computational methods for gaining such a corpora.
[Has06a] [Has06b]. Figure (2.4) presents the main categories and shows
some examples.
In both categories, the following are the most popular:
Figure (2.4) : Classification of POS tagging models [Has06a]

_________________________________________________________________________
 25 
 Statistical (stochastic, or probabilistic) methods: taggers which
use these methods are firstly trained on a correctly tagged set of
sentences which allow the tagger to disambiguate words by
extracting implicit rules or picking the most probable tag based on
the words that are surrounding the given word in the sentence.
Examples of these methods are Maximum-Entropy Models, Hidden
Markov Models (HMM), and Memory Based models.
 Rule based methods: a sequence of rules, a set of hand written
rules, is applied to detect the best tags set for the sentence regardless
of any maximization probability. The set of rules need to be written
probably and checked by human experts. Examples: the path-voting
constraint models and decision tree models.
 Transformational approach: combines both statistical methods and
rule based methods to firstly find the most probable set of available
tags and then applies a set of rules to select the best.
 Neural Networks: with linear separator or full neural network, have
been used for tagging processes.
The methods described above, as any other research areas, have their
advantages and disadvantages; but there is a major difficulty facing all
of them, it is the tagging of unknown words (words that have never seen
before in the training corpora). While rule-based approaches depends on
a special set of rules to handle such situations, stochastic and neural nets
lack this feature and use other ways such as suffixes analysis and n-
gram by applying morphological analysis; some methods use default set
of tags to disambiguate unknown words. [Has06a]

_________________________________________________________________________
 26 
2.5.3 Syntactic Analysis
"Syntax is the study of the relationships between linguistics forms,
how they are arranged in sequence, and which sequences are well-
formed". [Yul00]
Syntactic analysis, also referred by "Parsing", is the process of
converting the sentence from its flat format which is represented as a
sequence of words into a structure that defines its units and the relations
between these units. [Raj09]
Hence, the goal of this technique is to transform natural language
into an internal system representation. The format of this representation
may be dependency graphs, frames, trees or some other structural
representations. Syntactic parsing attempts only for converting sentences
into either dependency links representing the utterance syntactic structure
or a tree structure and the output of this process is called "parse tree" or
simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning
in the level of the smallest parts ("words" in terms of language scientist,
"tokens" in terms of computer scientists). [Gru08]
Syntactic analysis makes use of both the results of morphological
analysis and Part-Of-Speech tagging to build the structural description of
the sentence by applying the grammar rules of the language under
consideration; if a sentence violates the rules then it is rejected and
assigned as incorrect. [Raj09]
The two main components of every syntax analyzer are:
 Grammar: the grammar provides the analyzer with the set of
production rules that will lead it to construct the structure of the
sentences and specifies the correctness of every given sentence.

_________________________________________________________________________
 27 
Good grammars make a careful distinction between the
sentence/word level, which they often call syntax or syntaxis and
the word/letter level, which they call morphology. [Gru08]
 Parser: the parser reconstructs the production tree (or trees) by
applying the grammar to indicate how the given sentence (if
correctly constructed) was produced from that grammar.
Parsing is the process of structuring a linear representation in
accordance with a given grammar.
Today, most of parsers combine context free grammars with probability
models to determine the most likely syntactic structure out of many others
that are accepted as parse trees for an utterance. [Dzi04]
2.5.4 Semantic Analysis
"Semantics is the study of the relationships between linguistic
forms and entities in the words; that is, how words literally connect to
things." [Yul00]
This technique and the later following it are basically depended by
language understanding. Semantic analysis is the process of assigning
meanings to the syntactic structures of the sentences regardless of its
context. [Yul00] [Raj09]
2.5.5 Discourse Integration
Discourse analysis is concerned with studying the effect of sentences
of each other. It shows how a given sentence is affected by the one
preceding it and how it affects the sentence following it. Discourse
Integration is relevant to understanding texts and paragraphs rather than
simple sentences, so, discourse knowledge is important in the interpretation

_________________________________________________________________________
 28 
of temporal aspects (like pronouns) in the conveyed information. [Ric91]
[Raj09]
2.5.6 Pragmatic Analysis
This step interprets the structure that represents what is said for
determining what was meant actually. Context is a fundamental resource
for processing here. [Ric91]
2.6 Natural Language Processing Challenges
The challenges of natural language processing are much enough that
can't be summarized in a limited list; with every processing step from the
start point to results outputting there are a set of problems that natural
language processors vary in their ability to handle. However, the
application where NLP is used, usually, concerned with a specific task
rather than considering all processing steps with all their details, this is an
advantage for the NLP community helps to outline the challenges and
problems according to the task under consideration.
For our research area, we precisely concerned with the set of
problems that are directly affecting the task of text correction; the next
subsections describe some of them:
2.6.1 Linguistic Units Challenges:
The task of text correction starts from the level of characters up to
paragraphs and full texts, with every level there are a set of difficulties that
the handling analyzer faces:
2.6.1.1 Tokenization
In this process, the lexical analyzer, usually called "Tokenizer",
divides the text into smaller units and the output of this step is a series of

_________________________________________________________________________
 29 
morphemes, words, expressions and punctuations (called tokens). It
involves locating tokens boundaries (where one token ends and another
begins).
Issues that arise in tokenization and should be addressed are [Nad11]:
 Problem depends on language type: language includes, in addition to
their symbols, a set of orthographic conventions which are used in
writing to indicate the boundaries of linguistic units. English
employs whitespaces to separate words but this isn't sufficient to
tokenize a text in a complete and unambiguous manner because the
same character may be used in different uses (as the case with
punctuations), there are words with multi parts (such as dividing the
word with a hyphen at the end of lines and some cases in the addition
of prefixes) and many expressions consisted of more than one word.
 Encoding Problems: syllabic and alphabetic writing systems, usually,
encoded using single byte, but languages with larger character sets
require more than two bytes. The problem arise when the same set of
encodings represents different characters set; whereas, the tokenizers
are targeted to a specific encoding for a specific language.
 Other problems such as the dependency of the application
requirements which indicates what a constituent is defined as a
token; in computational linguistics the definition should precisely
indicate what the next processing step requires. The tokeniser should
also have the ability to recognize the irregularities in texts such as
misspellings and erratic spacing and punctuation, etc.
2.6.1.2 Segmentation
Segmenting text means dividing it into small meaningful pieces
typically referred by "sentence", a sentence consists of one or more tokens

_________________________________________________________________________
 30 
and handles a meaning which may not completely be clear. This task
requires a full knowledge in the scope of punctuation marks since they are
the major factor in denoting the start and ends of sentences.
Segmentation becomes more complicated as the punctuations usages
become more. Some of punctuations can be a part from a token and not a
stopping mark such as the case with periods (.) when used with
abbreviations.
However, there is a set of factors can help in making the
segmentation process more accurate [Nad11]:
 Case distinction: English sentences normally start with a capital
letter, (but Proper nouns also do).
 POS tag: the tags that are surrounding punctuation can assist this
process, but multi tags situations complicate it such as the using
of –ing verbs as nouns.
 The length of the word (in the case of abbreviation
disambiguation, notice a period may assign the end of a sentence
and an abbreviation at the same time).
 Morphological information, this task requires finding the stem
word by suffixes removal.
It is likely not to separate tokenization and segmentation processes;
they are usually merged together for solving most of the above
problems, specifically segmentation problems.
A sentence is described to be an indeterminate unit because of the difficulty
in deciding where it ends and another starts; while the grammar is
indeterminate from the stand point of deciding 'which sentence is
grammatically correct?' because this question permits to be answered
divisively and discourse segmentation difficulty is not the lonely reason but

_________________________________________________________________________
 31 
also grammatical acceptability, meaning, style goodness or badness, lexical
acceptability, context acceptability, etc. [Qui85]
2.6.2 Ambiguity
An input is ambiguous if there is more than one alternative linguistic
structure for it. [Jur00]
Two major types of sentence ambiguity, genuine and computer
ambiguity. In the first, the sentence is really has two different meanings to
the intelligent hearer; while in the second case, is that the sentence has one
meaning but for the computer it has more than one and this type is really a
problem facing NLP applications unlike the first. [Not]
Ambiguity as an NLP problem is found in every processing step [Not]
[Bha12]:
2.6.2.1 Lexical Ambiguity
Lexical ambiguity is described to be the possibility for a word to
have more than one meaning or more than one POS tag.
Obviously, meaning ambiguity leads to semantic ambiguity and tag
ambiguity to syntactic ambiguity because it can produce more than one
parse tree. Frequency is an available solution for this problem.
2.6.2.2 Syntactic Ambiguity
The sentence has more than one syntactic structure; particularly,
English common ambiguity sources are:
 Phrase attachment: how a certain phrase or a clause in the sentence
can be attached to another when there is more than one possibility.
Crossing is not allowed in parse trees; therefore, a parser generates a
parse tree for each accepted state.

_________________________________________________________________________
 32 
 Conjunction: sometimes, the parser befuddled to select which phrase
a conjunctive should be connected to.
 Noun group structure: the rule
NG  NG NG
allows English to generate long series of nouns to be strung together.
Some of these problems can be resolved by applying syntactic constraints.
2.6.2.3 Semantic Ambiguity
Even when a sentence is unambiguous lexically and syntactically,
sometimes, there is more than one interpretation for it. This is because a
phrase or a word may refer to more than one meaning.
"Selection restrictions" or "semantic constraints" is a way to
disambiguate such sentences. It combines two concepts in one mode if both
of the concepts or one of them has specific features. Frequency in context
also can help in deciding the meaning of a word.
2.6.2.4 Anaphoric Ambiguity
This is the possibility for a word or a phrase to refer to something
that is previously mentioned but in the reference there is more than one
possibility.
This type can be resolved by parallel structures or recency rules.
2.6.3 Language Change
"All living languages change with time, it is fortunate that they do so
rather slowly compare to the human life". Language change is represented
by the change of grammars of people who speak the language and it has
been shown that English was changed in its lexicon, phonological,

_________________________________________________________________________
 33 
morphological, syntax, and semantic components of the grammar over the
past 1,500 years. [Fro07]
2.6.3.1 Phonological Change
Correspondences of regular sounds show the phonological system
changes. The phonological system is governed, as well as any other
linguistic system, by a set of rules and this set of phonemes and
phonological rules is subjected to change by modification, deletion and
addition of new rules. The change in phonological rules can affect the
lexicon in that some of English words formations depends on sounds, such
as the vowels sound differentiate nouns from verbs ( nouns house and bath
from the verbs house and bathe).
2.6.3.2 Morphological Change
Morphological rules, like the phonological, are suspected to addition,
lose and change. Mostly, the usage of suffixes is the active area of change
where the way of adding them to the ends of stems affected the resulted
words and therefore changed the lexicon.
2.6.3.3 Syntactic Change
Syntactic changes are influenced by morphological changes which in
turn influenced by phonological changes. This type of change includes all
types of grammar modifications that are mainly based on the reordering of
words inside the sentence.
2.6.3.4 Lexical Change
Change of lexical categories is the most common in this type of
change. An example of this situation is the usage of nouns as verbs, verbs
as nouns, and adjectives as nouns. Lexical change also includes the

_________________________________________________________________________
 34 
addition of new words, borrowing or loan words from another language,
and the loss of existing words.
Figure (2.5) : An example of lexical change 1
2.6.3.5 Semantic Change
As the category of a word can be changed, its semantic
representation or meaning can be changed, too. Three types of change are
possible for a word:
 Broadening: the meaning of a word is expanded to mean everything
it has been used for and more than that.
 Narrowing: on the reverse of broadening, here the word meaning is
reduced from more general meaning to a specific meaning.
 Shifting: the word reference is shifted to refer to another meaning
somewhat differs from the original one.
_________________________________________________________
Darby Conley/ Get fuzzy © UFS, Inc. 24 Feb. 2012

 35
Part II
Text Correction
2.7 Introduction
Text correction is the process of indicating incorrect words in an input
text, finding candidates (or alternatives) and suggesting the candidates as
corrections to the incorrect word. The term incorrect refers to two different
types of erroneous words: misspelled and misused. But mainly, the process
is divided into two distinct phases: error detection phase which indicates the
incorrect words, and error correction phase that combined both generating
and suggesting candidates.
Devising techniques and algorithms for correcting texts in an
automatic manner is a primal opened research challenge started from the
early 1960s and continued until now because the existed correction
techniques are limited in their accuracy and application scope [Kuk92].
Usually, a correction application concerns a specific type of errors
because it is a complex task to computationally predict an intended word
written by a human.
2.8 Text Errors
A word can be mistaken in two ways: the first is by incorrectly
spelling a word due to lack of enough information about the word spell or
intentionally mistaking symbol(s) within the word, this type of errors is
known as non-word errors where the word can't be found in the language
lexicon.
The second is by using correctly spelled word in wrong position in the
sentences or unsuitable context. These errors are known as real-word errors

Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 36
where the incorrect word is accepted in the language lexicon.
[Gol96][Amb08]
Non-word errors are easier to be detected, unlike real-word errors; the
later needs more information about the language syntax and semantic nature.
Accordingly, the correction techniques are divided into isolated words error
detections that is concerned with non-word errors; and context sensitive
error correction which deals with real-words error. [Gol96]
2.8.1 Non-word errors
Those errors include the words that are not found in the lexicon; a
misspelled word contains one or more from the following errors:
 Substitution: one or more symbols are changed.
 Deletion: one or more symbols are missed from the intended word.
 Insertion: adding symbol(s) to the front, end, or any index in the word.
 Transposition: two adjacent symbols are swapped.
The four errors are known as Damerau edit operations.
2.8.2 Real-word errors
These errors occur through mistaking an intended word by another
one that is lexically accepted. Real-word errors can be resulted from
phonetic confusion like using the word "piece" instead of "peace" which
usually leads to semantically unaccepted sentences, after applying non-word
correction, or even from misspelling the intended word and producing
another lexically accepted word. [Amb08]
Sometimes, the confusion results in syntactically unaccepted
sentences; like writing the sentence "John visit his uncle" instead of "John
visits his uncle".

_______________________________________________________________________
 37
Correcting real-word errors is context sensitive in that it needs to
check the surrounding words and sentences before suggesting candidates.
2.9 Error Detection Techniques
Indicating whether a word is correct or not is based on the type of
correction procedure; non-word error detection is usually checking the
acceptance of a word in the language dictionary (the lexicon) and marks any
mismatched word as incorrect. While real-word error is more complex task,
it requires analysing larger parts from the text, typically, paragraphs and full
text [Kuk92]. In this work, we mainly focus on non-word error detection
techniques.
Dustin defined spelling error as E in a given query word Q which is
not an entry in the underhand dictionary D. [Bos05] He outlined an
algorithm for spelling correction as shown in figure (2.6).
Spell error detection techniques can be classified into two major types:
2.9.1 Dictionary Looking Up
All the words of a given text are matched against every word in a
pre created dictionary or a list of all acceptable words in the language under-
consideration (or most of them since some languages have a huge number of
words and collecting them totally is semi impossible task). The word is
incorrect if and only if there is no match found. This technique is robust but
suffers from the long time required for checking; as the dictionary size
becomes larger, looking up time becomes longer. [Kuk92] [Mis13]
2.9.1.1 Dictionaries Resources
There are many systems deal with collecting and updating languages
lexical dictionaries. Example of these systems is the WordNet online
application; it is a large database of English lexicons. Lexicons (nouns,

_______________________________________________________________________
 38
verbs, adjectives, articles …etc) are interlinked by lexical relations and
conceptual-semantic. The structure of WordNet is a network of words and
concepts that are related meaningfully and this structure made it a good tool
for NLP and Computational Linguistics.
Another example is the ISPELL text corrector; an online spell
checker provides many interfaces for many western languages. ISPELL is
the latest version of R. Gorin's spell checker which developed for Unix.
Suggestion a spell correction is based on only one Levenshtein edit distance
depending on looking up every token in the input text against a huge lexical
dictionary. [ISP14]
2.9.1.2 Dictionaries Structures
The standard looking up technique is to match every token in the
dictionary with every token in the text, but this process requires a long time
because NL dictionaries are usually of huge sizes and string matching needs
longer time than other data types do. A solution for this challenge is to
reduce the search space in such a way keeps similar tokens grouped together.
Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05]
Algorithm: Spell_correction
Input: word w
Output: suggestion(s) a set of alternatives for w
Begin
If (is_mistake(w))
Begin
Candidates=get_candidates( w)
Suggestions=filter_candidates( candidates)
Return suggestions
End
Else
Return is_correct
End.

_______________________________________________________________________
 39
Grouping according to spell or phones [Mis13], and using hash tables are
two fundamental ways to minimize search space.
Hashing techniques apply a hash function to generate a numeric key
from strings. The numeric keys are references to packets of tokens that can
generate the same key indices; hash functions differ in their ability to
distribute tokens and how much they minimize the search space. A perfect
hash function generates no collisions (hashing two different tokens to the
same key index), and a uniform hash function distribute tokens among
packets uniformly. The optimal hash function is a uniform perfect hash
function which hashes one token to every packet; such situation is
impossible with dictionaries due to the variance of tokens. [Nie09]
Spell and phones dependent groups use limited set of packets and
generate keys according to spell or pronunciation; they are another style of
hashing and sometimes of clustering. SPEEDCOP and Soundex are
examples. [Mis13] [Kuk92]
2.9.2 N-gram Analysis
N-grams are defined to be n subsequences of words or strings where n
is variable, often takes values: one to produce unigrams (or monograms),
two to produce bigrams (sometimes called "digrams"), three to produce
trigrams, or rarely takes larger values. This technique detects errors by
examining each n-gram from the given string and looking it with a
precompiled n-gram statistics table. The decision depends on the existence
of such n-gram or the frequency of it occurrence, if the n-gram is not found
or highly infrequent then the words or strings which contain it are incorrect.
[Kuk92] [Mis13]

_______________________________________________________________________
 40
2.10 Error Correction Techniques
Many techniques have been proposed to solve the problem of
generating candidates for the detected misspelled word; they vary in the
required resources, application scope, time and space complexity, and
accuracy. The most common are [Kuk92] [Mis13]:
2.10.1 Minimum Edit Distance Techniques
This technique stands on counting the minimum number of primal
operations required to convert the source string into the target one. Some
researchers refer to primal operations to be insertion, deletion, and
substitution of one letter by another; others add the transposition between
two adjacent letters to be the fourth primal operation. Examples, Levenshtein
Algorithm which counts one distance for every primal operation, Hamming
Algorithm works like Levenshtein but limited with only strings of equal
lengths; and Longest Common Substring finds the mutual substring between
two words.
Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has
no limitation on the types of symbols, or on their lengths. It can be executed
in time complexity of O(M.N) where M and N are the lengths of the two
input strings.
The algorithm can detect three types of errors (substitution, deletion,
and insertion). It doesn't account the transposition of two adjacent symbols
as one edit operation; instead, it counts such errors as two consecutive
substituting operations giving edit distance of 2.

_______________________________________________________________________
 41
One of the well-known modifications of the original Levenshtein
method was done by his friend Fred Damerau, who made a research and
found that about 80% to 90% of errors are caused by the four types of error
altogether which are known as Damerau-Levenshtein Distance. [Dam64]
The modified method required execution time longer than the original;
in every checking round, the method applies additional comparison to check
whether a transposition took place in the string then applies another
comparison to select the minimum value between the previous distance and
the distance with the occurrence of a transposition operation. This step
Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11]
1. Algorithm: Levenshtein Edit Distance
2. Input: String1, String2
3. Output: Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0,
cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2
9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. cost = 0
14. else
15. cost = 1
16. r=index of x, c=index of y
17. min1 = (distance(r - 1, c) + 1) // deletion
18. min2 = (distance(r, c - 1) + 1) //insertion
19. min3 = (distance(r - 1,c - 1) + cost) //substitution
20. distance( r , c )=minimum(min1 ,min2 ,min3)
21. end
22. Step3: return the value of the last cell in the distance matrix
23. return distance(Length of String1,Length of String2)
24. End.

_______________________________________________________________________
 42
multiplied time complexity by factor of 2, resulting in Ω(2*M.N).Hence, in
this work, the original Levenshtein method (figure (2.7)) is modified to
consider the Damerau's four errors types within a time complexity shorter
than the time consumed by Damerau-Levenshtein Algorithm and close to the
original method. Figure (2.8) shows the modification of Damerau on
Levenshtein method.
1. Algorithm: Damerau-Levenshtein Distance
3. Output: Damerau Edit Operations Number
5. distance(length of String1,Length of String2)=0, min1=0, min2=0,
min3=0, cost=0
11. begin
12. if x = y
13. cost = 0
14. else
15. cost = 1
21. if not(String1 starts with x) and not (String2 starts with y) then
22. if (the symbol preceding x= y) and (the symbol preceding y=x)
then
23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost)
24. end
27. End.
Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]

_______________________________________________________________________
 43
2.10.2 Similarity Key Techniques
As its name clarifies, this technique finds a unique key to group
similarly spelled words together. The similarity key is computed for the
misspelled word and mapped to a pointer refers to the group of words that
are similar in their spell to the input one. Soundex algorithm finds keys
depending on the pronunciation of the words, while the SPEEDCOP system
rearranges the letters of the words by placing the first letter, followed by
consonants, and finally vowels according to their occurrence sequence in the
word and without duplication.[Kuk92] [Mis13]
2.10.3 Rule Based Techniques
This approach applies a set of rules on the misspelled word depending on
common mistakes patterns to transform the word into valid one. After
applying all the applicable rules, the set of generated words that are valid in
the dictionary suggested as candidates.
2.10.4 Probabilistic Techniques
Two methods are mainly based on statistics and probability:
1) Transition Method: depends on the probability of a given letter to be
followed by another one. The probability is estimated according to n-
gram statistics from big size corpus.
2) Confusion Method: depends on the probability of a given letter to be
confused or mistaken by another one. Probabilities in this method are
source dependent, as example: Optical Character Recognition (OCR)
systems vary in their accuracy and their basics in recognizing letters,
and Speech Recognition (SR) systems usually confuse sounds.

_______________________________________________________________________
 44
2.11 Suggestion of Corrections
Suggesting corrections may be merged within the candidates'
generation; it is fully dependent on the output of the generation phase.
The user is usually provided with a set of corrections, and then he/she
can do a choice among them, keeps the written word unchanged, add the
token to the dictionary, or rewrite the word in the cases when the desired
word is not within the corrections list.
Suggestions are listed in non-increasing order according to their
similarity and suitability for replacing the source word. Similarity depends
on the method of computing the distance or similarity between every
candidate and the source token, while suitability depends on the surrounding
words within the sentence boundary or the paragraph (in context sensitive
correction, full text may be examined before making a suggestion).
2.12 The Suggested Approach
The primal goal of this work is to find the nearest alternative word
from all the available candidates in the underlying dictionary; when a non-
word is encountered there are many candidates available to replace it, but the
trick is here, which one of those alternatives was intended by the writer?
The suggested work answers this question as in the following:
All the dictionary tokens which their count may reach to some
hundreds of thousands can be intended by the writer or none of them could
be so. The writer (or typist) might really misspell the word or he/she wrote it
perfectly but the problem is that the word is not found in the dictionary, i.e.
never seen before and then it is an "unknown" token.
The problem of deciding whether a word is misspelled or unknown is
impossible to be solved. For this, the suggested system will assume every

_______________________________________________________________________
 45
unrecognized word is misspelled and may let the user makes the final
decision. As an initial solution, all the tokens in the dictionary are candidates
and in further processing the number of candidates must be minimized.
2.12.1 Find Candidates Using Minimum Edit Distance
The starting step is to look for the most similar tokens in the lexicon
dictionary and ranking them according to the minimum edit distance from
the misspelled word. This action reduces the number of candidates to an
acceptable amount depending on a threshold for the number of edit
operations that should be applied to equalize the candidates and the
misspelled word, or a maximum limit for number of candidates. The
suggested system used Levenshtein method after being enhanced to consider
the four Damerau edit operations.
To find the similar tokens, the lexicon should be looked up and every
token in it must be examined with the given word. This process consumes
time because of the huge tokens held by the lexicon dictionary and the
required time by the examining algorithm itself to find the minimum edit
distance. Hence, the search space needs to shrink; a method is proposed to
group similar tokens in semi clusters using spell properties.
2.12.2 Candidates Mining
The best set of candidates is going under another processing step to specify
how the generated candidates are related to the misspelled token and
accordingly they should be ranked. The process is implemented using a
vector of the following features:
 Named-entity recognition: many issues are considered.
 Transposition probability: Keyboard proximity and Physical Similarity.

_______________________________________________________________________
 46
 Confusion probability: because phonetic errors are popular, this
analysis help us to find if a word was misspelled because of replacing
letter(s) with another of the same sound.
 Starting and ending letters matching.
 Candidates' length effect.
A weighting scheme was applied to give each feature an effect role in
deciding the best set of suggestions. However, the Similarity amount has the
maximum part among the others.
2.12.3 Part Of Speech Tagging and Parsing
Finally, the suitable candidate is chosen by the parser. The parser
selects the candidate(s) that make(s) the sentence, which contains the
misspelled word, correct. Tagging plays an important role in specifying the
optimal candidate because filtering according to POS tag is the base on
which the parser stands to select a candidate for its incomplete sentence. The
selected tag is not only affect candidate but also every token in the sentence;
it is the nature of English (and most of natural languages).
The set of candidates, at this step, should contain the minimum
number of elements but the best.
Grammar checking, accomplished by parsing, is another goal of this
system. The system applies sentence phrasing process and check each phrase
consistency according to English grammar rules. When an incorrect structure
is encountered, the system tries to re-correct it.
Parsing is a fundamental step in specifying the correct choice of
candidates since the basic goal is to give a correct sentence.
The dependent dictionary is an integration of WordNet dictionary with
ISPELL dictionary.

_______________________________________________________________________
 47
Figure 2.9 shows the block diagram of the suggested work; and in
further chapters, more details are shown for each block.
_____________________________________________________________
1 Diagram in 2.9 is more detailed through the next three chapters
Figure (2.9): The suggested system block diagram1
Preprocessing WordNet Lexical
Dictionary
Morphological analysis and
POS tags Expansion
ISPELL
datasets
Dictionaries
Integration
Hashing and Indexing
POS Tagging
Integrated Hashed
Indexed Dictionary
------------
------------
------------
------------
------------
Tokens Stream
Sentences Stream with
tagged tokens
Candidates Generation
Sentences Recovery and
Suggestions Listing
-----------
-----------
-----------
-----------
-----------
-----
Phrase Level Suggestions
Phrasing
Candidates
Ranking
Grammar Correction



Chapter Three
Dictionary
Structure
and
Looking Up
Technique

 48 
Chapter Three
Hashed Dictionary and Looking Up Technique
3.1 Introduction
Dictionary is a basic unit, mostly, in every NLP application. It holds
the lexicon of the language under processing and related information
according to the application purpose type such as POS tags, semantic
information, phonetics, pronunciation and others.
Typically, dictionaries are data structures supported in a format of a
list of tokens or words collection. Each word (or token) is associated with its
information that makes its usage by a NLP application becomes possible.
The number of tokens held by a dictionary is a critical point in NLP
applications, especially taggers and text correction systems; because as the
number of tokens becomes smaller, the detected errors ratio also would be
small since poor dictionary allows erroneous words to pass undetected. On
the other hand, large sized dictionary increased this ratio but requires longer
time for tokens looking up.
Therefore, a balancing is needed to keep the size of a dictionary as
inclusive as possible and the looking up speed fast. Many approaches have
been proposed to handle this problem, some of these are indexing and hash
functions.
3.2 Hashing
The optimal feature of any dictionary is the availability of random
access but strings are high various data type which makes this task
impossible, at least from the sides of memory constraints.

Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 49 
Hashing is the process of converting a string S into an integer number
within the period [0, M-1] where M is the number of available addresses in a
predefined table. Hash functions made good promises from the area of
random access, but alone!? No, the variance of language tokens requires an
infinite hash table to hold every token "separately" and a variable size
addressing buffer which may be unloadable by most of current systems as
well as the highly wasted storage space.
By "separately" we mean that no two strings have the same hash
value, i.e. no collisions. As the number of collisions becomes larger the
looking up inside packets becomes longer.
However, an exploitation of hash function as a partial solution can be
applied with other approaches to solve the shown up problem. While hash
function can map tokens according to some of their features into size
manageable packets, approaches such as indexing and advance search
techniques would enhance looking up speed to a reasonable amount.
3.2.1 Hash Function
The hash function in this work was created to exploit the spell of
tokens as addressing key. It converts the prefix of tokens to be grouped into
packets.
English alphabet, the considered language of this work, contains the
set of uppercase letters from 'A' to 'Z', lowercase letters from 'a' to 'z', and
numbers from 0 to 9. In addition to some special purposes characters which
are not avoidable in the dictionary because they are parts of some tokens
such as slash (/), period (.), comma ('), underscore ( _ ), whitespace, and
hyphen (-). The resulted characters set contains about 67 characters which
can be reduced further more by replacing the numbers codes from 1 to 9 by

________________________________________________________________________
 50 
the code of 0 because the distinction between numbers has no such
importance in this application for two reasons:
 The difference between numbers is not a problem in the correction
process since any system can never estimate what a number was
intended by the writer; therefore any written number would be
absolutely accepted.
 If a distinction should be taken when treating numbers, then we need
to cover every counted number in the dictionary, resulting in an
infinite dictionary size because numbers are infinite.
The final alphabet contains the union of the above mentioned sets and the
reduced numbers set:
∑={ A,B,…,Z, a, b,…,z ,0, /, . , ' , - , _ , whitespace}
which can be re-encoded using only 6 bits as shown in Table 3.1 (unused
codes are referred by *) .
Hashing according to prefixes is a good way to minimize the sizes of
packets; it is similar to the SOUNDEX and SPEEDCOP methods
[Mis13][Kuk92] in the fact that they shared the same goal, minimizing the
size of search space, but it verses them in that this approach maps tokens to a
predefined packets addresses depending on a limited length from the string
prefix while those methods uses the total length and filters the letters
according to sound or spell. This difference gave the suggested approach two
interested features:
1. The hash function is simple and can be applied directly without any
considerations for pre processing; SOUNDEX needs to encode letters
into their phonetic groups, and SPEEDCOP rearranges letters.

________________________________________________________________________
 51 
Symbol Code Symbol Code Symbol Code
A 0 B 1 C 2
D 3 E 4 F 5
G 6 H 7 I 8
J 9 K 10 L 11
M 12 N 13 O 14
P 15 Q 16 R 17
S 18 T 19 U 20
V 21 W 22 X 23
Y 24 Z 25 a 26
b 27 c 28 d 29
e 30 f 31 g 32
h 33 i 34 j 35
k 36 l 37 m 38
n 39 o 40 p 41
q 42 r 43 s 44
t 45 u 46 v 47
w 48 x 29 y 50
z 51 ' 52 / 53
- 54 _ 55 . 56
0 57 whitespace 58 * 59
* 60 * 61 * 62
* 63
Table 3.1: Alphabet Encoding

________________________________________________________________________
 52 
2. A random access is established by using the output of the hash
function as an address while both previous methods need to search for
a matching between the computed value and the stored codes.
3.2.2 Formulation
As mentioned above, the size of the alphabet reduced to only 59
symbols which can be encoded using only 6 bits instead of the standard 8
bits, making a series of hashing functions available to be applied 1, 2, or any
longer sequence of symbols. But this is another area for discussion, if the
length of the prefix is too small then the packets number would be small
also; therefore, they hold large number of tokens resulting in longer looking
up time.
On the other hand, using long prefixes creates large number of packets
and some of them usually are sparse because of the variance and the
irregularity of tokens which is a characteristic of natural languages.
The function depends on using a three characters prefix C1C2C3 and
converts it as presented in Table (3.1) into integers, then computes the hash
value H according to Equation.1:
H(C1,C2,C3)= _________ (3.1)
H represents the packet address where tokens that are starting with same
prefix are held.
Obviously, the number of the available packets addresses is equal to
the number obtained from residing the three symbols binary codes as shown
in Table (3.2), where the symbol at index 0 is 'A' and symbol at index 63
(the last available index in the alphabet) is the unused cell which referred by
'*'.

________________________________________________________________________
 53 
Start Address= (C1)2||(C2)2||(C3)2=(000000000000000000)2=(0)10
End Address= (C1)2||(C2)2||(C3)2=(111111111111111111)2=(262143)10
This makes the total number of packets= 2 18
=262144 packets. Some of
these packets are empty because their addresses do not match an actual token
prefix in the lexicon but the distribution of tokens among packets reduced
the search space to a manageable size especially when the hash function has
been combined with an indexing scheme to build the dictionary in a two
levels structure.
Starting Address Encoding End Address Encoding
Alphabetic
Encoding
Decimal
Encoding
Binary
Encoding
Alphabetic
Encoding
Decimal
Encoding
Binary
Encoding
C1 A 0 000000 * 63 111111
C2 A 0 000000 * 63 111111
C3 A 0 000000 * 63 111111
3.2.3 Indexing
Key-indexing is an in-memory lookup technique based strictly on
direct addressing into an array with no comparisons between keys made. Its
area of applicability is limited to numeric keys falling in a limited range
defined by the available memory resources. Hashing helps direct addressing
to work on keys for any type and range by bringing serial search and
collision resolution policies into the equation.
Table 3.2: Addressing Range

________________________________________________________________________
 54 
Indexing exploited for creating a reference table that holds the 218
packets heads addresses which can be addressed directly by the hash
function. Every record in the reference table contains two fields: the first is
"base" field which holds an address if its index match a token prefix,
otherwise its value is (-1). The second is the "limit" field that holds the
length of the primary packet that related to its index. Looking up for a packet
contains tokens starting with a specific prefix is shown in figure (3.1).
The packets referred by the reference table are treated as primary
packets, which hold 3-symbols prefix identical tokens; for further reduction
for the search space, sub packets can be created for every primary packet.
The second level of tokens distribution is also based on their prefixes
but with longer sequences. Instead of using only three symbols to group
tokens with identical prefixes, the prefix equality expanded to 6 symbols by
subdividing tokens inside primary packets into more secondary packets
Figure (3.1): Token Hashing Algorithm
Algorithm: Token Hashing
Input: English token (finite string over ∑), reference and hash tables.
Output: packet head address where the input token may rely.
Step1: set variables C1,C2, and C3 to the input token prefix.
Step2: Compute Index from C1, C2, and C3.
Index=
Step3: go to reference table at the record indexed with Index.
Step4: examine Address field
if Base > -1
return (Base value)
else
return fail
End.

________________________________________________________________________
 55 
which consist of a head and a set of tokens that are identical to the head in
their first 6 symbols.
The structure of the dictionary can be clarified by hashing the
exemplar token ABCDEFGH according to the approach described
previously.
(1) The dollar sign ($) refers to any sequence may follow S
i
Figure (3.2) : Dictionary Structure and Indexing Scheme
C1=A, C2=B, C3=C
Reference Index= H(C1,C2,C3)=Index
Index : Head address =X : Length =Y
ABCS0$
ABCS1$
ABCS2$
ABCS3$
:
:
ABCSY-1$
ABCDEFT0
ABCDEFT1
ABCDEFT2
ABCDEFT3
:
:
ABCDEFTR-1
Primary Packet 1
"Head Code="ABC
Si="DEF"
Secondary Packet

________________________________________________________________________
 56 
An interested characteristic in secondary packets is that no more space
is wasted, because it is not based on a predefined packets structure. The
secondary head, which is a token within primary packet, may be followed by
tokens sharing it the same 6-symbol prefix which are collected in one
variable size secondary packet; or may not be followed, then no need for a
secondary packet.
3.3 Looking Up Procedure
As shown in figure (3.2), the process of looking for a target token is
started when the primary packet head address becomes in hand from the
reference table which in turn computed using the hash function.
At hash table, where the tokens are stored according to indexes, the
search process begins with a random access accomplished by the index of
the primary packet head, and the matching is done sequentially.
The matching is happened on the forth through the sixth symbols from
every token related to that primary packet; such an action reduces
comparison time since matching all the sequence requires longer time. Even
the reduction is infinitely short but it is useful in similar cases because logic
operations on strings differ from other data types.
When a full matching is found the target token is compared to the
token at that record completely, if they are matched the goal is reached;
otherwise, searching continued in the secondary packet related to that token
(if there is a one related to the current token). The comparison inside
secondary packets, unlike primary packets, uses full token length and failure
here infers that there is no chance to find the targeted token in the dictionary.
The algorithm in figure (3.3) outlines the looking up procedure after
gaining primary head address.

________________________________________________________________________
 57 
Figure (3.3) : Algorithm of Looking Up Procedure
Algorithm: Looking up a target token
Input: Target Token, Primary Packet Head address, Primary Packet
Size.
Output: tag of input target token.
Step1: Set primary packet information
X=head address, Y=packet size.
Step2: Examine X:
if X<0 then return fail
for primary_index=X to X+Y do
if prefix(token at Primary_index in Hash Table)=prefix(target)
begin
if Current token = target return primary_index
X2=Secondary packet head address
Y2=Secondary Packet Length
exit for
end
Step3: Examine X2
if X2<=0 return fail // no related secondary packet
for secondary_index=X2 to X2+Y2 do
if token at secondary_index in hash table=target
return secondary_index
Step4: if no match was found at step3 return fail
End.

________________________________________________________________________
 58 
3.4 Dictionary Structure Properties
The proposed dictionary can be applied in every application depends
on strings looking up. It provides high speed directed search for perfect
matching.
 The reference table, although there are wasted addresses because of
strings variance, is suitable to be used with natural language
dictionaries which are usually of huge size. The tokens are handled in
a separated table constructed depending on the reference table.
 String comparison consumes longer time than other types do. In this
approach, comparison is reduced to include subsequences from both
target and the stored tokens.
 Looking up procedure is fast in discovering the foundation of a target
token in many situations:
o At hashing step the empty record infers missed token after
consuming only one numeric comparison.
o At primary packet, failure requires comparing at most only the
three symbols from the fourth index to the sixth in the
6_symbols prefixes of tokens within primary packet.
o At secondary packet, failure requires comparing tokens within
that packet.
The worst case is the failure of finding the target at the end of a
secondary packet related to the last token in the primary packet which
consumes (length of primary packet +length of secondary packet)
comparisons.
 Since looking up is string dependent, there is a high flexibility in
associating information with tokens without any overloading in search

________________________________________________________________________
 59 
process. As a result it can be used to construct Lexical and semantic
dictionaries.
3.5 Similarity Based Looking-Up
The structure described in section (3.2) is suitable for a perfect
looking up, while the purpose of this work is to design a text correction
system where some errors are reasons of unknown words or misspelled.
Such situations need for looking up the dictionary to generate candidates that
are similar (not identical) to the given misspelled token.
The main purpose of any similarity based grouping approach is to
reduce the search space to a manageable size in order to shorten looking up
time, but at the same time they should not make lose of good candidates or
some similar objects (tokens). Clustering techniques are examples of such
approaches. But even when using fuzzy clustering techniques this problem
did not solved completely because:
 Tokens clustering should consider the sequence in which the symbols
are arranged in the token in addition to symbols themselves.
 Although there are many similarity measures techniques for grouping
tokens, no obvious separation measure can be used to separate strings
clusters.
 In the case of fuzzy clustering, decision threshold is a bottleneck;
where high threshold value makes lose of good candidates, low
threshold heightens redundancy by grouping less similar tokens in the
cluster resulting in longer searching time and inaccurate candidates.
 As the number of fuzzy centroids which a token relates to becomes
larger, computing the nearest set of centroids would also increase
search complexity.

________________________________________________________________________
 60 
For these reasons, an approach is proposed to save the same hash table as
the dictionary structure and to improve the looking up technique. The
algorithm is presented in figure (3.5).
The improvement forwards the search to include similarly spelled tokens,
depending on the same bases of the standard search described previously.
The outlines of the proposed approach are:
 bi-grams Generation
 Primary Centroids Selection (at maximum 3 symbols length)
 Connecting centroids to Reference table.
These three steps are presented in figure (3.4).
3.5.1 Bi-Grams Generation
Reference table is the building block of bi-gram generation process; it
specifies the range of hashing addresses and the number of the symbols
needed from tokens prefixes for computing hash values.
The hash-indexing method used here is limited within 3 symbols only;
therefore, the bi-grams generation involves three sub-divisions to produce
two symbols (bi-grams).
(C1,C2), (C1,C3), and (C2,C3)
Division into three bi-grams simplifies predicting Damerau four errors
types (insertion, deletion, substitution, and transposition) by applying the
template C1C2C3 using only two symbols at a time producing the results
shown in Table (3.3).

________________________________________________________________________
 61 
The variety of tokens of natural language cannot satisfy all nine distributions
of the template different sequences described above for every index in
Reference table, therefore, a preprocessing is applied to collect satisfied
prefixes by checking every generated template foundation in the dictionary
and the missed sequences are rejected.
Reference Index Selection
(C1,C2,C3)=H-1
(Index)
bi-grams variants generation
C1C2?
C1?C2
?C1C2
C2C3?
C2?C3
?C2C3
C1C3?
C1?C3
?C1C3
bi-grams Generation
Per each bi-gram variant a
3_symbols length
Centroids Set Selection
Redundancy Removal
(bi-grams, Centroids Set)
Connecting
bi-grams Association with
"Index"
Centroids
Selection
Centroids
Referencing
Figure (3.4) : Semi Hash Clustering block diagram

________________________________________________________________________
 62 
Table (3.3): Predicting errors using Bi-grams analysis
Sequence Substitution Insertion Deletion Transposition
C1C2? √ √ × ×
C1?C2 × √ × If ?=C3
?C1C2 × √ × ×
C2C3? × × √ ×
C2?C3 × √ If ?<>C1 If ?=C1
?C2C3 √ × × ×
C1C3? × × If ?<>C2 If ?=C2
C1?C3 √ × × ×
?C1C3 × √ If ?<>C2 If ?=C2
3.5.2 Primary Centroids Selection
For every accepted sequence, a set of centroids are selected as sub set
of the unification of primary centroids that are at maximum of three symbols
length.
A centroid related to a specific sequence is an assignment of a symbol
from the alphabet to the '?' sign in that sequence. For example at
index=9882:
H-1
(9882)="Che"
C1='C', C2='h', C3='e'
The nine sequences and their related primary centroids after pruning
mismatched sequences are:
1. Ch?: ChB, ChE, Cha, Che, Chi, Chk, Chl, Chn, Cho, Chr, Cht, Chu,
Chw, Chy, Ch', Ch˽, Ch
2. C?h: Cah, Coh, C˽h
3. ?Ch: BCh, DCh
4. he?: hea, heb, hec, hed, hee, hef, heg, heh, hei, hej, hek, hel, hem, hen,
heo, hep, her, hes, het, heu, hev, hew, hex, hey, he', he-, he

________________________________________________________________________
 63 
5. h?e: hae, hee, hie, hoe, hue, hye
6. ?he: Ahe, Che, Ghe, Jhe, Khe, Lhe, Phe, Rhe, She, The, Whe, ahe,
bhe, che, dhe, ghe, khe, phe, rhe, she, the, whe
7. Ce?: Cea, Ceb, Cec, Ced, Cee, Cei, Cel, Cen, Cep, Cer, Ces, Cet,
Ceu, Cey
8. C?e: Cae, Cce, Cde, Cee, Che, Cie, Cle, Coe, Cre, Cse, Cte, Cue,
Cve, Cze
9. ?Ce: BCe, vCe
3.5.3 Centroids Referencing
The final step is to join every sequence to its centroids set and every
index to its bi-gram sequences.
This process includes creating a list of all the primary centroids in the
dictionary which represent all the 3-symbols prefixes of primary packets
heads. Bi-grams are also stored in a separated list associated with the related
primary centroids set address.
Reference table in turn keeps track the addresses of the bi-grams of
each index within it. As a result, Bi-grams and the associated centroids sets
can be randomly accessed through Reference table.

________________________________________________________________________
 64 
3.6 Application of Similarity Based Looking up approach
The purpose of similarity based looking up is the minimization of
search space and maximizing the chance of finding tokens that are similar to
the source token.
Figure (3.5) : Similarity Based Hashing algorithm
Algorithm: Similarity Based Hashing
Input: Hashed Dictionary
Output: Similarity Based Hashed Dictionary
For each Reference Index apply the following steps:
Step1: Bi-grams Generation
1)CxCyCz=H-1
(Index)
2) generate sequence variants
3) filter sequences
Step2: Primary Centroids Selection
for each generated sequence do
1) for every alphabet symbol do
1.1) assign in the sequence missed symbol
1.2) reject if no prefix matching is found
2) remove duplicated centroids
Step3: Centroids Referencing
1) Bi-grams Centroids connecting
2) Index Bi-grams connecting
End.

________________________________________________________________________
 65 
Hashed dictionary structure shown in section (3.2) was built to
achieve perfect matching according to tokens prefixes; if the source token
wasn't found then similar tokens should be looked up.
Because looking up the hashed dictionary is based on the prefixes of
tokens, the similarity based looking up accounts all the available mistakes
that can occur within 3-symbols prefix of every token via exploiting the
associated bi-grams with the computed hash value. Every bi-gram is linked
to a list of primary centroids which in turn matched with the source token 3-
symbols prefix and filtered according to similarity amount. Centroids with
highest similarity are selected, while lower similarity centroids are rejected
for shortening the searching time.
The next step is expanding the prefix length in the similarity
calculation through including 6-symbols prefixes because the selected
primary centroids refers to primary packets where every token within them
differs from the others tokens in its 6-symbols prefix. This step directs the
search to be more precise by selecting the nearest tokens from the primary
packet to the source token.
Finally, for every selected primary packet token, there may be a
secondary token where each token within it is equivalent to the secondary
head (i.e. primary packet token). The final action, in turn, maximizes the
chance of encountering tokens that are similar to the source token inside the
secondary packet (usually contains small number of tokens).
An interested property in this approach is the ability of using
thresholds in every level of the looking up procedure. A different threshold
can be used in the primary centroid selection, in the secondary packets heads
selection and in the selection of candidates. The value of the threshold is

________________________________________________________________________
 66 
application dependent and fundamentally restricted with the similarity
calculation method.
Figure (3.6) : Block diagram of candidates generation using SBL
(C1,C2,C3)= Source 3-symbols Prefix
Index=H(C1,C2,C3)
P1 P2 P8 P9P3 P4 P5 P6 P7
2-grams Patterns Examining
(P1…P9)
Primary Centroids Collection
Collected Centroids Filtering
(Highest Similarity Centroids Selection)
Secondary Centroids Selecting and Filtering
Candidates Generation

________________________________________________________________________
 67 
3.7 The Similarity Based Looking up Properties
The proposed approach has many good features that make it suitable in
various string based search applications:
1. Clustering illusion: the structure of the dictionary and the looking up
technique used with it provides a way for dividing search space into
three different levels:
a. Primary Centroids Clusters: only the 3-symbols prefixes are
checked and the best are selected as centroids to the next level.
b. Primary Packets Clusters: every token here is referenced by a
primary centroid and may be referencing a secondary packet
(i.e. act as secondary centroids).
c. Secondary Packets Clusters: every token is referenced to by a
secondary centroid.
2. Time Complexity Minimization: hashing function and indexing
merging simplified searching and provided random access in more
than one level.
3. Application Flexibility: thresholds can be used in every clustering
level as a separation to exclude uninterested centroids or candidates.
Indicating the threshold value is relevant to the developer, used
similarity calculation method, and the application area.
The algorithm in figure (3.7) outlines the complete process.

________________________________________________________________________
 68 
* If no threshold was indicated, the approach generates candidates according to maximum
similarity.
Figure (3.7): Similarity Based Looking up algorithm
Algorithm: Similarity Based Looking up
Input: *
Hashed Dictionary; Source_Token; similarity thresholds:T1,T2,T3
Output: Candidates Set
Step1:Hash Index Calculation
C1,C2,C3=3_symbols Prefix of Source Token
Index=H(C1,C2,C3)=
Step2: Primary Centroid Selection
for each bi-gram at Index do
for each related Primary Centroid do
if similarity(C1C2C3,Primary Centroid)>=T1 then
select Primary Centroid
Step3: Secondary Centroids Selection
for each selected Primary Centroid do
for each related Secondary Centroid do
if similarity(6-symbols source prefix, Secondary
Centroid)>=T2 then
select Secondary Centroid
Step4: Candidates Selection
for each selected Secondary Centroid do
for each Token in related Secondary Packet do
if similarity(Source Token, Token)>=T3 then
select as Candidate
End.



Chapter Four
Error Detection
and
Candidates
Generation

 69 
Chapter Four
Error Detection and Candidates Generation
4.1 Introduction
Error detection is the process of indicating incorrect words in the
text. The term "incorrect" may refer to a misspelled word, misused word or
both. Misused words are correctly spelled but used in a way violates the
syntax or the meaning of the sentence.
The detection of misused words (non-word errors) is a forward
process in that it involves the looking up for every token in a pre-prepared
list or a dictionary (also referred by "lexicon") contains all the well spelled
words of the language but the size of the lexicon affects the looking up
process because larger sizes require longer time.
On the other hand, detecting misused words (real-word errors) is a
more complex task. It requires analyzing the syntax of the sentence to
discover sentence constituency correctness which in turn if not correct
requires indicating the word(s) that violated it. Errors resulting in
meaningless sentences entail further processing which may be expanded
out of the sentence boundaries and needs more information about sentence
tokens.
4.2 Non-word Error Detection
Detecting misspelled tokens in this system is based on the dictionary
looking up technique and is performed within the stage of tagging.
Tokens of a given text must be tagged. A tag should be found for
every token in the considered language; therefore, tokens are collected and
resided with their tags in a lexicon. Tagging stage is a fundamental process

Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 70 
in most of natural language processing systems; it is a necessity for tagging
to precede syntax analysis since no parsing can be done without associating
tags with each token in the sentence.
Figure (4.1) : Tagging Flow Chart
Convert the Text into
Tokens Stream
Handle a Token
Look up inside the
Hashed Dictionary
Found?
Save (Token,
Tag) pair
Generate
Candidates
Save(token, {candidates, tags} list)
Last
Token ?
Pass the new tagged Stream
to Segmentation Step
YesNo
Yes No
Start
Read Text
End

______________________________________________________________________
 71 
Because tagging requires looking up every token in the given text, it
serves another task at the same time since missed tokens are assigned to be
misspelled.
The looking up procedure discussed in Chapter Three is used for
discovering non-word errors; the structure of the dictionary is a
reconstruction of about 300,000 tokens collected from two datasets as raw
data. The major two resources of the lexicon are WordNet and ISPELL;
WordNet represented the basic resource and was integrated with ISPELL
dataset for making the lexicon more inclusive.
The lexicon was hashed and indexed in order to achieve random
access. The looking up time is very short compared to the typical structures
and the tagger is capable of deciding whether a token is found or not found
in the lexicon even after consuming only one operation (for further details
see sections 3.3 and 3.4).
4.3 Real-words Error Detection
Deciding whether a word is misused or not is more complex than
detecting misspelled words, the process needs more computations and more
resources. Syntax analysis can be exploited to recognize misused words
since every English sentence (as most natural languages sentences do) is
constrained in a syntactic rule or a grammar. Any sentence violates the
syntax constraints and cannot be fitted or parsed using a finite set of
production rules is signed as incorrect sentence. Next, the sentence should
be processed to indicate the erroneous word that made the sentence
incorrect.
Phrasing is a good way to indicate precisely the incorrect word
through converting the sentence into constituents. The constituency
hierarchy starts from the sentence as the head of the tree, which contains

______________________________________________________________________
 72 
one or more clauses, which contains one or more phrase, and the phrase
contains one or more words.
The division into phrases is useful in reducing the parse tree. As the
number of tokens becomes larger, the available parses for the same
sentence increase.
The suggested approach is rule based, any sentence cannot be parsed
correctly is described as an incorrect. The syntax analyzer is based on
phrasing by applying a brute force approach to identify the misused word
in the phrase.
The syntax analyzer is fully dependent on the output of the tagger,
however, misspelled words should be replaced with suggestions in order to
allow the analyzer to proceed analyzing the sentence and select the best
alternative that makes the sentence acceptable. (Chapter Five details the
idea).
4.4 Candidates Generation
Candidates are those tokens with high similarity to the incorrect
word. The meanings of "similarity" and "incorrect" are relative. In the case
of non-word errors, the incorrect word is a misspelled and the similarity is
a measure of how much another token is spelled or pronounced in a way
similar to the misspelled word. In the case of real-word errors the candidate
token is the one that is more likely to be intended by the writer but
confused by the incorrect one; sometimes, spell or phonetic mistake
resulted in another correct word.
4.4.1 Candidates Generation for Non-word Errors
In this step, the system takes the incorrect token (a token out of
dictionary) and looks for similar tokens in the underlying dictionary.

______________________________________________________________________
 73 
Since every token in the dictionary may be intended by the writer, the
process is somewhat more complex. Several issues should be considered to
decide which tokens are suitable to be generated as candidates.
A major problem is the distinction between unknown and mistaken
words; therefore, this research considers every unknown word as a
mistaken one and lets the decision to be taken by the user himself/herself.
However, candidates (or alternatives) are generated depending on the
mistaken word and the total process is performed in the following way:
On the first look, all the dictionary tokens which their count may
reach to some hundreds of thousands can be intended by the writer or none
of them could be so. The writer (or typist) might really misspell the word or
he/she wrote it perfectly but the problem is that the word is not found in the
dictionary, i.e. never seen before and then it is an "unknown" token.
The number of the generated candidates is not limited; further
processing would reduce the list of candidates to include only the best set
according to similarity amount and some other criteria that are fully
dependent on the spell of the encountered misspelled token.
In the tagging stage, if a token wasn't found in the lexicon then it is
misspelled. The starting step is to look after the most similar tokens in the
lexicon dictionary and ranking them according to their similarity to the
misspelled token, the similarity is based on minimum edit distance
measure. This action reduces the number of candidates to an acceptable
amount depending on a threshold for the number of edit operations that
should be applied to equalize the candidates and the misspelled word or a
maximum limit for number of candidates.

______________________________________________________________________
 74 
4.4.1.1 Enhanced Levenshtein Method
The modification on the Levenshtein method can be performed by
extending the standard matching step at line 12 in figure (2.7) to check the
foundation of a transposition case. The idea rises from the fact that no
transposition case may be found without finding a matching success
between at least two symbols in the examined strings; and more precisely
the transposition can be discovered using minimum number of operations
by considering two facts:
- Two adjacent symbols can never be mirrored by other two adjacent
symbols in another string unless the first symbol in the first set matches
the second in the second set.
- Instead of manipulating the transposition occurrence separately, the
algorithm can modify the under-processing cell in the distance matrix
directly and the next matching steps will do the work.
The first fact served in avoiding the trying of all possibilities as it was
presented in Damerau's modification at lines 20 and 21in figure (2.8) where
each symbol is matched to every symbol in the second string regardless to
the availability of a transposition operation happen by adding additional
matching statements to the original one at line 12 in figure (2.7).
On the other hand, the second fact announces another side of
processing that is the distance matrix is filled sequentially row by row from
the top most left corner to the bottom most right corner (where the total
distance is held). Using one step to process both cases (transposition
happen case and the not case) is a good way to minimize the number of
operations required to accurately compute the distance.
In this modification, the distance matrix is updated directly by one
step and the next steps (selecting the minimum and filling the underhand

______________________________________________________________________
 75 
cell) are continued normally as it was done in the original algorithm; such
action abstracted the step at line 22 of the Damerau's algorithm (figure 2.8)
which uses more than one operation to be completed.
How modifying the Levenshtein method reduced the time and
enhanced the candidates generation process is that the modification
exploited the first fact to make the algorithm avoids checking the cases that
are leading to a failure situation, unlike Damerau-Levenshtein modification
which makes no difference between the two situations; this is presented in
lines 15 and 16. The directly updated distance matrix (line 17) in the
enhanced algorithm has accurately adjusted the distance without any more
additional processing; it is simply an assignment.
The time complexity is related to the distance between the input
strings. However, as the strings becomes more different, the steps at lines
15, 16 and 17 in the enhanced algorithm (figure 4.2) are rarely executed
;therefore, they are saving time; in turn, this property is preferred in the
cases where the algorithm is used for generating candidates.
Candidates should be as similar as possible to the source token
(usually, a mistaken word) and the relativity of the additional steps (lines
15, 16 and 17) in the enhanced algorithm made the consumed time to
generate candidates useful (or not wasted) from the view point that those
steps are only executed when there is a matching with the source token and
they are more executed as the source word being more matched with the
target word which means that it is a good candidate.

______________________________________________________________________
 76 
The algorithm in figure (4.2) shows the enhancement of the original
Levenshtein method and the rest of this section describes the difference of
the three methods (original Levenshtein, Damerau-Levenshtein and the
enhanced Levenshtein method) through manipulating two example strings
"Transposed" and "Tarnspaesd":
Figure (4.2) : The Enhanced Levenshtein Method Algorithm
1. Algorithm: Enhanced Levenshtein Distance
3. Output: Damerau Edit Operations Number
5. distance(length of String1,Length of String2)=0, min1=0, min2=0,
min3=0, cost=0
11. begin
12. if x = y
13. begin
14. cost = 0
15. if x is not the start symbol of String1 then
16. if (the symbol preceding x=the symbol following y) and (x is
not duplicated) then
17. decrease distance (index(x)-1,index(y)) by 1 // transposed
18. end
19. else cost = 1
25. end
28. End.

______________________________________________________________________
 77 
Figure (4.4) : Damerau-Levenshtein Example
1) Levenshtein
T r a n s p o s e d The minimum edit distance=5
1. substitute 'r' by 'a'
2. substitute 'a' by 'r'
3. substitute 'o' by 'a'
4. substitute 'e' by 's'
5. substitute 's' by 'e'
Computation Complexity:
M*N comparisons=100
(cost,min1,min2,min3)assignments
*100 =400
100 Minimum function Calls
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 1 1 2 3 4 5 6 7 8
r 3 2 1 2 2 3 4 5 6 7 8
n 4 3 2 2 2 3 4 5 6 7 8
s 5 4 3 3 3 2 3 4 4 5 6
p 6 5 4 4 4 3 2 3 4 5 6
a 7 6 5 4 5 4 3 3 4 5 6
e 8 7 6 5 5 5 4 4 4 4 5
s 9 8 7 6 6 5 5 5 4 5 5
d 10 9 8 7 7 6 6 6 5 5 5
2) Damerau-Levenshtein
T r a n s p o s e d Minimum edit distance=3
1. transpose ('a', 'r')
2. substitute 'a' by 'o'
3. transpose ('e', 's')
In addition to the complexity of
original Levenshtein, the following
operations are executed:
100 comparisons (line 21)
2 calls for minimum function (line 23)
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 1 1 2 3 4 5 6 7 8
r 3 2 1 1 2 3 4 5 6 7 8
n 4 3 2 2 1 2 3 4 5 6 7
s 5 4 3 3 2 1 2 3 3 4 5
p 6 5 4 4 3 2 1 2 3 4 5
a 7 6 5 4 4 3 2 2 3 4 5
e 8 7 6 5 5 4 3 3 3 3 4
s 9 8 7 6 6 4 4 4 3 3 4
d 10 9 8 7 7 5 5 5 4 4 3
Figure (4.3) : Original Levenshtein Example

______________________________________________________________________
 78 
3) Enhanced Levenshtein
T r a n s p o s e d Minimum edit distance=3
1. transpose ('a', 'r')
2. substitute 'a' by 'o'
3. transpose ('e', 's')
In addition to the complexity of
original Levenshtein, the
following operations are
executed:
2 assignments (line 17)
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 0 1 2 3 4 5 6 7 8
r 3 2 0 1 2 3 4 5 6 7 8
n 4 3 1 1 1 2 3 4 5 6 7
s 5 4 2 2 2 1 2 3 3 4 5
p 6 5 3 3 3 2 2 3 4 5 6
a 7 6 4 3 4 3 3 3 4 5 6
e 8 7 5 4 4 4 4 4 2 3 4
s 9 8 6 5 5 4 5 5 2 3 4
d 10 9 7 6 6 5 5 6 3 3 3
4.4.1.2 Similarity Measure
Minimum Edit Distance methods counts the number of edit
operations required to convert on string to another but do not show how the
two strings are similar. An example: the distance between "a" and "b" =1,
but the similarity =0; whereas, distance between "Similar" and "Similer" is
also 1, but the similarity =6/7.
Strings lengths should be taken into account when computing the
edit distance then the resulted value is used as a similarity measure. Since
the absolute difference between any two strings is added to the total
mismatched symbols since it is considered as the number of deleted
symbols from the shorter string. The similarity measure must depend on the
maximum length between the two. The relative distance is computed by:
R_Dist(St1,St2)= distance(St1,St2) / max(length(St1),length(St2)) … (4.1)
Figure (4.5) : Enhanced Levenshtein Example

______________________________________________________________________
 79 
Relative distance is a value within the interval (0,1) where
completely different strings have a relative distance of 1; and as its value
decreases, the difference is also decreases until reaching the value of 0
when the two strings are identical.
Since the similarity and difference are complements to each other,
the similarity can be computed by:
Similarity (St1, St2)=1- R_Dist(St1,St2) … (4.2)
And the later is the measure of similarity used in the candidates' generation
for this work.
4.4.1.3 Looking for Candidates
To find the similar tokens, the dictionary should be looked up and
every token in it must be examined with the source word. This process
consumes time because of the huge number of tokens held by the lexicon
dictionary and the required time by the examining algorithm itself to find
the minimum edit distance and computing the similarity to the source
token. Hence, the search space needs to shrink; the Similarity Based
Looking up method shown in Chapter Three is used to group similar tokens
in clusters using local properties, i.e. the clustering process grouped the
similar tokens depending on tokens spell only.
The input of the algorithm in figure (3.7) is the misspelled token.
The thresholds usage is dependent on the generating ability, i.e. how much
the generated candidates are similar to the source token. If they are highly
similar, then the top set is selected; but if there is a difficulty in discovering
reasonable candidates, the usage of thresholds may be a good solution. As
the misspelled token is being highly confused, the set of examined
centroids becomes larger; therefore, a filtering factor must be used to
reduce the search space.

______________________________________________________________________
 80 
At least a generated primary centroid should be similar to the 3-
symbols prefix of the source token in an amount of 2/3, which allow at
maximum one mistake in the prefix. This restriction is not randomly
selected; experiments revealed that misspellings are usually reasons of
single-error and the ratio is between 70% and 95% depending on the text
source. The mistakes are rarely happen in the first three letters. According
to [Pol84], 7.8% of errors occur in the first letter, 11.7% in the second letter
and 19.2% in the third letter; where each percent is dependent from the
others.
After collecting the most similar set of primary centroids, the next
step is to examine secondary centroids of every selected primary centroid.
The selection is also dependent on the similarity to the 6-symbols prefix of
the source token since the secondary lengths are at maximum of size 6
symbols. The second threshold constraints the error value to be at most two
mistakes, i.e. 2/6 or less; but in some situations there is a need to select the
best centroid from every secondary centroids set (from every selected
primary cluster) because looking for candidates in this stage is limited
within the first six symbols from the tokens, however, longer tokens may
contains more than two mistakes in its prefix. In another word, for every
selected primary centroid, the nearest secondary centroids are selected and
the threshold serves as a limit to avoid selecting less similar centroids when
there are other centroids with higher similarity.
Finally, for every selected secondary centroid, the candidates are
generated from the secondary packets that are related to a centroid with a
reasonable similarity to the source token. Then, the decision of selecting a
token to be a candidate or not would be easier because the comparison
applied on the total lengths of both source token and dictionary tokens.

______________________________________________________________________
 81 
Ranking the candidates is a subroutine from the optimization stage. It
uses information more than similarity measure.
4.4.2 Candidates Generation for Real-words Errors
In this work, the generation of candidates is rule based. It can be
divided into two types according to the step in which it can be applied:
 Before suggesting optimal candidates for misspelled words:
This type of generation can be applied to sentences that have not
contain misspelled words, the decision is made after phrasing the
sentence into constituents and manipulating each phrase alone. The
word which violates the rule of constructing the given sentence from
the grammar or syntactic rules should be detected and replaced with
other forms set where any of which can make the sentence syntactically
accepted.
Grammar correction techniques are multiple and various; two
techniques are used in this step to solve a part from syntax errors, verb
tense correction and subject verb agreement.
 After suggesting optimal candidates for misspelled words:
After ranking candidates, this step allows the correction system to more
precisely select the candidate that is mostly fits into the sentence to
makes it correct or at least does not violates its correctness. Selecting
the best candidates after ranking is an additional filter for generating the
best suggestions set.



Chapter Five
Automatic
Text Correction
and
Candidates
Suggestion

 82 
Chapter Five
Automatic Text Correction and Candidates
Suggestion
5.1 Introduction
Text correction is the process of substituting the incorrect word(s) by
another correct word(s) that was selected as a candidate and filtered to be the
most suitable among many alternatives.
Automating text correction is a complex task because of its direct
association to humans' nature; a written word could never be absolutely
predicted even with the existence of perfect decision making parameters that
can help a computer to choose the perfect suggestion, since artificial
intelligence did not reach human capabilities, yet. However, there is always
an alternative solution. Optimizing candidates is an alternative solution for
handling the problem. Many existed techniques can help in making the
decision perfect and providing the user with a set of highly expected
alternatives for a given incorrect word.
This work, as we will see in next sections, exploited many features
that are related in the first order to the incorrect word and its candidates
themselves rather than context. The automatic correction is out of meaning
and suggests candidates depending on the output of the previous stages
(tokenization, tagging, and similarity based candidates generation) after
applying multi-features ranking and syntax analysis.
5.2 Correction and Candidates Suggestion Structure
Figure (5.1) shows ranking process applied on the generated candidates.

Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 83 
For every incorrect word, there is a set of candidates was generated by
the candidates generator at tagging stage. A set of features was predefined
for ranking candidates according to similarity and errors types' relevance.
The features set includes similarity value between the generated candidate
and the incorrect word, confusion and transposition factors, type of error in
the incorrect word and syntactic properties.
Ranking process involves:
 Assigning a value for every feature.
 Computing the effect factor of each feature (weighting).
 Summing all the weights in a single number.
 Inserting the processed candidate in the suitable index within the
candidates list where high similarity candidates ranked at the top and
candidates with low similarity are inserted in the bottom.
Features are represented by a vector of eight elements that may be
decreased or increased depending on the purpose for which the text
correction is applied and the source of the input text and the expected error
rate. Similarly, the weights of each feature are also affected by the input text
source since some features are dependent on error type.
Before applying ranking process, the source token that was marked as
a misspelled is examined against Named Entity (NE) features because most
of proper nouns are not added to dictionaries resulting in a mismatch case.
Recognizing NE requires combining multi sources and information to be
accounted together, some of them are strong enough to decide if a
misspelled token is a Name but not if it isn't. Syntax analyzing follows
features based ranking; it is another step for optimizing results, and mostly,
the one with highest effect. The accuracy of ranking candidates should be
completed by the syntactic role of the candidates that would be selected as
suggestions.

______________________________________________________________________
 84 
Figure (5.1): Candidates ranking flowchart
(Misspelled Token (T),
Candidates List) Pair
Handle a Candidate
Account similarity value
Specify Inserted Symbols
Confused?
Transposed?
Equal lengths?
Duplicated?
Difference
<=threshold ?
Confused?
Transposed?
Same
symbols set?
End symbol
match?
First symbol
match?
W1=f1
W2=f1
W3=f1
W3=f3
W3=f2
W4=f2
W4=f1
W5=f1
W5=f2
W6=f1
W7=f1
Rank according to Weights
Sum
Last candidate?
Stop
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
Start

______________________________________________________________________
 85 
5.3 Named-Entity Recognition
A big set of weak evidence features is proposed to decide whether a
token is a named-entity or not, but there is a variance in the level of
analyzing and the features themselves.
Some features are efficient in deciding if a token is a named entity and
can be used individually in decision making; other features could never be
helpful unless be combined with other features. The features are categorized
in many sub-categories, the most well known are those related to word level,
part of speech tags, and dictionary looking up.
Since the purpose of this system is determining token correctness, the
word level features are the most helpful because the dictionary looking up is
previously satisfied (a matched token doesn't need to be analyzed) and part
of speech tags are useless with the absence of decision.
In English, the following features gave the developers some evidence
for name detection:
(1) All-uppercase: a token consisted of capital letters only.
(2) Initial-caps: a token started with a capital letter.
(3) All-numbers: a token consisted of numbers only.
(4) Alphanumeric: a token contains letters and numbers.
(5) Single-char: a token of one letter.
(6) Single- i.
The all-uppercase feature is the strongest and can be used
individually; initial-caps may be infected by its position within the sentence
because English sentences start with a capitalized word. In this system all-
numbers feature is solved by allowing the system to treat all numeric values
fairly by assigning the same hash code and the same tag for every numeric
string in the hash table. Many abbreviations, sort of named-entities, are

______________________________________________________________________
 86 
alphanumeric and therefore it is a good feature. Single characters feature is
used by Microsoft word. Finally, single i may refer to the pronoun I which
usually written as lower case letter.
Named-entity recognition features may help in marking a token as a
name but they can't precisely decide if it is not so. An example of such cases
is that some names may be written in lowercase letters like "van Gogh"
which doesn't satisfy any of the features above.
5.4 Candidates Ranking
If the misspelled token was not recognized as a named entity, the
process of ranking starts by measuring the similarity between the source
token and every candidate in the associated list in more sophisticated manner
considering the type of the committed error to find a numeric value that
describes the fidelity of each candidate over the rest.
Eight weighted features are used to account every error type effect on
the whole candidate string; three different values for factors are considered
in the flowchart in figure(5.1) to outline and simplify the idea of giving
different factors values for different error types (f1=high, f2=medium,
f3=low). Practically, effect factors are numeric values that vary from a
feature to another.
For each element in the features vector, there is a weight reflects that
feature's share in the total computed rank value. Rank value for each
candidate is computed by:
… (5.1)
Where
n = features number, c = selected candidate, wi= associated weight with
feature no. i, and v is the features vector.

______________________________________________________________________
 87 
Weights values depend on the application area of the system; they rely
on the input text quality and the input device itself. The following
subsections describe each feature and its effect on the ranking process.
5.4.1 Edit Distance Based Similarity
Enhanced Levenshtein Edit distance method is used to calculate the
distance of each candidate from the source token. Computing the similarity
is dependent on the distance and the length of the two strings, (for more
details see section 4.4.1.3).
Similarity is measured by a numeric value within the interval (0,1),
therefore, it should be multiplied by a factor to be normalized with the other
features in such a way gives it the largest share in the ranking value among
the other features' weights.
In this application, as preferred in other applications because of the
majority of similarity amount in the suggestion decision, similarity was
weighted by a factor larger several times than other features weights.
5.4.2 First and End Symbols Matching
Researches in the area of errors analyzing showed that mistakes are
rarely happen in the first letters of the word, and mostly, the first letter is not
mistaken.
The probability of mistaking the second letter is also high but does not
achieve interested results compared to the first letter. On the other side, end
letter counted probability near to the probability of mistaking the first letter,
and hence, it is used as a part from the optimization procedure in calculating
ranking values.
First and end letter are sufficient because they are related to human
brains capabilities. Research from Stanford university showed that our brains

______________________________________________________________________
 88 
can predict the correct word exactly even if its letters are permutated in a
random manner but only the first and end letters are correct.
Exploiting the research results assists the process of optimizing
candidates suggestion. However, the idea cannot be achieved directly in a
computational way because the ability of human minds of prediction and
connecting facts together is infinitely fast and reliable. It depends on
imagination and semantic relevance in interpreting sentences even with the
existence of errors. Until our days, such ability is not found in computers.
As a result, this feature, difference in lengths and same set of symbols
altogether can simulate human brain in a statistical way because the idea
originally dependent on statistics.
Small weights are given for both first letter and end letter features,
with a preference to the first letter feature because it has larger effect on the
prediction rather than end letter do.
5.4.3 Difference in Lengths
Writing mistakes are usually occurring within the token length or in
its length ± 1, and rarely the mistaken token and the intended token lengths
differ in more than one unit.
Equality of lengths does not affect the candidate itself directly only but
also other features like transposition and confusion and even duplication
(next subsections details the idea).
Candidates with larger difference values may be rejected although they
count good ranking indexes. The feature value is calculated by the relative
length difference:
R_L_D(St1,St2)= 1- ( abs(||St1|| - ||St2||))/ min( ||St1||,||St2||) … (5.2)
Where ||Sti|| is the length of string Sti

______________________________________________________________________
 89 
The weight of this feature is dependent on the source of the input text;
texts that are entered by an optical character recognizer (OCR) usually have
smaller weights. While typed documents have larger value because the
insertion of symbols is probable.
5.4.4 Transposition Probability
Transposition refers to the case of replacing a character by another
neighbored one that is either similar in style or placed around it on the
keyboard.
Usually, this type of errors occurs with typed texts and refers by
"typos". The alphabet of English has small sized alphabet; the task of
computing the probability of transposing a letter by another is easy.
Table 5.1 shows a transposition matrix contains the probability of
each letter to be confused by another one from the 26 alphabet regardless of
being an uppercase or lowercase letter because such mistakes are related to
the typist's fingers physical movement not on the typed token.This feature
considers two types of errors:
1. Errors within the length of the word: in this case the typist mistake
a given letter by another, i.e. substitute it with a neighbor letter
through pressing the mistaken letter key instead of the intended
letter key. For this reason, such cases are described to be from the
first degree and the feature value is assigned to the maximum.
2. Errors resulted in word length increment: sometimes, fingers
confused the intended letter exact position and press two keys
simultaneously resulting in typing two consecutive letters (the
intended letter and the one to the right or the left). This mistake
inserts an additional letter and increases the word length by one.

______________________________________________________________________
 90 
Table (5.1) : Transposition Matrix
a b c d e f g h i j k l m n o p q r s t u v w x y z
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 1 0 0 1
b 0 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0
c 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0
d 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 0 0 0
e 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 0
f 0 0 1 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
g 0 1 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0
h 0 1 0 0 0 0 2 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
i 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0
j 0 0 0 0 0 0 0 2 1 0 2 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
k 0 0 0 0 0 0 0 0 1 2 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 0 0 0 1 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0
n 0 0 0 0 0 0 0 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
o 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0
q 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
r 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
s 1 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1
t 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0
u 0 0 0 0 0 0 0 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0
v 0 0 2 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0
x 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2
y 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0
z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0
5.4.5 Confusion Probability
Confusion refers to the case of replacing a letter with another of
similar pronunciation; sound is the base of calculating the probability of
confusing a given letter unlike transposition probability which depends on
the keys arrangement on the keyboard.
This type of analyzing is concerned with phonetic errors; usually,
vowels are the most confused letters. The weight of this feature is dependent
on the application where the correction is used; it should have large values
when used with speech recognition systems. Table (5.2) shows Stanford
confusion matrix after being updated and normalized.

______________________________________________________________________
 91 
Table (5.2) : Confusion Matrix
a b c d e f g h i j k l m n o p q r s t u v w x y z
a 0 0 0 0 3 0 0 0 2 0 0 0 0 0 2 0 0 0 2 0 1 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
e 3 0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 1 0
f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 2 0 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 1 0
j 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
k 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0
o 2 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0
p 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
q 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
s 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
t 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
u 2 0 0 0 2 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0
v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
y 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
5.4.6 Consecutive Letters (Duplication)
Duplicating single letter or missing one of originally duplicated letters
is one of the typos errors. Some writers intentionally omit or add a letter
from the original token, specifically, in the case of affixes addition. The two
major errors resulted in this type of mistakes are:
 Insertion: duplicating a single letter can be resulted intentionally
when a writer does not know the perfect formation of a word when
adding an affix, an example is duplicating the letter 'l' when adding
the suffix 'full' to the noun 'hope' for converting it to the adjective
'hopeful'. Or it may be resulted from pressing a key in a time

______________________________________________________________________
 92 
period longer than the required for typing a single letter, like
'prrint'.
 Deletion: the reverse of insertion is missing one of duplicated
letters like creating 'hopefuly' from adding the suffix 'ly' to the
word 'hopeful', or writing a single letter instead of two like writing
single 's' in 'omision'.
Duplication is an interested feature has sufficient effect in deciding the
optimum candidate, specifically when the difference between the source
token and the candidate is equal to the number of missed or duplicated
letters.
5.4.7 Different Symbols Existence
It is preferred to a candidate to contain the same set of letters that are
consisted in the source token; this feature highlights the case of transposing
two adjacent letters in the word ( Damerau forth error type) which is a
common mistake in typed text.
As a conclusion:
Obviously, none of the features described above is separable from the
others but each of them is constrained with its weight and effect factor. We
see these relations between edit distance and all the seven rest features;
between difference in length and all of confusion, transposition and
duplication; between transposition and duplication and so for.
In consequence, all the features above should share the task of ranking
the candidates each one with its special weight and according to the
environment of the application. At this step the suggestion of candidates in
the level of words is ended and the syntax restrictions start to have a role in
the decision making to decide which token would be suggested as the
optimum among all the alternatives in the dictionary.

______________________________________________________________________
 93 
5.5 Syntax Analysis
The task of syntax analyzer is critical in this stage; in addition to
examining sentences correctness, the selection of the optimum candidate is
done.
In both cases, the analyzing process is applied at the level of phrases
where a sentence is broken into clauses and the clauses into phrases. Syntax
analyzing process is shown in figure (5.2).
5.5.1 Sentence Phrasing
Tokens stream is divided into groups in the segmentation stage;
segmenting a text depends on the output of the tokeniser and the tagger
because determining sentences boundaries makes use of tags. As output, the
segmented text is a stream of sentences that can be passed to the syntax
analyzer because the later is usually works on sentences level.
A sentence contains one or more clauses, each clause consists of one
or more phrase and a phrase in turn contains one or more words. Phrasing is
efficient from the standpoints:
 Correcting a part from a phrase affects the structure of the sentence
partially which minimize the total number of possible alternatives leading
to smaller set of candidates and better reconstruction of the original
sentence in such a way reserve it reasonably unchanged.
 Attachment ambiguity is a challenge facing the correction process,
specifically in the semantic relations; phrases correction solved it because
a phrase is completely attached to another phrase and updating it does not
affect or be affected by other phrases, unlike words level correction
which must consider every possible parsing and related part from the
sentence.

______________________________________________________________________
 94 
 Converting into phrases simplifies the process of generating complex
sentences structures because how much a sentence can get complex it
Figure (5.2) : Syntax analysis flowchart
Convert each
sentence into phrases
Test candidates starting from the
top of ranked candidates list
Violated
?
Check Constituency
after correction
Select next
candidate
Correct Grammatically
Replace the misspelled
token by the candidate
Output Corrected Text
with list of candidates
for each corrected token
Yes
No
Start
End

______________________________________________________________________
 95 
would still being a collection of phrases connected syntactically and
semantically.
English has a set of phrases types includes: Noun Phrase (NP),
Prepositional Phrase (PP), Adjectival Phrase (A), Adverbial Phrase (v),
Complement (C) and Verb Phrase (VP). Each has its own set of word classes
and a structure governs those classes.
5.5.2 Candidates Optimization
Misspelled tokens are associated with a ranked list of candidates. The
top candidate is the most similar to the misspelled word.
Optimization procedure is applied in two phases, the first is
represented in the ranking according to features satisfaction and weights; the
second is the syntactic agreement within the phrase that contains the
misspelled word.
Selecting candidates starts from the top; checking the consistency of
phrase structure has a fundamental impact on the correction accuracy. The
tag of the selected candidate should satisfy the structure of the phrase and
sometimes the process may require checking the next tag in the sentence, i.e.
the token that followed the misspelled word in the sentence which may form
the head of the next phrase.
The task is not such a challenge if the phrasing procedure was
accurate; the structure of the phrase under-processing limits the possible
alternatives of the misspelled word within best similarity amount and
syntactic agreement.
5.5.3 Grammar Correction
A sentence is grammatically accepted if it can be generated by
applying a finite set of grammar rules.

______________________________________________________________________
 96 
Grammar correction is a subfield of real-word error correction, it
depends on sentences constituency satisfaction to detect the words that are
disagreed the grammar rules and made the sentence violating parsing rules.
In this step, the system checks the correctness of sentences by parsing
each sentence separately from the text because syntactic acceptance is
related to the sentence levels unlike semantic and further processing which
analyze texts at the level of paragraphs and full texts.
Grammar correction procedure deals with two types of sentences:
1. Sentences contain correctly spelled words.
2. Sentences contain words that have been replaced with correct words.
In both cases, the suggestion of candidates has been ended and the
correction is restricted in one suggested candidate. As shown in previous
sections, the optimal candidate is the grammatically suitable with the highest
similarity. However, the grammar corrector treats the two sentences types
equivalently (a sequence of correctly spelled words).
Many fields of grammar correction have been proposed because
correcting a text grammatically is an extensive process requires a huge
knowledge in the underlying language grammar and inclusive grammar rules
set.
This system is rule based and considers two types of correction:
 Subject verb agreement.
 Verb tenses.
In order to perform the two types of correction and the phrasing
procedure, the set of tags needed to be more detailed which is not available
in the original WordNet Dataset. The dictionary was preprocessed to
subdivide some tags into detailed tags; like dividing Definite and Indefinite

______________________________________________________________________
 97 
determiners into pre, central, and post determiners. Nouns also needed to be
categorized into plurals and singulars, verbs into different tenses and
participles. Integration with ISPELL database enhanced the accuracy of the
dictionary. It provided the dictionary with a big set of singular and plural
nouns, adjectives and verbs tenses forms.
5.5.4 Document Correction
The final step is suggesting the corrected sentences. It includes
replacing the incorrect words by the optimal candidates and associating the
remainder candidates with every corrected word.
The association is necessary because even a perfect suggestion could
never absolutely decide the intended word, therefore, the user is the only
person who can decide if the word was accurately corrected or not.
Candidates are listed according to their ranking values. The list is
preferred to be short and accurate. A threshold can be added to the
suggestion list to filter out any candidate has similarity less than the
predefined specified threshold.
Developers can indicate the threshold amount according to the
application environment. As an example, some applications are usually used
by native speakers; therefore, the threshold can be stricter than those used in
applications like language learning programs where users are typically of
poor linguistic knowledge.



Chapter Six
Experimental
Results,
Conclusions
and
Future Works

98 
Chapter Six
Experimental Results, Conclusions, and
Future Works
6.1 Experimental Results
Objectives of this system are achieved through applying many steps.
Some of these steps required techniques modifications to overcome some
problems that are facing the desired results:
6.1.1 Tagging and Error Detection Time Reduction
Assigning a POS tag to every token in the input text requires looking
up the underhand dictionary. Looking up is an extensive process being more
complex as the size of the dictionary becomes larger; the problem is solved
by applying prefix dependent dictionary structure based on hashing and
indexing.
The structure is consisted of two levels of division: primary packets and
secondary packets. Primary packets distribution depends on 3-symbols
prefixes resulted in quite manageable sizes; as shown in figure (6.1) the
search space is reduced to about one thousand of tokens at maximum instead
of the original hundreds thousands with an average packets size of (11.16)
tokens. In addition to the availability of packets heads random access that is
provided by the hash function.
Whereas, secondary packets are 6-prefix dependent resulted in more
steadfast searching and minimized the looking up time to a reasonable amount
as shown in figure (6.2), the search space reduced from hundreds thousands to
some hundreds at maximum; the average size is (7.26) tokens per a secondary
packet.

Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
99 
Figure (6.1): Tokens distribution in primary packets
Figure (6.2): Tokens distribution in secondary packets
From the dictionary looking up side in the tagging phase, the hashing scheme
provided a set of good properties to the looking up procedure:
6.1.1.1 Successful Looking Up:
In the case of successful matching where the target token is found in
the dictionary, looking up time is reduced through three steps:
- Primary packets selection: the head of the every primary packet is
randomly accessible by applying direct hash function; it consumes
three symbols from the target token; this action, in turn, reduces
matching time in the next steps.
- Secondary packet selection: selecting a secondary packet involves
examining only three symbols (4-6 indices), resulting in faster
searching even if it is performed sequentially.

_________________________________________________________________________
100 
- Inside secondary packet looking up: the remainder of the target
token is (length of the token – 6), since six symbols where consumed
in the two previous steps on the way to reach the target secondary
packet.
In other words, the best case for successful looking up has a time
complexity of O(1) in which the target token is stored at the first entry of
the primary packet (its head) that is randomly accessible.
The worst case happens when the target token is stored at the last entry
in a secondary packet and the head of that secondary packet is stored at the
last entry of the primary packet to which that secondary packet is related.
Time complexity is:
O(1) for primary packet head access ( random access)
O(Length of primary packet) for finding the secondary packet head
where the target token is stored, at each step only three symbols are
examined.
O(Length of secondary packet) for catching the target token, matching
only the remainder from the token after discarding the first six symbols.
Totally, O(1)+O(L1)+O(L2) is the worst case.
*
6.1.1.2 Failure Looking Up:
If the target token is not found in the dictionary, the looking failure
can be discovered in three different situations:
- At the hashing step (generating the primary packet head address): if
there is no match with the target token prefix, the reference table
announces the failure by referring to an empty primary packet.
________________________________________________
* L1 and L2 are the lengths of primary and secondary packets, consecutively.

_________________________________________________________________________
101 
This step consumes only one operation, O(1) time complexity.
- Within a prefix of six symbols from the target token, the failure can
be discovered through matching the symbols at indices (4-6) with the
same indices of each token in the primary packet.
This step consumes matching operations number equal to the length
of the primary packet, O(L1) time complexity.
- In the case of finding a match with a secondary packet head (the 6-
symbol prefix at minimum), matching with the tokens of the secondary
packet is limited within the remainder from the target token and those
tokens after discarding matching the prefixes since they were
previously checked.
Time complexity is O(L2).
6.1.2 Candidates Generation and Similarity Search Space Reduction
Candidates generation requires examining all the tokens in the lexical
dictionary to compute its similarity to the misspelled token. Spell based
clustering illusion is our proposed solution to reduce search space; similarly
spelled tokens are grouped together in such a way keeps the structure of the
dictionary unaffected and allows similarity based looking up using bi-grams
analysis and prefixes similarity. Thresholds usage is application environment
dependent.
The misspelled token is the basic unit of candidates generation. The
proposed similarity based looking up in this work generates similar tokens
using the misspelled token and the underhand dictionary. It exploits the hash-
indexing scheme to enhance the generation speed and the bi-gram analysis to
improve candidates selection accuracy with no loss of interested candidates.

_________________________________________________________________________
102 
The proposed approach is high flexible because it acts as a clustering
model with well structured clusters (even if it is in fact an illusion), it supports
a set of modifiable parameters:
1. Similarity measure:
In this work, we've depended on the minimum edit distance techniques
(specifically, an improvement of the Levenshtein method) and used a
similarity measure based on the distance calculated by this method.
The similarity based looking up approach is independent of the
similarity measure; therefore, any method or technique can be used
with it.
2. Thresholds:
Threshold specification is a challenge facing many applications; it
requires more work to be adjusted perfectly. This approach simplified
the task in two ways:
- Candidates generation can be performed without any consideration
for thresholds. After collecting the candidates, they'll be ranked and
the best set number simply can be selected.
- Filtering candidates can be broken into three levels, the first is at the
primary centroids selection; the second is at the secondary centroids
selection and finally the third is at the candidates selection.
As well as any other computation area, dividing a problem into sub-
tasks simplifies the initial assignment and the updating of the
parameters through the adjustment process.
3. Applicability:
Similarity based looking up is resilient to accept any update on the
candidates generation. It can be modulated to be used in different
environments. For example, a developer can use it in the post
correction of the OCR applications; he/she is able to add additional
features to make the recognition more accurate. This action may be

_________________________________________________________________________
103 
referred implicitly by the similarity measure, or explicitly as parameters
to the generation procedure.
6.1.3 Reduction the Time of the Damerau-Levenshtein method
The modification of Damerau on the Levenshtein method increased the
time complexity because it adds additional checks on every symbol in the
input strings, this returns to the simplicity of the idea of examining the
foundation of a transposition.
In this work, we modified the original method to consider the
transposition cases by exploiting the same idea of the original method via
merging the examining statement within an execution limited statement.
Figure (6.3) : Time complexity variance of Levenshtein, Damerau-Levenshtein, and
Enhanced Levenshtein (our modification) [ Y axis represents the consumed time measured
in seconds, the X axis shows the samples used for testing]
The time variance of the three methods (Levenshtein, Damerau-
Levenshtein, and the enhanced Levenshtein) is shown in figure (6.3); the
consumed time by the enhanced method is very close to the original
Levenshtein, but the Damerau modification resulted in a somewhat longer

_________________________________________________________________________
104 
time. The computed time is an average of repeating the execution of the three
methods ten times for each one on the same testing group.
6.1.4 Features Effect on Candidates Suggestion
The eight features selected for suggesting the best set of candidates are
tested in three different cases to show how each of them affected the selection
of the optimal suggestion for isolated words correction.
Figure 6.4 shows the ratio of correctly suggested candidates and
correctly chosen as optimal. Suggested tokens represent situations when the
target token is found in the list of suggestions but not necessary selected as
optimal. Chosen as optimal are the set of tokens which are correctly selected
as optimal.
Figure (6.4): Suggestion Accuracy with a comparison to Microsoft Office Word on a
Sample from the Wikipedia
1
Total Misspelled Tokens 1825
Suggested Target Token 1691
Optimally Selected 1477
Microsoft Word Suggestion 1659
0
200
400
600
800
1000
1200
1400
1600
1800
2000
TokensNumber
Experiment 1: Suggestion Accuracy
Suggestion Accuracy =
92.657%
Optimality Accuracy =
87.34%
Microsoft Word Suggestion
Accuracy =
90.904%

_________________________________________________________________________
105 
Suggestion accuracy was computed from applying isolated words correction
on a list of commonly misspelled words from Wikipedia website contains
1825 tokens resulted in an accuracy of (92.657%) where (87.34%) from them
are correctly suggested as optimal candidates. The same testing data was
checked with Microsoft Word resulted in (90.904%) suggestion accuracy.
A sub set from the wiki sample (presented in Appendix A) was used to
compare our system accuracy with other systems. The results were gained
from a research made by Ahmed Farag and others [Ahm09] and our system
tested the same data and gave the results shown in figure (6.5).
Figure (6.5) : Testing the suggested system (I.T.D.C) accuracy and comparing the
results with other systems using the same dataset
The accuracies of the tested systems are:
 ASPELL : 90.833%
 Microsoft Word: 88.33%
 MultiSpell : 92.5%
 I.T.D.C system (the suggested system) : 95.83%
1 2 3 4
Correctly Suggested 109 104 111 115
Incorrectly Suggested 11 16 9 5
ASPELL MicroSoft Word MultiSpell I.T.D.C System
0
20
40
60
80
100
120
140
TokensNumber
Experiment 2: A comparison among our work and some systems on
the isolated words correction

_________________________________________________________________________
106 
Another experiment was implemented to check the effect of every feature
on the selection of the optimal candidate accuracy; results shown in figure
(6.6) are computed by discarding one feature at a time, while figure (6.7)
shows the results of using one feature at a time.
Although some features gave high accuracy alone, taking such an
action is not sufficient. An example is the duplication feature which
accounted 1552 correctly selected optimal tokens when used alone, whereas
discarding it did not affect the total number of optimal set tokens.
Figure (6.6): Discarding one feature at a time for optimal candidate selection
1
Optimal Set 1477
Similarity Feature 827
First Letter Feature 1464
End Letter Feature 1468
Length Effect Feature 1476
Same Letter Set Feature 1436
Transpositionally Inserted
Feature
1464
Duplication Feature 1465
Confusion Feature 1475
Transposition Feature 1487
0
200
400
600
800
1000
1200
1400
1600
TokensNumber
Experiment 3: Features Selection Effect, Discarding One Feature

_________________________________________________________________________
107 
Figure (6.7): Using one feature at a time for optimal candidate selection
6.2 Conclusions
Text correction is a complex problem and an extensive task. It needs
many linguistic and statistical resources. In addition, it needs efficient
techniques for automatic execution. In this work we had performed a set of
improvements on both resources and techniques sides; our dictionary, an
integration of WordNet and ISPELL datasets, was retagged to achieve and
simplify parsing process. Hashing and indexing techniques are used to shorten
error detection process time; correction process based on exploiting the same
hashed dictionary and an enhancement on the Levenshtein method for
generating candidates. A set of features, some of them are statistics
1
Optimal Set 1477
Similarity Feature 1406
First Letter Feature 317
End Letter Feature 595
Length Effect Feature 447
Same Letter Set Feature 1486
Transpositionally Inserted
Feature
1478
Duplication Feature 1552
Confusion Feature 909
Transposition Feature 923
0
200
400
600
800
1000
1200
1400
1600
1800
TokensNumber
Experiment 4: Features Selection Effect, Applying One Feature

_________________________________________________________________________
108 
dependent, are used in optimizing candidates before passing to the parser
where the final decision is made at the level of phrases and sentences.
There is no way to avoid human intervention because computers could
never predict absolutely what a human intended; therefore, a set of
alternatives were associated with every corrected word.
6.3 Future Works
Automatic text correction is an open research; even with the presence
of several techniques and applications, the desired results still imperfect.
However some issues can be further considered in this work to improve its
accuracy:
- Semantic Processing: this system is fully dependent on an extensive
parser at the level of syntax analyzing only; semantic information would
increase accuracy if implemented to discard candidates that conflict the
sentence meaning. Discourse and pragmatic analysis also can enhance
results.
- In addition to spell based clustering and phonetic based clustering, a
technique for merging both of the two within the same searching time
constraints is preferred. Such enhancement will maximize the candidates
generation accuracy and minimize the time complexity.
- In the hash table, the looking up inside primary packets and secondary
packets is performed sequentially; an application of a faster technique like
binary search is a good improvement. This action requires sorting the
tokens according to their spell and applying the search in two directions:
o On the level of the token itself, where moving from an entry to
another is dependent on the tokens spell and therefore should
consider the symbols of the token sequentially because the length
is small enough to not to be looked with a complex technique.

_________________________________________________________________________
109 
o On the level of the packets, where moving is performed at the
level of tokens.
- The similarity based looking up need to be faster, an enhancement is
needed to reduce the number of the generated primary centroids. This
problem may be solved if the application where the system is used
becomes more specific.
- In grammar correction, we considered only two types and it is preferable
to consider as many types as possible.
- Because of time constraints, this system was implemented on simple
sentences only; an extension is required to make it general by including
complex, compound and complex compound sentences. The task is
forward easy because no more details are required for the construction
process via exploiting the phrases level analysis made in this work.
- A sophisticated study in the field of the type of errors and how people
usually making writing mistakes, such study requires multi resources
include corpus, statistics and even an interactive analyzer for recording
and classifying commonly committed mistakes. Although it is not an easy
task, it can simplify drawing a concluded idea about the general behavior
of users when they are unintentionally change the spell of words to
generate misspellings.



References

References
___________________________________________________________
 110 
References
Achenkunju A. and Bhuma V.R. (2014). "An Efficient Reformulated
Model for Transformation of String." International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 International
Conference on Humming Bird. 1 March.
Ahmed F., Ernesto W. De L., and Andreas N.( 2009). Revised N-Gram
based Automatic Spelling Correction Tool to Improve Retrieval
Effectiveness. Technical University of Berlin.
Ali A.(2011). Textual Similarity. Technical University of Denmark.
Amber W. O., Graeme H., and Alexander B. (2008). Real-word spelling
correction with trigrams: A reconsideration of the Mays, Damerau, and
Mercer model. Ontario: University of Toronto.
Baluja S., Vibhu O. M., and Rahul S.(2000). APPLYING MACHINE
LEARNING FOR HIGH PERFORMANCE NAMED-ENTITY
EXTRACTION. Cambridge: Blackwell Publishers.
Bassil Y.( 2012). "Parallel Spell-Checking Algorithm Based on Yahoo! N-
Grams Datasets." International Journal of Research and Reviews in
Computer Science (IJRRCS), ISSN: 2079-2557, Vol.3, No.1, February.
Bhattacharyya P. (2012). "Natural Language Processing A Perspective
from Computation in Presence of Ambiguity, Resource Constraint and
Multilinguality." CSI Journal of Computing, Vol.1 , No. 2, 3-13.
Booth A. D., Brandwood L., and Cleave J. P.. (1958). Mechanical
Resolution of Linguistic Problems. New York, London: Academic Press
Ink Publishers; ButterWorths Scientific Publications.
Boswell D. (2005). Speling KoreKsion: A survey of techniques from past to
present. A USCD Research Exam.

References
___________________________________________________________
 111 
Chakraborty R. C. (2010). "Artificial Intelligence: Natural Language
Processing." www.myreaders.info/html/artificial_intelligence.html, 1 June.
Church K. and Gale W. A. (1991). "Probability Scoring for Spelling
Correction." Statistics and Computing, 93-103.
Clark A., Chris F., and Shalom L. (2010). The Handbook of
Computational Linguistics and Natural Language Processing. Singapore:
Wiley-Blackwell.
Dahlmeier D. and Hwee T. N. (2011). "Grammatical Error Correction
with Alternative Structure Optimization." Proceedings of the Association
for Computational Linguistics, 915-923.
Damerau F. J. (1964). A Technique for Computer detection and
Correction of Spelling Errors. New York: ACM, Vol.3,No.4.
Dzikovska M. O. (2004). A Practical Semantic Representation For
Natural Language Parsing. New York: University of Rochester.
Farra N., Nadi T., Alla R., and Nizar H. ( 2014). "Generalized Character-
Level Spelling Error Correction." Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, June 23-25, 161–167.
Felice M., Yuan Z., Andersen Ø. E., and others.( 2014). "Grammatical
Error Correction using Hybrid Systems and Type Filtering." Proceedings of
the Shared Task Eighteenth Conference on Computational Natural
Language Learning, Maryland, 15-24.
Fromkin V., Robert R., and Nina H. ( 2007). Language Change: The
Syllabes of Time. Vol. 8, in An Introduction to Language, 461-497. Boston.
Gamon M. (2010). "Using Mostly Native Data to Correct Errors in
Learners' writing: A meta-classifier approach." proceedings of the Annual

References
___________________________________________________________
 112 
Meeting of the North America Chapter of the Association for
Computational Linguistics, 163-171.
Golding A. R., and Yves S. (1996). Combining Trigram-based and
Feature-based Methods for Context-Sensitive Spelling Correction.
Cambridge: Mitsubishi Electric Research Laboratories.
Grune D. and Ceriel J. H. J. ( 2008). Parsing Techniques- a practical
guide. Vol. Second Edition. Springer.
Gupta A. (2014). "Grammatical Error Detection and Correction Using
Tagger Disagreement." Proceedings of the Shared Task of the Eighteenth
Conference on Computational Natural Language Learning, 49-52.
Haldar R. and Debajyoti M. ( 2011). Levenshtein Distance Technique in
Dictionary Lookup Methods: An Improved Approach. New York: ACM.
Han N., Martin C., and Claudia L. (2006). "Detecting errors in English
Article Usage by Non-native Speakers." Natural Language Engineering,
115-129.
Hasan F. M.( 2006). COMPARISON OF DIFFERENT POS TAGGING
TECHNIQUES FOR SOME SOUTH ASIAN LANGUAGES. Dhaka: BRAC
University.
Hasan F. M., Naushad U., and Mumit K.( 2006). Comparison of
different POS Tagging Techniques (N-Gram, HMM and Brill’s tagger) for
Bangla. Bangladesh: BRAC University.
Hodge V. J. and Austin J. (2003). "A Comparison of Standard Spell
Checking Algorithms and Novel Binary Neural Approach." IEEE Trans.
Know. Dat. Eng, 1073-1081.
Hwee T. N., Siew M. W., Ted B., and others. (2014). "The CoNLL-2014
Shared Task on Grammatical Error Correction." Proceedings of the Shared

References
___________________________________________________________
 113 
Task of Eighteenth Conference on Computational Natural Language
Learning, June 26-27,1-14.
ISPELL. "ISPELL." April 10 , 2014 . http://Ispell/Wikipedia the free
encyclopedia.htm (accessed September 2014).
Jackson P. and Isabelle M.(2002). Natural Language Processing for
Online applications, Text Retrieval, Extraction and Categorization.
Amsterdam: John Benjamins Publishing Company.
Jones K. S. (2001). Natural Language Processing - A historical Review.
University of Cambridge, October.
Julius G. III.(2013) Intrasentential Grammatical Correction with
Weighted Finite State Transducers. Raleigh, North Carolina: North
Carolina State University.
Jurafsky D. and James H. M. (2000). Speech and Language Processing:
An introduction to natural language processing, Computational
Linguistics, and Speech Recognition. New Jersey: Alan Apt.
Kirthi J., Neeju N.J., and Nithiya P. (2011). "Automatic Spell Correction
of User query with Semantic Information Retrieval and Ranking of Search
Results using WordNetApproach." IJCSI International Journal of
Computer Science Issues, Vol. 8, No. 2, March, 557- 564.
Kukich K. ( 1992). Techniques for Automatically Correcting Words in
Text. ACM Computing Surveys, Vol. 24, No. 4.
Manning R. and Schútze. (2008). An Introduction to Information
Retrieval. Cambridge University Press.
Mihov S., Svetla K., and others. (2004). Precise and Efficient Text
Correction Using Levenshtein Automata, Dynamic Web Dictionaries and
Optimized Correction Models. Bulgarian Academy of Sciences.

References
___________________________________________________________
 114 
Mishra R. and Navjot K.(2013). "A Survey of Spelling Error Detection
and Correction Techniques." International Journal of Computer Trends
and Technology, vol.4, No.3, 372-374.
Momtazi S.(2012). Natural Language Processing: Introduction to
Language Technology. University of Potsdam.
Nadkarni P. M., Lucila O., and Wendy W. C.( 2011). "Natural language
processing: an introduction." J Am Med Inform Assoc, October 5, 544-551.
Niemann T.( 2009). SORTING AND SEARCHING ALGORITHMS.
Portland: epaperpress.com.
"Notes on Ambiguity." http://cs.nyu.edu/faculty/davise/ai/ambiguity.html.
Peterson J. L. (1980). "Computer Programs for Detecting and Correcting
Spelling Errors." Communications of the ACM, Vol.23, No. 12, 676- 687.
Pollock J. J. and Zamara A. (1983). "Collection and Characterization of
Spell Errors in Scientific and Scholary Text." Journal American Social
Information Scientific, 51-58.
Pollock J. J. and Zamara A. (1984). "Automatic Spelling Correction in
Scientific and Scholary Text." Communications of the ACM, 358-368.
Quirk R., Sidney G., Geoffreyleech, and Jan S.( 1985). A
Comprehensive Grammar of the English Language. New York and
London: Longman.
Raaijmakers S. (2013). "A Deep Graphical Model for Spelling
Correction." Proceedings of the 25th Benelux Conference on Artificial
Intelligence. Delft, 7-8 November.
Rich E. and Kevin K.( 1991). Chapter Fifteen: Natural Language
Processing. Vol. 2, in Artificial Intelligence. Amazon.

References
___________________________________________________________
 115 
Ritter A., Mausam S. C., and Oren E. ( 2011). Named Entity Recognition
in Tweets An Experimental Study. Computer Science and Engineering,
University of Washington.
Rajesh K. S. and Lokanatha C. R.(2009). "Natural Language Processing
- An Intelligent way to understand Context Sensitive Languages."
International Journal of Intelligent Information Processing, December
3,421-428.
Sagar and Shobha G. (2013). "Survey on Grammar Generation Methods
for Natural Languages." International Journal of Computational
Linguistics and Natural Language Processing ISSN 2279 – 0756, Vol.
2,No.1, January, 197-202.
Salifou L. and Harouna N. (2014). "Design of A Spell Corrector For
Hausa Language." International Journal of Computational Linguistics
(IJCL), Vol.5,No.2, 14-26.
Scott M. T. (1999). PARSING AND TAGGING SENTENCES
CONTAINING LEXICALLY AMBIGUOUS AND UNKNOWN TOKENS.
Purdue University.
Seo H., Jonghoon L., Seokhwan K., and others. (2012). "A Meta
Learning Approach to Grammatical Error Correction." 50th Annual
Meeting of the Association for Computational Linguistics. Jeju Island, July,
8 - 14.
Setiadi I. (2014). Damerau-Levenshtein Algorithm and Bayes Theorem for
Spell Checker Optimization. Bandung: Makalah IF2211 Strategi Algoritma
– Sem. I Tahun.
Tetreault J., Jenniefer F., and Martin C. (2010). "Using Parse Features
for Preposition Selection and Error Detection." Proceedings of the ACL
2010 Conference Short Papers, 353-358.

References
___________________________________________________________
 116 
Toutanova K., and Moore R. C.( 2002). "Prounciation Modeling for
Improved Spelling Correction." Proceedings 40th Annual Meeting of the
Association for Computational Linguistics. Hong Kong, pp. 144-151,144-
151.
Verberne S. (2002). Context-sensitive spell checking based on word
trigram probabilities. University of Nijmegen.
Voorhees E., Harman D.K., and others. ( 2005). TREC: Experiment and
Evaluation in information Retreival. Cambridge: MIT press.
Wanger R. A. and Fischer M.J. (1974). "The string-to-string correction
proplem." Journal of the Association for Computer Machinary, 168-173.
Wolniewicz R. (2011). Auto-Coding and Natural Language Processing.
U.S.A: 3M Health Information Systems.
Yannakoudakis E.J. and Fawthrop D. (1983). "An Intelligent Spelling
Error Correction." Information Processing and Management, 101-108.
Yule G. (2000). "Pragmatics." In Oxford Introductions to Language Study
Series Editor H.G. Widdowson, 4. Oxford University Press.
Zampieri M. and Renato C. de A. ( 2014). Between Sound and Spelling
Combining Phonetics and Clustering Algorithms to Improve Target Word
Recovery. Saarland: Saarland University.
Zhan J., Xiolong M., Shu q. L., and Ditang F. (1998). A language Model
in a Large-Vocabulary Speech Recognition System. Sydney: Proceedings of
International Conference ICSLP98.



Appendix A

Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 117
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
Abberration aberration aberration aberration aberration aberration
accomodation accommodation accommodation accommodation accommodation accommodation
acheive achieve Achieve achieve achieve achieve
abortificant abortifacient aficionados - abortifacient abortifacient
absorbsion absorption absorbsion absorbs ion absorption absorption
ackward (awkward,
backward)
awkward (awkward,
backward)
(awkward, backward) (backward,
awkward)
additinally additionally additionally additionally additionally additionally
adminstration administration administration administration administration administration
admissability admissibility admissibility admissibility admissibility admissibility
advertisments advertisements advertisements advertisements advertisements advertisements
adviced advised advised advised advice advised
afficionados aficionados aficionados aficionados aficionados aficionados
affort (effort, afford) effort afford afford (effort, afford)
agains against agings agings against against
aggreement agreement agreement agreement agreement agreement
agressively aggressively aggressively aggressively aggressively aggressively
agriculturalist agriculturist - - agriculturist agriculturist
alcoholical alcoholic alcoholically alcoholically alcoholic (alcoholically,
alcoholic)
algebraical algebraic algebraic algebraically algebraically algebraic
algoritms algorithms algorithms algorithms algorithms (algorism,
algorithms)
alterior (ulterior, anterior) ulterior (anterior,
ulterior)
(anterior, ulterior) (ulterior, anterior)

Word
anihilation annihilation annihilation annihilation annihilation annihilation
anthromorphization anthropomorphization anthropomorphizing - anthropomorphization anthropomorphization
bankrupcy bankruptcy bankruptcy bankruptcy bankruptcy bankruptcy
baout (about, bout) bout (about, bout) bout (about, bout)
basicly basically basically basically basically basically
breakthough breakthrough break though breakthrough breakthrough breakthrough
carachter character crocheter character character character
cannotation connotation connotation (connotation,
annotation)
(connotation,
annotation)
connotation
carismatic charismatic charismatic charismatic charismatic charismatic
carmel caramel Carmel - caramel caramel
cervial (cervical, servile) cervical cervical cervical cervical
clasical classical classical classical classical classical
cleareance clearance clearance clearance clearance clearance
comissioning commissioning commissioning commissioning commissioning commissioning
commemerative commemorative commemorative commemorative commemorative commemorative
compatabilities compatibilities compatibilities compatibilities compatabilities compatibilities
committment commitment commitment commitment commitment commitment
debateable debatable debatable debatable debatable debatable
determinining determining determinining determinining determining determining
childbird childbirth child bird child bird childbirth childbirth
definately definitely definitely definitely definitely definitely
decribe describe describe describe describe describe
elphant elephant elephant elephant elephant elephant
emmediately immediately immediately immediately immediately immediately
emphysyma emphysema emphysema emphysema emphysema emphysema
erally (orally, really) orally really orally (really ,orally)
eyasr (years, eyas) eyesore years eyas (eyas ,years)

Word
facist fascist fascist fascist fascist fascist
fluoroscent fluorescent fluorescent fluorescent fluorescent fluorescent
geneology genealogy genealogy genealogy genealogy genealogy
gernade grenade grenade grenade grenade grenade
girates gyrates grates gyrates Gyrates gyrates
gouvener governor governor souvenir convener (souvenir,
gouverneur,
governor)
gurantees guarantee guarantee guarantee guarantee (guaranties,guarantee)
guerrila (guerilla, guerrilla) guerrilla guerrilla (guerilla, guerrilla) (guerrilla, guerilla)
guerrilas (guerillas, guerrillas) guerrillas guerrillas (guerillas, guerrillas) (guerrillas, guerillas)
Guiseppe Giuseppe Giuseppe Giuseppe Giuseppe -
habaeus (habeas, sabaeus) habeas habitués sabaeus Cabaeus
hierarcical hierarchical hierarchical hierarchical hierarchical hierarchical
heros heroes heroes heroes herbs heroes
hypocracy hypocrisy hypocrisy hypocrisy hypocrisy hypocrisy
independance Independence Independence - Independence Independence
intergration integration integration integration integration integration
intrest interest interest interest interest interest
Johanine Johannine Johannes Johannes Johannine Johannine
judisuary judiciary judiciary judiciary judiciary judiciary
kindergarden kindergarten kindergarten kindergarten kindergarten kindergarten
knowlegeable knowledgeable knowledgeable knowledgeable knowledgeable knowledgeable
labatory (lavatory, laboratory) (lavatory, laboratory) (lavatory,
laboratory)
(lavatory, laboratory) lavatory
lonelyness loneliness loneliness loneliness loneliness loneliness
legitamate legitimate legitimate legitimate legitimate legitimate
libguistics linguistics linguistics linguistics linguistics linguistics

Word
lisence (license, licence) licence silence licence (licence, license)
mathmatician mathematician mathematician mathematician mathematician mathematician
ministery ministry ministry ministry ministry ministry
mysogynist misogynist misogynist misogynist misogynist misogynist
naturaly naturally naturally naturally naturally naturally
ocuntries countries countries countries countries countries
paraphenalia paraphernalia paraphernalia paraphernalia paraphernalia paraphernalia
Palistian Palestinian Alsatain politian Palestinian (Pakistan, politian)
pamflet pamphlet pamphlet pamphlet pamphlet partlet
psyhic psychic psychic psychic psychic psychic
Peloponnes Peloponnesus Peloponnese Peloponnese Peloponnesus Peloponnese
personell personnel personnel personnel personnel ( personally,
personnel)
posseses possesses possesses possesses possess possesses
prairy prairie priory prairie airy (priory, prairie)
qutie (quite, quiet) quite quite queue quite
radify (ratify,ramify) ratify ratify ramify (rarify, ratify, ramify)
reccommended recommended recommended recommended recommended recommended
reciever receiver receiver receiver reliever receiver
reconaissance reconnaissance reconnaissance reconnaissance reconnaissance reconnaissance
restauration restoration restoration restoration instauration restoration
rigeur (rigueur, rigour,
rigor)
rigger rigueur (rigueur, rigour) rigour
Saterday Saturday Saturday Saturday Saturday Saturday
scandanavia Scandinavia Scandinavia Scandinavia Scandinavia Scandinavia
scaleable scalable scalable - scalable scalable
secceeded (seceded, succeeded) succeeded succeeded succeeded succeeded
sepulchure (sepulchre, sepulcher) sepulcher sepulchered sepulchre (sepulchre, sepulcher)

Word
themselfs themselves themselves themselves themselves themselves
throught (thought, through,
throughout)
(thought, through) (thought,
through)
(thought, through,
throughout)
(through, thought,
throughout)
troups (troupes, troops) (troupes, troops) troupes troops (troops, troupes)
simultanous simultaneous simultaneous simultaneous simultaneous simultaneous
sincerley sincerely sincerely sincerely sincerely sincerely
sophicated sophisticated suffocated supplicated sophisticate sophister
surrended (surrounded,
surrendered)
surrounded surrender surrounded (surrender,
surrendered
surrounded)
unforetunately unfortunately unfortunately unfortunately unfortunately unfortunately
unnecesarily unnecessarily unnecessarily unnecessarily unnecessarily unnecessarily
usally usually usually usually usually usually
useing using using using seeing using
vaccum vacuum vacuum vacuum vacuum vacuum
vegitables vegetables vegetables vegetables vegetables vegetables
vetween between between between between between
volcanoe volcano volcano volcano volcano ( volcanoes, volcano)
weaponary weaponry weaponry weaponry weaponry weaponry
worstened worsened worsened worsened worsened worsened
wupport support support support support support
yeasr years years years yeast years
Yementite (Yemenite, Yemeni) Yemenite Yemenite Yemenite Yemenite
yuonger younger younger younger sponger younger



Appendix B

1
‫الخالصة‬:
‫مع‬ ‫االنسان‬ ‫بتفاعل‬ ‫المرتبطة‬ ‫المشكالت‬ ‫اهم‬ ‫من‬ ‫واحدة‬ ‫تلقائيا‬ ‫النصوص‬ ‫تصحيح‬ ‫عملية‬ ‫تعتبر‬
‫المباشرة‬ ‫العملية‬ ‫الجوانب‬ ‫من‬ ‫العديد‬ ‫في‬ ‫تدخل‬ ‫؛اذ‬ ‫الحاسوب‬‫تحويل‬ ‫عن‬ ‫الناجمة‬ ‫االخطاء‬ ‫كتصحيح‬
‫رقمية‬ ‫الى‬ ‫الخطية‬ ‫النصوص‬,‫عملية‬ ‫اجراء‬ ‫قبل‬ ‫المستخدمين‬ ‫إيعازات‬ ‫كتصحيح‬ ‫المباشرة‬ ‫وغير‬
‫تفاعلية‬ ‫بيانات‬ ‫قاعدة‬ ‫في‬ ‫ما‬ ‫استرجاع‬.
‫التل‬ ‫التصحيح‬ ‫عملية‬ ‫تمر‬‫رئيسيتين‬ ‫بمرحلتين‬ ‫قائي‬:‫و‬ ‫االخطاء‬ ‫تحديد‬‫البدائل‬ ‫اقتراح‬.
‫كال‬ ‫في‬ ‫عديدة‬ ‫وطرق‬ ‫تقنيات‬ ‫توجد‬‫وقابليتها‬ ‫نتائجها‬ ‫دقة‬ ‫في‬ ‫الطرق‬ ‫هذه‬ ‫وتتباين‬ ‫المرحلتين‬
‫التطبيقية‬,‫وإحصائية‬ ‫اجرائية‬ ‫طرق‬ ‫الى‬ ‫عامة‬ ‫بصورة‬ ‫تنقسم‬ ‫حيث‬.‫ما‬ ‫كل‬ ‫على‬ ‫منها‬ ‫االجرائية‬ ‫تشتمل‬
‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫ذلك‬ ‫في‬ ‫بما‬ ‫النصوص‬ ‫مقبولية‬ ‫في‬ ‫تتحكم‬ ‫محددة‬ ‫قواعد‬ ‫على‬ ‫عمله‬ ‫في‬ ‫معتمدا‬ ‫هو‬
‫ال‬ ‫تعتمد‬ ‫حين‬ ‫؛في‬ ‫الطبيعية‬‫عينات‬ ‫من‬ ‫عادة‬ ‫تجمع‬ ‫واحتمالية‬ ‫احصائية‬ ‫بيانات‬ ‫على‬ ‫االحصائية‬ ‫طرق‬
‫المستخدمين‬ ‫بين‬ ‫يتداول‬ ‫مما‬ ً‫ا‬‫اساس‬ ‫مستخلصة‬ ‫هائلة‬.
‫اللغوية‬ ‫المقبولية‬ ‫وفحص‬ ‫للتحليل‬ ‫كأساس‬ ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫اعتمدت‬ ‫النظام‬ ‫هذا‬ ‫في‬
‫االنكليزية‬ ‫للنصوص‬ ‫والنحوية‬,‫مفر‬ ‫كل‬ ‫يضم‬ ‫قاموس‬ ‫استخدام‬ ‫تم‬ ‫حيث‬‫لغرض‬ ‫االنكليزية‬ ‫اللغة‬ ‫دات‬
‫وطريقة‬ ‫هاش‬ ‫دالة‬ ‫استخدمت‬ ‫فقد‬ ‫القاموس‬ ‫لهذا‬ ‫الهائل‬ ‫للحجم‬ ‫ونظرا‬ ‫اللغوية‬ ‫االخطاء‬ ‫وتحديد‬ ‫اكتشاف‬
‫على‬ ‫اعتمادا‬ ‫عشوائي‬ ‫وصول‬ ‫قابلية‬ ‫وتوفير‬ ‫المنشودة‬ ‫الكلمات‬ ‫عن‬ ‫البحث‬ ‫نطاق‬ ‫لتقليص‬ ‫فهرسة‬
‫بادئاتها‬‫البحث‬ ‫وقت‬ ‫اختصار‬ ‫وبالتالي‬.
‫فيعت‬ ‫البدائل‬ ‫توليد‬ ‫أما‬‫كلمات‬ ‫وكافة‬ ‫المدخلة‬ ‫الكلمة‬ ‫بين‬ ‫التشابه‬ ‫مقدار‬ ‫حساب‬ ‫طريقة‬ ‫على‬ ‫مد‬
‫طريقة‬ ‫باستخدام‬ ‫احتسب‬ ‫والذي‬ ‫المقدار‬ ‫لهذا‬ ‫وفقا‬ ‫ترتيبها‬ ‫وإعادة‬ ‫القاموس‬Levenshtein‫؛إن‬ ‫رة‬ّ‫و‬‫مط‬
‫طويال‬ ‫وقتا‬ ‫تتطلب‬ ‫هذه‬ ‫التوليد‬ ‫عملية‬,‫مع‬ ‫صغيرة‬ ‫مجاميع‬ ‫الى‬ ‫القاموس‬ ‫في‬ ‫الكلمات‬ ‫تقسيم‬ ‫تم‬ ‫لذلك‬
‫بقاب‬ ‫االحتفاظ‬‫المصدر‬ ‫الكلمة‬ ‫تهجئة‬ ‫على‬ ‫تعتمد‬ ‫لمحددات‬ ‫تبعا‬ ‫العشوائي‬ ‫الوصول‬ ‫لية‬.‫اقتراح‬ ‫يتضمن‬
‫م‬ ‫حد‬ ‫الى‬ ‫تتصل‬ ‫خصائص‬ ‫مجموعة‬ ‫اختبار‬ ‫البدائل‬‫شيوعا‬ ‫االكثر‬ ‫االخطاء‬ ‫بطبيعة‬ ‫ا‬.‫النظام‬ ‫يقوم‬
‫قواعد‬ ‫مع‬ ‫يتعارض‬ ‫ال‬ ‫ان‬ ‫على‬ ‫المصدر‬ ‫الكلمة‬ ‫مع‬ ‫توافقية‬ ‫اعلى‬ ‫يحقق‬ ‫الذي‬ ‫االمثل‬ ‫البديل‬ ‫باختيار‬
‫لي‬ ‫النحو‬‫النص‬ ‫كون‬‫ونحويا‬ ‫لغويا‬ ‫مقبوال‬ ‫المصحح‬.
‫النظام‬ ‫دقة‬ ‫اختبار‬ ‫نتائج‬ ‫اظهرت‬‫المقترح‬ً‫ا‬‫تقدم‬‫أخرى‬ ‫وأنظمة‬ ‫وورد‬ ‫مايكروسوفت‬ ‫على‬,‫كما‬
‫حافظت‬ ‫الرمزية‬ ‫السالسل‬ ‫تشابه‬ ‫لحساب‬ ‫رة‬ّ‫و‬‫المط‬ ‫الطريقة‬ ‫ان‬‫قدرتها‬ ‫مع‬ ‫الوقت‬ ‫تعقيدات‬ ‫على‬ ‫تقريبا‬
‫االمالئية‬ ‫االخطاء‬ ‫من‬ ‫اضافي‬ ‫نوع‬ ‫اكتشاف‬ ‫على‬.

2
‫نظام‬‫ّية‬‫ص‬‫الن‬ ‫المستندات‬ ‫تصحيح‬‫الذكي‬
‫على‬ ‫باالعتماد‬‫الت‬ ‫تقنية‬‫شابه‬
‫مقدمة‬ ‫رسالة‬
‫إلى‬‫كلية‬ ‫مجلس‬‫المعلومات‬ ‫تكنولوجيا‬-‫متطلبات‬ ‫من‬ ‫جزء‬ ‫وهي‬ ‫بابل‬ ‫جامعة‬
‫شهادة‬ ‫نيل‬‫الحاسبات‬ ‫علوم‬ ‫في‬ ‫الماجستير‬
‫قبل‬ ‫من‬
‫الركابي‬ ‫عبيد‬ ‫كاظم‬ ‫مروه‬
‫بإش‬‫ــ‬‫راف‬
‫أ‬.‫د‬.‫البكري‬ ‫حمسن‬ ‫عباس‬
١٠٢5‫م‬٢٤١6‫هـ‬
‫العلمي‬ ‫والبحث‬ ‫العالي‬ ‫التعليم‬ ‫وزارة‬
‫بابل‬ ‫جامعة‬-‫كلية‬‫تكنولوجي‬‫ا‬‫المعلومات‬
‫البرامجيات‬ ‫قسم‬

Intelligent Text Document Correction System Based on Similarity Technique

More Related Content

What's hot (15)

Similar to Intelligent Text Document Correction System Based on Similarity Technique (20)

Recently uploaded (20)

Intelligent Text Document Correction System Based on Similarity Technique