SlideShare a Scribd company logo
Intelligent Text Document
Correction System Based on
Similarity Technique
A Thesis
Submitted to the Council of the College of Information Technology,
University of Babylon in Partial Fulfillment of the Requirements
for the Degree of Master of Sciences in Computer Sciences.
By
Marwa Kadhim Obeid Al-Rikaby
Supervised by
Prof. Dr. Abbas Mohsen Al-Bakry
2015 D.C. 1436 A.H.
Ministry of Higher Education and
Scientific Research
University of Babylon- College of Information
Technology
Software Department
II
ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ي‬ِ‫ح‬َّ‫ر‬‫ال‬ ِ‫ن‬ٰ‫ـ‬‫ـ‬‫ـ‬‫ـ‬َ‫م‬ْ‫ح‬َّ‫ر‬‫ال‬ ِ‫ه‬َّ‫الل‬ ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬ْ‫س‬ِ‫ب‬
{ِ‫ب‬ ‫ِي‬‫د‬ْ‫ه‬َ‫ي‬‫َا‬‫و‬ْ‫ض‬ِ‫ر‬ َ‫ع‬َ‫ب‬َّ‫ْت‬‫ا‬ ِ‫ن‬َ‫م‬ ُ‫هلل‬ْ‫ا‬ ِ‫ه‬‫َم‬‫ل‬َّ‫س‬‫ْل‬‫ا‬ َ‫ل‬ُ‫ب‬ُ‫س‬ ُ‫ه‬َ‫ن‬ِٰ
َ‫و‬ُ‫ي‬ْ‫خ‬ِ‫ر‬ُ‫ج‬ُ‫ه‬ِِّ‫م‬ ‫م‬َ‫ن‬ْ‫ا‬ُّ‫لظ‬ُ‫ل‬َ‫م‬‫ا‬ِ‫ت‬ِ‫إ‬َ‫ل‬‫ى‬َْٰ‫ا‬ُّ‫ن‬‫ل‬ِ‫ر‬‫و‬ِ‫ب‬ِ‫إ‬ْ‫ذ‬ِ‫ن‬ِ‫ه‬َ‫و‬َ‫ي‬ْ‫ه‬ِ‫د‬ِ‫ه‬‫ي‬ْ‫م‬ِ‫إ‬‫ىل‬َٰ
َِ‫ص‬‫ر‬ٍٰ‫ط‬ُّ‫م‬ْ‫س‬َ‫ت‬ِ‫ق‬ٍ‫م‬‫ي‬}
َ‫ص‬َ‫د‬َ‫ق‬َ‫ع‬‫ال‬ ‫اهلل‬ِ‫ل‬ُ‫ي‬َ‫ع‬‫ال‬ِ‫ظ‬‫يم‬
‫امل‬ ‫سورة‬‫ئادةة‬‫آية‬16
III
Supervisor Certification
I certify that this thesis was prepared under my supervision at the Department of
Software / Information Technology / University of Babylon, by Marwa
Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the
degree of Master of Sciences in Computer Science.
Signature:
Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry
Title : Professor.
Date : / / 2015
The Head of the Department Certification
In view of the available recommendation, we forward this thesis for debate by
the examining committee.
Signature:
Name : Dr. Eman Salih Al-Shamery
Title: Assist. Professor.
Date: / / 2015
IV
To
Master of creatures,
Loved by Allah,
The Prophet Muhammad
(Allah bless him and his family)
V
Acknowledgements
All praise be to Allah Almighty who enabled me to complete this task
successfully and utmost respect to His last Prophet Mohammad PBUH.
First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al-
Bakry, for his advice and guidance that led to the completion of this thesis.
I would like to thank the staff of the Software Department for the help they
have offered, especially, the head of the Software Department Dr. Eman Salih
Al-Shamery.
Most importantly, I would like to thank my parents, my sisters, my brothers
and my friends for their support.
VI
Abstract
Automatic text correction is one of the human-computer interaction
challenges. It is directly interposed with several application areas like post
handwritten text digitizing correction or indirectly such as user's queries correction
before applying a retrieval process in interactive databases.
Automatic text correction process passes through two major phases: error
detection and candidates suggestion. Techniques for both phases are categorized
into: Procedural and statistical. Procedural techniques are based on using rules to
govern texts acceptability, including Natural Language Processing Techniques.
Statistical techniques, on the other hand, are dependent on statistics and
probabilities collected from large corpus based on what is commonly used by
humans.
In this work, natural language processing techniques are used as bases for
analysis and both spell and grammar acceptance checking of English texts. A
prefix dependent hash-indexing scheme is used to shorten the time of looking up
the underhand dictionary which contains all English tokens. The dictionary is used
as a base for the error detection process.
Candidates generation is based on calculating source token similarity,
measured using an improved Levenshtein method, to the dictionary tokens and
ranking them accordingly; however this process is time extensive, therefore, tokens
are divided into smaller groups according to spell similarity in such a way keeps
the random access availability. Finally, candidates suggestion involves examining
a set of commonly committed mistakes related features. The system selects the
optimal candidate which provides the highest suitability and doesn't violate
grammar rules to generate linguistically accepted text.
Testing the system accuracy showed better results than Microsoft Word and
some other systems. The enhanced similarity measure reduced the time complexity
to be on the boundaries of the original Levenshtein method with an additional error
type discovery.
VII
Table of Contents
Subject Page
No.
Chapter One : Overview
1.1 Introduction 1
1.2 Problem Statement 3
1.3 Literature Review 5
1.4 Research Objectives 10
1.5 Thesis Outlines 11
Chapter Two: Background and Related Concepts
Part I: Natural Language Processing 12
2.1 Introduction 12
2.2 Natural Language Processing Definition 12
2.3 Natural Language Processing Applications 13
2.3.1 Text Techniques 14
2.3.2 Speech Techniques 15
2.4 Natural Language Processing and Linguistics 16
2.4.1 Linguistics 16
2.4.1.1 Terms of Linguistic Analysis 17
2.4.1.2 Linguistic Units Hierarchy 19
2.4.1.3 Sentence Structure and Constituency 19
2.4.1.4 Language and Grammar 20
2.5 Natural Language Processing Techniques 22
2.5.1 Morphological Analysis 22
2.5.2 Part of Speech Tagging 23
2.5.3 Syntactic Analysis 26
2.5.4 Semantic Analysis 27
2.5.5 Discourse Integration 27
2.5.6 Pragmatic Analysis 28
2.6 Natural Language Processing Challenges 28
2.6.1 Linguistics Units Challenges 28
2.6.1.1 Tokenization 28
2.6.1.2 Segmentation 29
2.6.2 Ambiguity 31
2.6.2.1 Lexical Ambiguity 31
VIII
Subject Page
No.
2.6.2.2 Syntactic Ambiguity 31
2.6.2.3 Semantic Ambiguity 32
2.6.2.4 Anaphoric Ambiguity 32
2.6.3 Language Change 32
2.6.3.1 Phonological Change 33
2.6.3.2 Morphological Change 33
2.6.3.3 Syntactic Change 33
2.6.3.4 Lexical Change 33
2.6.3.5 Semantic Change 34
Part II: Text Correction 35
2.7 Introduction 35
2.8 Text Errors 35
2.8.1 Non-words Errors 36
2.8.2 Real-word Errors 36
2.9 Error Detection Techniques 37
2.9.1 Dictionary Looking Up 37
2.9.1.1 Dictionaries Resources 37
2.9.1.2 Dictionaries Structures 38
2.9.2 N-gram Analysis 39
2.10 Error Correction Techniques 40
2.10.1 Minimum Edit Distance Techniques 40
2.10.2 Similarity Key Techniques 43
2.10.3 Rule Based Techniques 43
2.10.4 Probabilistic Techniques 43
2.11 Suggestion of Corrections 44
2.12 The Suggested Approach 44
2.12.1 Finding Candidates Using Minimum Edit Distance 45
2.12.2 Candidates Mining 45
2.12.3 Part-of-Speech Tagging and Parsing 46
Chapter Three : Hashed Dictionary and Looking Up Technique
3.1 Introduction 48
3.2 Hashing 48
3.2.1 Hash Function 49
3.2.2 Formulation 52
3.2.3 Indexing 53
3.3 Looking Up Procedure 56
IX
Subject Page
No.
3.4 Dictionary Structure Properties 58
3.5 Similarity Based Looking-Up 59
3.5.1 Bi-grams Generation 60
3.5.2 Primary Centroids Selection 62
3.5.3 Centroids Referencing 63
3.6 Application of Similarity Based Looking up approach 64
3.7 The Similarity Based Looking up Properties 67
Chapter Four : Error Detection and Candidates Generation
4.1 Introduction 69
4.2 Non-word Error Detection 69
4.3 Real-Words Error Detection 71
4.4 Candidates Generation 72
4.4.1 Candidates Generation for Non-word Errors 72
4.4.1.2 Enhanced Levenshtein Method 74
4.4.1.3 Similarity Measure 78
4.4.1.4 Looking for Candidates 79
4.4.2 Candidates Generation for Real-words Errors 81
Chapter Five : Text Correction and Candidates Suggestion
5.1 Introduction 82
5.2 Correction and Candidates Suggestion Structure 82
5.3 Named-Entity Recognition 85
5.4 Candidates Ranking 86
5.4.1 Edit Distance Based Similarity 87
5.4.2 First and End Symbols Matching 87
5.4.3 Difference in Lengths 88
5.4.4 Transposition Probability 89
5.4.5 Confusion Probability 90
5.4.6 Consecutive Letters (Duplication) 91
5.4.7 Different Symbols Existence 92
5.5 Syntax Analysis 93
5.5.1 Sentence Phrasing 93
5.5.2 Candidates Optimization 95
5.5.3 Grammar Correction 95
5.5.4 Document Correction 97
Chapter Six: Experimental Results, Conclusions, and Future Works
X
Subject Page
No.
6.1 Experimental Results 98
6.1.1 Tagging and Error Detection Time Reduction 98
6.1.1.1 Successful Looking Up 99
6.1.1.2 Failure Looking Up 100
6.1.2 Candidates Generation and Similarity Search Space
Reduction
101
6.1.3 Time Reduction of the Damerau-Levenshtein method 103
6.1.4 Features Effect on Candidates Suggestion 104
6.2 Conclusions 107
6.3 Future Works 108
References 110
Appendix A 117
Appendix B 122
List of Figures
Figure
No.
Title Page
No.
(2.1) NLP dimensions 16
(2.2) Linguistics analysis steps 17
(2.3) Linguistic Units Hierarchy 19
(2.4) Classification of POS tagging models 24
(2.5) An example of lexical change 34
(2.6) Outlines of Spell Correction Algorithm 38
(2.7) Levenshtein Edit Distance Algorithm 41
(2.8) Damerau-Levenshtein Edit Distance Algorithm 42
(2.9) The Suggested System Block Diagram 47
(3.1) Token Hashing Algorithm 54
XI
Figure
No.
Title Page
No.
(3.2) Dictionary Structure and Indexing Scheme 55
(3.3) Algorithm of Looking Up Procedure 57
(3.4) Semi Hash Clustering block diagram 61
(3.5) Similarity Based Hashing algorithm 64
(3.6) Block diagram of candidates generation using SBL 66
(3.7) Similarity Based Looking up algorithm 68
(4.1) Tagging Flow Chart 70
(4.2) The Enhanced Levenshtein Method Algorithm 76
(4.3) Original Levenshtein Example 77
(4.4) Damerau-Levenshtein Example 77
(4.5) Enhanced Levenshtein Example 78
(5.1) Candidates ranking flowchart 84
(5.2) Syntax analysis flowchart 94
(6.1) Tokens distribution in primary packets 99
(6.2) Tokens distribution in secondary packets 99
(6.3) Time complexity Variance of Levenshtein, Damerau-
Levenshtein, and Enhanced Levenshtein (our modification)
103
(6.4) Suggestion Accuracy with a comparison to Microsoft Office
Word on a Sample from the Wikipedia
104
(6.5) Testing the suggested system accuracy and comparing the
results with other systems using the same dataset
105
(6.6) Discarding one feature at a time for optimal candidate
selection
106
(6.7) Using one feature at a time for optimal candidate selection 107
XII
List of Tables
Table
No.
Title Page
No.
(1-1) Summary of Literature Review 9
(3-1) Alphabet Encoding 50
(3-2) Addressing Range 52
(3-3) Predicting errors using Bi-grams analysis 61
(5-1) Transposition Matrix 90
(5-2) Confusion Matrix 91
List of Symbols and Abbreviations
MeaningAbbreviation
Alphabet∑
Adjectival PhraseA
Absolute Differenceabs
Sentence ComplementC
Context Free GrammarCFG
DictionaryD
Dioxide Nuclear AcidDNA
ErrorE
GrammarG
Grammar Error CorrectionGEC
Hidden Markov ModelHMM
Information RetrievalIR
Machine TranslationMT
Named EntityNE
Named-Entity RecognitionNER
Noun GroupNG
Natural Language GenerationNLG
Natural Language ProcessingNLP
Natural LanguagesNLs
Natural Language UnderstandingNLU
XIII
Noun PhraseNP
big-Oh notation ( =at most)O( )
Optical Character RecognitionOCR
Production RuleP
Part Of SpeechPOS
Prepositional PhrasePP
QueryQ
Ranking ValueR
Relative DistanceR_Dist
Start SymbolS
Stanford Machine TranslatorSMT
Speech RecognitionSR
String1, String2St1,St2
VariableV
Adverbial Phrasev
Verb PhraseVP
big-Omega notation (= at least)Ω( )


Chapter One
Overview
1
Chapter One
Overview
1.1 Introduction
Natural Language Processing, also known as computational Linguistics,
is the field of computer science that deals with linguistics; it is a form of
human- computer interaction where formalization is applied on the elements
of human language to be performed by a computer [Ach14]. Natural
Language Processing (NLP) is the implementation of systems that are
capable of manipulating and processing natural languages (NLs)
sentences[Jac02] like English, Arabic, Chinese and not formal languages
like Python, Java, C++; nor descriptive languages such as DNA in biology
and Chemical formulas in chemist [Mom12]. NLP task is the designing and
building of software for analyzing, understanding and generating spoken
and/or written NLs. [Man08] [Mis13]
NLP has many applications such as automatic summarization, Machine
Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition
(SR), Optical Character Recognition (OCR), Information Retrieval (IR),
Opinion Mining [Nad11], and others [Wol11].
Text Correction is another significant application of NLP. It includes
both Spell Checking and Grammar Error Correction (GEC). Spell checking
research extends early back to the mid of 20th
century by Lee Earnest at
Stanford University but the first application was created in 1971 by Ralph
Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of
10,000 English words. [Set14] [Pet80]
Grammar error correction, in spite of its central role in semantic and
meaning representations, is largely ignored by NLP community. In recent
Chapter One   Overview
________________________________________________________________________
2
years, an improvement noticed in automatic GEC techniques. [Voo05]
[Jul13] However, most of these techniques are limited in specific domains
such as real-word spell correction [Hwe14], subject-verb disagreement
[Han06], verb tense misuse [Gam10], determiners or articles and improper
preposition usage. [Tet10] [Dah11]
Different techniques like edit distance [Wan74], rule-based techniques
[Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98],
probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel
model [Tou02] have been proposed for text correction purposes. Each
technique needs some sort of resources. Edit distance, rule-based and
similarity key techniques require a dictionary (or lexicon), n-grams and
probabilistic work with statistical and frequency information, neural nets are
learned with training patterns, etc…
Text correction, spell and grammar, is an extensive process includes,
typically, three major steps: [Ach14] [Jul13]
The first step is to detect the incorrect words. The most popular way to
decide if a word is misspelled is to look for it in a dictionary, a list of
correctly spelled words. This way can detect non-word errors not the real-
word errors [Kuk92] [Mis13] because an unintended word may match a
word in the dictionary. NLs have a large number of words resulting in a
huge dictionary, therefore, the task of looking every word consumes a long
time. Whereas, in GEC this step is more complicated, it requires applying
more analysis at the level of sentences and phrases using computational
linguistics basics to detect the word that makes the sentence incorrect.
Next, a list of candidates or alternatives should be generated for the
incorrect word (misspelled or misused). This list is preferred to be short and
contains the words with highest similarity or suitability; and to produce it, a
technique is needed to calculate the similarity of the incorrect word with
Chapter One   Overview
________________________________________________________________________
3
every word in the dictionary. Efficiency and accuracy are major factors in
the selection of such technique. GEC requires broad knowledge of diverse
grammatical error categories and extensive linguistic technique to identify
alternatives because a grammatical error mayn't be resulted from a unique
word.
Finally, suggesting the intended word or a list of alternatives contains
the intended word. This task requires ranking the words according to the
similarity amount to the incorrect word and some other considerations may
or may not be taken depending on the technique in use.
Text mining techniques started to enter the area of text correction;
Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and
Information Retrieval [Kir11] are examples. Statistics and probabilistic also
played a great role specifically in analyzing common mistakes and n-gram
datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and
phonetic level, can be used in reducing the looking up space; NER may help
in avoiding interpreting proper nouns as misspellings; statistics merged with
NLP techniques to provide more precise parsing and POS tagging, usually,
in context dependent applications. The application of a given technique
differs according to what level of correction is intended; it starts from the
character level [Far14], passes through word, phrase (usually in GEC),
sentence, and ends in the context or document subject level.
1.2 Problem Statement
Although many text checking and correction systems are produced,
each has its variances from the sides of input quality restrictions, techniques
used, output accuracy, speed, performance conditions…etc. [Ahm09]
[Pet80]. This field of NLP is really an open research from all sides because
there is no complete algorithm or technique handles all considerations.
Chapter One   Overview
________________________________________________________________________
4
The limited linguistic knowledge, the huge number of lexicons, the
extensive grammar, language ambiguity and change over time, variety of
committed errors and computational requirements are challenges facing the
process of developing a text correction application.
In this work, some of the above mentioned problems are solved using a
set of solutions:
 Integrating two lexicon datasets (WordNet and Ispell).
 Using brute-force approach to solve some sorts of ambiguity.
 Applying hashing and indexing in looking up the dictionary.
 Reducing search space in candidates collecting process by
grouping similarly spelled words into semi clusters.
The Levenshtein method [Hal11] is also enhanced to consider Damerau
four types of errors within time period shorter than Damerau-Levenshtein
method [Hal11]. Named Entity Recognition, letters confusion and
transposition, and candidate length effect are used as features to optimize the
candidates' suggestion. In addition to applying rules of Part-Of-Speech tags
and sentence constituency for checking sentence grammar correctness,
whether it is lexically corrected or is not.
The proposed three components of this system are: (1)a spell error
detection is based on a fast looking up technique in a dictionary of more than
300,000 tokens, constructed by applying a string prefix dependent hash
function and indexing method; grammar error detector is a brute-force
parser. (2)For candidates generation, an enhancement was implemented on
the Levenshtein method to consider Damerau four errors types and then used
to measure similarity according to the minimum edit distance and difference
in lengths effect, the dictionary tokens are grouped into spell based clusters
to reduce search space. (3)The candidates suggestion exploits NER features,
Chapter One   Overview
________________________________________________________________________
5
transposition error and confusion statistics, affixes analysis (including first
and last letters matching), length of candidates, and parsing success.
1.3 Literature Review
 Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to
string transformation includes a model consists of rules and weights for
training and an algorithm depends on scoring and ranking according to
conditional probability distribution for generating the top k-candidates at
the character level where both high and low frequency words can be
generated. Spell checking is one of many applications on which the
approach was applied; the misspelled strings (words or characters) are
transformed by applying a number of operators into the k-most similar
strings in a dictionary (start and end letters are constants). [Ach14]
 Mariano F., Zheng Y., and others, 2014, talked the correction of
grammatical errors by processes pipelining which combines results from
multiple systems. The components of the approach are: a rule based error
corrector uses rules automatically derived from the Cambridge Learner
Corpus which based on N-grams that have been annotated as incorrect;
SMT system translates incorrectly written English into correct English;
NLTK1
was used to perform segmentation, tokenization, and POS
tagging; the candidates generation produce all the possible combinations
of corrections for the sentence, in addition to the sentence itself to
consider the "no correction" option; finally the candidates are ranked
using a language model. [Fel14]
__________________________________________________________
1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research
and teaching in computational linguistics and natural language processing. NLTK is written in Python and
distributed under the GPL open source license. Over the past year the toolkit has been rewritten,
simplifying many linguistic data structures and taking advantage of recent enhancements in the Python
language.
Chapter One   Overview
________________________________________________________________________
6
 Anubhav G., 2014, presented a rule-based approach that used two POS
taggers to correct non-native English speakers' grammatical errors,
Stanford parser and Tree Tagger. The detection of errors depends on the
outputs of the two taggers, if they differ then the sentence is not correct.
Errors are corrected using Nodebox English Linguistic library. Error
correction includes subject-verb disagreement, verb form, and errors
detected by POS tag mismatch. [Gup14]
 Stephan R., 2013, proposed a model for spelling correction based on
treating words as "documents" and spell correction as a form of
document retrieval in that the model retrieves the best matching correct
spell for a given input. The words are transformed into tiny documents of
bits and hamming distance is used to predict the closest string of bits
from a dictionary holding the correctly spelled words as strings of bits.
The model is knowledge free and only contains a list of correct words.
[Raa13]
 Youssef B., 2012, produced a parallel spell checking algorithm for
spelling errors detection and correction. The algorithm is based on
information from Yahoo! N-gram dataset 2.0; it is a shared memory
model allowing concurrency among threads for both parallel multi
processor and multi core machines. The three major components (error
detector, candidates' generator and error corrector) are designed to run in
a parallel fashion. Error detector, based on unigrams, detects non-word
errors; candidates' generator is based on bi-grams; the error corrector,
context sensitive, is based on 5-grams information.[Bas12]
 Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and
Gary G. L., 2012, presented a novel method for grammatical error
correction by building a meta-classifier. The meta-classifier decides the
final output depending on the internal results from several base
classifiers; they used multiple grammatical errors tagged corpora with
Chapter One   Overview
________________________________________________________________________
7
different properties in various aspects. The method focused on the articles
and the correction arises only when a mismatching occur with the
observed articles. [Seo12]
 Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic
information retrieval system performing automatic spell correction for
user queries before applying the retrieval process. The correcting
procedure depends on matching the misspelled word against a correctly
spelled words dictionary using Levenshtein algorithm. If an incorrect
word is encountered then the system retrieves the most similar word
depending on the Levenshtein measure and the occurrence frequency of
the misspelled word.[Kir11]
 Farag, Ernesto, and Andreas, 2008, developed a language-independent
spell checker. It is based on the enhancement of N-gram model through
creating a ranked list of correction candidates derived based on N-gram
statistics and lexical resources then selecting the most promising
candidates as correction suggestions. Their algorithm assigns weights to
the possible suggestions to detect non-word errors. They depended a
"MultiWordNet" dictionary of about 80,000 entries.[Ahm09]
 Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of
real-words spelling error correction. They assumed that the observed
sentence is a signal passed through a noisy channel, where the channel
reflects the typist and the distortion reflects errors committed by the
typist. The probability of the sentence correctness, given by the channel
(typist), is a parameter associated with that sentence. The probability of
every word in the sentence to be the intended one is equivalent to the
sentence correctness probability and the word is associated with a set of
spell variants words excluding the word itself. Correction can be applied
to one word in the sentence by replacing the incorrect one by another
Chapter One   Overview
________________________________________________________________________
8
from the candidates (its real-word spelling variations) set so that it gives
the maximum probability.[Amb08]
 Stoyan, Svetla, and others, 2005, described an approach for lexical post-
correction of the output of optical character recognizer OCR as a two
research project. They worked on multiple sides; on the dictionary side,
they enriched their large sizes dictionaries with specialty dictionaries; on
the candidates selection, they used a very fast searching algorithm
depends on Levenshtein automata for efficient selecting the correction
candidates with a bound not exceeding 3; they ranked candidates
depending on a number of features such as frequency and edit
distance.[Mih04]
 Suzan V., 2002, described a context sensitive spell checking algorithm
based on the BESL spell checker lexicons and word trigrams for
detecting and correcting real-word errors using probability information.
The algorithm splits up the input text into trigrams and every trigram is
looked up in a precompiled database which contains a list of trigrams and
their occurrence number in the corpus used for database compiling. The
trigram is correct if it is in the trigram database, otherwise it is considered
an erroneous trigram containing a real-word error. The correction
algorithm uses BESL spell checker to find candidates but the most
frequent in the trigrams database are suggested to the user.[Ver02]
Chapter One   Overview
________________________________________________________________________
9
No. Reference Methodology Technique
1
[Ach14] Generating the top K-
candidates at the
character level for both
high and low frequency.
A model consists of rules and
weights, and a conditional
probability distribution
dependent algorithm
2
[Fel14] Grammatical errors
correction based on
generating all possible
correct alternatives for
the sentence
Combining the results of
multiple systems: rule based
error corrector, SMT English
to Correct English translator,
and NLTK for segmentation,
tokenization and tagging
3
[Gup14] Non-native English
speakers' grammatical
errors correction
Error detection used Stanford
parser and Tree Tagger.
Correction based on
Nodebox English Linguistic
library
4
[Raa13] Dictionary based Spell
correction treats the
misspelled word as a
document.
Converting the misspelled
word into a tiny document of
bits and retrieving the most
similar documents using
Hamming Distance
5
[Bas12] Context sensitive spell
checking using a shared
memory model allowing
concurrency among
threads for parallel
execution
Different N-grams levels for
error detection, candidates
generation, and candidates
suggestion depending on
Yahoo! N-Grams dataset 2.0
6
[Seo12] Meta-classifier for
grammatical errors
correction focused
mainly on the articles.
Deciding the output
depending on the internal
results from several base
classifiers
7
[Kir11] Automatic spell
correction for user
queries before applying
retrieval process
Using Levenshtein algorithm
for both error detection and
correction in a dictionary
looking up technique
Table 1.1: Summary of Literature Review
Chapter One   Overview
________________________________________________________________________
11
8
[Ahm09] Language independent
model for non-word error
correction based on N-
gram statistics and lexical
resources
Ranking a list of correction
candidates by assigning
weights to the possible
suggestions depending on a
"MultiWordNet" dictionary
of about 80,000 entries
9
[Amb08] Noisy channel model for
Real words error
correction based on
probability.
Channel represents the typist,
distortion represents the
error, and the noise
probability is a parameter
10
[Mih04] OCR output post
correction
Levenshtein automata for
candidates generation and
frequency for ranking
11
[Ver02] Context sensitive spell
checking algorithm based
on tri-grams
Splitting texts into word
trigrams and matching them
against the precompiled
BESL spell checker lexicons,
suggestion depends on
probability information.
1.4 Research Objectives
This research is attempted to design and implement a smart text
document correction system for English texts. It is based on mining a typed
text for detecting spelling and grammar errors and giving the optimal
suggestion(s) from a set of candidates, its steps are:
1. Analyzing the given text by using Natural Language Processing
techniques, at each step detect the erroneous words.
2. Looking up candidates for the erroneous words and ranking them
according to a given set of features and conditions to be the initial
solutions.
3. Optimizing the initial solutions depending on the extracted
information from the given text and the detected errors.
Chapter One   Overview
________________________________________________________________________
11
4. Recovering the input text document with the optimal solutions and
associating the best set of candidates with each incorrect detected
word.
1.5 Thesis Outlines
The next five chapters are:
1. Chapter Two: "Background and Related Concepts" consisted of two
parts. The first overviews NLP fundamentals, applications and
techniques; whereas, the second is about text correction techniques.
2. Chapter Three: "Dictionary Structure and Looking up Technique"
describes the suggested approach of constructing the dictionary of the
system for both perfect matching and similarity looking up.
3. Chapter Four: "Error Detection and Candidates Generation", declares
the suggested technique for indicating incorrect words and the method
of generating candidates.
4. Chapter Five: "Automatic Text Correction and Candidates
Suggestion", describes the techniques of suggestions selection and
optimization.
5. Chapter Six: "Experimental Results, Conclusion, and Future Works",
the experimental results of applying the techniques described in
chapters three, four and five, conclusion of the system and the future
directions are shown.


Chapter Two
Background
and
Related Concepts
 12 
Chapter Two
Background and Related Concepts
Part I
Natural Language Processing
2.1 Introduction
Natural Language Processing (NLP) began in the late 1940s. It was
focused on machine translations; in 1958, NLP was linked to the
information retrieval by the Washington International Conference of
Scientific Information; [Jon01] primary ideas for developing applications
for detecting and correcting text errors started at that period of time.
[Pet80] [Boo58]
Natural Language Processing has a great interest from that time till
our days because it plays an important role in the interaction between
human and computers. It represents the intersection of linguistics and
artificial intelligence [Nad11] where machine can be programmed to
manipulate natural language.
2.2 Natural Language Processing Definition
"Natural Language Processing (NLP) is the computerized approach
for analyzing text that is based on both a set of theories and a set of
technologies." [Sag13]
NLP describes the function of software or hardware components in a
computer system that is capable of analyzing or synthesizing human
languages (spoken or written) [Jac02] [Mis13] like English, Arabic,
Chinese …etc, not formal languages like Python, Java, C++ … etc, nor
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 13 
descriptive languages such as DNA in biology and Chemical formulas in
chemist [Mom12].
"NLP is a tool that can reside inside almost any text processing
software application" [Wol11]
We can define NLP as a subfield of Artificial Intelligence
encompasses anything needed by a computer to understand and generate
natural language. It is based on processing human language for two tasks:
the first receives a natural language input (text or speech), applies analysis,
reasons what was meant by that input, and outputs in computer language;
this is the task of Natural Language Understanding (NLU). While the
second task is to generate human sentences according to specific
considerations, the input is in computer language but the output is in human
languages; it is called Natural Language Generation (NLG). [Raj09]
"Natural Language Understanding is associated with the more
ambitious goals of having a computer system actually comprehend natural
language as a human being might". [Jac02]
2.3 Natural Language Processing Applications
Even of its wide usage in computer systems, NLP is entirely
disappeared into the background; where it is invisible to the user and adds
significant business value. [Wol11]
The major distinction of NLP applications from other data
processing systems is that they use Language Knowledge. Natural
Language Processing applications are mainly divided into two categories
according to the given NL format [Mom12] [Wol11]:
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 14 
2.3.1Text Technologies
 Spell and Grammar Checking: systems deal with indicating
lexical and grammar errors and suggest corrections.
 Text Categorization and Information Filtering: In such
applications, NLP represents the documents linguistically and
compares each one to the others. In text categorization, the
documents are grouped according to their linguistic
representation characteristics into several categories. Information
filtering signals out, from a collection of documents, the
documents that are satisfying some criterion.
 Information Retrieval: finds and collects relevant information to
a given query. A user expresses the information need by a query,
then the system attempts to match the given query to the database
documents that is satisfying the user’s query. Query and
documents are transformed into a sort of linguistic structure, and
the matching is performed accordingly.
 Summarization: according to an information need or a query
from the user, this type of applications finds the most relevant
part of the document.
 Information Extraction: refers to the automatic extraction of
structured information from unstructured sources. Structured
information like entities, their relationships, and attributes
describing them. This can integrate structured and unstructured
data sources, if both are exist, and pose queries for spanning the
integrated information giving better results than applying
searches by keywords alone.
 Question Answering: works with plain speech or text input,
applies an information search based on the input. Such as IBM®
Watson™ and the reigning JEOPARDY! Champion, which read
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 15 
questions and understand their intention, then looking up the
knowledge library to find a match.
 Machine Translation: translate a given text from a specific
natural language to another natural language, some applications
have the ability to recognize the given text language even if the
user didn't specify it correctly.
 Data Fusion: Combining extracted information from several text
files into a database or an ontology.
 Optical Character Recognition: digitizing handwritten and
printed texts. I.e. converting characters from images to digital
codes.
 Classification: this NLP application type sorts and organizes
information into relevant categories. Like e-mail spam filters and
Google News™ news service.
 And also NLP entered other applications such as educational
essay test-scoring systems, voice-mail phone trees, and even e-
mail spam detection software.
2.3.2 Speech Technologies
 Speech Recognition: mostly used on telephone voice response
systems as a service client. Its task is processing plain speech. It
is also used to convert speech into text.
 Speech Synthesis: means converting text into speech. This
process requires working at the level of phones and converting
from alphabetic symbols into sound signals.
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 16 
2.4 Natural Language Processing and Linguistics
Natural Language Processing is concerned with three dimensions:
language, algorithm and problem as presented in figure (2.1). On the
language dimension, NLP considers linguistics; algorithm dimension
mentions NLP techniques and tasks, while the problem dimension depicts
the applied mechanisms to solve problems. [Bha12]
2.4.1 Linguistics
Natural Language is a communication mean. It is a system of
arbitrary signals such as the voice sound and written symbols. [Ali11]
Linguistics is the scientific study of language; it starts from the simple
acoustic signals which form sounds and ends with pragmatic understanding
to produce the full context meaning.
There are two major levels of linguistic, Speech Recognition (SR)
and Natural Language Processing (NLP) as shown in figure (2.2).
Figure (2.1) : NLP dimensions [Bha12]
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 17 
2.4.1.1 Terms of Linguistic Analysis
A natural language, as formal language does, has a set of basic
components that may vary from one language to another but remain
bounded under specific considerations giving the special characteristics to
every language.
From the computational view, a language is a set of strings generated
over a finite alphabet and can be considered by a grammar. The definition
Acoustic Signal
Phones
Letters and Strings
Morphemes
Words
Phrases and Sentences
Meaning out of Context
Meaning in Context
SR
NLP
Phonetics
Phonology
Lexicon
Morphology
Syntax
Semantics
Pragmatics
Figure (2.2) : Linguistics analysis steps [Cha10]
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 18 
of the three abstracted names is dependent on the language itself; i.e.
strings, alphabet and grammar formulate and characterize language.
 Strings:
In natural language processing, the strings are the morphemes of the
language, their combinations (words) and the combinations of their
combinations (sentences), but linguistics going somewhat deeper than this.
It starts with phones, the primitive acoustic patterns, which are significant
and distinguishable from one natural language to another. Phonology
groups phones together to produce phonemes represented by symbols.
Morphemes consist of one or more symbols; thus, NLs can be further
distinguished.
 Alphabet:
When individual symbols, usually thousands, represent words then
the language is "logographic"; if the individual symbols represent syllables,
it is a "syllabic" one. But when they represent sounds, the language is
"alphabetic". Syllabic and alphabetic languages have typically less than 100
symbols, unlike logographic.
English is an alphabetic language system consists of 26 symbols,
these symbols represents phones combined into morphemes which may or
may not combined further more to form words.
 Grammar:
Grammar is a set of rules specifying the legal structure of the
language; it is a declarative representation about the language syntactic
facts. Usually, grammar is represented by a set of productive rules.
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 19 
2.4.1.2 Linguistic Units Hierarchy
Language can be divided into pieces; there is a typical structure or
form for every level of analysis. Those pieces can be put into a hierarchical
structure starting from a meaningful sentence as the top level, proceeding
in the separation of building units until reaching the primary acoustic
sounds. Figure (2.3) presented an example.
Figure (2.3) : Linguistic Units Hierarchy
2.4.1.3 Sentence Structure and Constituency
"It is constantly necessary to refer to units smaller than the sentence
itself units such as those which are commonly referred as CLAUSE,
PHRASE, WORD, and MORPHEME. The relation between one unit and
another unit of which it is a part is CONSTITUENCY." [Qui85]
The task of dividing a sentence into constituents is a complex task
________________________________________________________
1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries
The teacher talked to the students
The teach er talk ed to the student s
The teacher talked to the students
The teacher talked to the students
Sentence
Phrase
Word
Morphem
e
Phonemes1
ᶞᵊ ᵗ ː
ᶴ ᵊ ᵓː
ᵏ ᵗ ᵗu
ᶞᵊ ʹˢᵗᴶː
ᵈᵑᵗ ˢ
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 20 
requires incorporating more than one analysis stage; tokenization,
segmentation, parsing, (and sometimes stemming) usually merged together
to build the parse tree for a given sentence.
2.4.1.4 Language and Grammar
A language is a 'set' of sentences and a sentence is a 'sequence' of
'symbols' [Gru08]; it can be generated given its context free grammar
G=(V,∑,S,P). [Cla10]
Commonly, grammars are represented as a set of production rules
which is taken by the parser and compared against the input sentences.
Every matched rule adds something to the sentence complete structure
which is called 'parse tree'. [Ric91]
Context free grammar (CFG) is a popular method for generating
formal grammars. It is used extensively to define languages syntax. The
four components of the grammar are defined in CFG as [Sag13]:
 Terminals (∑): represent the basic elements which form the
strings of the language.
 Nonterminals or Syntactic Variables (V): sets of strings define the
language which is generated by the grammar. Nonterminals
represent a key in syntax analyzing and translation via imposing a
hierarchical structure for the language.
 Set of production rules (P): this set define the way of combining
terminals with nonterminals to produce strings. The production
rule is consisted of a variable on the left side represents its head,
this head defines
 Start symbol (S).
The following is an example describes the structure of English sentence
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 21 
V = {S, NP, N, V P, V, Art}
∑ = {boy, icecream, dog, bite, like, ate, the, a},
P = {S NP V P,
NP  N,
NP  ART N,
V P  V NP,
N  boy | icecream | dog,
V  ate | like | bite,
Art  the | a}
The grammar specifies two things about the language: [Ric91]
 Its weak generative capacity; the limited set of sentences which can
be completely matched by a series of grammar rules.
 Its strong generative capacity, grammatical structure(s) of each
sentence in the language.
Generally, there are an infinite number of sentences for each grammar
which can be structured with it. The strength and importance of grammars
lurk in their ability of supplying structure to an infinite number of
sentences because they succinctly summarize an infinite number of objects
structures of a certain class. [Gru08]
The grammar is said to be generative if it has a fixed size production
rules which, if followed, can generate every sentence in the language using
an infinite number of actions. [Gru08]
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 22 
2.5 Natural Language Processing Techniques
2.5.1 Morphological Analysis
Morphology is the study of how words are constructed from
morphemes which represent the minimal meaning-bearing language
primitive units.[Raj09] [Jur00]
There are two broad classes of morphemes: stems and affixes; the
distinction between the two classes is language dependent in that it varies
from one language to another. The stem, usually, refers to the main part of
the word and the affixes can be added to the words to give it additional
meaning. [Jur00]
Further more, affixes can be divided into four categories according to
the position where they are added. Prefixes, suffixes, circumfixes and
infixes generally refer to the different types of affixes but it is not necessary
to a language to have all the types. English accept both prefixes to precede
stems and suffixes to follow stems, while there is no good example for a
circumfixe (precede and follow a stem) in English, and infixing (inserting
inside the stem) is not allowed (unlike German and Philippine languages,
consecutively) . [Jur00]
Morphology is concerned with recognizing the modification of base
words to form other words with different syntactic categories but similar
meanings.
Generally, three forms of word modifications are found [Jur00]:
 Inflection: syntactic rules change the textual representation of the
words; such as adding the suffix 's' to convert nouns into plurals,
adding 'er' and 'est' convert regular adjectives into comparative and
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 23 
superlative forms, consecutively. This type of modification usually
results a word from the same word class of the stem word.
 Derivation: new words are produced by adding morphemes, usually
more complex and harder in meaning than inflectional morphology.
It often occurs in a regular manner and results words differ in their
word class from the stem word. Like adding the suffix 'ness' to
'happy' to produce 'happiness'.
 Compounding: this type modifies stem words by another stem words
by grouping them. Like grouping 'head' with 'ache' to produce
'headache'. In English, this type is infrequent.
Morphological processing, also known as stemming, depends heavily on
the analyzed language. The output is the set of morphemes that are
combined to form words. Morphemes can be stem words, affixes, and
punctuations.
2.5.2 Part Of Speech Tagging
Part of Speech (POS) tagging is the process of giving the proper
lexical information or POS tag (also known as word classes, lexical tags,
and morphological classes), which is encoded as a symbol, for every word
(or token) in a sentence. [Sco99] [Has06b]
In English, POS tags are classified into four basic classes of words: [Qui85]
1. Closed classes: include prepositions, pronouns, determiners,
conjunctions, modal verbs and primary verbs.
2. Open classes: include nouns, adjectives, adverbs, and full verbs.
3. Numerals: include numbers and orders.
4. Interjections: include small set of words like oh, ah, ugh, phew.
Usually, a POS tag indicates one or more of the previous information and it
is sometimes holds other features like the tense of the verb or the number
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 24 
(plural or singular). POS tagging may generate tagged corpora or serve as a
preprocessing step for the next NLP processes. [Sco99]
Most of tagging systems performance is typically limited because
they only use local lexical information available in the sentence, at the
opposite of syntax analyzing systems which exploit both lexical and
structural information. [Sco99] More research was done and several models
and methods have been proposed to enhance taggers performance, they fall
mainly into supervised and unsupervised methods where the main
difference between the two categories is the set of training corpora that is
pre tagged in supervised methods unlike unsupervised methods which
needs advanced computational methods for gaining such a corpora.
[Has06a] [Has06b]. Figure (2.4) presents the main categories and shows
some examples.
In both categories, the following are the most popular:
Figure (2.4) : Classification of POS tagging models [Has06a]
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 25 
 Statistical (stochastic, or probabilistic) methods: taggers which
use these methods are firstly trained on a correctly tagged set of
sentences which allow the tagger to disambiguate words by
extracting implicit rules or picking the most probable tag based on
the words that are surrounding the given word in the sentence.
Examples of these methods are Maximum-Entropy Models, Hidden
Markov Models (HMM), and Memory Based models.
 Rule based methods: a sequence of rules, a set of hand written
rules, is applied to detect the best tags set for the sentence regardless
of any maximization probability. The set of rules need to be written
probably and checked by human experts. Examples: the path-voting
constraint models and decision tree models.
 Transformational approach: combines both statistical methods and
rule based methods to firstly find the most probable set of available
tags and then applies a set of rules to select the best.
 Neural Networks: with linear separator or full neural network, have
been used for tagging processes.
The methods described above, as any other research areas, have their
advantages and disadvantages; but there is a major difficulty facing all
of them, it is the tagging of unknown words (words that have never seen
before in the training corpora). While rule-based approaches depends on
a special set of rules to handle such situations, stochastic and neural nets
lack this feature and use other ways such as suffixes analysis and n-
gram by applying morphological analysis; some methods use default set
of tags to disambiguate unknown words. [Has06a]
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 26 
2.5.3 Syntactic Analysis
"Syntax is the study of the relationships between linguistics forms,
how they are arranged in sequence, and which sequences are well-
formed". [Yul00]
Syntactic analysis, also referred by "Parsing", is the process of
converting the sentence from its flat format which is represented as a
sequence of words into a structure that defines its units and the relations
between these units. [Raj09]
Hence, the goal of this technique is to transform natural language
into an internal system representation. The format of this representation
may be dependency graphs, frames, trees or some other structural
representations. Syntactic parsing attempts only for converting sentences
into either dependency links representing the utterance syntactic structure
or a tree structure and the output of this process is called "parse tree" or
simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning
in the level of the smallest parts ("words" in terms of language scientist,
"tokens" in terms of computer scientists). [Gru08]
Syntactic analysis makes use of both the results of morphological
analysis and Part-Of-Speech tagging to build the structural description of
the sentence by applying the grammar rules of the language under
consideration; if a sentence violates the rules then it is rejected and
assigned as incorrect. [Raj09]
The two main components of every syntax analyzer are:
 Grammar: the grammar provides the analyzer with the set of
production rules that will lead it to construct the structure of the
sentences and specifies the correctness of every given sentence.
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 27 
Good grammars make a careful distinction between the
sentence/word level, which they often call syntax or syntaxis and
the word/letter level, which they call morphology. [Gru08]
 Parser: the parser reconstructs the production tree (or trees) by
applying the grammar to indicate how the given sentence (if
correctly constructed) was produced from that grammar.
Parsing is the process of structuring a linear representation in
accordance with a given grammar.
Today, most of parsers combine context free grammars with probability
models to determine the most likely syntactic structure out of many others
that are accepted as parse trees for an utterance. [Dzi04]
2.5.4 Semantic Analysis
"Semantics is the study of the relationships between linguistic
forms and entities in the words; that is, how words literally connect to
things." [Yul00]
This technique and the later following it are basically depended by
language understanding. Semantic analysis is the process of assigning
meanings to the syntactic structures of the sentences regardless of its
context. [Yul00] [Raj09]
2.5.5 Discourse Integration
Discourse analysis is concerned with studying the effect of sentences
of each other. It shows how a given sentence is affected by the one
preceding it and how it affects the sentence following it. Discourse
Integration is relevant to understanding texts and paragraphs rather than
simple sentences, so, discourse knowledge is important in the interpretation
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 28 
of temporal aspects (like pronouns) in the conveyed information. [Ric91]
[Raj09]
2.5.6 Pragmatic Analysis
This step interprets the structure that represents what is said for
determining what was meant actually. Context is a fundamental resource
for processing here. [Ric91]
2.6 Natural Language Processing Challenges
The challenges of natural language processing are much enough that
can't be summarized in a limited list; with every processing step from the
start point to results outputting there are a set of problems that natural
language processors vary in their ability to handle. However, the
application where NLP is used, usually, concerned with a specific task
rather than considering all processing steps with all their details, this is an
advantage for the NLP community helps to outline the challenges and
problems according to the task under consideration.
For our research area, we precisely concerned with the set of
problems that are directly affecting the task of text correction; the next
subsections describe some of them:
2.6.1 Linguistic Units Challenges:
The task of text correction starts from the level of characters up to
paragraphs and full texts, with every level there are a set of difficulties that
the handling analyzer faces:
2.6.1.1 Tokenization
In this process, the lexical analyzer, usually called "Tokenizer",
divides the text into smaller units and the output of this step is a series of
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 29 
morphemes, words, expressions and punctuations (called tokens). It
involves locating tokens boundaries (where one token ends and another
begins).
Issues that arise in tokenization and should be addressed are [Nad11]:
 Problem depends on language type: language includes, in addition to
their symbols, a set of orthographic conventions which are used in
writing to indicate the boundaries of linguistic units. English
employs whitespaces to separate words but this isn't sufficient to
tokenize a text in a complete and unambiguous manner because the
same character may be used in different uses (as the case with
punctuations), there are words with multi parts (such as dividing the
word with a hyphen at the end of lines and some cases in the addition
of prefixes) and many expressions consisted of more than one word.
 Encoding Problems: syllabic and alphabetic writing systems, usually,
encoded using single byte, but languages with larger character sets
require more than two bytes. The problem arise when the same set of
encodings represents different characters set; whereas, the tokenizers
are targeted to a specific encoding for a specific language.
 Other problems such as the dependency of the application
requirements which indicates what a constituent is defined as a
token; in computational linguistics the definition should precisely
indicate what the next processing step requires. The tokeniser should
also have the ability to recognize the irregularities in texts such as
misspellings and erratic spacing and punctuation, etc.
2.6.1.2 Segmentation
Segmenting text means dividing it into small meaningful pieces
typically referred by "sentence", a sentence consists of one or more tokens
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 30 
and handles a meaning which may not completely be clear. This task
requires a full knowledge in the scope of punctuation marks since they are
the major factor in denoting the start and ends of sentences.
Segmentation becomes more complicated as the punctuations usages
become more. Some of punctuations can be a part from a token and not a
stopping mark such as the case with periods (.) when used with
abbreviations.
However, there is a set of factors can help in making the
segmentation process more accurate [Nad11]:
 Case distinction: English sentences normally start with a capital
letter, (but Proper nouns also do).
 POS tag: the tags that are surrounding punctuation can assist this
process, but multi tags situations complicate it such as the using
of –ing verbs as nouns.
 The length of the word (in the case of abbreviation
disambiguation, notice a period may assign the end of a sentence
and an abbreviation at the same time).
 Morphological information, this task requires finding the stem
word by suffixes removal.
It is likely not to separate tokenization and segmentation processes;
they are usually merged together for solving most of the above
problems, specifically segmentation problems.
A sentence is described to be an indeterminate unit because of the difficulty
in deciding where it ends and another starts; while the grammar is
indeterminate from the stand point of deciding 'which sentence is
grammatically correct?' because this question permits to be answered
divisively and discourse segmentation difficulty is not the lonely reason but
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 31 
also grammatical acceptability, meaning, style goodness or badness, lexical
acceptability, context acceptability, etc. [Qui85]
2.6.2 Ambiguity
An input is ambiguous if there is more than one alternative linguistic
structure for it. [Jur00]
Two major types of sentence ambiguity, genuine and computer
ambiguity. In the first, the sentence is really has two different meanings to
the intelligent hearer; while in the second case, is that the sentence has one
meaning but for the computer it has more than one and this type is really a
problem facing NLP applications unlike the first. [Not]
Ambiguity as an NLP problem is found in every processing step [Not]
[Bha12]:
2.6.2.1 Lexical Ambiguity
Lexical ambiguity is described to be the possibility for a word to
have more than one meaning or more than one POS tag.
Obviously, meaning ambiguity leads to semantic ambiguity and tag
ambiguity to syntactic ambiguity because it can produce more than one
parse tree. Frequency is an available solution for this problem.
2.6.2.2 Syntactic Ambiguity
The sentence has more than one syntactic structure; particularly,
English common ambiguity sources are:
 Phrase attachment: how a certain phrase or a clause in the sentence
can be attached to another when there is more than one possibility.
Crossing is not allowed in parse trees; therefore, a parser generates a
parse tree for each accepted state.
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 32 
 Conjunction: sometimes, the parser befuddled to select which phrase
a conjunctive should be connected to.
 Noun group structure: the rule
NG  NG NG
allows English to generate long series of nouns to be strung together.
Some of these problems can be resolved by applying syntactic constraints.
2.6.2.3 Semantic Ambiguity
Even when a sentence is unambiguous lexically and syntactically,
sometimes, there is more than one interpretation for it. This is because a
phrase or a word may refer to more than one meaning.
"Selection restrictions" or "semantic constraints" is a way to
disambiguate such sentences. It combines two concepts in one mode if both
of the concepts or one of them has specific features. Frequency in context
also can help in deciding the meaning of a word.
2.6.2.4 Anaphoric Ambiguity
This is the possibility for a word or a phrase to refer to something
that is previously mentioned but in the reference there is more than one
possibility.
This type can be resolved by parallel structures or recency rules.
2.6.3 Language Change
"All living languages change with time, it is fortunate that they do so
rather slowly compare to the human life". Language change is represented
by the change of grammars of people who speak the language and it has
been shown that English was changed in its lexicon, phonological,
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 33 
morphological, syntax, and semantic components of the grammar over the
past 1,500 years. [Fro07]
2.6.3.1 Phonological Change
Correspondences of regular sounds show the phonological system
changes. The phonological system is governed, as well as any other
linguistic system, by a set of rules and this set of phonemes and
phonological rules is subjected to change by modification, deletion and
addition of new rules. The change in phonological rules can affect the
lexicon in that some of English words formations depends on sounds, such
as the vowels sound differentiate nouns from verbs ( nouns house and bath
from the verbs house and bathe).
2.6.3.2 Morphological Change
Morphological rules, like the phonological, are suspected to addition,
lose and change. Mostly, the usage of suffixes is the active area of change
where the way of adding them to the ends of stems affected the resulted
words and therefore changed the lexicon.
2.6.3.3 Syntactic Change
Syntactic changes are influenced by morphological changes which in
turn influenced by phonological changes. This type of change includes all
types of grammar modifications that are mainly based on the reordering of
words inside the sentence.
2.6.3.4 Lexical Change
Change of lexical categories is the most common in this type of
change. An example of this situation is the usage of nouns as verbs, verbs
as nouns, and adjectives as nouns. Lexical change also includes the
Chapter Two   Part I: Natural Language Processing
_________________________________________________________________________
 34 
addition of new words, borrowing or loan words from another language,
and the loss of existing words.
Figure (2.5) : An example of lexical change 1
2.6.3.5 Semantic Change
As the category of a word can be changed, its semantic
representation or meaning can be changed, too. Three types of change are
possible for a word:
 Broadening: the meaning of a word is expanded to mean everything
it has been used for and more than that.
 Narrowing: on the reverse of broadening, here the word meaning is
reduced from more general meaning to a specific meaning.
 Shifting: the word reference is shifted to refer to another meaning
somewhat differs from the original one.
_________________________________________________________
Darby Conley/ Get fuzzy © UFS, Inc. 24 Feb. 2012
 35
Part II
Text Correction
2.7 Introduction
Text correction is the process of indicating incorrect words in an input
text, finding candidates (or alternatives) and suggesting the candidates as
corrections to the incorrect word. The term incorrect refers to two different
types of erroneous words: misspelled and misused. But mainly, the process
is divided into two distinct phases: error detection phase which indicates the
incorrect words, and error correction phase that combined both generating
and suggesting candidates.
Devising techniques and algorithms for correcting texts in an
automatic manner is a primal opened research challenge started from the
early 1960s and continued until now because the existed correction
techniques are limited in their accuracy and application scope [Kuk92].
Usually, a correction application concerns a specific type of errors
because it is a complex task to computationally predict an intended word
written by a human.
2.8 Text Errors
A word can be mistaken in two ways: the first is by incorrectly
spelling a word due to lack of enough information about the word spell or
intentionally mistaking symbol(s) within the word, this type of errors is
known as non-word errors where the word can't be found in the language
lexicon.
The second is by using correctly spelled word in wrong position in the
sentences or unsuitable context. These errors are known as real-word errors
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 36
where the incorrect word is accepted in the language lexicon.
[Gol96][Amb08]
Non-word errors are easier to be detected, unlike real-word errors; the
later needs more information about the language syntax and semantic nature.
Accordingly, the correction techniques are divided into isolated words error
detections that is concerned with non-word errors; and context sensitive
error correction which deals with real-words error. [Gol96]
2.8.1 Non-word errors
Those errors include the words that are not found in the lexicon; a
misspelled word contains one or more from the following errors:
 Substitution: one or more symbols are changed.
 Deletion: one or more symbols are missed from the intended word.
 Insertion: adding symbol(s) to the front, end, or any index in the word.
 Transposition: two adjacent symbols are swapped.
The four errors are known as Damerau edit operations.
2.8.2 Real-word errors
These errors occur through mistaking an intended word by another
one that is lexically accepted. Real-word errors can be resulted from
phonetic confusion like using the word "piece" instead of "peace" which
usually leads to semantically unaccepted sentences, after applying non-word
correction, or even from misspelling the intended word and producing
another lexically accepted word. [Amb08]
Sometimes, the confusion results in syntactically unaccepted
sentences; like writing the sentence "John visit his uncle" instead of "John
visits his uncle".
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 37
Correcting real-word errors is context sensitive in that it needs to
check the surrounding words and sentences before suggesting candidates.
2.9 Error Detection Techniques
Indicating whether a word is correct or not is based on the type of
correction procedure; non-word error detection is usually checking the
acceptance of a word in the language dictionary (the lexicon) and marks any
mismatched word as incorrect. While real-word error is more complex task,
it requires analysing larger parts from the text, typically, paragraphs and full
text [Kuk92]. In this work, we mainly focus on non-word error detection
techniques.
Dustin defined spelling error as E in a given query word Q which is
not an entry in the underhand dictionary D. [Bos05] He outlined an
algorithm for spelling correction as shown in figure (2.6).
Spell error detection techniques can be classified into two major types:
2.9.1 Dictionary Looking Up
All the words of a given text are matched against every word in a
pre created dictionary or a list of all acceptable words in the language under-
consideration (or most of them since some languages have a huge number of
words and collecting them totally is semi impossible task). The word is
incorrect if and only if there is no match found. This technique is robust but
suffers from the long time required for checking; as the dictionary size
becomes larger, looking up time becomes longer. [Kuk92] [Mis13]
2.9.1.1 Dictionaries Resources
There are many systems deal with collecting and updating languages
lexical dictionaries. Example of these systems is the WordNet online
application; it is a large database of English lexicons. Lexicons (nouns,
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 38
verbs, adjectives, articles …etc) are interlinked by lexical relations and
conceptual-semantic. The structure of WordNet is a network of words and
concepts that are related meaningfully and this structure made it a good tool
for NLP and Computational Linguistics.
Another example is the ISPELL text corrector; an online spell
checker provides many interfaces for many western languages. ISPELL is
the latest version of R. Gorin's spell checker which developed for Unix.
Suggestion a spell correction is based on only one Levenshtein edit distance
depending on looking up every token in the input text against a huge lexical
dictionary. [ISP14]
2.9.1.2 Dictionaries Structures
The standard looking up technique is to match every token in the
dictionary with every token in the text, but this process requires a long time
because NL dictionaries are usually of huge sizes and string matching needs
longer time than other data types do. A solution for this challenge is to
reduce the search space in such a way keeps similar tokens grouped together.
Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05]
Algorithm: Spell_correction
Input: word w
Output: suggestion(s) a set of alternatives for w
Begin
If (is_mistake(w))
Begin
Candidates=get_candidates( w)
Suggestions=filter_candidates( candidates)
Return suggestions
End
Else
Return is_correct
End.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 39
Grouping according to spell or phones [Mis13], and using hash tables are
two fundamental ways to minimize search space.
Hashing techniques apply a hash function to generate a numeric key
from strings. The numeric keys are references to packets of tokens that can
generate the same key indices; hash functions differ in their ability to
distribute tokens and how much they minimize the search space. A perfect
hash function generates no collisions (hashing two different tokens to the
same key index), and a uniform hash function distribute tokens among
packets uniformly. The optimal hash function is a uniform perfect hash
function which hashes one token to every packet; such situation is
impossible with dictionaries due to the variance of tokens. [Nie09]
Spell and phones dependent groups use limited set of packets and
generate keys according to spell or pronunciation; they are another style of
hashing and sometimes of clustering. SPEEDCOP and Soundex are
examples. [Mis13] [Kuk92]
2.9.2 N-gram Analysis
N-grams are defined to be n subsequences of words or strings where n
is variable, often takes values: one to produce unigrams (or monograms),
two to produce bigrams (sometimes called "digrams"), three to produce
trigrams, or rarely takes larger values. This technique detects errors by
examining each n-gram from the given string and looking it with a
precompiled n-gram statistics table. The decision depends on the existence
of such n-gram or the frequency of it occurrence, if the n-gram is not found
or highly infrequent then the words or strings which contain it are incorrect.
[Kuk92] [Mis13]
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 40
2.10 Error Correction Techniques
Many techniques have been proposed to solve the problem of
generating candidates for the detected misspelled word; they vary in the
required resources, application scope, time and space complexity, and
accuracy. The most common are [Kuk92] [Mis13]:
2.10.1 Minimum Edit Distance Techniques
This technique stands on counting the minimum number of primal
operations required to convert the source string into the target one. Some
researchers refer to primal operations to be insertion, deletion, and
substitution of one letter by another; others add the transposition between
two adjacent letters to be the fourth primal operation. Examples, Levenshtein
Algorithm which counts one distance for every primal operation, Hamming
Algorithm works like Levenshtein but limited with only strings of equal
lengths; and Longest Common Substring finds the mutual substring between
two words.
Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has
no limitation on the types of symbols, or on their lengths. It can be executed
in time complexity of O(M.N) where M and N are the lengths of the two
input strings.
The algorithm can detect three types of errors (substitution, deletion,
and insertion). It doesn't account the transposition of two adjacent symbols
as one edit operation; instead, it counts such errors as two consecutive
substituting operations giving edit distance of 2.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 41
One of the well-known modifications of the original Levenshtein
method was done by his friend Fred Damerau, who made a research and
found that about 80% to 90% of errors are caused by the four types of error
altogether which are known as Damerau-Levenshtein Distance. [Dam64]
The modified method required execution time longer than the original;
in every checking round, the method applies additional comparison to check
whether a transposition took place in the string then applies another
comparison to select the minimum value between the previous distance and
the distance with the occurrence of a transposition operation. This step
Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11]
1. Algorithm: Levenshtein Edit Distance
2. Input: String1, String2
3. Output: Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0,
cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2
8. if String2 is NULL return Length of String1
9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. cost = 0
14. else
15. cost = 1
16. r=index of x, c=index of y
17. min1 = (distance(r - 1, c) + 1) // deletion
18. min2 = (distance(r, c - 1) + 1) //insertion
19. min3 = (distance(r - 1,c - 1) + cost) //substitution
20. distance( r , c )=minimum(min1 ,min2 ,min3)
21. end
22. Step3: return the value of the last cell in the distance matrix
23. return distance(Length of String1,Length of String2)
24. End.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 42
multiplied time complexity by factor of 2, resulting in Ω(2*M.N).Hence, in
this work, the original Levenshtein method (figure (2.7)) is modified to
consider the Damerau's four errors types within a time complexity shorter
than the time consumed by Damerau-Levenshtein Algorithm and close to the
original method. Figure (2.8) shows the modification of Damerau on
Levenshtein method.
1. Algorithm: Damerau-Levenshtein Distance
2. Input: String1, String2
3. Output: Damerau Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0,
min3=0, cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2
8. if String2 is NULL return Length of String1
9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. cost = 0
14. else
15. cost = 1
16. r=index of x, c=index of y
17. min1 = (distance(r - 1, c) + 1) // deletion
18. min2 = (distance(r, c - 1) + 1) //insertion
19. min3 = (distance(r - 1,c - 1) + cost) //substitution
20. distance( r , c )=minimum(min1 ,min2 ,min3)
21. if not(String1 starts with x) and not (String2 starts with y) then
22. if (the symbol preceding x= y) and (the symbol preceding y=x)
then
23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost)
24. end
25. Step3: return the value of the last cell in the distance matrix
26. return distance(Length of String1,Length of String2)
27. End.
Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 43
2.10.2 Similarity Key Techniques
As its name clarifies, this technique finds a unique key to group
similarly spelled words together. The similarity key is computed for the
misspelled word and mapped to a pointer refers to the group of words that
are similar in their spell to the input one. Soundex algorithm finds keys
depending on the pronunciation of the words, while the SPEEDCOP system
rearranges the letters of the words by placing the first letter, followed by
consonants, and finally vowels according to their occurrence sequence in the
word and without duplication.[Kuk92] [Mis13]
2.10.3 Rule Based Techniques
This approach applies a set of rules on the misspelled word depending on
common mistakes patterns to transform the word into valid one. After
applying all the applicable rules, the set of generated words that are valid in
the dictionary suggested as candidates.
2.10.4 Probabilistic Techniques
Two methods are mainly based on statistics and probability:
1) Transition Method: depends on the probability of a given letter to be
followed by another one. The probability is estimated according to n-
gram statistics from big size corpus.
2) Confusion Method: depends on the probability of a given letter to be
confused or mistaken by another one. Probabilities in this method are
source dependent, as example: Optical Character Recognition (OCR)
systems vary in their accuracy and their basics in recognizing letters,
and Speech Recognition (SR) systems usually confuse sounds.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 44
2.11 Suggestion of Corrections
Suggesting corrections may be merged within the candidates'
generation; it is fully dependent on the output of the generation phase.
The user is usually provided with a set of corrections, and then he/she
can do a choice among them, keeps the written word unchanged, add the
token to the dictionary, or rewrite the word in the cases when the desired
word is not within the corrections list.
Suggestions are listed in non-increasing order according to their
similarity and suitability for replacing the source word. Similarity depends
on the method of computing the distance or similarity between every
candidate and the source token, while suitability depends on the surrounding
words within the sentence boundary or the paragraph (in context sensitive
correction, full text may be examined before making a suggestion).
2.12 The Suggested Approach
The primal goal of this work is to find the nearest alternative word
from all the available candidates in the underlying dictionary; when a non-
word is encountered there are many candidates available to replace it, but the
trick is here, which one of those alternatives was intended by the writer?
The suggested work answers this question as in the following:
All the dictionary tokens which their count may reach to some
hundreds of thousands can be intended by the writer or none of them could
be so. The writer (or typist) might really misspell the word or he/she wrote it
perfectly but the problem is that the word is not found in the dictionary, i.e.
never seen before and then it is an "unknown" token.
The problem of deciding whether a word is misspelled or unknown is
impossible to be solved. For this, the suggested system will assume every
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 45
unrecognized word is misspelled and may let the user makes the final
decision. As an initial solution, all the tokens in the dictionary are candidates
and in further processing the number of candidates must be minimized.
2.12.1 Find Candidates Using Minimum Edit Distance
The starting step is to look for the most similar tokens in the lexicon
dictionary and ranking them according to the minimum edit distance from
the misspelled word. This action reduces the number of candidates to an
acceptable amount depending on a threshold for the number of edit
operations that should be applied to equalize the candidates and the
misspelled word, or a maximum limit for number of candidates. The
suggested system used Levenshtein method after being enhanced to consider
the four Damerau edit operations.
To find the similar tokens, the lexicon should be looked up and every
token in it must be examined with the given word. This process consumes
time because of the huge tokens held by the lexicon dictionary and the
required time by the examining algorithm itself to find the minimum edit
distance. Hence, the search space needs to shrink; a method is proposed to
group similar tokens in semi clusters using spell properties.
2.12.2 Candidates Mining
The best set of candidates is going under another processing step to specify
how the generated candidates are related to the misspelled token and
accordingly they should be ranked. The process is implemented using a
vector of the following features:
 Named-entity recognition: many issues are considered.
 Transposition probability: Keyboard proximity and Physical Similarity.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 46
 Confusion probability: because phonetic errors are popular, this
analysis help us to find if a word was misspelled because of replacing
letter(s) with another of the same sound.
 Starting and ending letters matching.
 Candidates' length effect.
A weighting scheme was applied to give each feature an effect role in
deciding the best set of suggestions. However, the Similarity amount has the
maximum part among the others.
2.12.3 Part Of Speech Tagging and Parsing
Finally, the suitable candidate is chosen by the parser. The parser
selects the candidate(s) that make(s) the sentence, which contains the
misspelled word, correct. Tagging plays an important role in specifying the
optimal candidate because filtering according to POS tag is the base on
which the parser stands to select a candidate for its incomplete sentence. The
selected tag is not only affect candidate but also every token in the sentence;
it is the nature of English (and most of natural languages).
The set of candidates, at this step, should contain the minimum
number of elements but the best.
Grammar checking, accomplished by parsing, is another goal of this
system. The system applies sentence phrasing process and check each phrase
consistency according to English grammar rules. When an incorrect structure
is encountered, the system tries to re-correct it.
Parsing is a fundamental step in specifying the correct choice of
candidates since the basic goal is to give a correct sentence.
The dependent dictionary is an integration of WordNet dictionary with
ISPELL dictionary.
Chapter Two_ Part II   Text Correction
_______________________________________________________________________
 47
Figure 2.9 shows the block diagram of the suggested work; and in
further chapters, more details are shown for each block.
_____________________________________________________________
1 Diagram in 2.9 is more detailed through the next three chapters
Figure (2.9): The suggested system block diagram1
Preprocessing WordNet Lexical
Dictionary
Morphological analysis and
POS tags Expansion
ISPELL
datasets
Dictionaries
Integration
Hashing and Indexing
POS Tagging
Integrated Hashed
Indexed Dictionary
------------
------------
------------
------------
------------
Tokens Stream
Sentences Stream with
tagged tokens
Candidates Generation
Sentences Recovery and
Suggestions Listing
-----------
-----------
-----------
-----------
-----------
-----
Phrase Level Suggestions
Phrasing
Candidates
Ranking
Grammar Correction


Chapter Three
Dictionary
Structure
and
Looking Up
Technique
 48 
Chapter Three
Hashed Dictionary and Looking Up Technique
3.1 Introduction
Dictionary is a basic unit, mostly, in every NLP application. It holds
the lexicon of the language under processing and related information
according to the application purpose type such as POS tags, semantic
information, phonetics, pronunciation and others.
Typically, dictionaries are data structures supported in a format of a
list of tokens or words collection. Each word (or token) is associated with its
information that makes its usage by a NLP application becomes possible.
The number of tokens held by a dictionary is a critical point in NLP
applications, especially taggers and text correction systems; because as the
number of tokens becomes smaller, the detected errors ratio also would be
small since poor dictionary allows erroneous words to pass undetected. On
the other hand, large sized dictionary increased this ratio but requires longer
time for tokens looking up.
Therefore, a balancing is needed to keep the size of a dictionary as
inclusive as possible and the looking up speed fast. Many approaches have
been proposed to handle this problem, some of these are indexing and hash
functions.
3.2 Hashing
The optimal feature of any dictionary is the availability of random
access but strings are high various data type which makes this task
impossible, at least from the sides of memory constraints.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 49 
Hashing is the process of converting a string S into an integer number
within the period [0, M-1] where M is the number of available addresses in a
predefined table. Hash functions made good promises from the area of
random access, but alone!? No, the variance of language tokens requires an
infinite hash table to hold every token "separately" and a variable size
addressing buffer which may be unloadable by most of current systems as
well as the highly wasted storage space.
By "separately" we mean that no two strings have the same hash
value, i.e. no collisions. As the number of collisions becomes larger the
looking up inside packets becomes longer.
However, an exploitation of hash function as a partial solution can be
applied with other approaches to solve the shown up problem. While hash
function can map tokens according to some of their features into size
manageable packets, approaches such as indexing and advance search
techniques would enhance looking up speed to a reasonable amount.
3.2.1 Hash Function
The hash function in this work was created to exploit the spell of
tokens as addressing key. It converts the prefix of tokens to be grouped into
packets.
English alphabet, the considered language of this work, contains the
set of uppercase letters from 'A' to 'Z', lowercase letters from 'a' to 'z', and
numbers from 0 to 9. In addition to some special purposes characters which
are not avoidable in the dictionary because they are parts of some tokens
such as slash (/), period (.), comma ('), underscore ( _ ), whitespace, and
hyphen (-). The resulted characters set contains about 67 characters which
can be reduced further more by replacing the numbers codes from 1 to 9 by
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 50 
the code of 0 because the distinction between numbers has no such
importance in this application for two reasons:
 The difference between numbers is not a problem in the correction
process since any system can never estimate what a number was
intended by the writer; therefore any written number would be
absolutely accepted.
 If a distinction should be taken when treating numbers, then we need
to cover every counted number in the dictionary, resulting in an
infinite dictionary size because numbers are infinite.
The final alphabet contains the union of the above mentioned sets and the
reduced numbers set:
∑={ A,B,…,Z, a, b,…,z ,0, /, . , ' , - , _ , whitespace}
which can be re-encoded using only 6 bits as shown in Table 3.1 (unused
codes are referred by *) .
Hashing according to prefixes is a good way to minimize the sizes of
packets; it is similar to the SOUNDEX and SPEEDCOP methods
[Mis13][Kuk92] in the fact that they shared the same goal, minimizing the
size of search space, but it verses them in that this approach maps tokens to a
predefined packets addresses depending on a limited length from the string
prefix while those methods uses the total length and filters the letters
according to sound or spell. This difference gave the suggested approach two
interested features:
1. The hash function is simple and can be applied directly without any
considerations for pre processing; SOUNDEX needs to encode letters
into their phonetic groups, and SPEEDCOP rearranges letters.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 51 
Symbol Code Symbol Code Symbol Code
A 0 B 1 C 2
D 3 E 4 F 5
G 6 H 7 I 8
J 9 K 10 L 11
M 12 N 13 O 14
P 15 Q 16 R 17
S 18 T 19 U 20
V 21 W 22 X 23
Y 24 Z 25 a 26
b 27 c 28 d 29
e 30 f 31 g 32
h 33 i 34 j 35
k 36 l 37 m 38
n 39 o 40 p 41
q 42 r 43 s 44
t 45 u 46 v 47
w 48 x 29 y 50
z 51 ' 52 / 53
- 54 _ 55 . 56
0 57 whitespace 58 * 59
* 60 * 61 * 62
* 63
Table 3.1: Alphabet Encoding
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 52 
2. A random access is established by using the output of the hash
function as an address while both previous methods need to search for
a matching between the computed value and the stored codes.
3.2.2 Formulation
As mentioned above, the size of the alphabet reduced to only 59
symbols which can be encoded using only 6 bits instead of the standard 8
bits, making a series of hashing functions available to be applied 1, 2, or any
longer sequence of symbols. But this is another area for discussion, if the
length of the prefix is too small then the packets number would be small
also; therefore, they hold large number of tokens resulting in longer looking
up time.
On the other hand, using long prefixes creates large number of packets
and some of them usually are sparse because of the variance and the
irregularity of tokens which is a characteristic of natural languages.
The function depends on using a three characters prefix C1C2C3 and
converts it as presented in Table (3.1) into integers, then computes the hash
value H according to Equation.1:
H(C1,C2,C3)= _________ (3.1)
H represents the packet address where tokens that are starting with same
prefix are held.
Obviously, the number of the available packets addresses is equal to
the number obtained from residing the three symbols binary codes as shown
in Table (3.2), where the symbol at index 0 is 'A' and symbol at index 63
(the last available index in the alphabet) is the unused cell which referred by
'*'.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 53 
Start Address= (C1)2||(C2)2||(C3)2=(000000000000000000)2=(0)10
End Address= (C1)2||(C2)2||(C3)2=(111111111111111111)2=(262143)10
This makes the total number of packets= 2 18
=262144 packets. Some of
these packets are empty because their addresses do not match an actual token
prefix in the lexicon but the distribution of tokens among packets reduced
the search space to a manageable size especially when the hash function has
been combined with an indexing scheme to build the dictionary in a two
levels structure.
Starting Address Encoding End Address Encoding
Alphabetic
Encoding
Decimal
Encoding
Binary
Encoding
Alphabetic
Encoding
Decimal
Encoding
Binary
Encoding
C1 A 0 000000 * 63 111111
C2 A 0 000000 * 63 111111
C3 A 0 000000 * 63 111111
3.2.3 Indexing
Key-indexing is an in-memory lookup technique based strictly on
direct addressing into an array with no comparisons between keys made. Its
area of applicability is limited to numeric keys falling in a limited range
defined by the available memory resources. Hashing helps direct addressing
to work on keys for any type and range by bringing serial search and
collision resolution policies into the equation.
Table 3.2: Addressing Range
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 54 
Indexing exploited for creating a reference table that holds the 218
packets heads addresses which can be addressed directly by the hash
function. Every record in the reference table contains two fields: the first is
"base" field which holds an address if its index match a token prefix,
otherwise its value is (-1). The second is the "limit" field that holds the
length of the primary packet that related to its index. Looking up for a packet
contains tokens starting with a specific prefix is shown in figure (3.1).
The packets referred by the reference table are treated as primary
packets, which hold 3-symbols prefix identical tokens; for further reduction
for the search space, sub packets can be created for every primary packet.
The second level of tokens distribution is also based on their prefixes
but with longer sequences. Instead of using only three symbols to group
tokens with identical prefixes, the prefix equality expanded to 6 symbols by
subdividing tokens inside primary packets into more secondary packets
Figure (3.1): Token Hashing Algorithm
Algorithm: Token Hashing
Input: English token (finite string over ∑), reference and hash tables.
Output: packet head address where the input token may rely.
Step1: set variables C1,C2, and C3 to the input token prefix.
Step2: Compute Index from C1, C2, and C3.
Index=
Step3: go to reference table at the record indexed with Index.
Step4: examine Address field
if Base > -1
return (Base value)
else
return fail
End.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 55 
which consist of a head and a set of tokens that are identical to the head in
their first 6 symbols.
The structure of the dictionary can be clarified by hashing the
exemplar token ABCDEFGH according to the approach described
previously.
(1) The dollar sign ($) refers to any sequence may follow S
i
Figure (3.2) : Dictionary Structure and Indexing Scheme
C1=A, C2=B, C3=C
Reference Index= H(C1,C2,C3)=Index
Index : Head address =X : Length =Y
ABCS0$
ABCS1$
ABCS2$
ABCS3$
:
:
ABCSY-1$
ABCDEFT0
ABCDEFT1
ABCDEFT2
ABCDEFT3
:
:
ABCDEFTR-1
Primary Packet 1
"Head Code="ABC
Si="DEF"
Secondary Packet
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 56 
An interested characteristic in secondary packets is that no more space
is wasted, because it is not based on a predefined packets structure. The
secondary head, which is a token within primary packet, may be followed by
tokens sharing it the same 6-symbol prefix which are collected in one
variable size secondary packet; or may not be followed, then no need for a
secondary packet.
3.3 Looking Up Procedure
As shown in figure (3.2), the process of looking for a target token is
started when the primary packet head address becomes in hand from the
reference table which in turn computed using the hash function.
At hash table, where the tokens are stored according to indexes, the
search process begins with a random access accomplished by the index of
the primary packet head, and the matching is done sequentially.
The matching is happened on the forth through the sixth symbols from
every token related to that primary packet; such an action reduces
comparison time since matching all the sequence requires longer time. Even
the reduction is infinitely short but it is useful in similar cases because logic
operations on strings differ from other data types.
When a full matching is found the target token is compared to the
token at that record completely, if they are matched the goal is reached;
otherwise, searching continued in the secondary packet related to that token
(if there is a one related to the current token). The comparison inside
secondary packets, unlike primary packets, uses full token length and failure
here infers that there is no chance to find the targeted token in the dictionary.
The algorithm in figure (3.3) outlines the looking up procedure after
gaining primary head address.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 57 
Figure (3.3) : Algorithm of Looking Up Procedure
Algorithm: Looking up a target token
Input: Target Token, Primary Packet Head address, Primary Packet
Size.
Output: tag of input target token.
Step1: Set primary packet information
X=head address, Y=packet size.
Step2: Examine X:
if X<0 then return fail
for primary_index=X to X+Y do
if prefix(token at Primary_index in Hash Table)=prefix(target)
begin
if Current token = target return primary_index
X2=Secondary packet head address
Y2=Secondary Packet Length
exit for
end
Step3: Examine X2
if X2<=0 return fail // no related secondary packet
for secondary_index=X2 to X2+Y2 do
if token at secondary_index in hash table=target
return secondary_index
Step4: if no match was found at step3 return fail
End.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 58 
3.4 Dictionary Structure Properties
The proposed dictionary can be applied in every application depends
on strings looking up. It provides high speed directed search for perfect
matching.
 The reference table, although there are wasted addresses because of
strings variance, is suitable to be used with natural language
dictionaries which are usually of huge size. The tokens are handled in
a separated table constructed depending on the reference table.
 String comparison consumes longer time than other types do. In this
approach, comparison is reduced to include subsequences from both
target and the stored tokens.
 Looking up procedure is fast in discovering the foundation of a target
token in many situations:
o At hashing step the empty record infers missed token after
consuming only one numeric comparison.
o At primary packet, failure requires comparing at most only the
three symbols from the fourth index to the sixth in the
6_symbols prefixes of tokens within primary packet.
o At secondary packet, failure requires comparing tokens within
that packet.
The worst case is the failure of finding the target at the end of a
secondary packet related to the last token in the primary packet which
consumes (length of primary packet +length of secondary packet)
comparisons.
 Since looking up is string dependent, there is a high flexibility in
associating information with tokens without any overloading in search
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 59 
process. As a result it can be used to construct Lexical and semantic
dictionaries.
3.5 Similarity Based Looking-Up
The structure described in section (3.2) is suitable for a perfect
looking up, while the purpose of this work is to design a text correction
system where some errors are reasons of unknown words or misspelled.
Such situations need for looking up the dictionary to generate candidates that
are similar (not identical) to the given misspelled token.
The main purpose of any similarity based grouping approach is to
reduce the search space to a manageable size in order to shorten looking up
time, but at the same time they should not make lose of good candidates or
some similar objects (tokens). Clustering techniques are examples of such
approaches. But even when using fuzzy clustering techniques this problem
did not solved completely because:
 Tokens clustering should consider the sequence in which the symbols
are arranged in the token in addition to symbols themselves.
 Although there are many similarity measures techniques for grouping
tokens, no obvious separation measure can be used to separate strings
clusters.
 In the case of fuzzy clustering, decision threshold is a bottleneck;
where high threshold value makes lose of good candidates, low
threshold heightens redundancy by grouping less similar tokens in the
cluster resulting in longer searching time and inaccurate candidates.
 As the number of fuzzy centroids which a token relates to becomes
larger, computing the nearest set of centroids would also increase
search complexity.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 60 
For these reasons, an approach is proposed to save the same hash table as
the dictionary structure and to improve the looking up technique. The
algorithm is presented in figure (3.5).
The improvement forwards the search to include similarly spelled tokens,
depending on the same bases of the standard search described previously.
The outlines of the proposed approach are:
 bi-grams Generation
 Primary Centroids Selection (at maximum 3 symbols length)
 Connecting centroids to Reference table.
These three steps are presented in figure (3.4).
3.5.1 Bi-Grams Generation
Reference table is the building block of bi-gram generation process; it
specifies the range of hashing addresses and the number of the symbols
needed from tokens prefixes for computing hash values.
The hash-indexing method used here is limited within 3 symbols only;
therefore, the bi-grams generation involves three sub-divisions to produce
two symbols (bi-grams).
(C1,C2), (C1,C3), and (C2,C3)
Division into three bi-grams simplifies predicting Damerau four errors
types (insertion, deletion, substitution, and transposition) by applying the
template C1C2C3 using only two symbols at a time producing the results
shown in Table (3.3).
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 61 
The variety of tokens of natural language cannot satisfy all nine distributions
of the template different sequences described above for every index in
Reference table, therefore, a preprocessing is applied to collect satisfied
prefixes by checking every generated template foundation in the dictionary
and the missed sequences are rejected.
Reference Index Selection
(C1,C2,C3)=H-1
(Index)
bi-grams variants generation
C1C2?
C1?C2
?C1C2
C2C3?
C2?C3
?C2C3
C1C3?
C1?C3
?C1C3
bi-grams Generation
Per each bi-gram variant a
3_symbols length
Centroids Set Selection
Redundancy Removal
(bi-grams, Centroids Set)
Connecting
bi-grams Association with
"Index"
Centroids
Selection
Centroids
Referencing
Figure (3.4) : Semi Hash Clustering block diagram
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 62 
Table (3.3): Predicting errors using Bi-grams analysis
Sequence Substitution Insertion Deletion Transposition
C1C2? √ √ × ×
C1?C2 × √ × If ?=C3
?C1C2 × √ × ×
C2C3? × × √ ×
C2?C3 × √ If ?<>C1 If ?=C1
?C2C3 √ × × ×
C1C3? × × If ?<>C2 If ?=C2
C1?C3 √ × × ×
?C1C3 × √ If ?<>C2 If ?=C2
3.5.2 Primary Centroids Selection
For every accepted sequence, a set of centroids are selected as sub set
of the unification of primary centroids that are at maximum of three symbols
length.
A centroid related to a specific sequence is an assignment of a symbol
from the alphabet to the '?' sign in that sequence. For example at
index=9882:
H-1
(9882)="Che"
C1='C', C2='h', C3='e'
The nine sequences and their related primary centroids after pruning
mismatched sequences are:
1. Ch?: ChB, ChE, Cha, Che, Chi, Chk, Chl, Chn, Cho, Chr, Cht, Chu,
Chw, Chy, Ch', Ch˽, Ch
2. C?h: Cah, Coh, C˽h
3. ?Ch: BCh, DCh
4. he?: hea, heb, hec, hed, hee, hef, heg, heh, hei, hej, hek, hel, hem, hen,
heo, hep, her, hes, het, heu, hev, hew, hex, hey, he', he-, he
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 63 
5. h?e: hae, hee, hie, hoe, hue, hye
6. ?he: Ahe, Che, Ghe, Jhe, Khe, Lhe, Phe, Rhe, She, The, Whe, ahe,
bhe, che, dhe, ghe, khe, phe, rhe, she, the, whe
7. Ce?: Cea, Ceb, Cec, Ced, Cee, Cei, Cel, Cen, Cep, Cer, Ces, Cet,
Ceu, Cey
8. C?e: Cae, Cce, Cde, Cee, Che, Cie, Cle, Coe, Cre, Cse, Cte, Cue,
Cve, Cze
9. ?Ce: BCe, vCe
3.5.3 Centroids Referencing
The final step is to join every sequence to its centroids set and every
index to its bi-gram sequences.
This process includes creating a list of all the primary centroids in the
dictionary which represent all the 3-symbols prefixes of primary packets
heads. Bi-grams are also stored in a separated list associated with the related
primary centroids set address.
Reference table in turn keeps track the addresses of the bi-grams of
each index within it. As a result, Bi-grams and the associated centroids sets
can be randomly accessed through Reference table.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 64 
3.6 Application of Similarity Based Looking up approach
The purpose of similarity based looking up is the minimization of
search space and maximizing the chance of finding tokens that are similar to
the source token.
Figure (3.5) : Similarity Based Hashing algorithm
Algorithm: Similarity Based Hashing
Input: Hashed Dictionary
Output: Similarity Based Hashed Dictionary
For each Reference Index apply the following steps:
Step1: Bi-grams Generation
1)CxCyCz=H-1
(Index)
2) generate sequence variants
3) filter sequences
Step2: Primary Centroids Selection
for each generated sequence do
1) for every alphabet symbol do
1.1) assign in the sequence missed symbol
1.2) reject if no prefix matching is found
2) remove duplicated centroids
Step3: Centroids Referencing
1) Bi-grams Centroids connecting
2) Index Bi-grams connecting
End.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 65 
Hashed dictionary structure shown in section (3.2) was built to
achieve perfect matching according to tokens prefixes; if the source token
wasn't found then similar tokens should be looked up.
Because looking up the hashed dictionary is based on the prefixes of
tokens, the similarity based looking up accounts all the available mistakes
that can occur within 3-symbols prefix of every token via exploiting the
associated bi-grams with the computed hash value. Every bi-gram is linked
to a list of primary centroids which in turn matched with the source token 3-
symbols prefix and filtered according to similarity amount. Centroids with
highest similarity are selected, while lower similarity centroids are rejected
for shortening the searching time.
The next step is expanding the prefix length in the similarity
calculation through including 6-symbols prefixes because the selected
primary centroids refers to primary packets where every token within them
differs from the others tokens in its 6-symbols prefix. This step directs the
search to be more precise by selecting the nearest tokens from the primary
packet to the source token.
Finally, for every selected primary packet token, there may be a
secondary token where each token within it is equivalent to the secondary
head (i.e. primary packet token). The final action, in turn, maximizes the
chance of encountering tokens that are similar to the source token inside the
secondary packet (usually contains small number of tokens).
An interested property in this approach is the ability of using
thresholds in every level of the looking up procedure. A different threshold
can be used in the primary centroid selection, in the secondary packets heads
selection and in the selection of candidates. The value of the threshold is
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 66 
application dependent and fundamentally restricted with the similarity
calculation method.
Figure (3.6) : Block diagram of candidates generation using SBL
(C1,C2,C3)= Source 3-symbols Prefix
Index=H(C1,C2,C3)
P1 P2 P8 P9P3 P4 P5 P6 P7
2-grams Patterns Examining
(P1…P9)
Primary Centroids Collection
Collected Centroids Filtering
(Highest Similarity Centroids Selection)
Secondary Centroids Selecting and Filtering
Candidates Generation
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 67 
3.7 The Similarity Based Looking up Properties
The proposed approach has many good features that make it suitable in
various string based search applications:
1. Clustering illusion: the structure of the dictionary and the looking up
technique used with it provides a way for dividing search space into
three different levels:
a. Primary Centroids Clusters: only the 3-symbols prefixes are
checked and the best are selected as centroids to the next level.
b. Primary Packets Clusters: every token here is referenced by a
primary centroid and may be referencing a secondary packet
(i.e. act as secondary centroids).
c. Secondary Packets Clusters: every token is referenced to by a
secondary centroid.
2. Time Complexity Minimization: hashing function and indexing
merging simplified searching and provided random access in more
than one level.
3. Application Flexibility: thresholds can be used in every clustering
level as a separation to exclude uninterested centroids or candidates.
Indicating the threshold value is relevant to the developer, used
similarity calculation method, and the application area.
The algorithm in figure (3.7) outlines the complete process.
Chapter Three  Dictionary Structure and Looking up Technique
________________________________________________________________________
 68 
* If no threshold was indicated, the approach generates candidates according to maximum
similarity.
Figure (3.7): Similarity Based Looking up algorithm
Algorithm: Similarity Based Looking up
Input: *
Hashed Dictionary; Source_Token; similarity thresholds:T1,T2,T3
Output: Candidates Set
Step1:Hash Index Calculation
C1,C2,C3=3_symbols Prefix of Source Token
Index=H(C1,C2,C3)=
Step2: Primary Centroid Selection
for each bi-gram at Index do
for each related Primary Centroid do
if similarity(C1C2C3,Primary Centroid)>=T1 then
select Primary Centroid
Step3: Secondary Centroids Selection
for each selected Primary Centroid do
for each related Secondary Centroid do
if similarity(6-symbols source prefix, Secondary
Centroid)>=T2 then
select Secondary Centroid
Step4: Candidates Selection
for each selected Secondary Centroid do
for each Token in related Secondary Packet do
if similarity(Source Token, Token)>=T3 then
select as Candidate
End.


Chapter Four
Error Detection
and
Candidates
Generation
 69 
Chapter Four
Error Detection and Candidates Generation
4.1 Introduction
Error detection is the process of indicating incorrect words in the
text. The term "incorrect" may refer to a misspelled word, misused word or
both. Misused words are correctly spelled but used in a way violates the
syntax or the meaning of the sentence.
The detection of misused words (non-word errors) is a forward
process in that it involves the looking up for every token in a pre-prepared
list or a dictionary (also referred by "lexicon") contains all the well spelled
words of the language but the size of the lexicon affects the looking up
process because larger sizes require longer time.
On the other hand, detecting misused words (real-word errors) is a
more complex task. It requires analyzing the syntax of the sentence to
discover sentence constituency correctness which in turn if not correct
requires indicating the word(s) that violated it. Errors resulting in
meaningless sentences entail further processing which may be expanded
out of the sentence boundaries and needs more information about sentence
tokens.
4.2 Non-word Error Detection
Detecting misspelled tokens in this system is based on the dictionary
looking up technique and is performed within the stage of tagging.
Tokens of a given text must be tagged. A tag should be found for
every token in the considered language; therefore, tokens are collected and
resided with their tags in a lexicon. Tagging stage is a fundamental process
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 70 
in most of natural language processing systems; it is a necessity for tagging
to precede syntax analysis since no parsing can be done without associating
tags with each token in the sentence.
Figure (4.1) : Tagging Flow Chart
Convert the Text into
Tokens Stream
Handle a Token
Look up inside the
Hashed Dictionary
Found?
Save (Token,
Tag) pair
Generate
Candidates
Save(token, {candidates, tags} list)
Last
Token ?
Pass the new tagged Stream
to Segmentation Step
YesNo
Yes No
Start
Read Text
End
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 71 
Because tagging requires looking up every token in the given text, it
serves another task at the same time since missed tokens are assigned to be
misspelled.
The looking up procedure discussed in Chapter Three is used for
discovering non-word errors; the structure of the dictionary is a
reconstruction of about 300,000 tokens collected from two datasets as raw
data. The major two resources of the lexicon are WordNet and ISPELL;
WordNet represented the basic resource and was integrated with ISPELL
dataset for making the lexicon more inclusive.
The lexicon was hashed and indexed in order to achieve random
access. The looking up time is very short compared to the typical structures
and the tagger is capable of deciding whether a token is found or not found
in the lexicon even after consuming only one operation (for further details
see sections 3.3 and 3.4).
4.3 Real-words Error Detection
Deciding whether a word is misused or not is more complex than
detecting misspelled words, the process needs more computations and more
resources. Syntax analysis can be exploited to recognize misused words
since every English sentence (as most natural languages sentences do) is
constrained in a syntactic rule or a grammar. Any sentence violates the
syntax constraints and cannot be fitted or parsed using a finite set of
production rules is signed as incorrect sentence. Next, the sentence should
be processed to indicate the erroneous word that made the sentence
incorrect.
Phrasing is a good way to indicate precisely the incorrect word
through converting the sentence into constituents. The constituency
hierarchy starts from the sentence as the head of the tree, which contains
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 72 
one or more clauses, which contains one or more phrase, and the phrase
contains one or more words.
The division into phrases is useful in reducing the parse tree. As the
number of tokens becomes larger, the available parses for the same
sentence increase.
The suggested approach is rule based, any sentence cannot be parsed
correctly is described as an incorrect. The syntax analyzer is based on
phrasing by applying a brute force approach to identify the misused word
in the phrase.
The syntax analyzer is fully dependent on the output of the tagger,
however, misspelled words should be replaced with suggestions in order to
allow the analyzer to proceed analyzing the sentence and select the best
alternative that makes the sentence acceptable. (Chapter Five details the
idea).
4.4 Candidates Generation
Candidates are those tokens with high similarity to the incorrect
word. The meanings of "similarity" and "incorrect" are relative. In the case
of non-word errors, the incorrect word is a misspelled and the similarity is
a measure of how much another token is spelled or pronounced in a way
similar to the misspelled word. In the case of real-word errors the candidate
token is the one that is more likely to be intended by the writer but
confused by the incorrect one; sometimes, spell or phonetic mistake
resulted in another correct word.
4.4.1 Candidates Generation for Non-word Errors
In this step, the system takes the incorrect token (a token out of
dictionary) and looks for similar tokens in the underlying dictionary.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 73 
Since every token in the dictionary may be intended by the writer, the
process is somewhat more complex. Several issues should be considered to
decide which tokens are suitable to be generated as candidates.
A major problem is the distinction between unknown and mistaken
words; therefore, this research considers every unknown word as a
mistaken one and lets the decision to be taken by the user himself/herself.
However, candidates (or alternatives) are generated depending on the
mistaken word and the total process is performed in the following way:
On the first look, all the dictionary tokens which their count may
reach to some hundreds of thousands can be intended by the writer or none
of them could be so. The writer (or typist) might really misspell the word or
he/she wrote it perfectly but the problem is that the word is not found in the
dictionary, i.e. never seen before and then it is an "unknown" token.
The number of the generated candidates is not limited; further
processing would reduce the list of candidates to include only the best set
according to similarity amount and some other criteria that are fully
dependent on the spell of the encountered misspelled token.
In the tagging stage, if a token wasn't found in the lexicon then it is
misspelled. The starting step is to look after the most similar tokens in the
lexicon dictionary and ranking them according to their similarity to the
misspelled token, the similarity is based on minimum edit distance
measure. This action reduces the number of candidates to an acceptable
amount depending on a threshold for the number of edit operations that
should be applied to equalize the candidates and the misspelled word or a
maximum limit for number of candidates.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 74 
4.4.1.1 Enhanced Levenshtein Method
The modification on the Levenshtein method can be performed by
extending the standard matching step at line 12 in figure (2.7) to check the
foundation of a transposition case. The idea rises from the fact that no
transposition case may be found without finding a matching success
between at least two symbols in the examined strings; and more precisely
the transposition can be discovered using minimum number of operations
by considering two facts:
- Two adjacent symbols can never be mirrored by other two adjacent
symbols in another string unless the first symbol in the first set matches
the second in the second set.
- Instead of manipulating the transposition occurrence separately, the
algorithm can modify the under-processing cell in the distance matrix
directly and the next matching steps will do the work.
The first fact served in avoiding the trying of all possibilities as it was
presented in Damerau's modification at lines 20 and 21in figure (2.8) where
each symbol is matched to every symbol in the second string regardless to
the availability of a transposition operation happen by adding additional
matching statements to the original one at line 12 in figure (2.7).
On the other hand, the second fact announces another side of
processing that is the distance matrix is filled sequentially row by row from
the top most left corner to the bottom most right corner (where the total
distance is held). Using one step to process both cases (transposition
happen case and the not case) is a good way to minimize the number of
operations required to accurately compute the distance.
In this modification, the distance matrix is updated directly by one
step and the next steps (selecting the minimum and filling the underhand
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 75 
cell) are continued normally as it was done in the original algorithm; such
action abstracted the step at line 22 of the Damerau's algorithm (figure 2.8)
which uses more than one operation to be completed.
How modifying the Levenshtein method reduced the time and
enhanced the candidates generation process is that the modification
exploited the first fact to make the algorithm avoids checking the cases that
are leading to a failure situation, unlike Damerau-Levenshtein modification
which makes no difference between the two situations; this is presented in
lines 15 and 16. The directly updated distance matrix (line 17) in the
enhanced algorithm has accurately adjusted the distance without any more
additional processing; it is simply an assignment.
The time complexity is related to the distance between the input
strings. However, as the strings becomes more different, the steps at lines
15, 16 and 17 in the enhanced algorithm (figure 4.2) are rarely executed
;therefore, they are saving time; in turn, this property is preferred in the
cases where the algorithm is used for generating candidates.
Candidates should be as similar as possible to the source token
(usually, a mistaken word) and the relativity of the additional steps (lines
15, 16 and 17) in the enhanced algorithm made the consumed time to
generate candidates useful (or not wasted) from the view point that those
steps are only executed when there is a matching with the source token and
they are more executed as the source word being more matched with the
target word which means that it is a good candidate.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 76 
The algorithm in figure (4.2) shows the enhancement of the original
Levenshtein method and the rest of this section describes the difference of
the three methods (original Levenshtein, Damerau-Levenshtein and the
enhanced Levenshtein method) through manipulating two example strings
"Transposed" and "Tarnspaesd":
Figure (4.2) : The Enhanced Levenshtein Method Algorithm
1. Algorithm: Enhanced Levenshtein Distance
2. Input: String1, String2
3. Output: Damerau Edit Operations Number
4. Step1: Declaration
5. distance(length of String1,Length of String2)=0, min1=0, min2=0,
min3=0, cost=0
6. Step2: Calculate Distance
7. if String1 is NULL return Length of String2
8. if String2 is NULL return Length of String1
9. for each symbol x in String1 do
10. for each symbol y in String2 do
11. begin
12. if x = y
13. begin
14. cost = 0
15. if x is not the start symbol of String1 then
16. if (the symbol preceding x=the symbol following y) and (x is
not duplicated) then
17. decrease distance (index(x)-1,index(y)) by 1 // transposed
18. end
19. else cost = 1
20. r=index of x, c=index of y
21. min1 = (distance(r - 1, c) + 1) // deletion
22. min2 = (distance(r, c - 1) + 1) //insertion
23. min3 = (distance(r - 1,c - 1) + cost) //substitution
24. distance( r , c )=minimum(min1 ,min2 ,min3)
25. end
26. Step3: return the value of the last cell in the distance matrix
27. return distance(Length of String1,Length of String2)
28. End.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 77 
Figure (4.4) : Damerau-Levenshtein Example
1) Levenshtein
T r a n s p o s e d The minimum edit distance=5
1. substitute 'r' by 'a'
2. substitute 'a' by 'r'
3. substitute 'o' by 'a'
4. substitute 'e' by 's'
5. substitute 's' by 'e'
Computation Complexity:
M*N comparisons=100
(cost,min1,min2,min3)assignments
*100 =400
100 Minimum function Calls
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 1 1 2 3 4 5 6 7 8
r 3 2 1 2 2 3 4 5 6 7 8
n 4 3 2 2 2 3 4 5 6 7 8
s 5 4 3 3 3 2 3 4 4 5 6
p 6 5 4 4 4 3 2 3 4 5 6
a 7 6 5 4 5 4 3 3 4 5 6
e 8 7 6 5 5 5 4 4 4 4 5
s 9 8 7 6 6 5 5 5 4 5 5
d 10 9 8 7 7 6 6 6 5 5 5
2) Damerau-Levenshtein
T r a n s p o s e d Minimum edit distance=3
1. transpose ('a', 'r')
2. substitute 'a' by 'o'
3. transpose ('e', 's')
In addition to the complexity of
original Levenshtein, the following
operations are executed:
100 comparisons (line 21)
81 comparisons (line 22)
2 calls for minimum function (line 23)
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 1 1 2 3 4 5 6 7 8
r 3 2 1 1 2 3 4 5 6 7 8
n 4 3 2 2 1 2 3 4 5 6 7
s 5 4 3 3 2 1 2 3 3 4 5
p 6 5 4 4 3 2 1 2 3 4 5
a 7 6 5 4 4 3 2 2 3 4 5
e 8 7 6 5 5 4 3 3 3 3 4
s 9 8 7 6 6 4 4 4 3 3 4
d 10 9 8 7 7 5 5 5 4 4 3
Figure (4.3) : Original Levenshtein Example
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 78 
3) Enhanced Levenshtein
T r a n s p o s e d Minimum edit distance=3
1. transpose ('a', 'r')
2. substitute 'a' by 'o'
3. transpose ('e', 's')
In addition to the complexity of
original Levenshtein, the
following operations are
executed:
12 comparisons (line 15)
7 comparisons (line 16)
2 assignments (line 17)
0 1 2 3 4 5 6 7 8 9 10
T 1 0 1 2 3 4 5 6 7 8 9
a 2 1 0 1 2 3 4 5 6 7 8
r 3 2 0 1 2 3 4 5 6 7 8
n 4 3 1 1 1 2 3 4 5 6 7
s 5 4 2 2 2 1 2 3 3 4 5
p 6 5 3 3 3 2 2 3 4 5 6
a 7 6 4 3 4 3 3 3 4 5 6
e 8 7 5 4 4 4 4 4 2 3 4
s 9 8 6 5 5 4 5 5 2 3 4
d 10 9 7 6 6 5 5 6 3 3 3
4.4.1.2 Similarity Measure
Minimum Edit Distance methods counts the number of edit
operations required to convert on string to another but do not show how the
two strings are similar. An example: the distance between "a" and "b" =1,
but the similarity =0; whereas, distance between "Similar" and "Similer" is
also 1, but the similarity =6/7.
Strings lengths should be taken into account when computing the
edit distance then the resulted value is used as a similarity measure. Since
the absolute difference between any two strings is added to the total
mismatched symbols since it is considered as the number of deleted
symbols from the shorter string. The similarity measure must depend on the
maximum length between the two. The relative distance is computed by:
R_Dist(St1,St2)= distance(St1,St2) / max(length(St1),length(St2)) … (4.1)
Figure (4.5) : Enhanced Levenshtein Example
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 79 
Relative distance is a value within the interval (0,1) where
completely different strings have a relative distance of 1; and as its value
decreases, the difference is also decreases until reaching the value of 0
when the two strings are identical.
Since the similarity and difference are complements to each other,
the similarity can be computed by:
Similarity (St1, St2)=1- R_Dist(St1,St2) … (4.2)
And the later is the measure of similarity used in the candidates' generation
for this work.
4.4.1.3 Looking for Candidates
To find the similar tokens, the dictionary should be looked up and
every token in it must be examined with the source word. This process
consumes time because of the huge number of tokens held by the lexicon
dictionary and the required time by the examining algorithm itself to find
the minimum edit distance and computing the similarity to the source
token. Hence, the search space needs to shrink; the Similarity Based
Looking up method shown in Chapter Three is used to group similar tokens
in clusters using local properties, i.e. the clustering process grouped the
similar tokens depending on tokens spell only.
The input of the algorithm in figure (3.7) is the misspelled token.
The thresholds usage is dependent on the generating ability, i.e. how much
the generated candidates are similar to the source token. If they are highly
similar, then the top set is selected; but if there is a difficulty in discovering
reasonable candidates, the usage of thresholds may be a good solution. As
the misspelled token is being highly confused, the set of examined
centroids becomes larger; therefore, a filtering factor must be used to
reduce the search space.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 80 
At least a generated primary centroid should be similar to the 3-
symbols prefix of the source token in an amount of 2/3, which allow at
maximum one mistake in the prefix. This restriction is not randomly
selected; experiments revealed that misspellings are usually reasons of
single-error and the ratio is between 70% and 95% depending on the text
source. The mistakes are rarely happen in the first three letters. According
to [Pol84], 7.8% of errors occur in the first letter, 11.7% in the second letter
and 19.2% in the third letter; where each percent is dependent from the
others.
After collecting the most similar set of primary centroids, the next
step is to examine secondary centroids of every selected primary centroid.
The selection is also dependent on the similarity to the 6-symbols prefix of
the source token since the secondary lengths are at maximum of size 6
symbols. The second threshold constraints the error value to be at most two
mistakes, i.e. 2/6 or less; but in some situations there is a need to select the
best centroid from every secondary centroids set (from every selected
primary cluster) because looking for candidates in this stage is limited
within the first six symbols from the tokens, however, longer tokens may
contains more than two mistakes in its prefix. In another word, for every
selected primary centroid, the nearest secondary centroids are selected and
the threshold serves as a limit to avoid selecting less similar centroids when
there are other centroids with higher similarity.
Finally, for every selected secondary centroid, the candidates are
generated from the secondary packets that are related to a centroid with a
reasonable similarity to the source token. Then, the decision of selecting a
token to be a candidate or not would be easier because the comparison
applied on the total lengths of both source token and dictionary tokens.
Chapter Four  Error Detection and Candidates Generation
______________________________________________________________________
 81 
Ranking the candidates is a subroutine from the optimization stage. It
uses information more than similarity measure.
4.4.2 Candidates Generation for Real-words Errors
In this work, the generation of candidates is rule based. It can be
divided into two types according to the step in which it can be applied:
 Before suggesting optimal candidates for misspelled words:
This type of generation can be applied to sentences that have not
contain misspelled words, the decision is made after phrasing the
sentence into constituents and manipulating each phrase alone. The
word which violates the rule of constructing the given sentence from
the grammar or syntactic rules should be detected and replaced with
other forms set where any of which can make the sentence syntactically
accepted.
Grammar correction techniques are multiple and various; two
techniques are used in this step to solve a part from syntax errors, verb
tense correction and subject verb agreement.
 After suggesting optimal candidates for misspelled words:
After ranking candidates, this step allows the correction system to more
precisely select the candidate that is mostly fits into the sentence to
makes it correct or at least does not violates its correctness. Selecting
the best candidates after ranking is an additional filter for generating the
best suggestions set.


Chapter Five
Automatic
Text Correction
and
Candidates
Suggestion
 82 
Chapter Five
Automatic Text Correction and Candidates
Suggestion
5.1 Introduction
Text correction is the process of substituting the incorrect word(s) by
another correct word(s) that was selected as a candidate and filtered to be the
most suitable among many alternatives.
Automating text correction is a complex task because of its direct
association to humans' nature; a written word could never be absolutely
predicted even with the existence of perfect decision making parameters that
can help a computer to choose the perfect suggestion, since artificial
intelligence did not reach human capabilities, yet. However, there is always
an alternative solution. Optimizing candidates is an alternative solution for
handling the problem. Many existed techniques can help in making the
decision perfect and providing the user with a set of highly expected
alternatives for a given incorrect word.
This work, as we will see in next sections, exploited many features
that are related in the first order to the incorrect word and its candidates
themselves rather than context. The automatic correction is out of meaning
and suggests candidates depending on the output of the previous stages
(tokenization, tagging, and similarity based candidates generation) after
applying multi-features ranking and syntax analysis.
5.2 Correction and Candidates Suggestion Structure
Figure (5.1) shows ranking process applied on the generated candidates.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 83 
For every incorrect word, there is a set of candidates was generated by
the candidates generator at tagging stage. A set of features was predefined
for ranking candidates according to similarity and errors types' relevance.
The features set includes similarity value between the generated candidate
and the incorrect word, confusion and transposition factors, type of error in
the incorrect word and syntactic properties.
Ranking process involves:
 Assigning a value for every feature.
 Computing the effect factor of each feature (weighting).
 Summing all the weights in a single number.
 Inserting the processed candidate in the suitable index within the
candidates list where high similarity candidates ranked at the top and
candidates with low similarity are inserted in the bottom.
Features are represented by a vector of eight elements that may be
decreased or increased depending on the purpose for which the text
correction is applied and the source of the input text and the expected error
rate. Similarly, the weights of each feature are also affected by the input text
source since some features are dependent on error type.
Before applying ranking process, the source token that was marked as
a misspelled is examined against Named Entity (NE) features because most
of proper nouns are not added to dictionaries resulting in a mismatch case.
Recognizing NE requires combining multi sources and information to be
accounted together, some of them are strong enough to decide if a
misspelled token is a Name but not if it isn't. Syntax analyzing follows
features based ranking; it is another step for optimizing results, and mostly,
the one with highest effect. The accuracy of ranking candidates should be
completed by the syntactic role of the candidates that would be selected as
suggestions.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 84 
Figure (5.1): Candidates ranking flowchart
(Misspelled Token (T),
Candidates List) Pair
Handle a Candidate
Account similarity value
Specify Inserted Symbols
Confused?
Transposed?
Equal lengths?
Duplicated?
Difference
<=threshold ?
Confused?
Transposed?
Same
symbols set?
End symbol
match?
First symbol
match?
W1=f1
W2=f1
W3=f1
W3=f3
W3=f2
W4=f2
W4=f1
W5=f1
W5=f2
W6=f1
W7=f1
Rank according to Weights
Sum
Last candidate?
Stop
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
Start
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 85 
5.3 Named-Entity Recognition
A big set of weak evidence features is proposed to decide whether a
token is a named-entity or not, but there is a variance in the level of
analyzing and the features themselves.
Some features are efficient in deciding if a token is a named entity and
can be used individually in decision making; other features could never be
helpful unless be combined with other features. The features are categorized
in many sub-categories, the most well known are those related to word level,
part of speech tags, and dictionary looking up.
Since the purpose of this system is determining token correctness, the
word level features are the most helpful because the dictionary looking up is
previously satisfied (a matched token doesn't need to be analyzed) and part
of speech tags are useless with the absence of decision.
In English, the following features gave the developers some evidence
for name detection:
(1) All-uppercase: a token consisted of capital letters only.
(2) Initial-caps: a token started with a capital letter.
(3) All-numbers: a token consisted of numbers only.
(4) Alphanumeric: a token contains letters and numbers.
(5) Single-char: a token of one letter.
(6) Single- i.
The all-uppercase feature is the strongest and can be used
individually; initial-caps may be infected by its position within the sentence
because English sentences start with a capitalized word. In this system all-
numbers feature is solved by allowing the system to treat all numeric values
fairly by assigning the same hash code and the same tag for every numeric
string in the hash table. Many abbreviations, sort of named-entities, are
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 86 
alphanumeric and therefore it is a good feature. Single characters feature is
used by Microsoft word. Finally, single i may refer to the pronoun I which
usually written as lower case letter.
Named-entity recognition features may help in marking a token as a
name but they can't precisely decide if it is not so. An example of such cases
is that some names may be written in lowercase letters like "van Gogh"
which doesn't satisfy any of the features above.
5.4 Candidates Ranking
If the misspelled token was not recognized as a named entity, the
process of ranking starts by measuring the similarity between the source
token and every candidate in the associated list in more sophisticated manner
considering the type of the committed error to find a numeric value that
describes the fidelity of each candidate over the rest.
Eight weighted features are used to account every error type effect on
the whole candidate string; three different values for factors are considered
in the flowchart in figure(5.1) to outline and simplify the idea of giving
different factors values for different error types (f1=high, f2=medium,
f3=low). Practically, effect factors are numeric values that vary from a
feature to another.
For each element in the features vector, there is a weight reflects that
feature's share in the total computed rank value. Rank value for each
candidate is computed by:
… (5.1)
Where
n = features number, c = selected candidate, wi= associated weight with
feature no. i, and v is the features vector.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 87 
Weights values depend on the application area of the system; they rely
on the input text quality and the input device itself. The following
subsections describe each feature and its effect on the ranking process.
5.4.1 Edit Distance Based Similarity
Enhanced Levenshtein Edit distance method is used to calculate the
distance of each candidate from the source token. Computing the similarity
is dependent on the distance and the length of the two strings, (for more
details see section 4.4.1.3).
Similarity is measured by a numeric value within the interval (0,1),
therefore, it should be multiplied by a factor to be normalized with the other
features in such a way gives it the largest share in the ranking value among
the other features' weights.
In this application, as preferred in other applications because of the
majority of similarity amount in the suggestion decision, similarity was
weighted by a factor larger several times than other features weights.
5.4.2 First and End Symbols Matching
Researches in the area of errors analyzing showed that mistakes are
rarely happen in the first letters of the word, and mostly, the first letter is not
mistaken.
The probability of mistaking the second letter is also high but does not
achieve interested results compared to the first letter. On the other side, end
letter counted probability near to the probability of mistaking the first letter,
and hence, it is used as a part from the optimization procedure in calculating
ranking values.
First and end letter are sufficient because they are related to human
brains capabilities. Research from Stanford university showed that our brains
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 88 
can predict the correct word exactly even if its letters are permutated in a
random manner but only the first and end letters are correct.
Exploiting the research results assists the process of optimizing
candidates suggestion. However, the idea cannot be achieved directly in a
computational way because the ability of human minds of prediction and
connecting facts together is infinitely fast and reliable. It depends on
imagination and semantic relevance in interpreting sentences even with the
existence of errors. Until our days, such ability is not found in computers.
As a result, this feature, difference in lengths and same set of symbols
altogether can simulate human brain in a statistical way because the idea
originally dependent on statistics.
Small weights are given for both first letter and end letter features,
with a preference to the first letter feature because it has larger effect on the
prediction rather than end letter do.
5.4.3 Difference in Lengths
Writing mistakes are usually occurring within the token length or in
its length ± 1, and rarely the mistaken token and the intended token lengths
differ in more than one unit.
Equality of lengths does not affect the candidate itself directly only but
also other features like transposition and confusion and even duplication
(next subsections details the idea).
Candidates with larger difference values may be rejected although they
count good ranking indexes. The feature value is calculated by the relative
length difference:
R_L_D(St1,St2)= 1- ( abs(||St1|| - ||St2||))/ min( ||St1||,||St2||) … (5.2)
Where ||Sti|| is the length of string Sti
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 89 
The weight of this feature is dependent on the source of the input text;
texts that are entered by an optical character recognizer (OCR) usually have
smaller weights. While typed documents have larger value because the
insertion of symbols is probable.
5.4.4 Transposition Probability
Transposition refers to the case of replacing a character by another
neighbored one that is either similar in style or placed around it on the
keyboard.
Usually, this type of errors occurs with typed texts and refers by
"typos". The alphabet of English has small sized alphabet; the task of
computing the probability of transposing a letter by another is easy.
Table 5.1 shows a transposition matrix contains the probability of
each letter to be confused by another one from the 26 alphabet regardless of
being an uppercase or lowercase letter because such mistakes are related to
the typist's fingers physical movement not on the typed token.This feature
considers two types of errors:
1. Errors within the length of the word: in this case the typist mistake
a given letter by another, i.e. substitute it with a neighbor letter
through pressing the mistaken letter key instead of the intended
letter key. For this reason, such cases are described to be from the
first degree and the feature value is assigned to the maximum.
2. Errors resulted in word length increment: sometimes, fingers
confused the intended letter exact position and press two keys
simultaneously resulting in typing two consecutive letters (the
intended letter and the one to the right or the left). This mistake
inserts an additional letter and increases the word length by one.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 90 
Table (5.1) : Transposition Matrix
a b c d e f g h i j k l m n o p q r s t u v w x y z
a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 1 0 0 1
b 0 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0
c 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0
d 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 0 0 0
e 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 0
f 0 0 1 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
g 0 1 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0
h 0 1 0 0 0 0 2 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
i 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0
j 0 0 0 0 0 0 0 2 1 0 2 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
k 0 0 0 0 0 0 0 0 1 2 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 0 0 0 1 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0
n 0 0 0 0 0 0 0 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
o 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0
q 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
r 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
s 1 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1
t 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0
u 0 0 0 0 0 0 0 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0
v 0 0 2 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0
x 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2
y 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0
z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0
5.4.5 Confusion Probability
Confusion refers to the case of replacing a letter with another of
similar pronunciation; sound is the base of calculating the probability of
confusing a given letter unlike transposition probability which depends on
the keys arrangement on the keyboard.
This type of analyzing is concerned with phonetic errors; usually,
vowels are the most confused letters. The weight of this feature is dependent
on the application where the correction is used; it should have large values
when used with speech recognition systems. Table (5.2) shows Stanford
confusion matrix after being updated and normalized.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 91 
Table (5.2) : Confusion Matrix
a b c d e f g h i j k l m n o p q r s t u v w x y z
a 0 0 0 0 3 0 0 0 2 0 0 0 0 0 2 0 0 0 2 0 1 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0
d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
e 3 0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 1 0
f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 2 0 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 1 0
j 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
k 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0
o 2 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0
p 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
q 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
s 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
t 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
u 2 0 0 0 2 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0
v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
y 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
5.4.6 Consecutive Letters (Duplication)
Duplicating single letter or missing one of originally duplicated letters
is one of the typos errors. Some writers intentionally omit or add a letter
from the original token, specifically, in the case of affixes addition. The two
major errors resulted in this type of mistakes are:
 Insertion: duplicating a single letter can be resulted intentionally
when a writer does not know the perfect formation of a word when
adding an affix, an example is duplicating the letter 'l' when adding
the suffix 'full' to the noun 'hope' for converting it to the adjective
'hopeful'. Or it may be resulted from pressing a key in a time
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 92 
period longer than the required for typing a single letter, like
'prrint'.
 Deletion: the reverse of insertion is missing one of duplicated
letters like creating 'hopefuly' from adding the suffix 'ly' to the
word 'hopeful', or writing a single letter instead of two like writing
single 's' in 'omision'.
Duplication is an interested feature has sufficient effect in deciding the
optimum candidate, specifically when the difference between the source
token and the candidate is equal to the number of missed or duplicated
letters.
5.4.7 Different Symbols Existence
It is preferred to a candidate to contain the same set of letters that are
consisted in the source token; this feature highlights the case of transposing
two adjacent letters in the word ( Damerau forth error type) which is a
common mistake in typed text.
As a conclusion:
Obviously, none of the features described above is separable from the
others but each of them is constrained with its weight and effect factor. We
see these relations between edit distance and all the seven rest features;
between difference in length and all of confusion, transposition and
duplication; between transposition and duplication and so for.
In consequence, all the features above should share the task of ranking
the candidates each one with its special weight and according to the
environment of the application. At this step the suggestion of candidates in
the level of words is ended and the syntax restrictions start to have a role in
the decision making to decide which token would be suggested as the
optimum among all the alternatives in the dictionary.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 93 
5.5 Syntax Analysis
The task of syntax analyzer is critical in this stage; in addition to
examining sentences correctness, the selection of the optimum candidate is
done.
In both cases, the analyzing process is applied at the level of phrases
where a sentence is broken into clauses and the clauses into phrases. Syntax
analyzing process is shown in figure (5.2).
5.5.1 Sentence Phrasing
Tokens stream is divided into groups in the segmentation stage;
segmenting a text depends on the output of the tokeniser and the tagger
because determining sentences boundaries makes use of tags. As output, the
segmented text is a stream of sentences that can be passed to the syntax
analyzer because the later is usually works on sentences level.
A sentence contains one or more clauses, each clause consists of one
or more phrase and a phrase in turn contains one or more words. Phrasing is
efficient from the standpoints:
 Correcting a part from a phrase affects the structure of the sentence
partially which minimize the total number of possible alternatives leading
to smaller set of candidates and better reconstruction of the original
sentence in such a way reserve it reasonably unchanged.
 Attachment ambiguity is a challenge facing the correction process,
specifically in the semantic relations; phrases correction solved it because
a phrase is completely attached to another phrase and updating it does not
affect or be affected by other phrases, unlike words level correction
which must consider every possible parsing and related part from the
sentence.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 94 
 Converting into phrases simplifies the process of generating complex
sentences structures because how much a sentence can get complex it
Figure (5.2) : Syntax analysis flowchart
Convert each
sentence into phrases
Test candidates starting from the
top of ranked candidates list
Violated
?
Check Constituency
after correction
Select next
candidate
Correct Grammatically
Replace the misspelled
token by the candidate
Output Corrected Text
with list of candidates
for each corrected token
Yes
No
Start
End
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 95 
would still being a collection of phrases connected syntactically and
semantically.
English has a set of phrases types includes: Noun Phrase (NP),
Prepositional Phrase (PP), Adjectival Phrase (A), Adverbial Phrase (v),
Complement (C) and Verb Phrase (VP). Each has its own set of word classes
and a structure governs those classes.
5.5.2 Candidates Optimization
Misspelled tokens are associated with a ranked list of candidates. The
top candidate is the most similar to the misspelled word.
Optimization procedure is applied in two phases, the first is
represented in the ranking according to features satisfaction and weights; the
second is the syntactic agreement within the phrase that contains the
misspelled word.
Selecting candidates starts from the top; checking the consistency of
phrase structure has a fundamental impact on the correction accuracy. The
tag of the selected candidate should satisfy the structure of the phrase and
sometimes the process may require checking the next tag in the sentence, i.e.
the token that followed the misspelled word in the sentence which may form
the head of the next phrase.
The task is not such a challenge if the phrasing procedure was
accurate; the structure of the phrase under-processing limits the possible
alternatives of the misspelled word within best similarity amount and
syntactic agreement.
5.5.3 Grammar Correction
A sentence is grammatically accepted if it can be generated by
applying a finite set of grammar rules.
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 96 
Grammar correction is a subfield of real-word error correction, it
depends on sentences constituency satisfaction to detect the words that are
disagreed the grammar rules and made the sentence violating parsing rules.
In this step, the system checks the correctness of sentences by parsing
each sentence separately from the text because syntactic acceptance is
related to the sentence levels unlike semantic and further processing which
analyze texts at the level of paragraphs and full texts.
Grammar correction procedure deals with two types of sentences:
1. Sentences contain correctly spelled words.
2. Sentences contain words that have been replaced with correct words.
In both cases, the suggestion of candidates has been ended and the
correction is restricted in one suggested candidate. As shown in previous
sections, the optimal candidate is the grammatically suitable with the highest
similarity. However, the grammar corrector treats the two sentences types
equivalently (a sequence of correctly spelled words).
Many fields of grammar correction have been proposed because
correcting a text grammatically is an extensive process requires a huge
knowledge in the underlying language grammar and inclusive grammar rules
set.
This system is rule based and considers two types of correction:
 Subject verb agreement.
 Verb tenses.
In order to perform the two types of correction and the phrasing
procedure, the set of tags needed to be more detailed which is not available
in the original WordNet Dataset. The dictionary was preprocessed to
subdivide some tags into detailed tags; like dividing Definite and Indefinite
Chapter Five Automatic Text Correction and Candidates Suggestion
______________________________________________________________________
 97 
determiners into pre, central, and post determiners. Nouns also needed to be
categorized into plurals and singulars, verbs into different tenses and
participles. Integration with ISPELL database enhanced the accuracy of the
dictionary. It provided the dictionary with a big set of singular and plural
nouns, adjectives and verbs tenses forms.
5.5.4 Document Correction
The final step is suggesting the corrected sentences. It includes
replacing the incorrect words by the optimal candidates and associating the
remainder candidates with every corrected word.
The association is necessary because even a perfect suggestion could
never absolutely decide the intended word, therefore, the user is the only
person who can decide if the word was accurately corrected or not.
Candidates are listed according to their ranking values. The list is
preferred to be short and accurate. A threshold can be added to the
suggestion list to filter out any candidate has similarity less than the
predefined specified threshold.
Developers can indicate the threshold amount according to the
application environment. As an example, some applications are usually used
by native speakers; therefore, the threshold can be stricter than those used in
applications like language learning programs where users are typically of
poor linguistic knowledge.


Chapter Six
Experimental
Results,
Conclusions
and
Future Works
98 
Chapter Six
Experimental Results, Conclusions, and
Future Works
6.1 Experimental Results
Objectives of this system are achieved through applying many steps.
Some of these steps required techniques modifications to overcome some
problems that are facing the desired results:
6.1.1 Tagging and Error Detection Time Reduction
Assigning a POS tag to every token in the input text requires looking
up the underhand dictionary. Looking up is an extensive process being more
complex as the size of the dictionary becomes larger; the problem is solved
by applying prefix dependent dictionary structure based on hashing and
indexing.
The structure is consisted of two levels of division: primary packets and
secondary packets. Primary packets distribution depends on 3-symbols
prefixes resulted in quite manageable sizes; as shown in figure (6.1) the
search space is reduced to about one thousand of tokens at maximum instead
of the original hundreds thousands with an average packets size of (11.16)
tokens. In addition to the availability of packets heads random access that is
provided by the hash function.
Whereas, secondary packets are 6-prefix dependent resulted in more
steadfast searching and minimized the looking up time to a reasonable amount
as shown in figure (6.2), the search space reduced from hundreds thousands to
some hundreds at maximum; the average size is (7.26) tokens per a secondary
packet.
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
99 
Figure (6.1): Tokens distribution in primary packets
Figure (6.2): Tokens distribution in secondary packets
From the dictionary looking up side in the tagging phase, the hashing scheme
provided a set of good properties to the looking up procedure:
6.1.1.1 Successful Looking Up:
In the case of successful matching where the target token is found in
the dictionary, looking up time is reduced through three steps:
- Primary packets selection: the head of the every primary packet is
randomly accessible by applying direct hash function; it consumes
three symbols from the target token; this action, in turn, reduces
matching time in the next steps.
- Secondary packet selection: selecting a secondary packet involves
examining only three symbols (4-6 indices), resulting in faster
searching even if it is performed sequentially.
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
100 
- Inside secondary packet looking up: the remainder of the target
token is (length of the token – 6), since six symbols where consumed
in the two previous steps on the way to reach the target secondary
packet.
In other words, the best case for successful looking up has a time
complexity of O(1) in which the target token is stored at the first entry of
the primary packet (its head) that is randomly accessible.
The worst case happens when the target token is stored at the last entry
in a secondary packet and the head of that secondary packet is stored at the
last entry of the primary packet to which that secondary packet is related.
Time complexity is:
O(1) for primary packet head access ( random access)
O(Length of primary packet) for finding the secondary packet head
where the target token is stored, at each step only three symbols are
examined.
O(Length of secondary packet) for catching the target token, matching
only the remainder from the token after discarding the first six symbols.
Totally, O(1)+O(L1)+O(L2) is the worst case.
*
6.1.1.2 Failure Looking Up:
If the target token is not found in the dictionary, the looking failure
can be discovered in three different situations:
- At the hashing step (generating the primary packet head address): if
there is no match with the target token prefix, the reference table
announces the failure by referring to an empty primary packet.
________________________________________________
* L1 and L2 are the lengths of primary and secondary packets, consecutively.
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
101 
This step consumes only one operation, O(1) time complexity.
- Within a prefix of six symbols from the target token, the failure can
be discovered through matching the symbols at indices (4-6) with the
same indices of each token in the primary packet.
This step consumes matching operations number equal to the length
of the primary packet, O(L1) time complexity.
- In the case of finding a match with a secondary packet head (the 6-
symbol prefix at minimum), matching with the tokens of the secondary
packet is limited within the remainder from the target token and those
tokens after discarding matching the prefixes since they were
previously checked.
Time complexity is O(L2).
6.1.2 Candidates Generation and Similarity Search Space Reduction
Candidates generation requires examining all the tokens in the lexical
dictionary to compute its similarity to the misspelled token. Spell based
clustering illusion is our proposed solution to reduce search space; similarly
spelled tokens are grouped together in such a way keeps the structure of the
dictionary unaffected and allows similarity based looking up using bi-grams
analysis and prefixes similarity. Thresholds usage is application environment
dependent.
The misspelled token is the basic unit of candidates generation. The
proposed similarity based looking up in this work generates similar tokens
using the misspelled token and the underhand dictionary. It exploits the hash-
indexing scheme to enhance the generation speed and the bi-gram analysis to
improve candidates selection accuracy with no loss of interested candidates.
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
102 
The proposed approach is high flexible because it acts as a clustering
model with well structured clusters (even if it is in fact an illusion), it supports
a set of modifiable parameters:
1. Similarity measure:
In this work, we've depended on the minimum edit distance techniques
(specifically, an improvement of the Levenshtein method) and used a
similarity measure based on the distance calculated by this method.
The similarity based looking up approach is independent of the
similarity measure; therefore, any method or technique can be used
with it.
2. Thresholds:
Threshold specification is a challenge facing many applications; it
requires more work to be adjusted perfectly. This approach simplified
the task in two ways:
- Candidates generation can be performed without any consideration
for thresholds. After collecting the candidates, they'll be ranked and
the best set number simply can be selected.
- Filtering candidates can be broken into three levels, the first is at the
primary centroids selection; the second is at the secondary centroids
selection and finally the third is at the candidates selection.
As well as any other computation area, dividing a problem into sub-
tasks simplifies the initial assignment and the updating of the
parameters through the adjustment process.
3. Applicability:
Similarity based looking up is resilient to accept any update on the
candidates generation. It can be modulated to be used in different
environments. For example, a developer can use it in the post
correction of the OCR applications; he/she is able to add additional
features to make the recognition more accurate. This action may be
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
103 
referred implicitly by the similarity measure, or explicitly as parameters
to the generation procedure.
6.1.3 Reduction the Time of the Damerau-Levenshtein method
The modification of Damerau on the Levenshtein method increased the
time complexity because it adds additional checks on every symbol in the
input strings, this returns to the simplicity of the idea of examining the
foundation of a transposition.
In this work, we modified the original method to consider the
transposition cases by exploiting the same idea of the original method via
merging the examining statement within an execution limited statement.
Figure (6.3) : Time complexity variance of Levenshtein, Damerau-Levenshtein, and
Enhanced Levenshtein (our modification) [ Y axis represents the consumed time measured
in seconds, the X axis shows the samples used for testing]
The time variance of the three methods (Levenshtein, Damerau-
Levenshtein, and the enhanced Levenshtein) is shown in figure (6.3); the
consumed time by the enhanced method is very close to the original
Levenshtein, but the Damerau modification resulted in a somewhat longer
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
104 
time. The computed time is an average of repeating the execution of the three
methods ten times for each one on the same testing group.
6.1.4 Features Effect on Candidates Suggestion
The eight features selected for suggesting the best set of candidates are
tested in three different cases to show how each of them affected the selection
of the optimal suggestion for isolated words correction.
Figure 6.4 shows the ratio of correctly suggested candidates and
correctly chosen as optimal. Suggested tokens represent situations when the
target token is found in the list of suggestions but not necessary selected as
optimal. Chosen as optimal are the set of tokens which are correctly selected
as optimal.
Figure (6.4): Suggestion Accuracy with a comparison to Microsoft Office Word on a
Sample from the Wikipedia
1
Total Misspelled Tokens 1825
Suggested Target Token 1691
Optimally Selected 1477
Microsoft Word Suggestion 1659
0
200
400
600
800
1000
1200
1400
1600
1800
2000
TokensNumber
Experiment 1: Suggestion Accuracy
Suggestion Accuracy =
92.657%
Optimality Accuracy =
87.34%
Microsoft Word Suggestion
Accuracy =
90.904%
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
105 
Suggestion accuracy was computed from applying isolated words correction
on a list of commonly misspelled words from Wikipedia website contains
1825 tokens resulted in an accuracy of (92.657%) where (87.34%) from them
are correctly suggested as optimal candidates. The same testing data was
checked with Microsoft Word resulted in (90.904%) suggestion accuracy.
A sub set from the wiki sample (presented in Appendix A) was used to
compare our system accuracy with other systems. The results were gained
from a research made by Ahmed Farag and others [Ahm09] and our system
tested the same data and gave the results shown in figure (6.5).
Figure (6.5) : Testing the suggested system (I.T.D.C) accuracy and comparing the
results with other systems using the same dataset
The accuracies of the tested systems are:
 ASPELL : 90.833%
 Microsoft Word: 88.33%
 MultiSpell : 92.5%
 I.T.D.C system (the suggested system) : 95.83%
1 2 3 4
Correctly Suggested 109 104 111 115
Incorrectly Suggested 11 16 9 5
ASPELL MicroSoft Word MultiSpell I.T.D.C System
0
20
40
60
80
100
120
140
TokensNumber
Experiment 2: A comparison among our work and some systems on
the isolated words correction
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
106 
Another experiment was implemented to check the effect of every feature
on the selection of the optimal candidate accuracy; results shown in figure
(6.6) are computed by discarding one feature at a time, while figure (6.7)
shows the results of using one feature at a time.
Although some features gave high accuracy alone, taking such an
action is not sufficient. An example is the duplication feature which
accounted 1552 correctly selected optimal tokens when used alone, whereas
discarding it did not affect the total number of optimal set tokens.
Figure (6.6): Discarding one feature at a time for optimal candidate selection
1
Optimal Set 1477
Similarity Feature 827
First Letter Feature 1464
End Letter Feature 1468
Length Effect Feature 1476
Same Letter Set Feature 1436
Transpositionally Inserted
Feature
1464
Duplication Feature 1465
Confusion Feature 1475
Transposition Feature 1487
0
200
400
600
800
1000
1200
1400
1600
TokensNumber
Experiment 3: Features Selection Effect, Discarding One Feature
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
107 
Figure (6.7): Using one feature at a time for optimal candidate selection
6.2 Conclusions
Text correction is a complex problem and an extensive task. It needs
many linguistic and statistical resources. In addition, it needs efficient
techniques for automatic execution. In this work we had performed a set of
improvements on both resources and techniques sides; our dictionary, an
integration of WordNet and ISPELL datasets, was retagged to achieve and
simplify parsing process. Hashing and indexing techniques are used to shorten
error detection process time; correction process based on exploiting the same
hashed dictionary and an enhancement on the Levenshtein method for
generating candidates. A set of features, some of them are statistics
1
Optimal Set 1477
Similarity Feature 1406
First Letter Feature 317
End Letter Feature 595
Length Effect Feature 447
Same Letter Set Feature 1486
Transpositionally Inserted
Feature
1478
Duplication Feature 1552
Confusion Feature 909
Transposition Feature 923
0
200
400
600
800
1000
1200
1400
1600
1800
TokensNumber
Experiment 4: Features Selection Effect, Applying One Feature
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
108 
dependent, are used in optimizing candidates before passing to the parser
where the final decision is made at the level of phrases and sentences.
There is no way to avoid human intervention because computers could
never predict absolutely what a human intended; therefore, a set of
alternatives were associated with every corrected word.
6.3 Future Works
Automatic text correction is an open research; even with the presence
of several techniques and applications, the desired results still imperfect.
However some issues can be further considered in this work to improve its
accuracy:
- Semantic Processing: this system is fully dependent on an extensive
parser at the level of syntax analyzing only; semantic information would
increase accuracy if implemented to discard candidates that conflict the
sentence meaning. Discourse and pragmatic analysis also can enhance
results.
- In addition to spell based clustering and phonetic based clustering, a
technique for merging both of the two within the same searching time
constraints is preferred. Such enhancement will maximize the candidates
generation accuracy and minimize the time complexity.
- In the hash table, the looking up inside primary packets and secondary
packets is performed sequentially; an application of a faster technique like
binary search is a good improvement. This action requires sorting the
tokens according to their spell and applying the search in two directions:
o On the level of the token itself, where moving from an entry to
another is dependent on the tokens spell and therefore should
consider the symbols of the token sequentially because the length
is small enough to not to be looked with a complex technique.
Chapter Six Experimental Results, Conclusion, and Future Works
_________________________________________________________________________
109 
o On the level of the packets, where moving is performed at the
level of tokens.
- The similarity based looking up need to be faster, an enhancement is
needed to reduce the number of the generated primary centroids. This
problem may be solved if the application where the system is used
becomes more specific.
- In grammar correction, we considered only two types and it is preferable
to consider as many types as possible.
- Because of time constraints, this system was implemented on simple
sentences only; an extension is required to make it general by including
complex, compound and complex compound sentences. The task is
forward easy because no more details are required for the construction
process via exploiting the phrases level analysis made in this work.
- A sophisticated study in the field of the type of errors and how people
usually making writing mistakes, such study requires multi resources
include corpus, statistics and even an interactive analyzer for recording
and classifying commonly committed mistakes. Although it is not an easy
task, it can simplify drawing a concluded idea about the general behavior
of users when they are unintentionally change the spell of words to
generate misspellings.


References
References
___________________________________________________________
 110 
References
Achenkunju A. and Bhuma V.R. (2014). "An Efficient Reformulated
Model for Transformation of String." International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622 International
Conference on Humming Bird. 1 March.
Ahmed F., Ernesto W. De L., and Andreas N.( 2009). Revised N-Gram
based Automatic Spelling Correction Tool to Improve Retrieval
Effectiveness. Technical University of Berlin.
Ali A.(2011). Textual Similarity. Technical University of Denmark.
Amber W. O., Graeme H., and Alexander B. (2008). Real-word spelling
correction with trigrams: A reconsideration of the Mays, Damerau, and
Mercer model. Ontario: University of Toronto.
Baluja S., Vibhu O. M., and Rahul S.(2000). APPLYING MACHINE
LEARNING FOR HIGH PERFORMANCE NAMED-ENTITY
EXTRACTION. Cambridge: Blackwell Publishers.
Bassil Y.( 2012). "Parallel Spell-Checking Algorithm Based on Yahoo! N-
Grams Datasets." International Journal of Research and Reviews in
Computer Science (IJRRCS), ISSN: 2079-2557, Vol.3, No.1, February.
Bhattacharyya P. (2012). "Natural Language Processing A Perspective
from Computation in Presence of Ambiguity, Resource Constraint and
Multilinguality." CSI Journal of Computing, Vol.1 , No. 2, 3-13.
Booth A. D., Brandwood L., and Cleave J. P.. (1958). Mechanical
Resolution of Linguistic Problems. New York, London: Academic Press
Ink Publishers; ButterWorths Scientific Publications.
Boswell D. (2005). Speling KoreKsion: A survey of techniques from past to
present. A USCD Research Exam.
References
___________________________________________________________
 111 
Chakraborty R. C. (2010). "Artificial Intelligence: Natural Language
Processing." www.myreaders.info/html/artificial_intelligence.html, 1 June.
Church K. and Gale W. A. (1991). "Probability Scoring for Spelling
Correction." Statistics and Computing, 93-103.
Clark A., Chris F., and Shalom L. (2010). The Handbook of
Computational Linguistics and Natural Language Processing. Singapore:
Wiley-Blackwell.
Dahlmeier D. and Hwee T. N. (2011). "Grammatical Error Correction
with Alternative Structure Optimization." Proceedings of the Association
for Computational Linguistics, 915-923.
Damerau F. J. (1964). A Technique for Computer detection and
Correction of Spelling Errors. New York: ACM, Vol.3,No.4.
Dzikovska M. O. (2004). A Practical Semantic Representation For
Natural Language Parsing. New York: University of Rochester.
Farra N., Nadi T., Alla R., and Nizar H. ( 2014). "Generalized Character-
Level Spelling Error Correction." Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, June 23-25, 161–167.
Felice M., Yuan Z., Andersen Ø. E., and others.( 2014). "Grammatical
Error Correction using Hybrid Systems and Type Filtering." Proceedings of
the Shared Task Eighteenth Conference on Computational Natural
Language Learning, Maryland, 15-24.
Fromkin V., Robert R., and Nina H. ( 2007). Language Change: The
Syllabes of Time. Vol. 8, in An Introduction to Language, 461-497. Boston.
Gamon M. (2010). "Using Mostly Native Data to Correct Errors in
Learners' writing: A meta-classifier approach." proceedings of the Annual
References
___________________________________________________________
 112 
Meeting of the North America Chapter of the Association for
Computational Linguistics, 163-171.
Golding A. R., and Yves S. (1996). Combining Trigram-based and
Feature-based Methods for Context-Sensitive Spelling Correction.
Cambridge: Mitsubishi Electric Research Laboratories.
Grune D. and Ceriel J. H. J. ( 2008). Parsing Techniques- a practical
guide. Vol. Second Edition. Springer.
Gupta A. (2014). "Grammatical Error Detection and Correction Using
Tagger Disagreement." Proceedings of the Shared Task of the Eighteenth
Conference on Computational Natural Language Learning, 49-52.
Haldar R. and Debajyoti M. ( 2011). Levenshtein Distance Technique in
Dictionary Lookup Methods: An Improved Approach. New York: ACM.
Han N., Martin C., and Claudia L. (2006). "Detecting errors in English
Article Usage by Non-native Speakers." Natural Language Engineering,
115-129.
Hasan F. M.( 2006). COMPARISON OF DIFFERENT POS TAGGING
TECHNIQUES FOR SOME SOUTH ASIAN LANGUAGES. Dhaka: BRAC
University.
Hasan F. M., Naushad U., and Mumit K.( 2006). Comparison of
different POS Tagging Techniques (N-Gram, HMM and Brill’s tagger) for
Bangla. Bangladesh: BRAC University.
Hodge V. J. and Austin J. (2003). "A Comparison of Standard Spell
Checking Algorithms and Novel Binary Neural Approach." IEEE Trans.
Know. Dat. Eng, 1073-1081.
Hwee T. N., Siew M. W., Ted B., and others. (2014). "The CoNLL-2014
Shared Task on Grammatical Error Correction." Proceedings of the Shared
References
___________________________________________________________
 113 
Task of Eighteenth Conference on Computational Natural Language
Learning, June 26-27,1-14.
ISPELL. "ISPELL." April 10 , 2014 . http://Ispell/Wikipedia the free
encyclopedia.htm (accessed September 2014).
Jackson P. and Isabelle M.(2002). Natural Language Processing for
Online applications, Text Retrieval, Extraction and Categorization.
Amsterdam: John Benjamins Publishing Company.
Jones K. S. (2001). Natural Language Processing - A historical Review.
University of Cambridge, October.
Julius G. III.(2013) Intrasentential Grammatical Correction with
Weighted Finite State Transducers. Raleigh, North Carolina: North
Carolina State University.
Jurafsky D. and James H. M. (2000). Speech and Language Processing:
An introduction to natural language processing, Computational
Linguistics, and Speech Recognition. New Jersey: Alan Apt.
Kirthi J., Neeju N.J., and Nithiya P. (2011). "Automatic Spell Correction
of User query with Semantic Information Retrieval and Ranking of Search
Results using WordNetApproach." IJCSI International Journal of
Computer Science Issues, Vol. 8, No. 2, March, 557- 564.
Kukich K. ( 1992). Techniques for Automatically Correcting Words in
Text. ACM Computing Surveys, Vol. 24, No. 4.
Manning R. and Schútze. (2008). An Introduction to Information
Retrieval. Cambridge University Press.
Mihov S., Svetla K., and others. (2004). Precise and Efficient Text
Correction Using Levenshtein Automata, Dynamic Web Dictionaries and
Optimized Correction Models. Bulgarian Academy of Sciences.
References
___________________________________________________________
 114 
Mishra R. and Navjot K.(2013). "A Survey of Spelling Error Detection
and Correction Techniques." International Journal of Computer Trends
and Technology, vol.4, No.3, 372-374.
Momtazi S.(2012). Natural Language Processing: Introduction to
Language Technology. University of Potsdam.
Nadkarni P. M., Lucila O., and Wendy W. C.( 2011). "Natural language
processing: an introduction." J Am Med Inform Assoc, October 5, 544-551.
Niemann T.( 2009). SORTING AND SEARCHING ALGORITHMS.
Portland: epaperpress.com.
"Notes on Ambiguity." http://cs.nyu.edu/faculty/davise/ai/ambiguity.html.
Peterson J. L. (1980). "Computer Programs for Detecting and Correcting
Spelling Errors." Communications of the ACM, Vol.23, No. 12, 676- 687.
Pollock J. J. and Zamara A. (1983). "Collection and Characterization of
Spell Errors in Scientific and Scholary Text." Journal American Social
Information Scientific, 51-58.
Pollock J. J. and Zamara A. (1984). "Automatic Spelling Correction in
Scientific and Scholary Text." Communications of the ACM, 358-368.
Quirk R., Sidney G., Geoffreyleech, and Jan S.( 1985). A
Comprehensive Grammar of the English Language. New York and
London: Longman.
Raaijmakers S. (2013). "A Deep Graphical Model for Spelling
Correction." Proceedings of the 25th Benelux Conference on Artificial
Intelligence. Delft, 7-8 November.
Rich E. and Kevin K.( 1991). Chapter Fifteen: Natural Language
Processing. Vol. 2, in Artificial Intelligence. Amazon.
References
___________________________________________________________
 115 
Ritter A., Mausam S. C., and Oren E. ( 2011). Named Entity Recognition
in Tweets An Experimental Study. Computer Science and Engineering,
University of Washington.
Rajesh K. S. and Lokanatha C. R.(2009). "Natural Language Processing
- An Intelligent way to understand Context Sensitive Languages."
International Journal of Intelligent Information Processing, December
3,421-428.
Sagar and Shobha G. (2013). "Survey on Grammar Generation Methods
for Natural Languages." International Journal of Computational
Linguistics and Natural Language Processing ISSN 2279 – 0756, Vol.
2,No.1, January, 197-202.
Salifou L. and Harouna N. (2014). "Design of A Spell Corrector For
Hausa Language." International Journal of Computational Linguistics
(IJCL), Vol.5,No.2, 14-26.
Scott M. T. (1999). PARSING AND TAGGING SENTENCES
CONTAINING LEXICALLY AMBIGUOUS AND UNKNOWN TOKENS.
Purdue University.
Seo H., Jonghoon L., Seokhwan K., and others. (2012). "A Meta
Learning Approach to Grammatical Error Correction." 50th Annual
Meeting of the Association for Computational Linguistics. Jeju Island, July,
8 - 14.
Setiadi I. (2014). Damerau-Levenshtein Algorithm and Bayes Theorem for
Spell Checker Optimization. Bandung: Makalah IF2211 Strategi Algoritma
– Sem. I Tahun.
Tetreault J., Jenniefer F., and Martin C. (2010). "Using Parse Features
for Preposition Selection and Error Detection." Proceedings of the ACL
2010 Conference Short Papers, 353-358.
References
___________________________________________________________
 116 
Toutanova K., and Moore R. C.( 2002). "Prounciation Modeling for
Improved Spelling Correction." Proceedings 40th Annual Meeting of the
Association for Computational Linguistics. Hong Kong, pp. 144-151,144-
151.
Verberne S. (2002). Context-sensitive spell checking based on word
trigram probabilities. University of Nijmegen.
Voorhees E., Harman D.K., and others. ( 2005). TREC: Experiment and
Evaluation in information Retreival. Cambridge: MIT press.
Wanger R. A. and Fischer M.J. (1974). "The string-to-string correction
proplem." Journal of the Association for Computer Machinary, 168-173.
Wolniewicz R. (2011). Auto-Coding and Natural Language Processing.
U.S.A: 3M Health Information Systems.
Yannakoudakis E.J. and Fawthrop D. (1983). "An Intelligent Spelling
Error Correction." Information Processing and Management, 101-108.
Yule G. (2000). "Pragmatics." In Oxford Introductions to Language Study
Series Editor H.G. Widdowson, 4. Oxford University Press.
Zampieri M. and Renato C. de A. ( 2014). Between Sound and Spelling
Combining Phonetics and Clustering Algorithms to Improve Target Word
Recovery. Saarland: Saarland University.
Zhan J., Xiolong M., Shu q. L., and Ditang F. (1998). A language Model
in a Large-Vocabulary Speech Recognition System. Sydney: Proceedings of
International Conference ICSLP98.


Appendix A
Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 117
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
Abberration aberration aberration aberration aberration aberration
accomodation accommodation accommodation accommodation accommodation accommodation
acheive achieve Achieve achieve achieve achieve
abortificant abortifacient aficionados - abortifacient abortifacient
absorbsion absorption absorbsion absorbs ion absorption absorption
ackward (awkward,
backward)
awkward (awkward,
backward)
(awkward, backward) (backward,
awkward)
additinally additionally additionally additionally additionally additionally
adminstration administration administration administration administration administration
admissability admissibility admissibility admissibility admissibility admissibility
advertisments advertisements advertisements advertisements advertisements advertisements
adviced advised advised advised advice advised
afficionados aficionados aficionados aficionados aficionados aficionados
affort (effort, afford) effort afford afford (effort, afford)
agains against agings agings against against
aggreement agreement agreement agreement agreement agreement
agressively aggressively aggressively aggressively aggressively aggressively
agriculturalist agriculturist - - agriculturist agriculturist
alcoholical alcoholic alcoholically alcoholically alcoholic (alcoholically,
alcoholic)
algebraical algebraic algebraic algebraically algebraically algebraic
algoritms algorithms algorithms algorithms algorithms (algorism,
algorithms)
alterior (ulterior, anterior) ulterior (anterior,
ulterior)
(anterior, ulterior) (ulterior, anterior)
Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 118
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
anihilation annihilation annihilation annihilation annihilation annihilation
anthromorphization anthropomorphization anthropomorphizing - anthropomorphization anthropomorphization
bankrupcy bankruptcy bankruptcy bankruptcy bankruptcy bankruptcy
baout (about, bout) bout (about, bout) bout (about, bout)
basicly basically basically basically basically basically
breakthough breakthrough break though breakthrough breakthrough breakthrough
carachter character crocheter character character character
cannotation connotation connotation (connotation,
annotation)
(connotation,
annotation)
connotation
carismatic charismatic charismatic charismatic charismatic charismatic
carmel caramel Carmel - caramel caramel
cervial (cervical, servile) cervical cervical cervical cervical
clasical classical classical classical classical classical
cleareance clearance clearance clearance clearance clearance
comissioning commissioning commissioning commissioning commissioning commissioning
commemerative commemorative commemorative commemorative commemorative commemorative
compatabilities compatibilities compatibilities compatibilities compatabilities compatibilities
committment commitment commitment commitment commitment commitment
debateable debatable debatable debatable debatable debatable
determinining determining determinining determinining determining determining
childbird childbirth child bird child bird childbirth childbirth
definately definitely definitely definitely definitely definitely
decribe describe describe describe describe describe
elphant elephant elephant elephant elephant elephant
emmediately immediately immediately immediately immediately immediately
emphysyma emphysema emphysema emphysema emphysema emphysema
erally (orally, really) orally really orally (really ,orally)
eyasr (years, eyas) eyesore years eyas (eyas ,years)
Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 119
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
facist fascist fascist fascist fascist fascist
fluoroscent fluorescent fluorescent fluorescent fluorescent fluorescent
geneology genealogy genealogy genealogy genealogy genealogy
gernade grenade grenade grenade grenade grenade
girates gyrates grates gyrates Gyrates gyrates
gouvener governor governor souvenir convener (souvenir,
gouverneur,
governor)
gurantees guarantee guarantee guarantee guarantee (guaranties,guarantee)
guerrila (guerilla, guerrilla) guerrilla guerrilla (guerilla, guerrilla) (guerrilla, guerilla)
guerrilas (guerillas, guerrillas) guerrillas guerrillas (guerillas, guerrillas) (guerrillas, guerillas)
Guiseppe Giuseppe Giuseppe Giuseppe Giuseppe -
habaeus (habeas, sabaeus) habeas habitués sabaeus Cabaeus
hierarcical hierarchical hierarchical hierarchical hierarchical hierarchical
heros heroes heroes heroes herbs heroes
hypocracy hypocrisy hypocrisy hypocrisy hypocrisy hypocrisy
independance Independence Independence - Independence Independence
intergration integration integration integration integration integration
intrest interest interest interest interest interest
Johanine Johannine Johannes Johannes Johannine Johannine
judisuary judiciary judiciary judiciary judiciary judiciary
kindergarden kindergarten kindergarten kindergarten kindergarten kindergarten
knowlegeable knowledgeable knowledgeable knowledgeable knowledgeable knowledgeable
labatory (lavatory, laboratory) (lavatory, laboratory) (lavatory,
laboratory)
(lavatory, laboratory) lavatory
lonelyness loneliness loneliness loneliness loneliness loneliness
legitamate legitimate legitimate legitimate legitimate legitimate
libguistics linguistics linguistics linguistics linguistics linguistics
Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 120
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
lisence (license, licence) licence silence licence (licence, license)
mathmatician mathematician mathematician mathematician mathematician mathematician
ministery ministry ministry ministry ministry ministry
mysogynist misogynist misogynist misogynist misogynist misogynist
naturaly naturally naturally naturally naturally naturally
ocuntries countries countries countries countries countries
paraphenalia paraphernalia paraphernalia paraphernalia paraphernalia paraphernalia
Palistian Palestinian Alsatain politian Palestinian (Pakistan, politian)
pamflet pamphlet pamphlet pamphlet pamphlet partlet
psyhic psychic psychic psychic psychic psychic
Peloponnes Peloponnesus Peloponnese Peloponnese Peloponnesus Peloponnese
personell personnel personnel personnel personnel ( personally,
personnel)
posseses possesses possesses possesses possess possesses
prairy prairie priory prairie airy (priory, prairie)
qutie (quite, quiet) quite quite queue quite
radify (ratify,ramify) ratify ratify ramify (rarify, ratify, ramify)
reccommended recommended recommended recommended recommended recommended
reciever receiver receiver receiver reliever receiver
reconaissance reconnaissance reconnaissance reconnaissance reconnaissance reconnaissance
restauration restoration restoration restoration instauration restoration
rigeur (rigueur, rigour,
rigor)
rigger rigueur (rigueur, rigour) rigour
Saterday Saturday Saturday Saturday Saturday Saturday
scandanavia Scandinavia Scandinavia Scandinavia Scandinavia Scandinavia
scaleable scalable scalable - scalable scalable
secceeded (seceded, succeeded) succeeded succeeded succeeded succeeded
sepulchure (sepulchre, sepulcher) sepulcher sepulchered sepulchre (sepulchre, sepulcher)
Appendix (A): A comparison among this work and some systems on the isolated words correction
* Bold words are incorrectly suggested
** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 121
Misspellings Correct Word ASPELL Microsoft
Word
MultiSpell [Ahm09] I.T.D.C System
themselfs themselves themselves themselves themselves themselves
throught (thought, through,
throughout)
(thought, through) (thought,
through)
(thought, through,
throughout)
(through, thought,
throughout)
troups (troupes, troops) (troupes, troops) troupes troops (troops, troupes)
simultanous simultaneous simultaneous simultaneous simultaneous simultaneous
sincerley sincerely sincerely sincerely sincerely sincerely
sophicated sophisticated suffocated supplicated sophisticate sophister
surrended (surrounded,
surrendered)
surrounded surrender surrounded (surrender,
surrendered
surrounded)
unforetunately unfortunately unfortunately unfortunately unfortunately unfortunately
unnecesarily unnecessarily unnecessarily unnecessarily unnecessarily unnecessarily
usally usually usually usually usually usually
useing using using using seeing using
vaccum vacuum vacuum vacuum vacuum vacuum
vegitables vegetables vegetables vegetables vegetables vegetables
vetween between between between between between
volcanoe volcano volcano volcano volcano ( volcanoes, volcano)
weaponary weaponry weaponry weaponry weaponry weaponry
worstened worsened worsened worsened worsened worsened
wupport support support support support support
yeasr years years years yeast years
Yementite (Yemenite, Yemeni) Yemenite Yemenite Yemenite Yemenite
yuonger younger younger younger sponger younger


Appendix B
211
1
‫الخالصة‬:
‫مع‬ ‫االنسان‬ ‫بتفاعل‬ ‫المرتبطة‬ ‫المشكالت‬ ‫اهم‬ ‫من‬ ‫واحدة‬ ‫تلقائيا‬ ‫النصوص‬ ‫تصحيح‬ ‫عملية‬ ‫تعتبر‬
‫المباشرة‬ ‫العملية‬ ‫الجوانب‬ ‫من‬ ‫العديد‬ ‫في‬ ‫تدخل‬ ‫؛اذ‬ ‫الحاسوب‬‫تحويل‬ ‫عن‬ ‫الناجمة‬ ‫االخطاء‬ ‫كتصحيح‬
‫رقمية‬ ‫الى‬ ‫الخطية‬ ‫النصوص‬,‫عملية‬ ‫اجراء‬ ‫قبل‬ ‫المستخدمين‬ ‫إيعازات‬ ‫كتصحيح‬ ‫المباشرة‬ ‫وغير‬
‫تفاعلية‬ ‫بيانات‬ ‫قاعدة‬ ‫في‬ ‫ما‬ ‫استرجاع‬.
‫التل‬ ‫التصحيح‬ ‫عملية‬ ‫تمر‬‫رئيسيتين‬ ‫بمرحلتين‬ ‫قائي‬:‫و‬ ‫االخطاء‬ ‫تحديد‬‫البدائل‬ ‫اقتراح‬.
‫كال‬ ‫في‬ ‫عديدة‬ ‫وطرق‬ ‫تقنيات‬ ‫توجد‬‫وقابليتها‬ ‫نتائجها‬ ‫دقة‬ ‫في‬ ‫الطرق‬ ‫هذه‬ ‫وتتباين‬ ‫المرحلتين‬
‫التطبيقية‬,‫وإحصائية‬ ‫اجرائية‬ ‫طرق‬ ‫الى‬ ‫عامة‬ ‫بصورة‬ ‫تنقسم‬ ‫حيث‬.‫ما‬ ‫كل‬ ‫على‬ ‫منها‬ ‫االجرائية‬ ‫تشتمل‬
‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫ذلك‬ ‫في‬ ‫بما‬ ‫النصوص‬ ‫مقبولية‬ ‫في‬ ‫تتحكم‬ ‫محددة‬ ‫قواعد‬ ‫على‬ ‫عمله‬ ‫في‬ ‫معتمدا‬ ‫هو‬
‫ال‬ ‫تعتمد‬ ‫حين‬ ‫؛في‬ ‫الطبيعية‬‫عينات‬ ‫من‬ ‫عادة‬ ‫تجمع‬ ‫واحتمالية‬ ‫احصائية‬ ‫بيانات‬ ‫على‬ ‫االحصائية‬ ‫طرق‬
‫المستخدمين‬ ‫بين‬ ‫يتداول‬ ‫مما‬ ً‫ا‬‫اساس‬ ‫مستخلصة‬ ‫هائلة‬.
‫اللغوية‬ ‫المقبولية‬ ‫وفحص‬ ‫للتحليل‬ ‫كأساس‬ ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫اعتمدت‬ ‫النظام‬ ‫هذا‬ ‫في‬
‫االنكليزية‬ ‫للنصوص‬ ‫والنحوية‬,‫مفر‬ ‫كل‬ ‫يضم‬ ‫قاموس‬ ‫استخدام‬ ‫تم‬ ‫حيث‬‫لغرض‬ ‫االنكليزية‬ ‫اللغة‬ ‫دات‬
‫وطريقة‬ ‫هاش‬ ‫دالة‬ ‫استخدمت‬ ‫فقد‬ ‫القاموس‬ ‫لهذا‬ ‫الهائل‬ ‫للحجم‬ ‫ونظرا‬ ‫اللغوية‬ ‫االخطاء‬ ‫وتحديد‬ ‫اكتشاف‬
‫على‬ ‫اعتمادا‬ ‫عشوائي‬ ‫وصول‬ ‫قابلية‬ ‫وتوفير‬ ‫المنشودة‬ ‫الكلمات‬ ‫عن‬ ‫البحث‬ ‫نطاق‬ ‫لتقليص‬ ‫فهرسة‬
‫بادئاتها‬‫البحث‬ ‫وقت‬ ‫اختصار‬ ‫وبالتالي‬.
‫فيعت‬ ‫البدائل‬ ‫توليد‬ ‫أما‬‫كلمات‬ ‫وكافة‬ ‫المدخلة‬ ‫الكلمة‬ ‫بين‬ ‫التشابه‬ ‫مقدار‬ ‫حساب‬ ‫طريقة‬ ‫على‬ ‫مد‬
‫طريقة‬ ‫باستخدام‬ ‫احتسب‬ ‫والذي‬ ‫المقدار‬ ‫لهذا‬ ‫وفقا‬ ‫ترتيبها‬ ‫وإعادة‬ ‫القاموس‬Levenshtein‫؛إن‬ ‫رة‬ّ‫و‬‫مط‬
‫طويال‬ ‫وقتا‬ ‫تتطلب‬ ‫هذه‬ ‫التوليد‬ ‫عملية‬,‫مع‬ ‫صغيرة‬ ‫مجاميع‬ ‫الى‬ ‫القاموس‬ ‫في‬ ‫الكلمات‬ ‫تقسيم‬ ‫تم‬ ‫لذلك‬
‫بقاب‬ ‫االحتفاظ‬‫المصدر‬ ‫الكلمة‬ ‫تهجئة‬ ‫على‬ ‫تعتمد‬ ‫لمحددات‬ ‫تبعا‬ ‫العشوائي‬ ‫الوصول‬ ‫لية‬.‫اقتراح‬ ‫يتضمن‬
‫م‬ ‫حد‬ ‫الى‬ ‫تتصل‬ ‫خصائص‬ ‫مجموعة‬ ‫اختبار‬ ‫البدائل‬‫شيوعا‬ ‫االكثر‬ ‫االخطاء‬ ‫بطبيعة‬ ‫ا‬.‫النظام‬ ‫يقوم‬
‫قواعد‬ ‫مع‬ ‫يتعارض‬ ‫ال‬ ‫ان‬ ‫على‬ ‫المصدر‬ ‫الكلمة‬ ‫مع‬ ‫توافقية‬ ‫اعلى‬ ‫يحقق‬ ‫الذي‬ ‫االمثل‬ ‫البديل‬ ‫باختيار‬
‫لي‬ ‫النحو‬‫النص‬ ‫كون‬‫ونحويا‬ ‫لغويا‬ ‫مقبوال‬ ‫المصحح‬.
‫النظام‬ ‫دقة‬ ‫اختبار‬ ‫نتائج‬ ‫اظهرت‬‫المقترح‬ً‫ا‬‫تقدم‬‫أخرى‬ ‫وأنظمة‬ ‫وورد‬ ‫مايكروسوفت‬ ‫على‬,‫كما‬
‫حافظت‬ ‫الرمزية‬ ‫السالسل‬ ‫تشابه‬ ‫لحساب‬ ‫رة‬ّ‫و‬‫المط‬ ‫الطريقة‬ ‫ان‬‫قدرتها‬ ‫مع‬ ‫الوقت‬ ‫تعقيدات‬ ‫على‬ ‫تقريبا‬
‫االمالئية‬ ‫االخطاء‬ ‫من‬ ‫اضافي‬ ‫نوع‬ ‫اكتشاف‬ ‫على‬.
2
‫نظام‬‫ّية‬‫ص‬‫الن‬ ‫المستندات‬ ‫تصحيح‬‫الذكي‬
‫على‬ ‫باالعتماد‬‫الت‬ ‫تقنية‬‫شابه‬
‫مقدمة‬ ‫رسالة‬
‫إلى‬‫كلية‬ ‫مجلس‬‫المعلومات‬ ‫تكنولوجيا‬-‫متطلبات‬ ‫من‬ ‫جزء‬ ‫وهي‬ ‫بابل‬ ‫جامعة‬
‫شهادة‬ ‫نيل‬‫الحاسبات‬ ‫علوم‬ ‫في‬ ‫الماجستير‬
‫قبل‬ ‫من‬
‫الركابي‬ ‫عبيد‬ ‫كاظم‬ ‫مروه‬
‫بإش‬‫ــ‬‫راف‬
‫أ‬.‫د‬.‫البكري‬ ‫حمسن‬ ‫عباس‬
١٠٢5‫م‬٢٤١6‫هـ‬
‫العلمي‬ ‫والبحث‬ ‫العالي‬ ‫التعليم‬ ‫وزارة‬
‫بابل‬ ‫جامعة‬-‫كلية‬‫تكنولوجي‬‫ا‬‫المعلومات‬
‫البرامجيات‬ ‫قسم‬

More Related Content

PDF
Comparison Analysis of Post- Processing Method for Punjabi Font
PDF
Cross language information retrieval in indian
PDF
Survey on Indian CLIR and MT systems in Marathi Language
PDF
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
PDF
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PDF
International Journal of Engineering Research and Development
PDF
A survey on Script and Language identification for Handwritten document images
Comparison Analysis of Post- Processing Method for Punjabi Font
Cross language information retrieval in indian
Survey on Indian CLIR and MT systems in Marathi Language
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
International Journal of Engineering Research and Development
A survey on Script and Language identification for Handwritten document images

What's hot (15)

PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PDF
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
PDF
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
PDF
Qualitative Analysis Syllabus Final
PDF
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
PDF
A New Approach to Parts of Speech Tagging in Malayalam
PDF
A Novel Approach for Rule Based Translation of English to Marathi
PDF
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
PDF
Dimension Reduction for Script Classification - Printed Indian Documents
PDF
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
PDF
Automatic vs. human question answering over multimedia meeting recordings
PDF
Performance analysis on secured data method in natural language steganography
PDF
A Review on Grammar-Based Fuzzing Techniques
PDF
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
Qualitative Analysis Syllabus Final
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
A New Approach to Parts of Speech Tagging in Malayalam
A Novel Approach for Rule Based Translation of English to Marathi
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
Dimension Reduction for Script Classification - Printed Indian Documents
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...
Automatic vs. human question answering over multimedia meeting recordings
Performance analysis on secured data method in natural language steganography
A Review on Grammar-Based Fuzzing Techniques
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Named Entity Recognition using Hidden Markov Model (HMM)
Ad

Similar to Intelligent Text Document Correction System Based on Similarity Technique (20)

PDF
EasyChair-Preprint-7375.pdf
PDF
Assisting Tool For Essay Grading For Turkish Language Instructors
PDF
Techniques for automatically correcting words in text
PDF
An Adaptive Approach for Subjective Answer Evaluation
PDF
IRJET - Text Optimization/Summarizer using Natural Language Processing
PDF
IRJET - Analysis of Paraphrase Detection using NLP Techniques
PDF
IRJET- Spelling and Grammar Checker and Template Suggestion
PDF
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
PPTX
Techniques For Deep Query Understanding
PDF
IRJET- Vernacular Language Spell Checker & Autocorrection
PDF
Proposed Method for String Transformation using Probablistic Approach
PDF
A Vietnamese Text-based Conversational Agent.pdf
PDF
20433-39028-3-PB.pdf
PDF
A scoring rubric for automatic short answer grading system
PDF
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
PDF
IRJET- Implementation of Automatic Question Paper Generator System
PDF
Design of A Spell Corrector For Hausa Language
PDF
English to punjabi machine translation system using hybrid approach of word s
PDF
Automatic Essay Grading System For Short Answers In English Language
PDF
Automated Question Paper Generator And Answer Checker Using Information Retri...
EasyChair-Preprint-7375.pdf
Assisting Tool For Essay Grading For Turkish Language Instructors
Techniques for automatically correcting words in text
An Adaptive Approach for Subjective Answer Evaluation
IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET- Spelling and Grammar Checker and Template Suggestion
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter
Techniques For Deep Query Understanding
IRJET- Vernacular Language Spell Checker & Autocorrection
Proposed Method for String Transformation using Probablistic Approach
A Vietnamese Text-based Conversational Agent.pdf
20433-39028-3-PB.pdf
A scoring rubric for automatic short answer grading system
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
IRJET- Implementation of Automatic Question Paper Generator System
Design of A Spell Corrector For Hausa Language
English to punjabi machine translation system using hybrid approach of word s
Automatic Essay Grading System For Short Answers In English Language
Automated Question Paper Generator And Answer Checker Using Information Retri...
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Electronic commerce courselecture one. Pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
A comparative analysis of optical character recognition models for extracting...
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine Learning_overview_presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25-Week II
Electronic commerce courselecture one. Pdf
sap open course for s4hana steps from ECC to s4
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
A comparative analysis of optical character recognition models for extracting...

Intelligent Text Document Correction System Based on Similarity Technique

  • 1. Intelligent Text Document Correction System Based on Similarity Technique A Thesis Submitted to the Council of the College of Information Technology, University of Babylon in Partial Fulfillment of the Requirements for the Degree of Master of Sciences in Computer Sciences. By Marwa Kadhim Obeid Al-Rikaby Supervised by Prof. Dr. Abbas Mohsen Al-Bakry 2015 D.C. 1436 A.H. Ministry of Higher Education and Scientific Research University of Babylon- College of Information Technology Software Department
  • 2. II ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ي‬ِ‫ح‬َّ‫ر‬‫ال‬ ِ‫ن‬ٰ‫ـ‬‫ـ‬‫ـ‬‫ـ‬َ‫م‬ْ‫ح‬َّ‫ر‬‫ال‬ ِ‫ه‬َّ‫الل‬ ِ‫م‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬‫ـ‬ْ‫س‬ِ‫ب‬ {ِ‫ب‬ ‫ِي‬‫د‬ْ‫ه‬َ‫ي‬‫َا‬‫و‬ْ‫ض‬ِ‫ر‬ َ‫ع‬َ‫ب‬َّ‫ْت‬‫ا‬ ِ‫ن‬َ‫م‬ ُ‫هلل‬ْ‫ا‬ ِ‫ه‬‫َم‬‫ل‬َّ‫س‬‫ْل‬‫ا‬ َ‫ل‬ُ‫ب‬ُ‫س‬ ُ‫ه‬َ‫ن‬ِٰ َ‫و‬ُ‫ي‬ْ‫خ‬ِ‫ر‬ُ‫ج‬ُ‫ه‬ِِّ‫م‬ ‫م‬َ‫ن‬ْ‫ا‬ُّ‫لظ‬ُ‫ل‬َ‫م‬‫ا‬ِ‫ت‬ِ‫إ‬َ‫ل‬‫ى‬َْٰ‫ا‬ُّ‫ن‬‫ل‬ِ‫ر‬‫و‬ِ‫ب‬ِ‫إ‬ْ‫ذ‬ِ‫ن‬ِ‫ه‬َ‫و‬َ‫ي‬ْ‫ه‬ِ‫د‬ِ‫ه‬‫ي‬ْ‫م‬ِ‫إ‬‫ىل‬َٰ َِ‫ص‬‫ر‬ٍٰ‫ط‬ُّ‫م‬ْ‫س‬َ‫ت‬ِ‫ق‬ٍ‫م‬‫ي‬} َ‫ص‬َ‫د‬َ‫ق‬َ‫ع‬‫ال‬ ‫اهلل‬ِ‫ل‬ُ‫ي‬َ‫ع‬‫ال‬ِ‫ظ‬‫يم‬ ‫امل‬ ‫سورة‬‫ئادةة‬‫آية‬16
  • 3. III Supervisor Certification I certify that this thesis was prepared under my supervision at the Department of Software / Information Technology / University of Babylon, by Marwa Kadhim Obeid Al-Rikaby as a partial fulfillment of the requirement for the degree of Master of Sciences in Computer Science. Signature: Supervisor : Prof. Dr. Abbas Mohsen Al-Bakry Title : Professor. Date : / / 2015 The Head of the Department Certification In view of the available recommendation, we forward this thesis for debate by the examining committee. Signature: Name : Dr. Eman Salih Al-Shamery Title: Assist. Professor. Date: / / 2015
  • 4. IV To Master of creatures, Loved by Allah, The Prophet Muhammad (Allah bless him and his family)
  • 5. V Acknowledgements All praise be to Allah Almighty who enabled me to complete this task successfully and utmost respect to His last Prophet Mohammad PBUH. First, my appreciation is due to my advisor Prof. Dr. Abbas Mohsen Al- Bakry, for his advice and guidance that led to the completion of this thesis. I would like to thank the staff of the Software Department for the help they have offered, especially, the head of the Software Department Dr. Eman Salih Al-Shamery. Most importantly, I would like to thank my parents, my sisters, my brothers and my friends for their support.
  • 6. VI Abstract Automatic text correction is one of the human-computer interaction challenges. It is directly interposed with several application areas like post handwritten text digitizing correction or indirectly such as user's queries correction before applying a retrieval process in interactive databases. Automatic text correction process passes through two major phases: error detection and candidates suggestion. Techniques for both phases are categorized into: Procedural and statistical. Procedural techniques are based on using rules to govern texts acceptability, including Natural Language Processing Techniques. Statistical techniques, on the other hand, are dependent on statistics and probabilities collected from large corpus based on what is commonly used by humans. In this work, natural language processing techniques are used as bases for analysis and both spell and grammar acceptance checking of English texts. A prefix dependent hash-indexing scheme is used to shorten the time of looking up the underhand dictionary which contains all English tokens. The dictionary is used as a base for the error detection process. Candidates generation is based on calculating source token similarity, measured using an improved Levenshtein method, to the dictionary tokens and ranking them accordingly; however this process is time extensive, therefore, tokens are divided into smaller groups according to spell similarity in such a way keeps the random access availability. Finally, candidates suggestion involves examining a set of commonly committed mistakes related features. The system selects the optimal candidate which provides the highest suitability and doesn't violate grammar rules to generate linguistically accepted text. Testing the system accuracy showed better results than Microsoft Word and some other systems. The enhanced similarity measure reduced the time complexity to be on the boundaries of the original Levenshtein method with an additional error type discovery.
  • 7. VII Table of Contents Subject Page No. Chapter One : Overview 1.1 Introduction 1 1.2 Problem Statement 3 1.3 Literature Review 5 1.4 Research Objectives 10 1.5 Thesis Outlines 11 Chapter Two: Background and Related Concepts Part I: Natural Language Processing 12 2.1 Introduction 12 2.2 Natural Language Processing Definition 12 2.3 Natural Language Processing Applications 13 2.3.1 Text Techniques 14 2.3.2 Speech Techniques 15 2.4 Natural Language Processing and Linguistics 16 2.4.1 Linguistics 16 2.4.1.1 Terms of Linguistic Analysis 17 2.4.1.2 Linguistic Units Hierarchy 19 2.4.1.3 Sentence Structure and Constituency 19 2.4.1.4 Language and Grammar 20 2.5 Natural Language Processing Techniques 22 2.5.1 Morphological Analysis 22 2.5.2 Part of Speech Tagging 23 2.5.3 Syntactic Analysis 26 2.5.4 Semantic Analysis 27 2.5.5 Discourse Integration 27 2.5.6 Pragmatic Analysis 28 2.6 Natural Language Processing Challenges 28 2.6.1 Linguistics Units Challenges 28 2.6.1.1 Tokenization 28 2.6.1.2 Segmentation 29 2.6.2 Ambiguity 31 2.6.2.1 Lexical Ambiguity 31
  • 8. VIII Subject Page No. 2.6.2.2 Syntactic Ambiguity 31 2.6.2.3 Semantic Ambiguity 32 2.6.2.4 Anaphoric Ambiguity 32 2.6.3 Language Change 32 2.6.3.1 Phonological Change 33 2.6.3.2 Morphological Change 33 2.6.3.3 Syntactic Change 33 2.6.3.4 Lexical Change 33 2.6.3.5 Semantic Change 34 Part II: Text Correction 35 2.7 Introduction 35 2.8 Text Errors 35 2.8.1 Non-words Errors 36 2.8.2 Real-word Errors 36 2.9 Error Detection Techniques 37 2.9.1 Dictionary Looking Up 37 2.9.1.1 Dictionaries Resources 37 2.9.1.2 Dictionaries Structures 38 2.9.2 N-gram Analysis 39 2.10 Error Correction Techniques 40 2.10.1 Minimum Edit Distance Techniques 40 2.10.2 Similarity Key Techniques 43 2.10.3 Rule Based Techniques 43 2.10.4 Probabilistic Techniques 43 2.11 Suggestion of Corrections 44 2.12 The Suggested Approach 44 2.12.1 Finding Candidates Using Minimum Edit Distance 45 2.12.2 Candidates Mining 45 2.12.3 Part-of-Speech Tagging and Parsing 46 Chapter Three : Hashed Dictionary and Looking Up Technique 3.1 Introduction 48 3.2 Hashing 48 3.2.1 Hash Function 49 3.2.2 Formulation 52 3.2.3 Indexing 53 3.3 Looking Up Procedure 56
  • 9. IX Subject Page No. 3.4 Dictionary Structure Properties 58 3.5 Similarity Based Looking-Up 59 3.5.1 Bi-grams Generation 60 3.5.2 Primary Centroids Selection 62 3.5.3 Centroids Referencing 63 3.6 Application of Similarity Based Looking up approach 64 3.7 The Similarity Based Looking up Properties 67 Chapter Four : Error Detection and Candidates Generation 4.1 Introduction 69 4.2 Non-word Error Detection 69 4.3 Real-Words Error Detection 71 4.4 Candidates Generation 72 4.4.1 Candidates Generation for Non-word Errors 72 4.4.1.2 Enhanced Levenshtein Method 74 4.4.1.3 Similarity Measure 78 4.4.1.4 Looking for Candidates 79 4.4.2 Candidates Generation for Real-words Errors 81 Chapter Five : Text Correction and Candidates Suggestion 5.1 Introduction 82 5.2 Correction and Candidates Suggestion Structure 82 5.3 Named-Entity Recognition 85 5.4 Candidates Ranking 86 5.4.1 Edit Distance Based Similarity 87 5.4.2 First and End Symbols Matching 87 5.4.3 Difference in Lengths 88 5.4.4 Transposition Probability 89 5.4.5 Confusion Probability 90 5.4.6 Consecutive Letters (Duplication) 91 5.4.7 Different Symbols Existence 92 5.5 Syntax Analysis 93 5.5.1 Sentence Phrasing 93 5.5.2 Candidates Optimization 95 5.5.3 Grammar Correction 95 5.5.4 Document Correction 97 Chapter Six: Experimental Results, Conclusions, and Future Works
  • 10. X Subject Page No. 6.1 Experimental Results 98 6.1.1 Tagging and Error Detection Time Reduction 98 6.1.1.1 Successful Looking Up 99 6.1.1.2 Failure Looking Up 100 6.1.2 Candidates Generation and Similarity Search Space Reduction 101 6.1.3 Time Reduction of the Damerau-Levenshtein method 103 6.1.4 Features Effect on Candidates Suggestion 104 6.2 Conclusions 107 6.3 Future Works 108 References 110 Appendix A 117 Appendix B 122 List of Figures Figure No. Title Page No. (2.1) NLP dimensions 16 (2.2) Linguistics analysis steps 17 (2.3) Linguistic Units Hierarchy 19 (2.4) Classification of POS tagging models 24 (2.5) An example of lexical change 34 (2.6) Outlines of Spell Correction Algorithm 38 (2.7) Levenshtein Edit Distance Algorithm 41 (2.8) Damerau-Levenshtein Edit Distance Algorithm 42 (2.9) The Suggested System Block Diagram 47 (3.1) Token Hashing Algorithm 54
  • 11. XI Figure No. Title Page No. (3.2) Dictionary Structure and Indexing Scheme 55 (3.3) Algorithm of Looking Up Procedure 57 (3.4) Semi Hash Clustering block diagram 61 (3.5) Similarity Based Hashing algorithm 64 (3.6) Block diagram of candidates generation using SBL 66 (3.7) Similarity Based Looking up algorithm 68 (4.1) Tagging Flow Chart 70 (4.2) The Enhanced Levenshtein Method Algorithm 76 (4.3) Original Levenshtein Example 77 (4.4) Damerau-Levenshtein Example 77 (4.5) Enhanced Levenshtein Example 78 (5.1) Candidates ranking flowchart 84 (5.2) Syntax analysis flowchart 94 (6.1) Tokens distribution in primary packets 99 (6.2) Tokens distribution in secondary packets 99 (6.3) Time complexity Variance of Levenshtein, Damerau- Levenshtein, and Enhanced Levenshtein (our modification) 103 (6.4) Suggestion Accuracy with a comparison to Microsoft Office Word on a Sample from the Wikipedia 104 (6.5) Testing the suggested system accuracy and comparing the results with other systems using the same dataset 105 (6.6) Discarding one feature at a time for optimal candidate selection 106 (6.7) Using one feature at a time for optimal candidate selection 107
  • 12. XII List of Tables Table No. Title Page No. (1-1) Summary of Literature Review 9 (3-1) Alphabet Encoding 50 (3-2) Addressing Range 52 (3-3) Predicting errors using Bi-grams analysis 61 (5-1) Transposition Matrix 90 (5-2) Confusion Matrix 91 List of Symbols and Abbreviations MeaningAbbreviation Alphabet∑ Adjectival PhraseA Absolute Differenceabs Sentence ComplementC Context Free GrammarCFG DictionaryD Dioxide Nuclear AcidDNA ErrorE GrammarG Grammar Error CorrectionGEC Hidden Markov ModelHMM Information RetrievalIR Machine TranslationMT Named EntityNE Named-Entity RecognitionNER Noun GroupNG Natural Language GenerationNLG Natural Language ProcessingNLP Natural LanguagesNLs Natural Language UnderstandingNLU
  • 13. XIII Noun PhraseNP big-Oh notation ( =at most)O( ) Optical Character RecognitionOCR Production RuleP Part Of SpeechPOS Prepositional PhrasePP QueryQ Ranking ValueR Relative DistanceR_Dist Start SymbolS Stanford Machine TranslatorSMT Speech RecognitionSR String1, String2St1,St2 VariableV Adverbial Phrasev Verb PhraseVP big-Omega notation (= at least)Ω( )
  • 15. 1 Chapter One Overview 1.1 Introduction Natural Language Processing, also known as computational Linguistics, is the field of computer science that deals with linguistics; it is a form of human- computer interaction where formalization is applied on the elements of human language to be performed by a computer [Ach14]. Natural Language Processing (NLP) is the implementation of systems that are capable of manipulating and processing natural languages (NLs) sentences[Jac02] like English, Arabic, Chinese and not formal languages like Python, Java, C++; nor descriptive languages such as DNA in biology and Chemical formulas in chemist [Mom12]. NLP task is the designing and building of software for analyzing, understanding and generating spoken and/or written NLs. [Man08] [Mis13] NLP has many applications such as automatic summarization, Machine Translation (MT), Part-Of-Speech (POS) Tagging, Speech Recognition (SR), Optical Character Recognition (OCR), Information Retrieval (IR), Opinion Mining [Nad11], and others [Wol11]. Text Correction is another significant application of NLP. It includes both Spell Checking and Grammar Error Correction (GEC). Spell checking research extends early back to the mid of 20th century by Lee Earnest at Stanford University but the first application was created in 1971 by Ralph Gorin, Lee's student, for DEC PDP-10 mainframe with a dictionary of 10,000 English words. [Set14] [Pet80] Grammar error correction, in spite of its central role in semantic and meaning representations, is largely ignored by NLP community. In recent
  • 16. Chapter One   Overview ________________________________________________________________________ 2 years, an improvement noticed in automatic GEC techniques. [Voo05] [Jul13] However, most of these techniques are limited in specific domains such as real-word spell correction [Hwe14], subject-verb disagreement [Han06], verb tense misuse [Gam10], determiners or articles and improper preposition usage. [Tet10] [Dah11] Different techniques like edit distance [Wan74], rule-based techniques [Yan83], similarity key techniques [Pol83] [Pol84], n-grams [Zha98], probabilistic techniques [Chu91], neural nets [Hod03] and noisy channel model [Tou02] have been proposed for text correction purposes. Each technique needs some sort of resources. Edit distance, rule-based and similarity key techniques require a dictionary (or lexicon), n-grams and probabilistic work with statistical and frequency information, neural nets are learned with training patterns, etc… Text correction, spell and grammar, is an extensive process includes, typically, three major steps: [Ach14] [Jul13] The first step is to detect the incorrect words. The most popular way to decide if a word is misspelled is to look for it in a dictionary, a list of correctly spelled words. This way can detect non-word errors not the real- word errors [Kuk92] [Mis13] because an unintended word may match a word in the dictionary. NLs have a large number of words resulting in a huge dictionary, therefore, the task of looking every word consumes a long time. Whereas, in GEC this step is more complicated, it requires applying more analysis at the level of sentences and phrases using computational linguistics basics to detect the word that makes the sentence incorrect. Next, a list of candidates or alternatives should be generated for the incorrect word (misspelled or misused). This list is preferred to be short and contains the words with highest similarity or suitability; and to produce it, a technique is needed to calculate the similarity of the incorrect word with
  • 17. Chapter One   Overview ________________________________________________________________________ 3 every word in the dictionary. Efficiency and accuracy are major factors in the selection of such technique. GEC requires broad knowledge of diverse grammatical error categories and extensive linguistic technique to identify alternatives because a grammatical error mayn't be resulted from a unique word. Finally, suggesting the intended word or a list of alternatives contains the intended word. This task requires ranking the words according to the similarity amount to the incorrect word and some other considerations may or may not be taken depending on the technique in use. Text mining techniques started to enter the area of text correction; Clustering [Zam14], Named-Entity Recognition (NER) [Bal00] [Rit11] and Information Retrieval [Kir11] are examples. Statistics and probabilistic also played a great role specifically in analyzing common mistakes and n-gram datasets [Ahm09] [Gol96] [Amb08]. Clustering, in both syllable and phonetic level, can be used in reducing the looking up space; NER may help in avoiding interpreting proper nouns as misspellings; statistics merged with NLP techniques to provide more precise parsing and POS tagging, usually, in context dependent applications. The application of a given technique differs according to what level of correction is intended; it starts from the character level [Far14], passes through word, phrase (usually in GEC), sentence, and ends in the context or document subject level. 1.2 Problem Statement Although many text checking and correction systems are produced, each has its variances from the sides of input quality restrictions, techniques used, output accuracy, speed, performance conditions…etc. [Ahm09] [Pet80]. This field of NLP is really an open research from all sides because there is no complete algorithm or technique handles all considerations.
  • 18. Chapter One   Overview ________________________________________________________________________ 4 The limited linguistic knowledge, the huge number of lexicons, the extensive grammar, language ambiguity and change over time, variety of committed errors and computational requirements are challenges facing the process of developing a text correction application. In this work, some of the above mentioned problems are solved using a set of solutions:  Integrating two lexicon datasets (WordNet and Ispell).  Using brute-force approach to solve some sorts of ambiguity.  Applying hashing and indexing in looking up the dictionary.  Reducing search space in candidates collecting process by grouping similarly spelled words into semi clusters. The Levenshtein method [Hal11] is also enhanced to consider Damerau four types of errors within time period shorter than Damerau-Levenshtein method [Hal11]. Named Entity Recognition, letters confusion and transposition, and candidate length effect are used as features to optimize the candidates' suggestion. In addition to applying rules of Part-Of-Speech tags and sentence constituency for checking sentence grammar correctness, whether it is lexically corrected or is not. The proposed three components of this system are: (1)a spell error detection is based on a fast looking up technique in a dictionary of more than 300,000 tokens, constructed by applying a string prefix dependent hash function and indexing method; grammar error detector is a brute-force parser. (2)For candidates generation, an enhancement was implemented on the Levenshtein method to consider Damerau four errors types and then used to measure similarity according to the minimum edit distance and difference in lengths effect, the dictionary tokens are grouped into spell based clusters to reduce search space. (3)The candidates suggestion exploits NER features,
  • 19. Chapter One   Overview ________________________________________________________________________ 5 transposition error and confusion statistics, affixes analysis (including first and last letters matching), length of candidates, and parsing success. 1.3 Literature Review  Asha A. and Bhuma V. R., 2014, introduced a probabilistic approach to string transformation includes a model consists of rules and weights for training and an algorithm depends on scoring and ranking according to conditional probability distribution for generating the top k-candidates at the character level where both high and low frequency words can be generated. Spell checking is one of many applications on which the approach was applied; the misspelled strings (words or characters) are transformed by applying a number of operators into the k-most similar strings in a dictionary (start and end letters are constants). [Ach14]  Mariano F., Zheng Y., and others, 2014, talked the correction of grammatical errors by processes pipelining which combines results from multiple systems. The components of the approach are: a rule based error corrector uses rules automatically derived from the Cambridge Learner Corpus which based on N-grams that have been annotated as incorrect; SMT system translates incorrectly written English into correct English; NLTK1 was used to perform segmentation, tokenization, and POS tagging; the candidates generation produce all the possible combinations of corrections for the sentence, in addition to the sentence itself to consider the "no correction" option; finally the candidates are ranked using a language model. [Fel14] __________________________________________________________ 1 The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language.
  • 20. Chapter One   Overview ________________________________________________________________________ 6  Anubhav G., 2014, presented a rule-based approach that used two POS taggers to correct non-native English speakers' grammatical errors, Stanford parser and Tree Tagger. The detection of errors depends on the outputs of the two taggers, if they differ then the sentence is not correct. Errors are corrected using Nodebox English Linguistic library. Error correction includes subject-verb disagreement, verb form, and errors detected by POS tag mismatch. [Gup14]  Stephan R., 2013, proposed a model for spelling correction based on treating words as "documents" and spell correction as a form of document retrieval in that the model retrieves the best matching correct spell for a given input. The words are transformed into tiny documents of bits and hamming distance is used to predict the closest string of bits from a dictionary holding the correctly spelled words as strings of bits. The model is knowledge free and only contains a list of correct words. [Raa13]  Youssef B., 2012, produced a parallel spell checking algorithm for spelling errors detection and correction. The algorithm is based on information from Yahoo! N-gram dataset 2.0; it is a shared memory model allowing concurrency among threads for both parallel multi processor and multi core machines. The three major components (error detector, candidates' generator and error corrector) are designed to run in a parallel fashion. Error detector, based on unigrams, detects non-word errors; candidates' generator is based on bi-grams; the error corrector, context sensitive, is based on 5-grams information.[Bas12]  Hongsuck S., Jonghoon L., Seokhwan K., Kyusong L., Sechun K., and Gary G. L., 2012, presented a novel method for grammatical error correction by building a meta-classifier. The meta-classifier decides the final output depending on the internal results from several base classifiers; they used multiple grammatical errors tagged corpora with
  • 21. Chapter One   Overview ________________________________________________________________________ 7 different properties in various aspects. The method focused on the articles and the correction arises only when a mismatching occur with the observed articles. [Seo12]  Kirthi J., Neeju N.J., and P.Nithiya, 2011, proposed a semantic information retrieval system performing automatic spell correction for user queries before applying the retrieval process. The correcting procedure depends on matching the misspelled word against a correctly spelled words dictionary using Levenshtein algorithm. If an incorrect word is encountered then the system retrieves the most similar word depending on the Levenshtein measure and the occurrence frequency of the misspelled word.[Kir11]  Farag, Ernesto, and Andreas, 2008, developed a language-independent spell checker. It is based on the enhancement of N-gram model through creating a ranked list of correction candidates derived based on N-gram statistics and lexical resources then selecting the most promising candidates as correction suggestions. Their algorithm assigns weights to the possible suggestions to detect non-word errors. They depended a "MultiWordNet" dictionary of about 80,000 entries.[Ahm09]  Mays, Damerau, and Mercer, 2008, designed a noisy-channel model of real-words spelling error correction. They assumed that the observed sentence is a signal passed through a noisy channel, where the channel reflects the typist and the distortion reflects errors committed by the typist. The probability of the sentence correctness, given by the channel (typist), is a parameter associated with that sentence. The probability of every word in the sentence to be the intended one is equivalent to the sentence correctness probability and the word is associated with a set of spell variants words excluding the word itself. Correction can be applied to one word in the sentence by replacing the incorrect one by another
  • 22. Chapter One   Overview ________________________________________________________________________ 8 from the candidates (its real-word spelling variations) set so that it gives the maximum probability.[Amb08]  Stoyan, Svetla, and others, 2005, described an approach for lexical post- correction of the output of optical character recognizer OCR as a two research project. They worked on multiple sides; on the dictionary side, they enriched their large sizes dictionaries with specialty dictionaries; on the candidates selection, they used a very fast searching algorithm depends on Levenshtein automata for efficient selecting the correction candidates with a bound not exceeding 3; they ranked candidates depending on a number of features such as frequency and edit distance.[Mih04]  Suzan V., 2002, described a context sensitive spell checking algorithm based on the BESL spell checker lexicons and word trigrams for detecting and correcting real-word errors using probability information. The algorithm splits up the input text into trigrams and every trigram is looked up in a precompiled database which contains a list of trigrams and their occurrence number in the corpus used for database compiling. The trigram is correct if it is in the trigram database, otherwise it is considered an erroneous trigram containing a real-word error. The correction algorithm uses BESL spell checker to find candidates but the most frequent in the trigrams database are suggested to the user.[Ver02]
  • 23. Chapter One   Overview ________________________________________________________________________ 9 No. Reference Methodology Technique 1 [Ach14] Generating the top K- candidates at the character level for both high and low frequency. A model consists of rules and weights, and a conditional probability distribution dependent algorithm 2 [Fel14] Grammatical errors correction based on generating all possible correct alternatives for the sentence Combining the results of multiple systems: rule based error corrector, SMT English to Correct English translator, and NLTK for segmentation, tokenization and tagging 3 [Gup14] Non-native English speakers' grammatical errors correction Error detection used Stanford parser and Tree Tagger. Correction based on Nodebox English Linguistic library 4 [Raa13] Dictionary based Spell correction treats the misspelled word as a document. Converting the misspelled word into a tiny document of bits and retrieving the most similar documents using Hamming Distance 5 [Bas12] Context sensitive spell checking using a shared memory model allowing concurrency among threads for parallel execution Different N-grams levels for error detection, candidates generation, and candidates suggestion depending on Yahoo! N-Grams dataset 2.0 6 [Seo12] Meta-classifier for grammatical errors correction focused mainly on the articles. Deciding the output depending on the internal results from several base classifiers 7 [Kir11] Automatic spell correction for user queries before applying retrieval process Using Levenshtein algorithm for both error detection and correction in a dictionary looking up technique Table 1.1: Summary of Literature Review
  • 24. Chapter One   Overview ________________________________________________________________________ 11 8 [Ahm09] Language independent model for non-word error correction based on N- gram statistics and lexical resources Ranking a list of correction candidates by assigning weights to the possible suggestions depending on a "MultiWordNet" dictionary of about 80,000 entries 9 [Amb08] Noisy channel model for Real words error correction based on probability. Channel represents the typist, distortion represents the error, and the noise probability is a parameter 10 [Mih04] OCR output post correction Levenshtein automata for candidates generation and frequency for ranking 11 [Ver02] Context sensitive spell checking algorithm based on tri-grams Splitting texts into word trigrams and matching them against the precompiled BESL spell checker lexicons, suggestion depends on probability information. 1.4 Research Objectives This research is attempted to design and implement a smart text document correction system for English texts. It is based on mining a typed text for detecting spelling and grammar errors and giving the optimal suggestion(s) from a set of candidates, its steps are: 1. Analyzing the given text by using Natural Language Processing techniques, at each step detect the erroneous words. 2. Looking up candidates for the erroneous words and ranking them according to a given set of features and conditions to be the initial solutions. 3. Optimizing the initial solutions depending on the extracted information from the given text and the detected errors.
  • 25. Chapter One   Overview ________________________________________________________________________ 11 4. Recovering the input text document with the optimal solutions and associating the best set of candidates with each incorrect detected word. 1.5 Thesis Outlines The next five chapters are: 1. Chapter Two: "Background and Related Concepts" consisted of two parts. The first overviews NLP fundamentals, applications and techniques; whereas, the second is about text correction techniques. 2. Chapter Three: "Dictionary Structure and Looking up Technique" describes the suggested approach of constructing the dictionary of the system for both perfect matching and similarity looking up. 3. Chapter Four: "Error Detection and Candidates Generation", declares the suggested technique for indicating incorrect words and the method of generating candidates. 4. Chapter Five: "Automatic Text Correction and Candidates Suggestion", describes the techniques of suggestions selection and optimization. 5. Chapter Six: "Experimental Results, Conclusion, and Future Works", the experimental results of applying the techniques described in chapters three, four and five, conclusion of the system and the future directions are shown.
  • 27.  12  Chapter Two Background and Related Concepts Part I Natural Language Processing 2.1 Introduction Natural Language Processing (NLP) began in the late 1940s. It was focused on machine translations; in 1958, NLP was linked to the information retrieval by the Washington International Conference of Scientific Information; [Jon01] primary ideas for developing applications for detecting and correcting text errors started at that period of time. [Pet80] [Boo58] Natural Language Processing has a great interest from that time till our days because it plays an important role in the interaction between human and computers. It represents the intersection of linguistics and artificial intelligence [Nad11] where machine can be programmed to manipulate natural language. 2.2 Natural Language Processing Definition "Natural Language Processing (NLP) is the computerized approach for analyzing text that is based on both a set of theories and a set of technologies." [Sag13] NLP describes the function of software or hardware components in a computer system that is capable of analyzing or synthesizing human languages (spoken or written) [Jac02] [Mis13] like English, Arabic, Chinese …etc, not formal languages like Python, Java, C++ … etc, nor
  • 28. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  13  descriptive languages such as DNA in biology and Chemical formulas in chemist [Mom12]. "NLP is a tool that can reside inside almost any text processing software application" [Wol11] We can define NLP as a subfield of Artificial Intelligence encompasses anything needed by a computer to understand and generate natural language. It is based on processing human language for two tasks: the first receives a natural language input (text or speech), applies analysis, reasons what was meant by that input, and outputs in computer language; this is the task of Natural Language Understanding (NLU). While the second task is to generate human sentences according to specific considerations, the input is in computer language but the output is in human languages; it is called Natural Language Generation (NLG). [Raj09] "Natural Language Understanding is associated with the more ambitious goals of having a computer system actually comprehend natural language as a human being might". [Jac02] 2.3 Natural Language Processing Applications Even of its wide usage in computer systems, NLP is entirely disappeared into the background; where it is invisible to the user and adds significant business value. [Wol11] The major distinction of NLP applications from other data processing systems is that they use Language Knowledge. Natural Language Processing applications are mainly divided into two categories according to the given NL format [Mom12] [Wol11]:
  • 29. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  14  2.3.1Text Technologies  Spell and Grammar Checking: systems deal with indicating lexical and grammar errors and suggest corrections.  Text Categorization and Information Filtering: In such applications, NLP represents the documents linguistically and compares each one to the others. In text categorization, the documents are grouped according to their linguistic representation characteristics into several categories. Information filtering signals out, from a collection of documents, the documents that are satisfying some criterion.  Information Retrieval: finds and collects relevant information to a given query. A user expresses the information need by a query, then the system attempts to match the given query to the database documents that is satisfying the user’s query. Query and documents are transformed into a sort of linguistic structure, and the matching is performed accordingly.  Summarization: according to an information need or a query from the user, this type of applications finds the most relevant part of the document.  Information Extraction: refers to the automatic extraction of structured information from unstructured sources. Structured information like entities, their relationships, and attributes describing them. This can integrate structured and unstructured data sources, if both are exist, and pose queries for spanning the integrated information giving better results than applying searches by keywords alone.  Question Answering: works with plain speech or text input, applies an information search based on the input. Such as IBM® Watson™ and the reigning JEOPARDY! Champion, which read
  • 30. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  15  questions and understand their intention, then looking up the knowledge library to find a match.  Machine Translation: translate a given text from a specific natural language to another natural language, some applications have the ability to recognize the given text language even if the user didn't specify it correctly.  Data Fusion: Combining extracted information from several text files into a database or an ontology.  Optical Character Recognition: digitizing handwritten and printed texts. I.e. converting characters from images to digital codes.  Classification: this NLP application type sorts and organizes information into relevant categories. Like e-mail spam filters and Google News™ news service.  And also NLP entered other applications such as educational essay test-scoring systems, voice-mail phone trees, and even e- mail spam detection software. 2.3.2 Speech Technologies  Speech Recognition: mostly used on telephone voice response systems as a service client. Its task is processing plain speech. It is also used to convert speech into text.  Speech Synthesis: means converting text into speech. This process requires working at the level of phones and converting from alphabetic symbols into sound signals.
  • 31. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  16  2.4 Natural Language Processing and Linguistics Natural Language Processing is concerned with three dimensions: language, algorithm and problem as presented in figure (2.1). On the language dimension, NLP considers linguistics; algorithm dimension mentions NLP techniques and tasks, while the problem dimension depicts the applied mechanisms to solve problems. [Bha12] 2.4.1 Linguistics Natural Language is a communication mean. It is a system of arbitrary signals such as the voice sound and written symbols. [Ali11] Linguistics is the scientific study of language; it starts from the simple acoustic signals which form sounds and ends with pragmatic understanding to produce the full context meaning. There are two major levels of linguistic, Speech Recognition (SR) and Natural Language Processing (NLP) as shown in figure (2.2). Figure (2.1) : NLP dimensions [Bha12]
  • 32. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  17  2.4.1.1 Terms of Linguistic Analysis A natural language, as formal language does, has a set of basic components that may vary from one language to another but remain bounded under specific considerations giving the special characteristics to every language. From the computational view, a language is a set of strings generated over a finite alphabet and can be considered by a grammar. The definition Acoustic Signal Phones Letters and Strings Morphemes Words Phrases and Sentences Meaning out of Context Meaning in Context SR NLP Phonetics Phonology Lexicon Morphology Syntax Semantics Pragmatics Figure (2.2) : Linguistics analysis steps [Cha10]
  • 33. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  18  of the three abstracted names is dependent on the language itself; i.e. strings, alphabet and grammar formulate and characterize language.  Strings: In natural language processing, the strings are the morphemes of the language, their combinations (words) and the combinations of their combinations (sentences), but linguistics going somewhat deeper than this. It starts with phones, the primitive acoustic patterns, which are significant and distinguishable from one natural language to another. Phonology groups phones together to produce phonemes represented by symbols. Morphemes consist of one or more symbols; thus, NLs can be further distinguished.  Alphabet: When individual symbols, usually thousands, represent words then the language is "logographic"; if the individual symbols represent syllables, it is a "syllabic" one. But when they represent sounds, the language is "alphabetic". Syllabic and alphabetic languages have typically less than 100 symbols, unlike logographic. English is an alphabetic language system consists of 26 symbols, these symbols represents phones combined into morphemes which may or may not combined further more to form words.  Grammar: Grammar is a set of rules specifying the legal structure of the language; it is a declarative representation about the language syntactic facts. Usually, grammar is represented by a set of productive rules.
  • 34. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  19  2.4.1.2 Linguistic Units Hierarchy Language can be divided into pieces; there is a typical structure or form for every level of analysis. Those pieces can be put into a hierarchical structure starting from a meaningful sentence as the top level, proceeding in the separation of building units until reaching the primary acoustic sounds. Figure (2.3) presented an example. Figure (2.3) : Linguistic Units Hierarchy 2.4.1.3 Sentence Structure and Constituency "It is constantly necessary to refer to units smaller than the sentence itself units such as those which are commonly referred as CLAUSE, PHRASE, WORD, and MORPHEME. The relation between one unit and another unit of which it is a part is CONSTITUENCY." [Qui85] The task of dividing a sentence into constituents is a complex task ________________________________________________________ 1 The symbols denote the latest codes of English phones dependent by OXFORD dictionaries The teacher talked to the students The teach er talk ed to the student s The teacher talked to the students The teacher talked to the students Sentence Phrase Word Morphem e Phonemes1 ᶞᵊ ᵗ ː ᶴ ᵊ ᵓː ᵏ ᵗ ᵗu ᶞᵊ ʹˢᵗᴶː ᵈᵑᵗ ˢ
  • 35. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  20  requires incorporating more than one analysis stage; tokenization, segmentation, parsing, (and sometimes stemming) usually merged together to build the parse tree for a given sentence. 2.4.1.4 Language and Grammar A language is a 'set' of sentences and a sentence is a 'sequence' of 'symbols' [Gru08]; it can be generated given its context free grammar G=(V,∑,S,P). [Cla10] Commonly, grammars are represented as a set of production rules which is taken by the parser and compared against the input sentences. Every matched rule adds something to the sentence complete structure which is called 'parse tree'. [Ric91] Context free grammar (CFG) is a popular method for generating formal grammars. It is used extensively to define languages syntax. The four components of the grammar are defined in CFG as [Sag13]:  Terminals (∑): represent the basic elements which form the strings of the language.  Nonterminals or Syntactic Variables (V): sets of strings define the language which is generated by the grammar. Nonterminals represent a key in syntax analyzing and translation via imposing a hierarchical structure for the language.  Set of production rules (P): this set define the way of combining terminals with nonterminals to produce strings. The production rule is consisted of a variable on the left side represents its head, this head defines  Start symbol (S). The following is an example describes the structure of English sentence
  • 36. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  21  V = {S, NP, N, V P, V, Art} ∑ = {boy, icecream, dog, bite, like, ate, the, a}, P = {S NP V P, NP  N, NP  ART N, V P  V NP, N  boy | icecream | dog, V  ate | like | bite, Art  the | a} The grammar specifies two things about the language: [Ric91]  Its weak generative capacity; the limited set of sentences which can be completely matched by a series of grammar rules.  Its strong generative capacity, grammatical structure(s) of each sentence in the language. Generally, there are an infinite number of sentences for each grammar which can be structured with it. The strength and importance of grammars lurk in their ability of supplying structure to an infinite number of sentences because they succinctly summarize an infinite number of objects structures of a certain class. [Gru08] The grammar is said to be generative if it has a fixed size production rules which, if followed, can generate every sentence in the language using an infinite number of actions. [Gru08]
  • 37. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  22  2.5 Natural Language Processing Techniques 2.5.1 Morphological Analysis Morphology is the study of how words are constructed from morphemes which represent the minimal meaning-bearing language primitive units.[Raj09] [Jur00] There are two broad classes of morphemes: stems and affixes; the distinction between the two classes is language dependent in that it varies from one language to another. The stem, usually, refers to the main part of the word and the affixes can be added to the words to give it additional meaning. [Jur00] Further more, affixes can be divided into four categories according to the position where they are added. Prefixes, suffixes, circumfixes and infixes generally refer to the different types of affixes but it is not necessary to a language to have all the types. English accept both prefixes to precede stems and suffixes to follow stems, while there is no good example for a circumfixe (precede and follow a stem) in English, and infixing (inserting inside the stem) is not allowed (unlike German and Philippine languages, consecutively) . [Jur00] Morphology is concerned with recognizing the modification of base words to form other words with different syntactic categories but similar meanings. Generally, three forms of word modifications are found [Jur00]:  Inflection: syntactic rules change the textual representation of the words; such as adding the suffix 's' to convert nouns into plurals, adding 'er' and 'est' convert regular adjectives into comparative and
  • 38. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  23  superlative forms, consecutively. This type of modification usually results a word from the same word class of the stem word.  Derivation: new words are produced by adding morphemes, usually more complex and harder in meaning than inflectional morphology. It often occurs in a regular manner and results words differ in their word class from the stem word. Like adding the suffix 'ness' to 'happy' to produce 'happiness'.  Compounding: this type modifies stem words by another stem words by grouping them. Like grouping 'head' with 'ache' to produce 'headache'. In English, this type is infrequent. Morphological processing, also known as stemming, depends heavily on the analyzed language. The output is the set of morphemes that are combined to form words. Morphemes can be stem words, affixes, and punctuations. 2.5.2 Part Of Speech Tagging Part of Speech (POS) tagging is the process of giving the proper lexical information or POS tag (also known as word classes, lexical tags, and morphological classes), which is encoded as a symbol, for every word (or token) in a sentence. [Sco99] [Has06b] In English, POS tags are classified into four basic classes of words: [Qui85] 1. Closed classes: include prepositions, pronouns, determiners, conjunctions, modal verbs and primary verbs. 2. Open classes: include nouns, adjectives, adverbs, and full verbs. 3. Numerals: include numbers and orders. 4. Interjections: include small set of words like oh, ah, ugh, phew. Usually, a POS tag indicates one or more of the previous information and it is sometimes holds other features like the tense of the verb or the number
  • 39. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  24  (plural or singular). POS tagging may generate tagged corpora or serve as a preprocessing step for the next NLP processes. [Sco99] Most of tagging systems performance is typically limited because they only use local lexical information available in the sentence, at the opposite of syntax analyzing systems which exploit both lexical and structural information. [Sco99] More research was done and several models and methods have been proposed to enhance taggers performance, they fall mainly into supervised and unsupervised methods where the main difference between the two categories is the set of training corpora that is pre tagged in supervised methods unlike unsupervised methods which needs advanced computational methods for gaining such a corpora. [Has06a] [Has06b]. Figure (2.4) presents the main categories and shows some examples. In both categories, the following are the most popular: Figure (2.4) : Classification of POS tagging models [Has06a]
  • 40. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  25   Statistical (stochastic, or probabilistic) methods: taggers which use these methods are firstly trained on a correctly tagged set of sentences which allow the tagger to disambiguate words by extracting implicit rules or picking the most probable tag based on the words that are surrounding the given word in the sentence. Examples of these methods are Maximum-Entropy Models, Hidden Markov Models (HMM), and Memory Based models.  Rule based methods: a sequence of rules, a set of hand written rules, is applied to detect the best tags set for the sentence regardless of any maximization probability. The set of rules need to be written probably and checked by human experts. Examples: the path-voting constraint models and decision tree models.  Transformational approach: combines both statistical methods and rule based methods to firstly find the most probable set of available tags and then applies a set of rules to select the best.  Neural Networks: with linear separator or full neural network, have been used for tagging processes. The methods described above, as any other research areas, have their advantages and disadvantages; but there is a major difficulty facing all of them, it is the tagging of unknown words (words that have never seen before in the training corpora). While rule-based approaches depends on a special set of rules to handle such situations, stochastic and neural nets lack this feature and use other ways such as suffixes analysis and n- gram by applying morphological analysis; some methods use default set of tags to disambiguate unknown words. [Has06a]
  • 41. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  26  2.5.3 Syntactic Analysis "Syntax is the study of the relationships between linguistics forms, how they are arranged in sequence, and which sequences are well- formed". [Yul00] Syntactic analysis, also referred by "Parsing", is the process of converting the sentence from its flat format which is represented as a sequence of words into a structure that defines its units and the relations between these units. [Raj09] Hence, the goal of this technique is to transform natural language into an internal system representation. The format of this representation may be dependency graphs, frames, trees or some other structural representations. Syntactic parsing attempts only for converting sentences into either dependency links representing the utterance syntactic structure or a tree structure and the output of this process is called "parse tree" or simply a "parse". [Dzi04]The parse tree of the sentence holds its meaning in the level of the smallest parts ("words" in terms of language scientist, "tokens" in terms of computer scientists). [Gru08] Syntactic analysis makes use of both the results of morphological analysis and Part-Of-Speech tagging to build the structural description of the sentence by applying the grammar rules of the language under consideration; if a sentence violates the rules then it is rejected and assigned as incorrect. [Raj09] The two main components of every syntax analyzer are:  Grammar: the grammar provides the analyzer with the set of production rules that will lead it to construct the structure of the sentences and specifies the correctness of every given sentence.
  • 42. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  27  Good grammars make a careful distinction between the sentence/word level, which they often call syntax or syntaxis and the word/letter level, which they call morphology. [Gru08]  Parser: the parser reconstructs the production tree (or trees) by applying the grammar to indicate how the given sentence (if correctly constructed) was produced from that grammar. Parsing is the process of structuring a linear representation in accordance with a given grammar. Today, most of parsers combine context free grammars with probability models to determine the most likely syntactic structure out of many others that are accepted as parse trees for an utterance. [Dzi04] 2.5.4 Semantic Analysis "Semantics is the study of the relationships between linguistic forms and entities in the words; that is, how words literally connect to things." [Yul00] This technique and the later following it are basically depended by language understanding. Semantic analysis is the process of assigning meanings to the syntactic structures of the sentences regardless of its context. [Yul00] [Raj09] 2.5.5 Discourse Integration Discourse analysis is concerned with studying the effect of sentences of each other. It shows how a given sentence is affected by the one preceding it and how it affects the sentence following it. Discourse Integration is relevant to understanding texts and paragraphs rather than simple sentences, so, discourse knowledge is important in the interpretation
  • 43. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  28  of temporal aspects (like pronouns) in the conveyed information. [Ric91] [Raj09] 2.5.6 Pragmatic Analysis This step interprets the structure that represents what is said for determining what was meant actually. Context is a fundamental resource for processing here. [Ric91] 2.6 Natural Language Processing Challenges The challenges of natural language processing are much enough that can't be summarized in a limited list; with every processing step from the start point to results outputting there are a set of problems that natural language processors vary in their ability to handle. However, the application where NLP is used, usually, concerned with a specific task rather than considering all processing steps with all their details, this is an advantage for the NLP community helps to outline the challenges and problems according to the task under consideration. For our research area, we precisely concerned with the set of problems that are directly affecting the task of text correction; the next subsections describe some of them: 2.6.1 Linguistic Units Challenges: The task of text correction starts from the level of characters up to paragraphs and full texts, with every level there are a set of difficulties that the handling analyzer faces: 2.6.1.1 Tokenization In this process, the lexical analyzer, usually called "Tokenizer", divides the text into smaller units and the output of this step is a series of
  • 44. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  29  morphemes, words, expressions and punctuations (called tokens). It involves locating tokens boundaries (where one token ends and another begins). Issues that arise in tokenization and should be addressed are [Nad11]:  Problem depends on language type: language includes, in addition to their symbols, a set of orthographic conventions which are used in writing to indicate the boundaries of linguistic units. English employs whitespaces to separate words but this isn't sufficient to tokenize a text in a complete and unambiguous manner because the same character may be used in different uses (as the case with punctuations), there are words with multi parts (such as dividing the word with a hyphen at the end of lines and some cases in the addition of prefixes) and many expressions consisted of more than one word.  Encoding Problems: syllabic and alphabetic writing systems, usually, encoded using single byte, but languages with larger character sets require more than two bytes. The problem arise when the same set of encodings represents different characters set; whereas, the tokenizers are targeted to a specific encoding for a specific language.  Other problems such as the dependency of the application requirements which indicates what a constituent is defined as a token; in computational linguistics the definition should precisely indicate what the next processing step requires. The tokeniser should also have the ability to recognize the irregularities in texts such as misspellings and erratic spacing and punctuation, etc. 2.6.1.2 Segmentation Segmenting text means dividing it into small meaningful pieces typically referred by "sentence", a sentence consists of one or more tokens
  • 45. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  30  and handles a meaning which may not completely be clear. This task requires a full knowledge in the scope of punctuation marks since they are the major factor in denoting the start and ends of sentences. Segmentation becomes more complicated as the punctuations usages become more. Some of punctuations can be a part from a token and not a stopping mark such as the case with periods (.) when used with abbreviations. However, there is a set of factors can help in making the segmentation process more accurate [Nad11]:  Case distinction: English sentences normally start with a capital letter, (but Proper nouns also do).  POS tag: the tags that are surrounding punctuation can assist this process, but multi tags situations complicate it such as the using of –ing verbs as nouns.  The length of the word (in the case of abbreviation disambiguation, notice a period may assign the end of a sentence and an abbreviation at the same time).  Morphological information, this task requires finding the stem word by suffixes removal. It is likely not to separate tokenization and segmentation processes; they are usually merged together for solving most of the above problems, specifically segmentation problems. A sentence is described to be an indeterminate unit because of the difficulty in deciding where it ends and another starts; while the grammar is indeterminate from the stand point of deciding 'which sentence is grammatically correct?' because this question permits to be answered divisively and discourse segmentation difficulty is not the lonely reason but
  • 46. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  31  also grammatical acceptability, meaning, style goodness or badness, lexical acceptability, context acceptability, etc. [Qui85] 2.6.2 Ambiguity An input is ambiguous if there is more than one alternative linguistic structure for it. [Jur00] Two major types of sentence ambiguity, genuine and computer ambiguity. In the first, the sentence is really has two different meanings to the intelligent hearer; while in the second case, is that the sentence has one meaning but for the computer it has more than one and this type is really a problem facing NLP applications unlike the first. [Not] Ambiguity as an NLP problem is found in every processing step [Not] [Bha12]: 2.6.2.1 Lexical Ambiguity Lexical ambiguity is described to be the possibility for a word to have more than one meaning or more than one POS tag. Obviously, meaning ambiguity leads to semantic ambiguity and tag ambiguity to syntactic ambiguity because it can produce more than one parse tree. Frequency is an available solution for this problem. 2.6.2.2 Syntactic Ambiguity The sentence has more than one syntactic structure; particularly, English common ambiguity sources are:  Phrase attachment: how a certain phrase or a clause in the sentence can be attached to another when there is more than one possibility. Crossing is not allowed in parse trees; therefore, a parser generates a parse tree for each accepted state.
  • 47. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  32   Conjunction: sometimes, the parser befuddled to select which phrase a conjunctive should be connected to.  Noun group structure: the rule NG  NG NG allows English to generate long series of nouns to be strung together. Some of these problems can be resolved by applying syntactic constraints. 2.6.2.3 Semantic Ambiguity Even when a sentence is unambiguous lexically and syntactically, sometimes, there is more than one interpretation for it. This is because a phrase or a word may refer to more than one meaning. "Selection restrictions" or "semantic constraints" is a way to disambiguate such sentences. It combines two concepts in one mode if both of the concepts or one of them has specific features. Frequency in context also can help in deciding the meaning of a word. 2.6.2.4 Anaphoric Ambiguity This is the possibility for a word or a phrase to refer to something that is previously mentioned but in the reference there is more than one possibility. This type can be resolved by parallel structures or recency rules. 2.6.3 Language Change "All living languages change with time, it is fortunate that they do so rather slowly compare to the human life". Language change is represented by the change of grammars of people who speak the language and it has been shown that English was changed in its lexicon, phonological,
  • 48. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  33  morphological, syntax, and semantic components of the grammar over the past 1,500 years. [Fro07] 2.6.3.1 Phonological Change Correspondences of regular sounds show the phonological system changes. The phonological system is governed, as well as any other linguistic system, by a set of rules and this set of phonemes and phonological rules is subjected to change by modification, deletion and addition of new rules. The change in phonological rules can affect the lexicon in that some of English words formations depends on sounds, such as the vowels sound differentiate nouns from verbs ( nouns house and bath from the verbs house and bathe). 2.6.3.2 Morphological Change Morphological rules, like the phonological, are suspected to addition, lose and change. Mostly, the usage of suffixes is the active area of change where the way of adding them to the ends of stems affected the resulted words and therefore changed the lexicon. 2.6.3.3 Syntactic Change Syntactic changes are influenced by morphological changes which in turn influenced by phonological changes. This type of change includes all types of grammar modifications that are mainly based on the reordering of words inside the sentence. 2.6.3.4 Lexical Change Change of lexical categories is the most common in this type of change. An example of this situation is the usage of nouns as verbs, verbs as nouns, and adjectives as nouns. Lexical change also includes the
  • 49. Chapter Two   Part I: Natural Language Processing _________________________________________________________________________  34  addition of new words, borrowing or loan words from another language, and the loss of existing words. Figure (2.5) : An example of lexical change 1 2.6.3.5 Semantic Change As the category of a word can be changed, its semantic representation or meaning can be changed, too. Three types of change are possible for a word:  Broadening: the meaning of a word is expanded to mean everything it has been used for and more than that.  Narrowing: on the reverse of broadening, here the word meaning is reduced from more general meaning to a specific meaning.  Shifting: the word reference is shifted to refer to another meaning somewhat differs from the original one. _________________________________________________________ Darby Conley/ Get fuzzy © UFS, Inc. 24 Feb. 2012
  • 50.  35 Part II Text Correction 2.7 Introduction Text correction is the process of indicating incorrect words in an input text, finding candidates (or alternatives) and suggesting the candidates as corrections to the incorrect word. The term incorrect refers to two different types of erroneous words: misspelled and misused. But mainly, the process is divided into two distinct phases: error detection phase which indicates the incorrect words, and error correction phase that combined both generating and suggesting candidates. Devising techniques and algorithms for correcting texts in an automatic manner is a primal opened research challenge started from the early 1960s and continued until now because the existed correction techniques are limited in their accuracy and application scope [Kuk92]. Usually, a correction application concerns a specific type of errors because it is a complex task to computationally predict an intended word written by a human. 2.8 Text Errors A word can be mistaken in two ways: the first is by incorrectly spelling a word due to lack of enough information about the word spell or intentionally mistaking symbol(s) within the word, this type of errors is known as non-word errors where the word can't be found in the language lexicon. The second is by using correctly spelled word in wrong position in the sentences or unsuitable context. These errors are known as real-word errors
  • 51. Chapter Two_ Part II   Text Correction _______________________________________________________________________  36 where the incorrect word is accepted in the language lexicon. [Gol96][Amb08] Non-word errors are easier to be detected, unlike real-word errors; the later needs more information about the language syntax and semantic nature. Accordingly, the correction techniques are divided into isolated words error detections that is concerned with non-word errors; and context sensitive error correction which deals with real-words error. [Gol96] 2.8.1 Non-word errors Those errors include the words that are not found in the lexicon; a misspelled word contains one or more from the following errors:  Substitution: one or more symbols are changed.  Deletion: one or more symbols are missed from the intended word.  Insertion: adding symbol(s) to the front, end, or any index in the word.  Transposition: two adjacent symbols are swapped. The four errors are known as Damerau edit operations. 2.8.2 Real-word errors These errors occur through mistaking an intended word by another one that is lexically accepted. Real-word errors can be resulted from phonetic confusion like using the word "piece" instead of "peace" which usually leads to semantically unaccepted sentences, after applying non-word correction, or even from misspelling the intended word and producing another lexically accepted word. [Amb08] Sometimes, the confusion results in syntactically unaccepted sentences; like writing the sentence "John visit his uncle" instead of "John visits his uncle".
  • 52. Chapter Two_ Part II   Text Correction _______________________________________________________________________  37 Correcting real-word errors is context sensitive in that it needs to check the surrounding words and sentences before suggesting candidates. 2.9 Error Detection Techniques Indicating whether a word is correct or not is based on the type of correction procedure; non-word error detection is usually checking the acceptance of a word in the language dictionary (the lexicon) and marks any mismatched word as incorrect. While real-word error is more complex task, it requires analysing larger parts from the text, typically, paragraphs and full text [Kuk92]. In this work, we mainly focus on non-word error detection techniques. Dustin defined spelling error as E in a given query word Q which is not an entry in the underhand dictionary D. [Bos05] He outlined an algorithm for spelling correction as shown in figure (2.6). Spell error detection techniques can be classified into two major types: 2.9.1 Dictionary Looking Up All the words of a given text are matched against every word in a pre created dictionary or a list of all acceptable words in the language under- consideration (or most of them since some languages have a huge number of words and collecting them totally is semi impossible task). The word is incorrect if and only if there is no match found. This technique is robust but suffers from the long time required for checking; as the dictionary size becomes larger, looking up time becomes longer. [Kuk92] [Mis13] 2.9.1.1 Dictionaries Resources There are many systems deal with collecting and updating languages lexical dictionaries. Example of these systems is the WordNet online application; it is a large database of English lexicons. Lexicons (nouns,
  • 53. Chapter Two_ Part II   Text Correction _______________________________________________________________________  38 verbs, adjectives, articles …etc) are interlinked by lexical relations and conceptual-semantic. The structure of WordNet is a network of words and concepts that are related meaningfully and this structure made it a good tool for NLP and Computational Linguistics. Another example is the ISPELL text corrector; an online spell checker provides many interfaces for many western languages. ISPELL is the latest version of R. Gorin's spell checker which developed for Unix. Suggestion a spell correction is based on only one Levenshtein edit distance depending on looking up every token in the input text against a huge lexical dictionary. [ISP14] 2.9.1.2 Dictionaries Structures The standard looking up technique is to match every token in the dictionary with every token in the text, but this process requires a long time because NL dictionaries are usually of huge sizes and string matching needs longer time than other data types do. A solution for this challenge is to reduce the search space in such a way keeps similar tokens grouped together. Figure (2.6) : Outlines of Spell Correction Algorithm [Bos05] Algorithm: Spell_correction Input: word w Output: suggestion(s) a set of alternatives for w Begin If (is_mistake(w)) Begin Candidates=get_candidates( w) Suggestions=filter_candidates( candidates) Return suggestions End Else Return is_correct End.
  • 54. Chapter Two_ Part II   Text Correction _______________________________________________________________________  39 Grouping according to spell or phones [Mis13], and using hash tables are two fundamental ways to minimize search space. Hashing techniques apply a hash function to generate a numeric key from strings. The numeric keys are references to packets of tokens that can generate the same key indices; hash functions differ in their ability to distribute tokens and how much they minimize the search space. A perfect hash function generates no collisions (hashing two different tokens to the same key index), and a uniform hash function distribute tokens among packets uniformly. The optimal hash function is a uniform perfect hash function which hashes one token to every packet; such situation is impossible with dictionaries due to the variance of tokens. [Nie09] Spell and phones dependent groups use limited set of packets and generate keys according to spell or pronunciation; they are another style of hashing and sometimes of clustering. SPEEDCOP and Soundex are examples. [Mis13] [Kuk92] 2.9.2 N-gram Analysis N-grams are defined to be n subsequences of words or strings where n is variable, often takes values: one to produce unigrams (or monograms), two to produce bigrams (sometimes called "digrams"), three to produce trigrams, or rarely takes larger values. This technique detects errors by examining each n-gram from the given string and looking it with a precompiled n-gram statistics table. The decision depends on the existence of such n-gram or the frequency of it occurrence, if the n-gram is not found or highly infrequent then the words or strings which contain it are incorrect. [Kuk92] [Mis13]
  • 55. Chapter Two_ Part II   Text Correction _______________________________________________________________________  40 2.10 Error Correction Techniques Many techniques have been proposed to solve the problem of generating candidates for the detected misspelled word; they vary in the required resources, application scope, time and space complexity, and accuracy. The most common are [Kuk92] [Mis13]: 2.10.1 Minimum Edit Distance Techniques This technique stands on counting the minimum number of primal operations required to convert the source string into the target one. Some researchers refer to primal operations to be insertion, deletion, and substitution of one letter by another; others add the transposition between two adjacent letters to be the fourth primal operation. Examples, Levenshtein Algorithm which counts one distance for every primal operation, Hamming Algorithm works like Levenshtein but limited with only strings of equal lengths; and Longest Common Substring finds the mutual substring between two words. Levenshtein, shown in figure (2.7) [Hal11], is preferred because it has no limitation on the types of symbols, or on their lengths. It can be executed in time complexity of O(M.N) where M and N are the lengths of the two input strings. The algorithm can detect three types of errors (substitution, deletion, and insertion). It doesn't account the transposition of two adjacent symbols as one edit operation; instead, it counts such errors as two consecutive substituting operations giving edit distance of 2.
  • 56. Chapter Two_ Part II   Text Correction _______________________________________________________________________  41 One of the well-known modifications of the original Levenshtein method was done by his friend Fred Damerau, who made a research and found that about 80% to 90% of errors are caused by the four types of error altogether which are known as Damerau-Levenshtein Distance. [Dam64] The modified method required execution time longer than the original; in every checking round, the method applies additional comparison to check whether a transposition took place in the string then applies another comparison to select the minimum value between the previous distance and the distance with the occurrence of a transposition operation. This step Figure (2.7) : Levenshtein Edit Distance Algorithm [Hal11] 1. Algorithm: Levenshtein Edit Distance 2. Input: String1, String2 3. Output: Edit Operations Number 4. Step1: Declaration 5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0, cost=0 6. Step2: Calculate Distance 7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do 10. for each symbol y in String2 do 11. begin 12. if x = y 13. cost = 0 14. else 15. cost = 1 16. r=index of x, c=index of y 17. min1 = (distance(r - 1, c) + 1) // deletion 18. min2 = (distance(r, c - 1) + 1) //insertion 19. min3 = (distance(r - 1,c - 1) + cost) //substitution 20. distance( r , c )=minimum(min1 ,min2 ,min3) 21. end 22. Step3: return the value of the last cell in the distance matrix 23. return distance(Length of String1,Length of String2) 24. End.
  • 57. Chapter Two_ Part II   Text Correction _______________________________________________________________________  42 multiplied time complexity by factor of 2, resulting in Ω(2*M.N).Hence, in this work, the original Levenshtein method (figure (2.7)) is modified to consider the Damerau's four errors types within a time complexity shorter than the time consumed by Damerau-Levenshtein Algorithm and close to the original method. Figure (2.8) shows the modification of Damerau on Levenshtein method. 1. Algorithm: Damerau-Levenshtein Distance 2. Input: String1, String2 3. Output: Damerau Edit Operations Number 4. Step1: Declaration 5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0, cost=0 6. Step2: Calculate Distance 7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do 10. for each symbol y in String2 do 11. begin 12. if x = y 13. cost = 0 14. else 15. cost = 1 16. r=index of x, c=index of y 17. min1 = (distance(r - 1, c) + 1) // deletion 18. min2 = (distance(r, c - 1) + 1) //insertion 19. min3 = (distance(r - 1,c - 1) + cost) //substitution 20. distance( r , c )=minimum(min1 ,min2 ,min3) 21. if not(String1 starts with x) and not (String2 starts with y) then 22. if (the symbol preceding x= y) and (the symbol preceding y=x) then 23. distance(r,c)=minimum(distance(r,c), distance(r-2,c-2)+cost) 24. end 25. Step3: return the value of the last cell in the distance matrix 26. return distance(Length of String1,Length of String2) 27. End. Figure (2.8) : Damerau-Levenshtein Edit Distance Algorithm [Dam64]
  • 58. Chapter Two_ Part II   Text Correction _______________________________________________________________________  43 2.10.2 Similarity Key Techniques As its name clarifies, this technique finds a unique key to group similarly spelled words together. The similarity key is computed for the misspelled word and mapped to a pointer refers to the group of words that are similar in their spell to the input one. Soundex algorithm finds keys depending on the pronunciation of the words, while the SPEEDCOP system rearranges the letters of the words by placing the first letter, followed by consonants, and finally vowels according to their occurrence sequence in the word and without duplication.[Kuk92] [Mis13] 2.10.3 Rule Based Techniques This approach applies a set of rules on the misspelled word depending on common mistakes patterns to transform the word into valid one. After applying all the applicable rules, the set of generated words that are valid in the dictionary suggested as candidates. 2.10.4 Probabilistic Techniques Two methods are mainly based on statistics and probability: 1) Transition Method: depends on the probability of a given letter to be followed by another one. The probability is estimated according to n- gram statistics from big size corpus. 2) Confusion Method: depends on the probability of a given letter to be confused or mistaken by another one. Probabilities in this method are source dependent, as example: Optical Character Recognition (OCR) systems vary in their accuracy and their basics in recognizing letters, and Speech Recognition (SR) systems usually confuse sounds.
  • 59. Chapter Two_ Part II   Text Correction _______________________________________________________________________  44 2.11 Suggestion of Corrections Suggesting corrections may be merged within the candidates' generation; it is fully dependent on the output of the generation phase. The user is usually provided with a set of corrections, and then he/she can do a choice among them, keeps the written word unchanged, add the token to the dictionary, or rewrite the word in the cases when the desired word is not within the corrections list. Suggestions are listed in non-increasing order according to their similarity and suitability for replacing the source word. Similarity depends on the method of computing the distance or similarity between every candidate and the source token, while suitability depends on the surrounding words within the sentence boundary or the paragraph (in context sensitive correction, full text may be examined before making a suggestion). 2.12 The Suggested Approach The primal goal of this work is to find the nearest alternative word from all the available candidates in the underlying dictionary; when a non- word is encountered there are many candidates available to replace it, but the trick is here, which one of those alternatives was intended by the writer? The suggested work answers this question as in the following: All the dictionary tokens which their count may reach to some hundreds of thousands can be intended by the writer or none of them could be so. The writer (or typist) might really misspell the word or he/she wrote it perfectly but the problem is that the word is not found in the dictionary, i.e. never seen before and then it is an "unknown" token. The problem of deciding whether a word is misspelled or unknown is impossible to be solved. For this, the suggested system will assume every
  • 60. Chapter Two_ Part II   Text Correction _______________________________________________________________________  45 unrecognized word is misspelled and may let the user makes the final decision. As an initial solution, all the tokens in the dictionary are candidates and in further processing the number of candidates must be minimized. 2.12.1 Find Candidates Using Minimum Edit Distance The starting step is to look for the most similar tokens in the lexicon dictionary and ranking them according to the minimum edit distance from the misspelled word. This action reduces the number of candidates to an acceptable amount depending on a threshold for the number of edit operations that should be applied to equalize the candidates and the misspelled word, or a maximum limit for number of candidates. The suggested system used Levenshtein method after being enhanced to consider the four Damerau edit operations. To find the similar tokens, the lexicon should be looked up and every token in it must be examined with the given word. This process consumes time because of the huge tokens held by the lexicon dictionary and the required time by the examining algorithm itself to find the minimum edit distance. Hence, the search space needs to shrink; a method is proposed to group similar tokens in semi clusters using spell properties. 2.12.2 Candidates Mining The best set of candidates is going under another processing step to specify how the generated candidates are related to the misspelled token and accordingly they should be ranked. The process is implemented using a vector of the following features:  Named-entity recognition: many issues are considered.  Transposition probability: Keyboard proximity and Physical Similarity.
  • 61. Chapter Two_ Part II   Text Correction _______________________________________________________________________  46  Confusion probability: because phonetic errors are popular, this analysis help us to find if a word was misspelled because of replacing letter(s) with another of the same sound.  Starting and ending letters matching.  Candidates' length effect. A weighting scheme was applied to give each feature an effect role in deciding the best set of suggestions. However, the Similarity amount has the maximum part among the others. 2.12.3 Part Of Speech Tagging and Parsing Finally, the suitable candidate is chosen by the parser. The parser selects the candidate(s) that make(s) the sentence, which contains the misspelled word, correct. Tagging plays an important role in specifying the optimal candidate because filtering according to POS tag is the base on which the parser stands to select a candidate for its incomplete sentence. The selected tag is not only affect candidate but also every token in the sentence; it is the nature of English (and most of natural languages). The set of candidates, at this step, should contain the minimum number of elements but the best. Grammar checking, accomplished by parsing, is another goal of this system. The system applies sentence phrasing process and check each phrase consistency according to English grammar rules. When an incorrect structure is encountered, the system tries to re-correct it. Parsing is a fundamental step in specifying the correct choice of candidates since the basic goal is to give a correct sentence. The dependent dictionary is an integration of WordNet dictionary with ISPELL dictionary.
  • 62. Chapter Two_ Part II   Text Correction _______________________________________________________________________  47 Figure 2.9 shows the block diagram of the suggested work; and in further chapters, more details are shown for each block. _____________________________________________________________ 1 Diagram in 2.9 is more detailed through the next three chapters Figure (2.9): The suggested system block diagram1 Preprocessing WordNet Lexical Dictionary Morphological analysis and POS tags Expansion ISPELL datasets Dictionaries Integration Hashing and Indexing POS Tagging Integrated Hashed Indexed Dictionary ------------ ------------ ------------ ------------ ------------ Tokens Stream Sentences Stream with tagged tokens Candidates Generation Sentences Recovery and Suggestions Listing ----------- ----------- ----------- ----------- ----------- ----- Phrase Level Suggestions Phrasing Candidates Ranking Grammar Correction
  • 64.  48  Chapter Three Hashed Dictionary and Looking Up Technique 3.1 Introduction Dictionary is a basic unit, mostly, in every NLP application. It holds the lexicon of the language under processing and related information according to the application purpose type such as POS tags, semantic information, phonetics, pronunciation and others. Typically, dictionaries are data structures supported in a format of a list of tokens or words collection. Each word (or token) is associated with its information that makes its usage by a NLP application becomes possible. The number of tokens held by a dictionary is a critical point in NLP applications, especially taggers and text correction systems; because as the number of tokens becomes smaller, the detected errors ratio also would be small since poor dictionary allows erroneous words to pass undetected. On the other hand, large sized dictionary increased this ratio but requires longer time for tokens looking up. Therefore, a balancing is needed to keep the size of a dictionary as inclusive as possible and the looking up speed fast. Many approaches have been proposed to handle this problem, some of these are indexing and hash functions. 3.2 Hashing The optimal feature of any dictionary is the availability of random access but strings are high various data type which makes this task impossible, at least from the sides of memory constraints.
  • 65. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  49  Hashing is the process of converting a string S into an integer number within the period [0, M-1] where M is the number of available addresses in a predefined table. Hash functions made good promises from the area of random access, but alone!? No, the variance of language tokens requires an infinite hash table to hold every token "separately" and a variable size addressing buffer which may be unloadable by most of current systems as well as the highly wasted storage space. By "separately" we mean that no two strings have the same hash value, i.e. no collisions. As the number of collisions becomes larger the looking up inside packets becomes longer. However, an exploitation of hash function as a partial solution can be applied with other approaches to solve the shown up problem. While hash function can map tokens according to some of their features into size manageable packets, approaches such as indexing and advance search techniques would enhance looking up speed to a reasonable amount. 3.2.1 Hash Function The hash function in this work was created to exploit the spell of tokens as addressing key. It converts the prefix of tokens to be grouped into packets. English alphabet, the considered language of this work, contains the set of uppercase letters from 'A' to 'Z', lowercase letters from 'a' to 'z', and numbers from 0 to 9. In addition to some special purposes characters which are not avoidable in the dictionary because they are parts of some tokens such as slash (/), period (.), comma ('), underscore ( _ ), whitespace, and hyphen (-). The resulted characters set contains about 67 characters which can be reduced further more by replacing the numbers codes from 1 to 9 by
  • 66. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  50  the code of 0 because the distinction between numbers has no such importance in this application for two reasons:  The difference between numbers is not a problem in the correction process since any system can never estimate what a number was intended by the writer; therefore any written number would be absolutely accepted.  If a distinction should be taken when treating numbers, then we need to cover every counted number in the dictionary, resulting in an infinite dictionary size because numbers are infinite. The final alphabet contains the union of the above mentioned sets and the reduced numbers set: ∑={ A,B,…,Z, a, b,…,z ,0, /, . , ' , - , _ , whitespace} which can be re-encoded using only 6 bits as shown in Table 3.1 (unused codes are referred by *) . Hashing according to prefixes is a good way to minimize the sizes of packets; it is similar to the SOUNDEX and SPEEDCOP methods [Mis13][Kuk92] in the fact that they shared the same goal, minimizing the size of search space, but it verses them in that this approach maps tokens to a predefined packets addresses depending on a limited length from the string prefix while those methods uses the total length and filters the letters according to sound or spell. This difference gave the suggested approach two interested features: 1. The hash function is simple and can be applied directly without any considerations for pre processing; SOUNDEX needs to encode letters into their phonetic groups, and SPEEDCOP rearranges letters.
  • 67. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  51  Symbol Code Symbol Code Symbol Code A 0 B 1 C 2 D 3 E 4 F 5 G 6 H 7 I 8 J 9 K 10 L 11 M 12 N 13 O 14 P 15 Q 16 R 17 S 18 T 19 U 20 V 21 W 22 X 23 Y 24 Z 25 a 26 b 27 c 28 d 29 e 30 f 31 g 32 h 33 i 34 j 35 k 36 l 37 m 38 n 39 o 40 p 41 q 42 r 43 s 44 t 45 u 46 v 47 w 48 x 29 y 50 z 51 ' 52 / 53 - 54 _ 55 . 56 0 57 whitespace 58 * 59 * 60 * 61 * 62 * 63 Table 3.1: Alphabet Encoding
  • 68. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  52  2. A random access is established by using the output of the hash function as an address while both previous methods need to search for a matching between the computed value and the stored codes. 3.2.2 Formulation As mentioned above, the size of the alphabet reduced to only 59 symbols which can be encoded using only 6 bits instead of the standard 8 bits, making a series of hashing functions available to be applied 1, 2, or any longer sequence of symbols. But this is another area for discussion, if the length of the prefix is too small then the packets number would be small also; therefore, they hold large number of tokens resulting in longer looking up time. On the other hand, using long prefixes creates large number of packets and some of them usually are sparse because of the variance and the irregularity of tokens which is a characteristic of natural languages. The function depends on using a three characters prefix C1C2C3 and converts it as presented in Table (3.1) into integers, then computes the hash value H according to Equation.1: H(C1,C2,C3)= _________ (3.1) H represents the packet address where tokens that are starting with same prefix are held. Obviously, the number of the available packets addresses is equal to the number obtained from residing the three symbols binary codes as shown in Table (3.2), where the symbol at index 0 is 'A' and symbol at index 63 (the last available index in the alphabet) is the unused cell which referred by '*'.
  • 69. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  53  Start Address= (C1)2||(C2)2||(C3)2=(000000000000000000)2=(0)10 End Address= (C1)2||(C2)2||(C3)2=(111111111111111111)2=(262143)10 This makes the total number of packets= 2 18 =262144 packets. Some of these packets are empty because their addresses do not match an actual token prefix in the lexicon but the distribution of tokens among packets reduced the search space to a manageable size especially when the hash function has been combined with an indexing scheme to build the dictionary in a two levels structure. Starting Address Encoding End Address Encoding Alphabetic Encoding Decimal Encoding Binary Encoding Alphabetic Encoding Decimal Encoding Binary Encoding C1 A 0 000000 * 63 111111 C2 A 0 000000 * 63 111111 C3 A 0 000000 * 63 111111 3.2.3 Indexing Key-indexing is an in-memory lookup technique based strictly on direct addressing into an array with no comparisons between keys made. Its area of applicability is limited to numeric keys falling in a limited range defined by the available memory resources. Hashing helps direct addressing to work on keys for any type and range by bringing serial search and collision resolution policies into the equation. Table 3.2: Addressing Range
  • 70. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  54  Indexing exploited for creating a reference table that holds the 218 packets heads addresses which can be addressed directly by the hash function. Every record in the reference table contains two fields: the first is "base" field which holds an address if its index match a token prefix, otherwise its value is (-1). The second is the "limit" field that holds the length of the primary packet that related to its index. Looking up for a packet contains tokens starting with a specific prefix is shown in figure (3.1). The packets referred by the reference table are treated as primary packets, which hold 3-symbols prefix identical tokens; for further reduction for the search space, sub packets can be created for every primary packet. The second level of tokens distribution is also based on their prefixes but with longer sequences. Instead of using only three symbols to group tokens with identical prefixes, the prefix equality expanded to 6 symbols by subdividing tokens inside primary packets into more secondary packets Figure (3.1): Token Hashing Algorithm Algorithm: Token Hashing Input: English token (finite string over ∑), reference and hash tables. Output: packet head address where the input token may rely. Step1: set variables C1,C2, and C3 to the input token prefix. Step2: Compute Index from C1, C2, and C3. Index= Step3: go to reference table at the record indexed with Index. Step4: examine Address field if Base > -1 return (Base value) else return fail End.
  • 71. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  55  which consist of a head and a set of tokens that are identical to the head in their first 6 symbols. The structure of the dictionary can be clarified by hashing the exemplar token ABCDEFGH according to the approach described previously. (1) The dollar sign ($) refers to any sequence may follow S i Figure (3.2) : Dictionary Structure and Indexing Scheme C1=A, C2=B, C3=C Reference Index= H(C1,C2,C3)=Index Index : Head address =X : Length =Y ABCS0$ ABCS1$ ABCS2$ ABCS3$ : : ABCSY-1$ ABCDEFT0 ABCDEFT1 ABCDEFT2 ABCDEFT3 : : ABCDEFTR-1 Primary Packet 1 "Head Code="ABC Si="DEF" Secondary Packet
  • 72. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  56  An interested characteristic in secondary packets is that no more space is wasted, because it is not based on a predefined packets structure. The secondary head, which is a token within primary packet, may be followed by tokens sharing it the same 6-symbol prefix which are collected in one variable size secondary packet; or may not be followed, then no need for a secondary packet. 3.3 Looking Up Procedure As shown in figure (3.2), the process of looking for a target token is started when the primary packet head address becomes in hand from the reference table which in turn computed using the hash function. At hash table, where the tokens are stored according to indexes, the search process begins with a random access accomplished by the index of the primary packet head, and the matching is done sequentially. The matching is happened on the forth through the sixth symbols from every token related to that primary packet; such an action reduces comparison time since matching all the sequence requires longer time. Even the reduction is infinitely short but it is useful in similar cases because logic operations on strings differ from other data types. When a full matching is found the target token is compared to the token at that record completely, if they are matched the goal is reached; otherwise, searching continued in the secondary packet related to that token (if there is a one related to the current token). The comparison inside secondary packets, unlike primary packets, uses full token length and failure here infers that there is no chance to find the targeted token in the dictionary. The algorithm in figure (3.3) outlines the looking up procedure after gaining primary head address.
  • 73. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  57  Figure (3.3) : Algorithm of Looking Up Procedure Algorithm: Looking up a target token Input: Target Token, Primary Packet Head address, Primary Packet Size. Output: tag of input target token. Step1: Set primary packet information X=head address, Y=packet size. Step2: Examine X: if X<0 then return fail for primary_index=X to X+Y do if prefix(token at Primary_index in Hash Table)=prefix(target) begin if Current token = target return primary_index X2=Secondary packet head address Y2=Secondary Packet Length exit for end Step3: Examine X2 if X2<=0 return fail // no related secondary packet for secondary_index=X2 to X2+Y2 do if token at secondary_index in hash table=target return secondary_index Step4: if no match was found at step3 return fail End.
  • 74. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  58  3.4 Dictionary Structure Properties The proposed dictionary can be applied in every application depends on strings looking up. It provides high speed directed search for perfect matching.  The reference table, although there are wasted addresses because of strings variance, is suitable to be used with natural language dictionaries which are usually of huge size. The tokens are handled in a separated table constructed depending on the reference table.  String comparison consumes longer time than other types do. In this approach, comparison is reduced to include subsequences from both target and the stored tokens.  Looking up procedure is fast in discovering the foundation of a target token in many situations: o At hashing step the empty record infers missed token after consuming only one numeric comparison. o At primary packet, failure requires comparing at most only the three symbols from the fourth index to the sixth in the 6_symbols prefixes of tokens within primary packet. o At secondary packet, failure requires comparing tokens within that packet. The worst case is the failure of finding the target at the end of a secondary packet related to the last token in the primary packet which consumes (length of primary packet +length of secondary packet) comparisons.  Since looking up is string dependent, there is a high flexibility in associating information with tokens without any overloading in search
  • 75. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  59  process. As a result it can be used to construct Lexical and semantic dictionaries. 3.5 Similarity Based Looking-Up The structure described in section (3.2) is suitable for a perfect looking up, while the purpose of this work is to design a text correction system where some errors are reasons of unknown words or misspelled. Such situations need for looking up the dictionary to generate candidates that are similar (not identical) to the given misspelled token. The main purpose of any similarity based grouping approach is to reduce the search space to a manageable size in order to shorten looking up time, but at the same time they should not make lose of good candidates or some similar objects (tokens). Clustering techniques are examples of such approaches. But even when using fuzzy clustering techniques this problem did not solved completely because:  Tokens clustering should consider the sequence in which the symbols are arranged in the token in addition to symbols themselves.  Although there are many similarity measures techniques for grouping tokens, no obvious separation measure can be used to separate strings clusters.  In the case of fuzzy clustering, decision threshold is a bottleneck; where high threshold value makes lose of good candidates, low threshold heightens redundancy by grouping less similar tokens in the cluster resulting in longer searching time and inaccurate candidates.  As the number of fuzzy centroids which a token relates to becomes larger, computing the nearest set of centroids would also increase search complexity.
  • 76. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  60  For these reasons, an approach is proposed to save the same hash table as the dictionary structure and to improve the looking up technique. The algorithm is presented in figure (3.5). The improvement forwards the search to include similarly spelled tokens, depending on the same bases of the standard search described previously. The outlines of the proposed approach are:  bi-grams Generation  Primary Centroids Selection (at maximum 3 symbols length)  Connecting centroids to Reference table. These three steps are presented in figure (3.4). 3.5.1 Bi-Grams Generation Reference table is the building block of bi-gram generation process; it specifies the range of hashing addresses and the number of the symbols needed from tokens prefixes for computing hash values. The hash-indexing method used here is limited within 3 symbols only; therefore, the bi-grams generation involves three sub-divisions to produce two symbols (bi-grams). (C1,C2), (C1,C3), and (C2,C3) Division into three bi-grams simplifies predicting Damerau four errors types (insertion, deletion, substitution, and transposition) by applying the template C1C2C3 using only two symbols at a time producing the results shown in Table (3.3).
  • 77. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  61  The variety of tokens of natural language cannot satisfy all nine distributions of the template different sequences described above for every index in Reference table, therefore, a preprocessing is applied to collect satisfied prefixes by checking every generated template foundation in the dictionary and the missed sequences are rejected. Reference Index Selection (C1,C2,C3)=H-1 (Index) bi-grams variants generation C1C2? C1?C2 ?C1C2 C2C3? C2?C3 ?C2C3 C1C3? C1?C3 ?C1C3 bi-grams Generation Per each bi-gram variant a 3_symbols length Centroids Set Selection Redundancy Removal (bi-grams, Centroids Set) Connecting bi-grams Association with "Index" Centroids Selection Centroids Referencing Figure (3.4) : Semi Hash Clustering block diagram
  • 78. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  62  Table (3.3): Predicting errors using Bi-grams analysis Sequence Substitution Insertion Deletion Transposition C1C2? √ √ × × C1?C2 × √ × If ?=C3 ?C1C2 × √ × × C2C3? × × √ × C2?C3 × √ If ?<>C1 If ?=C1 ?C2C3 √ × × × C1C3? × × If ?<>C2 If ?=C2 C1?C3 √ × × × ?C1C3 × √ If ?<>C2 If ?=C2 3.5.2 Primary Centroids Selection For every accepted sequence, a set of centroids are selected as sub set of the unification of primary centroids that are at maximum of three symbols length. A centroid related to a specific sequence is an assignment of a symbol from the alphabet to the '?' sign in that sequence. For example at index=9882: H-1 (9882)="Che" C1='C', C2='h', C3='e' The nine sequences and their related primary centroids after pruning mismatched sequences are: 1. Ch?: ChB, ChE, Cha, Che, Chi, Chk, Chl, Chn, Cho, Chr, Cht, Chu, Chw, Chy, Ch', Ch˽, Ch 2. C?h: Cah, Coh, C˽h 3. ?Ch: BCh, DCh 4. he?: hea, heb, hec, hed, hee, hef, heg, heh, hei, hej, hek, hel, hem, hen, heo, hep, her, hes, het, heu, hev, hew, hex, hey, he', he-, he
  • 79. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  63  5. h?e: hae, hee, hie, hoe, hue, hye 6. ?he: Ahe, Che, Ghe, Jhe, Khe, Lhe, Phe, Rhe, She, The, Whe, ahe, bhe, che, dhe, ghe, khe, phe, rhe, she, the, whe 7. Ce?: Cea, Ceb, Cec, Ced, Cee, Cei, Cel, Cen, Cep, Cer, Ces, Cet, Ceu, Cey 8. C?e: Cae, Cce, Cde, Cee, Che, Cie, Cle, Coe, Cre, Cse, Cte, Cue, Cve, Cze 9. ?Ce: BCe, vCe 3.5.3 Centroids Referencing The final step is to join every sequence to its centroids set and every index to its bi-gram sequences. This process includes creating a list of all the primary centroids in the dictionary which represent all the 3-symbols prefixes of primary packets heads. Bi-grams are also stored in a separated list associated with the related primary centroids set address. Reference table in turn keeps track the addresses of the bi-grams of each index within it. As a result, Bi-grams and the associated centroids sets can be randomly accessed through Reference table.
  • 80. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  64  3.6 Application of Similarity Based Looking up approach The purpose of similarity based looking up is the minimization of search space and maximizing the chance of finding tokens that are similar to the source token. Figure (3.5) : Similarity Based Hashing algorithm Algorithm: Similarity Based Hashing Input: Hashed Dictionary Output: Similarity Based Hashed Dictionary For each Reference Index apply the following steps: Step1: Bi-grams Generation 1)CxCyCz=H-1 (Index) 2) generate sequence variants 3) filter sequences Step2: Primary Centroids Selection for each generated sequence do 1) for every alphabet symbol do 1.1) assign in the sequence missed symbol 1.2) reject if no prefix matching is found 2) remove duplicated centroids Step3: Centroids Referencing 1) Bi-grams Centroids connecting 2) Index Bi-grams connecting End.
  • 81. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  65  Hashed dictionary structure shown in section (3.2) was built to achieve perfect matching according to tokens prefixes; if the source token wasn't found then similar tokens should be looked up. Because looking up the hashed dictionary is based on the prefixes of tokens, the similarity based looking up accounts all the available mistakes that can occur within 3-symbols prefix of every token via exploiting the associated bi-grams with the computed hash value. Every bi-gram is linked to a list of primary centroids which in turn matched with the source token 3- symbols prefix and filtered according to similarity amount. Centroids with highest similarity are selected, while lower similarity centroids are rejected for shortening the searching time. The next step is expanding the prefix length in the similarity calculation through including 6-symbols prefixes because the selected primary centroids refers to primary packets where every token within them differs from the others tokens in its 6-symbols prefix. This step directs the search to be more precise by selecting the nearest tokens from the primary packet to the source token. Finally, for every selected primary packet token, there may be a secondary token where each token within it is equivalent to the secondary head (i.e. primary packet token). The final action, in turn, maximizes the chance of encountering tokens that are similar to the source token inside the secondary packet (usually contains small number of tokens). An interested property in this approach is the ability of using thresholds in every level of the looking up procedure. A different threshold can be used in the primary centroid selection, in the secondary packets heads selection and in the selection of candidates. The value of the threshold is
  • 82. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  66  application dependent and fundamentally restricted with the similarity calculation method. Figure (3.6) : Block diagram of candidates generation using SBL (C1,C2,C3)= Source 3-symbols Prefix Index=H(C1,C2,C3) P1 P2 P8 P9P3 P4 P5 P6 P7 2-grams Patterns Examining (P1…P9) Primary Centroids Collection Collected Centroids Filtering (Highest Similarity Centroids Selection) Secondary Centroids Selecting and Filtering Candidates Generation
  • 83. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  67  3.7 The Similarity Based Looking up Properties The proposed approach has many good features that make it suitable in various string based search applications: 1. Clustering illusion: the structure of the dictionary and the looking up technique used with it provides a way for dividing search space into three different levels: a. Primary Centroids Clusters: only the 3-symbols prefixes are checked and the best are selected as centroids to the next level. b. Primary Packets Clusters: every token here is referenced by a primary centroid and may be referencing a secondary packet (i.e. act as secondary centroids). c. Secondary Packets Clusters: every token is referenced to by a secondary centroid. 2. Time Complexity Minimization: hashing function and indexing merging simplified searching and provided random access in more than one level. 3. Application Flexibility: thresholds can be used in every clustering level as a separation to exclude uninterested centroids or candidates. Indicating the threshold value is relevant to the developer, used similarity calculation method, and the application area. The algorithm in figure (3.7) outlines the complete process.
  • 84. Chapter Three  Dictionary Structure and Looking up Technique ________________________________________________________________________  68  * If no threshold was indicated, the approach generates candidates according to maximum similarity. Figure (3.7): Similarity Based Looking up algorithm Algorithm: Similarity Based Looking up Input: * Hashed Dictionary; Source_Token; similarity thresholds:T1,T2,T3 Output: Candidates Set Step1:Hash Index Calculation C1,C2,C3=3_symbols Prefix of Source Token Index=H(C1,C2,C3)= Step2: Primary Centroid Selection for each bi-gram at Index do for each related Primary Centroid do if similarity(C1C2C3,Primary Centroid)>=T1 then select Primary Centroid Step3: Secondary Centroids Selection for each selected Primary Centroid do for each related Secondary Centroid do if similarity(6-symbols source prefix, Secondary Centroid)>=T2 then select Secondary Centroid Step4: Candidates Selection for each selected Secondary Centroid do for each Token in related Secondary Packet do if similarity(Source Token, Token)>=T3 then select as Candidate End.
  • 86.  69  Chapter Four Error Detection and Candidates Generation 4.1 Introduction Error detection is the process of indicating incorrect words in the text. The term "incorrect" may refer to a misspelled word, misused word or both. Misused words are correctly spelled but used in a way violates the syntax or the meaning of the sentence. The detection of misused words (non-word errors) is a forward process in that it involves the looking up for every token in a pre-prepared list or a dictionary (also referred by "lexicon") contains all the well spelled words of the language but the size of the lexicon affects the looking up process because larger sizes require longer time. On the other hand, detecting misused words (real-word errors) is a more complex task. It requires analyzing the syntax of the sentence to discover sentence constituency correctness which in turn if not correct requires indicating the word(s) that violated it. Errors resulting in meaningless sentences entail further processing which may be expanded out of the sentence boundaries and needs more information about sentence tokens. 4.2 Non-word Error Detection Detecting misspelled tokens in this system is based on the dictionary looking up technique and is performed within the stage of tagging. Tokens of a given text must be tagged. A tag should be found for every token in the considered language; therefore, tokens are collected and resided with their tags in a lexicon. Tagging stage is a fundamental process
  • 87. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  70  in most of natural language processing systems; it is a necessity for tagging to precede syntax analysis since no parsing can be done without associating tags with each token in the sentence. Figure (4.1) : Tagging Flow Chart Convert the Text into Tokens Stream Handle a Token Look up inside the Hashed Dictionary Found? Save (Token, Tag) pair Generate Candidates Save(token, {candidates, tags} list) Last Token ? Pass the new tagged Stream to Segmentation Step YesNo Yes No Start Read Text End
  • 88. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  71  Because tagging requires looking up every token in the given text, it serves another task at the same time since missed tokens are assigned to be misspelled. The looking up procedure discussed in Chapter Three is used for discovering non-word errors; the structure of the dictionary is a reconstruction of about 300,000 tokens collected from two datasets as raw data. The major two resources of the lexicon are WordNet and ISPELL; WordNet represented the basic resource and was integrated with ISPELL dataset for making the lexicon more inclusive. The lexicon was hashed and indexed in order to achieve random access. The looking up time is very short compared to the typical structures and the tagger is capable of deciding whether a token is found or not found in the lexicon even after consuming only one operation (for further details see sections 3.3 and 3.4). 4.3 Real-words Error Detection Deciding whether a word is misused or not is more complex than detecting misspelled words, the process needs more computations and more resources. Syntax analysis can be exploited to recognize misused words since every English sentence (as most natural languages sentences do) is constrained in a syntactic rule or a grammar. Any sentence violates the syntax constraints and cannot be fitted or parsed using a finite set of production rules is signed as incorrect sentence. Next, the sentence should be processed to indicate the erroneous word that made the sentence incorrect. Phrasing is a good way to indicate precisely the incorrect word through converting the sentence into constituents. The constituency hierarchy starts from the sentence as the head of the tree, which contains
  • 89. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  72  one or more clauses, which contains one or more phrase, and the phrase contains one or more words. The division into phrases is useful in reducing the parse tree. As the number of tokens becomes larger, the available parses for the same sentence increase. The suggested approach is rule based, any sentence cannot be parsed correctly is described as an incorrect. The syntax analyzer is based on phrasing by applying a brute force approach to identify the misused word in the phrase. The syntax analyzer is fully dependent on the output of the tagger, however, misspelled words should be replaced with suggestions in order to allow the analyzer to proceed analyzing the sentence and select the best alternative that makes the sentence acceptable. (Chapter Five details the idea). 4.4 Candidates Generation Candidates are those tokens with high similarity to the incorrect word. The meanings of "similarity" and "incorrect" are relative. In the case of non-word errors, the incorrect word is a misspelled and the similarity is a measure of how much another token is spelled or pronounced in a way similar to the misspelled word. In the case of real-word errors the candidate token is the one that is more likely to be intended by the writer but confused by the incorrect one; sometimes, spell or phonetic mistake resulted in another correct word. 4.4.1 Candidates Generation for Non-word Errors In this step, the system takes the incorrect token (a token out of dictionary) and looks for similar tokens in the underlying dictionary.
  • 90. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  73  Since every token in the dictionary may be intended by the writer, the process is somewhat more complex. Several issues should be considered to decide which tokens are suitable to be generated as candidates. A major problem is the distinction between unknown and mistaken words; therefore, this research considers every unknown word as a mistaken one and lets the decision to be taken by the user himself/herself. However, candidates (or alternatives) are generated depending on the mistaken word and the total process is performed in the following way: On the first look, all the dictionary tokens which their count may reach to some hundreds of thousands can be intended by the writer or none of them could be so. The writer (or typist) might really misspell the word or he/she wrote it perfectly but the problem is that the word is not found in the dictionary, i.e. never seen before and then it is an "unknown" token. The number of the generated candidates is not limited; further processing would reduce the list of candidates to include only the best set according to similarity amount and some other criteria that are fully dependent on the spell of the encountered misspelled token. In the tagging stage, if a token wasn't found in the lexicon then it is misspelled. The starting step is to look after the most similar tokens in the lexicon dictionary and ranking them according to their similarity to the misspelled token, the similarity is based on minimum edit distance measure. This action reduces the number of candidates to an acceptable amount depending on a threshold for the number of edit operations that should be applied to equalize the candidates and the misspelled word or a maximum limit for number of candidates.
  • 91. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  74  4.4.1.1 Enhanced Levenshtein Method The modification on the Levenshtein method can be performed by extending the standard matching step at line 12 in figure (2.7) to check the foundation of a transposition case. The idea rises from the fact that no transposition case may be found without finding a matching success between at least two symbols in the examined strings; and more precisely the transposition can be discovered using minimum number of operations by considering two facts: - Two adjacent symbols can never be mirrored by other two adjacent symbols in another string unless the first symbol in the first set matches the second in the second set. - Instead of manipulating the transposition occurrence separately, the algorithm can modify the under-processing cell in the distance matrix directly and the next matching steps will do the work. The first fact served in avoiding the trying of all possibilities as it was presented in Damerau's modification at lines 20 and 21in figure (2.8) where each symbol is matched to every symbol in the second string regardless to the availability of a transposition operation happen by adding additional matching statements to the original one at line 12 in figure (2.7). On the other hand, the second fact announces another side of processing that is the distance matrix is filled sequentially row by row from the top most left corner to the bottom most right corner (where the total distance is held). Using one step to process both cases (transposition happen case and the not case) is a good way to minimize the number of operations required to accurately compute the distance. In this modification, the distance matrix is updated directly by one step and the next steps (selecting the minimum and filling the underhand
  • 92. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  75  cell) are continued normally as it was done in the original algorithm; such action abstracted the step at line 22 of the Damerau's algorithm (figure 2.8) which uses more than one operation to be completed. How modifying the Levenshtein method reduced the time and enhanced the candidates generation process is that the modification exploited the first fact to make the algorithm avoids checking the cases that are leading to a failure situation, unlike Damerau-Levenshtein modification which makes no difference between the two situations; this is presented in lines 15 and 16. The directly updated distance matrix (line 17) in the enhanced algorithm has accurately adjusted the distance without any more additional processing; it is simply an assignment. The time complexity is related to the distance between the input strings. However, as the strings becomes more different, the steps at lines 15, 16 and 17 in the enhanced algorithm (figure 4.2) are rarely executed ;therefore, they are saving time; in turn, this property is preferred in the cases where the algorithm is used for generating candidates. Candidates should be as similar as possible to the source token (usually, a mistaken word) and the relativity of the additional steps (lines 15, 16 and 17) in the enhanced algorithm made the consumed time to generate candidates useful (or not wasted) from the view point that those steps are only executed when there is a matching with the source token and they are more executed as the source word being more matched with the target word which means that it is a good candidate.
  • 93. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  76  The algorithm in figure (4.2) shows the enhancement of the original Levenshtein method and the rest of this section describes the difference of the three methods (original Levenshtein, Damerau-Levenshtein and the enhanced Levenshtein method) through manipulating two example strings "Transposed" and "Tarnspaesd": Figure (4.2) : The Enhanced Levenshtein Method Algorithm 1. Algorithm: Enhanced Levenshtein Distance 2. Input: String1, String2 3. Output: Damerau Edit Operations Number 4. Step1: Declaration 5. distance(length of String1,Length of String2)=0, min1=0, min2=0, min3=0, cost=0 6. Step2: Calculate Distance 7. if String1 is NULL return Length of String2 8. if String2 is NULL return Length of String1 9. for each symbol x in String1 do 10. for each symbol y in String2 do 11. begin 12. if x = y 13. begin 14. cost = 0 15. if x is not the start symbol of String1 then 16. if (the symbol preceding x=the symbol following y) and (x is not duplicated) then 17. decrease distance (index(x)-1,index(y)) by 1 // transposed 18. end 19. else cost = 1 20. r=index of x, c=index of y 21. min1 = (distance(r - 1, c) + 1) // deletion 22. min2 = (distance(r, c - 1) + 1) //insertion 23. min3 = (distance(r - 1,c - 1) + cost) //substitution 24. distance( r , c )=minimum(min1 ,min2 ,min3) 25. end 26. Step3: return the value of the last cell in the distance matrix 27. return distance(Length of String1,Length of String2) 28. End.
  • 94. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  77  Figure (4.4) : Damerau-Levenshtein Example 1) Levenshtein T r a n s p o s e d The minimum edit distance=5 1. substitute 'r' by 'a' 2. substitute 'a' by 'r' 3. substitute 'o' by 'a' 4. substitute 'e' by 's' 5. substitute 's' by 'e' Computation Complexity: M*N comparisons=100 (cost,min1,min2,min3)assignments *100 =400 100 Minimum function Calls 0 1 2 3 4 5 6 7 8 9 10 T 1 0 1 2 3 4 5 6 7 8 9 a 2 1 1 1 2 3 4 5 6 7 8 r 3 2 1 2 2 3 4 5 6 7 8 n 4 3 2 2 2 3 4 5 6 7 8 s 5 4 3 3 3 2 3 4 4 5 6 p 6 5 4 4 4 3 2 3 4 5 6 a 7 6 5 4 5 4 3 3 4 5 6 e 8 7 6 5 5 5 4 4 4 4 5 s 9 8 7 6 6 5 5 5 4 5 5 d 10 9 8 7 7 6 6 6 5 5 5 2) Damerau-Levenshtein T r a n s p o s e d Minimum edit distance=3 1. transpose ('a', 'r') 2. substitute 'a' by 'o' 3. transpose ('e', 's') In addition to the complexity of original Levenshtein, the following operations are executed: 100 comparisons (line 21) 81 comparisons (line 22) 2 calls for minimum function (line 23) 0 1 2 3 4 5 6 7 8 9 10 T 1 0 1 2 3 4 5 6 7 8 9 a 2 1 1 1 2 3 4 5 6 7 8 r 3 2 1 1 2 3 4 5 6 7 8 n 4 3 2 2 1 2 3 4 5 6 7 s 5 4 3 3 2 1 2 3 3 4 5 p 6 5 4 4 3 2 1 2 3 4 5 a 7 6 5 4 4 3 2 2 3 4 5 e 8 7 6 5 5 4 3 3 3 3 4 s 9 8 7 6 6 4 4 4 3 3 4 d 10 9 8 7 7 5 5 5 4 4 3 Figure (4.3) : Original Levenshtein Example
  • 95. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  78  3) Enhanced Levenshtein T r a n s p o s e d Minimum edit distance=3 1. transpose ('a', 'r') 2. substitute 'a' by 'o' 3. transpose ('e', 's') In addition to the complexity of original Levenshtein, the following operations are executed: 12 comparisons (line 15) 7 comparisons (line 16) 2 assignments (line 17) 0 1 2 3 4 5 6 7 8 9 10 T 1 0 1 2 3 4 5 6 7 8 9 a 2 1 0 1 2 3 4 5 6 7 8 r 3 2 0 1 2 3 4 5 6 7 8 n 4 3 1 1 1 2 3 4 5 6 7 s 5 4 2 2 2 1 2 3 3 4 5 p 6 5 3 3 3 2 2 3 4 5 6 a 7 6 4 3 4 3 3 3 4 5 6 e 8 7 5 4 4 4 4 4 2 3 4 s 9 8 6 5 5 4 5 5 2 3 4 d 10 9 7 6 6 5 5 6 3 3 3 4.4.1.2 Similarity Measure Minimum Edit Distance methods counts the number of edit operations required to convert on string to another but do not show how the two strings are similar. An example: the distance between "a" and "b" =1, but the similarity =0; whereas, distance between "Similar" and "Similer" is also 1, but the similarity =6/7. Strings lengths should be taken into account when computing the edit distance then the resulted value is used as a similarity measure. Since the absolute difference between any two strings is added to the total mismatched symbols since it is considered as the number of deleted symbols from the shorter string. The similarity measure must depend on the maximum length between the two. The relative distance is computed by: R_Dist(St1,St2)= distance(St1,St2) / max(length(St1),length(St2)) … (4.1) Figure (4.5) : Enhanced Levenshtein Example
  • 96. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  79  Relative distance is a value within the interval (0,1) where completely different strings have a relative distance of 1; and as its value decreases, the difference is also decreases until reaching the value of 0 when the two strings are identical. Since the similarity and difference are complements to each other, the similarity can be computed by: Similarity (St1, St2)=1- R_Dist(St1,St2) … (4.2) And the later is the measure of similarity used in the candidates' generation for this work. 4.4.1.3 Looking for Candidates To find the similar tokens, the dictionary should be looked up and every token in it must be examined with the source word. This process consumes time because of the huge number of tokens held by the lexicon dictionary and the required time by the examining algorithm itself to find the minimum edit distance and computing the similarity to the source token. Hence, the search space needs to shrink; the Similarity Based Looking up method shown in Chapter Three is used to group similar tokens in clusters using local properties, i.e. the clustering process grouped the similar tokens depending on tokens spell only. The input of the algorithm in figure (3.7) is the misspelled token. The thresholds usage is dependent on the generating ability, i.e. how much the generated candidates are similar to the source token. If they are highly similar, then the top set is selected; but if there is a difficulty in discovering reasonable candidates, the usage of thresholds may be a good solution. As the misspelled token is being highly confused, the set of examined centroids becomes larger; therefore, a filtering factor must be used to reduce the search space.
  • 97. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  80  At least a generated primary centroid should be similar to the 3- symbols prefix of the source token in an amount of 2/3, which allow at maximum one mistake in the prefix. This restriction is not randomly selected; experiments revealed that misspellings are usually reasons of single-error and the ratio is between 70% and 95% depending on the text source. The mistakes are rarely happen in the first three letters. According to [Pol84], 7.8% of errors occur in the first letter, 11.7% in the second letter and 19.2% in the third letter; where each percent is dependent from the others. After collecting the most similar set of primary centroids, the next step is to examine secondary centroids of every selected primary centroid. The selection is also dependent on the similarity to the 6-symbols prefix of the source token since the secondary lengths are at maximum of size 6 symbols. The second threshold constraints the error value to be at most two mistakes, i.e. 2/6 or less; but in some situations there is a need to select the best centroid from every secondary centroids set (from every selected primary cluster) because looking for candidates in this stage is limited within the first six symbols from the tokens, however, longer tokens may contains more than two mistakes in its prefix. In another word, for every selected primary centroid, the nearest secondary centroids are selected and the threshold serves as a limit to avoid selecting less similar centroids when there are other centroids with higher similarity. Finally, for every selected secondary centroid, the candidates are generated from the secondary packets that are related to a centroid with a reasonable similarity to the source token. Then, the decision of selecting a token to be a candidate or not would be easier because the comparison applied on the total lengths of both source token and dictionary tokens.
  • 98. Chapter Four  Error Detection and Candidates Generation ______________________________________________________________________  81  Ranking the candidates is a subroutine from the optimization stage. It uses information more than similarity measure. 4.4.2 Candidates Generation for Real-words Errors In this work, the generation of candidates is rule based. It can be divided into two types according to the step in which it can be applied:  Before suggesting optimal candidates for misspelled words: This type of generation can be applied to sentences that have not contain misspelled words, the decision is made after phrasing the sentence into constituents and manipulating each phrase alone. The word which violates the rule of constructing the given sentence from the grammar or syntactic rules should be detected and replaced with other forms set where any of which can make the sentence syntactically accepted. Grammar correction techniques are multiple and various; two techniques are used in this step to solve a part from syntax errors, verb tense correction and subject verb agreement.  After suggesting optimal candidates for misspelled words: After ranking candidates, this step allows the correction system to more precisely select the candidate that is mostly fits into the sentence to makes it correct or at least does not violates its correctness. Selecting the best candidates after ranking is an additional filter for generating the best suggestions set.
  • 100.  82  Chapter Five Automatic Text Correction and Candidates Suggestion 5.1 Introduction Text correction is the process of substituting the incorrect word(s) by another correct word(s) that was selected as a candidate and filtered to be the most suitable among many alternatives. Automating text correction is a complex task because of its direct association to humans' nature; a written word could never be absolutely predicted even with the existence of perfect decision making parameters that can help a computer to choose the perfect suggestion, since artificial intelligence did not reach human capabilities, yet. However, there is always an alternative solution. Optimizing candidates is an alternative solution for handling the problem. Many existed techniques can help in making the decision perfect and providing the user with a set of highly expected alternatives for a given incorrect word. This work, as we will see in next sections, exploited many features that are related in the first order to the incorrect word and its candidates themselves rather than context. The automatic correction is out of meaning and suggests candidates depending on the output of the previous stages (tokenization, tagging, and similarity based candidates generation) after applying multi-features ranking and syntax analysis. 5.2 Correction and Candidates Suggestion Structure Figure (5.1) shows ranking process applied on the generated candidates.
  • 101. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  83  For every incorrect word, there is a set of candidates was generated by the candidates generator at tagging stage. A set of features was predefined for ranking candidates according to similarity and errors types' relevance. The features set includes similarity value between the generated candidate and the incorrect word, confusion and transposition factors, type of error in the incorrect word and syntactic properties. Ranking process involves:  Assigning a value for every feature.  Computing the effect factor of each feature (weighting).  Summing all the weights in a single number.  Inserting the processed candidate in the suitable index within the candidates list where high similarity candidates ranked at the top and candidates with low similarity are inserted in the bottom. Features are represented by a vector of eight elements that may be decreased or increased depending on the purpose for which the text correction is applied and the source of the input text and the expected error rate. Similarly, the weights of each feature are also affected by the input text source since some features are dependent on error type. Before applying ranking process, the source token that was marked as a misspelled is examined against Named Entity (NE) features because most of proper nouns are not added to dictionaries resulting in a mismatch case. Recognizing NE requires combining multi sources and information to be accounted together, some of them are strong enough to decide if a misspelled token is a Name but not if it isn't. Syntax analyzing follows features based ranking; it is another step for optimizing results, and mostly, the one with highest effect. The accuracy of ranking candidates should be completed by the syntactic role of the candidates that would be selected as suggestions.
  • 102. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  84  Figure (5.1): Candidates ranking flowchart (Misspelled Token (T), Candidates List) Pair Handle a Candidate Account similarity value Specify Inserted Symbols Confused? Transposed? Equal lengths? Duplicated? Difference <=threshold ? Confused? Transposed? Same symbols set? End symbol match? First symbol match? W1=f1 W2=f1 W3=f1 W3=f3 W3=f2 W4=f2 W4=f1 W5=f1 W5=f2 W6=f1 W7=f1 Rank according to Weights Sum Last candidate? Stop Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No No No No No No Start
  • 103. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  85  5.3 Named-Entity Recognition A big set of weak evidence features is proposed to decide whether a token is a named-entity or not, but there is a variance in the level of analyzing and the features themselves. Some features are efficient in deciding if a token is a named entity and can be used individually in decision making; other features could never be helpful unless be combined with other features. The features are categorized in many sub-categories, the most well known are those related to word level, part of speech tags, and dictionary looking up. Since the purpose of this system is determining token correctness, the word level features are the most helpful because the dictionary looking up is previously satisfied (a matched token doesn't need to be analyzed) and part of speech tags are useless with the absence of decision. In English, the following features gave the developers some evidence for name detection: (1) All-uppercase: a token consisted of capital letters only. (2) Initial-caps: a token started with a capital letter. (3) All-numbers: a token consisted of numbers only. (4) Alphanumeric: a token contains letters and numbers. (5) Single-char: a token of one letter. (6) Single- i. The all-uppercase feature is the strongest and can be used individually; initial-caps may be infected by its position within the sentence because English sentences start with a capitalized word. In this system all- numbers feature is solved by allowing the system to treat all numeric values fairly by assigning the same hash code and the same tag for every numeric string in the hash table. Many abbreviations, sort of named-entities, are
  • 104. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  86  alphanumeric and therefore it is a good feature. Single characters feature is used by Microsoft word. Finally, single i may refer to the pronoun I which usually written as lower case letter. Named-entity recognition features may help in marking a token as a name but they can't precisely decide if it is not so. An example of such cases is that some names may be written in lowercase letters like "van Gogh" which doesn't satisfy any of the features above. 5.4 Candidates Ranking If the misspelled token was not recognized as a named entity, the process of ranking starts by measuring the similarity between the source token and every candidate in the associated list in more sophisticated manner considering the type of the committed error to find a numeric value that describes the fidelity of each candidate over the rest. Eight weighted features are used to account every error type effect on the whole candidate string; three different values for factors are considered in the flowchart in figure(5.1) to outline and simplify the idea of giving different factors values for different error types (f1=high, f2=medium, f3=low). Practically, effect factors are numeric values that vary from a feature to another. For each element in the features vector, there is a weight reflects that feature's share in the total computed rank value. Rank value for each candidate is computed by: … (5.1) Where n = features number, c = selected candidate, wi= associated weight with feature no. i, and v is the features vector.
  • 105. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  87  Weights values depend on the application area of the system; they rely on the input text quality and the input device itself. The following subsections describe each feature and its effect on the ranking process. 5.4.1 Edit Distance Based Similarity Enhanced Levenshtein Edit distance method is used to calculate the distance of each candidate from the source token. Computing the similarity is dependent on the distance and the length of the two strings, (for more details see section 4.4.1.3). Similarity is measured by a numeric value within the interval (0,1), therefore, it should be multiplied by a factor to be normalized with the other features in such a way gives it the largest share in the ranking value among the other features' weights. In this application, as preferred in other applications because of the majority of similarity amount in the suggestion decision, similarity was weighted by a factor larger several times than other features weights. 5.4.2 First and End Symbols Matching Researches in the area of errors analyzing showed that mistakes are rarely happen in the first letters of the word, and mostly, the first letter is not mistaken. The probability of mistaking the second letter is also high but does not achieve interested results compared to the first letter. On the other side, end letter counted probability near to the probability of mistaking the first letter, and hence, it is used as a part from the optimization procedure in calculating ranking values. First and end letter are sufficient because they are related to human brains capabilities. Research from Stanford university showed that our brains
  • 106. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  88  can predict the correct word exactly even if its letters are permutated in a random manner but only the first and end letters are correct. Exploiting the research results assists the process of optimizing candidates suggestion. However, the idea cannot be achieved directly in a computational way because the ability of human minds of prediction and connecting facts together is infinitely fast and reliable. It depends on imagination and semantic relevance in interpreting sentences even with the existence of errors. Until our days, such ability is not found in computers. As a result, this feature, difference in lengths and same set of symbols altogether can simulate human brain in a statistical way because the idea originally dependent on statistics. Small weights are given for both first letter and end letter features, with a preference to the first letter feature because it has larger effect on the prediction rather than end letter do. 5.4.3 Difference in Lengths Writing mistakes are usually occurring within the token length or in its length ± 1, and rarely the mistaken token and the intended token lengths differ in more than one unit. Equality of lengths does not affect the candidate itself directly only but also other features like transposition and confusion and even duplication (next subsections details the idea). Candidates with larger difference values may be rejected although they count good ranking indexes. The feature value is calculated by the relative length difference: R_L_D(St1,St2)= 1- ( abs(||St1|| - ||St2||))/ min( ||St1||,||St2||) … (5.2) Where ||Sti|| is the length of string Sti
  • 107. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  89  The weight of this feature is dependent on the source of the input text; texts that are entered by an optical character recognizer (OCR) usually have smaller weights. While typed documents have larger value because the insertion of symbols is probable. 5.4.4 Transposition Probability Transposition refers to the case of replacing a character by another neighbored one that is either similar in style or placed around it on the keyboard. Usually, this type of errors occurs with typed texts and refers by "typos". The alphabet of English has small sized alphabet; the task of computing the probability of transposing a letter by another is easy. Table 5.1 shows a transposition matrix contains the probability of each letter to be confused by another one from the 26 alphabet regardless of being an uppercase or lowercase letter because such mistakes are related to the typist's fingers physical movement not on the typed token.This feature considers two types of errors: 1. Errors within the length of the word: in this case the typist mistake a given letter by another, i.e. substitute it with a neighbor letter through pressing the mistaken letter key instead of the intended letter key. For this reason, such cases are described to be from the first degree and the feature value is assigned to the maximum. 2. Errors resulted in word length increment: sometimes, fingers confused the intended letter exact position and press two keys simultaneously resulting in typing two consecutive letters (the intended letter and the one to the right or the left). This mistake inserts an additional letter and increases the word length by one.
  • 108. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  90  Table (5.1) : Transposition Matrix a b c d e f g h i j k l m n o p q r s t u v w x y z a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 1 0 0 1 b 0 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 c 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 0 0 d 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 0 0 0 e 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 2 0 0 0 f 0 0 1 2 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 g 0 1 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 h 0 1 0 0 0 0 2 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 i 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 j 0 0 0 0 0 0 0 2 1 0 2 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 k 0 0 0 0 0 0 0 0 1 2 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 l 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 m 0 0 0 0 0 0 0 0 0 1 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 0 1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 o 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 p 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 q 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 r 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 s 1 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 t 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 u 0 0 0 0 0 0 0 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 v 0 0 2 2 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 x 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 y 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 z 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 5.4.5 Confusion Probability Confusion refers to the case of replacing a letter with another of similar pronunciation; sound is the base of calculating the probability of confusing a given letter unlike transposition probability which depends on the keys arrangement on the keyboard. This type of analyzing is concerned with phonetic errors; usually, vowels are the most confused letters. The weight of this feature is dependent on the application where the correction is used; it should have large values when used with speech recognition systems. Table (5.2) shows Stanford confusion matrix after being updated and normalized.
  • 109. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  91  Table (5.2) : Confusion Matrix a b c d e f g h i j k l m n o p q r s t u v w x y z a 0 0 0 0 3 0 0 0 2 0 0 0 0 0 2 0 0 0 2 0 1 0 0 0 0 0 b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0 d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 e 3 0 0 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 1 0 f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i 2 0 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 1 0 j 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 k 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 l 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 n 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 o 2 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 p 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 q 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 r 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 t 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 u 2 0 0 0 2 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 y 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 5.4.6 Consecutive Letters (Duplication) Duplicating single letter or missing one of originally duplicated letters is one of the typos errors. Some writers intentionally omit or add a letter from the original token, specifically, in the case of affixes addition. The two major errors resulted in this type of mistakes are:  Insertion: duplicating a single letter can be resulted intentionally when a writer does not know the perfect formation of a word when adding an affix, an example is duplicating the letter 'l' when adding the suffix 'full' to the noun 'hope' for converting it to the adjective 'hopeful'. Or it may be resulted from pressing a key in a time
  • 110. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  92  period longer than the required for typing a single letter, like 'prrint'.  Deletion: the reverse of insertion is missing one of duplicated letters like creating 'hopefuly' from adding the suffix 'ly' to the word 'hopeful', or writing a single letter instead of two like writing single 's' in 'omision'. Duplication is an interested feature has sufficient effect in deciding the optimum candidate, specifically when the difference between the source token and the candidate is equal to the number of missed or duplicated letters. 5.4.7 Different Symbols Existence It is preferred to a candidate to contain the same set of letters that are consisted in the source token; this feature highlights the case of transposing two adjacent letters in the word ( Damerau forth error type) which is a common mistake in typed text. As a conclusion: Obviously, none of the features described above is separable from the others but each of them is constrained with its weight and effect factor. We see these relations between edit distance and all the seven rest features; between difference in length and all of confusion, transposition and duplication; between transposition and duplication and so for. In consequence, all the features above should share the task of ranking the candidates each one with its special weight and according to the environment of the application. At this step the suggestion of candidates in the level of words is ended and the syntax restrictions start to have a role in the decision making to decide which token would be suggested as the optimum among all the alternatives in the dictionary.
  • 111. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  93  5.5 Syntax Analysis The task of syntax analyzer is critical in this stage; in addition to examining sentences correctness, the selection of the optimum candidate is done. In both cases, the analyzing process is applied at the level of phrases where a sentence is broken into clauses and the clauses into phrases. Syntax analyzing process is shown in figure (5.2). 5.5.1 Sentence Phrasing Tokens stream is divided into groups in the segmentation stage; segmenting a text depends on the output of the tokeniser and the tagger because determining sentences boundaries makes use of tags. As output, the segmented text is a stream of sentences that can be passed to the syntax analyzer because the later is usually works on sentences level. A sentence contains one or more clauses, each clause consists of one or more phrase and a phrase in turn contains one or more words. Phrasing is efficient from the standpoints:  Correcting a part from a phrase affects the structure of the sentence partially which minimize the total number of possible alternatives leading to smaller set of candidates and better reconstruction of the original sentence in such a way reserve it reasonably unchanged.  Attachment ambiguity is a challenge facing the correction process, specifically in the semantic relations; phrases correction solved it because a phrase is completely attached to another phrase and updating it does not affect or be affected by other phrases, unlike words level correction which must consider every possible parsing and related part from the sentence.
  • 112. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  94   Converting into phrases simplifies the process of generating complex sentences structures because how much a sentence can get complex it Figure (5.2) : Syntax analysis flowchart Convert each sentence into phrases Test candidates starting from the top of ranked candidates list Violated ? Check Constituency after correction Select next candidate Correct Grammatically Replace the misspelled token by the candidate Output Corrected Text with list of candidates for each corrected token Yes No Start End
  • 113. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  95  would still being a collection of phrases connected syntactically and semantically. English has a set of phrases types includes: Noun Phrase (NP), Prepositional Phrase (PP), Adjectival Phrase (A), Adverbial Phrase (v), Complement (C) and Verb Phrase (VP). Each has its own set of word classes and a structure governs those classes. 5.5.2 Candidates Optimization Misspelled tokens are associated with a ranked list of candidates. The top candidate is the most similar to the misspelled word. Optimization procedure is applied in two phases, the first is represented in the ranking according to features satisfaction and weights; the second is the syntactic agreement within the phrase that contains the misspelled word. Selecting candidates starts from the top; checking the consistency of phrase structure has a fundamental impact on the correction accuracy. The tag of the selected candidate should satisfy the structure of the phrase and sometimes the process may require checking the next tag in the sentence, i.e. the token that followed the misspelled word in the sentence which may form the head of the next phrase. The task is not such a challenge if the phrasing procedure was accurate; the structure of the phrase under-processing limits the possible alternatives of the misspelled word within best similarity amount and syntactic agreement. 5.5.3 Grammar Correction A sentence is grammatically accepted if it can be generated by applying a finite set of grammar rules.
  • 114. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  96  Grammar correction is a subfield of real-word error correction, it depends on sentences constituency satisfaction to detect the words that are disagreed the grammar rules and made the sentence violating parsing rules. In this step, the system checks the correctness of sentences by parsing each sentence separately from the text because syntactic acceptance is related to the sentence levels unlike semantic and further processing which analyze texts at the level of paragraphs and full texts. Grammar correction procedure deals with two types of sentences: 1. Sentences contain correctly spelled words. 2. Sentences contain words that have been replaced with correct words. In both cases, the suggestion of candidates has been ended and the correction is restricted in one suggested candidate. As shown in previous sections, the optimal candidate is the grammatically suitable with the highest similarity. However, the grammar corrector treats the two sentences types equivalently (a sequence of correctly spelled words). Many fields of grammar correction have been proposed because correcting a text grammatically is an extensive process requires a huge knowledge in the underlying language grammar and inclusive grammar rules set. This system is rule based and considers two types of correction:  Subject verb agreement.  Verb tenses. In order to perform the two types of correction and the phrasing procedure, the set of tags needed to be more detailed which is not available in the original WordNet Dataset. The dictionary was preprocessed to subdivide some tags into detailed tags; like dividing Definite and Indefinite
  • 115. Chapter Five Automatic Text Correction and Candidates Suggestion ______________________________________________________________________  97  determiners into pre, central, and post determiners. Nouns also needed to be categorized into plurals and singulars, verbs into different tenses and participles. Integration with ISPELL database enhanced the accuracy of the dictionary. It provided the dictionary with a big set of singular and plural nouns, adjectives and verbs tenses forms. 5.5.4 Document Correction The final step is suggesting the corrected sentences. It includes replacing the incorrect words by the optimal candidates and associating the remainder candidates with every corrected word. The association is necessary because even a perfect suggestion could never absolutely decide the intended word, therefore, the user is the only person who can decide if the word was accurately corrected or not. Candidates are listed according to their ranking values. The list is preferred to be short and accurate. A threshold can be added to the suggestion list to filter out any candidate has similarity less than the predefined specified threshold. Developers can indicate the threshold amount according to the application environment. As an example, some applications are usually used by native speakers; therefore, the threshold can be stricter than those used in applications like language learning programs where users are typically of poor linguistic knowledge.
  • 117. 98  Chapter Six Experimental Results, Conclusions, and Future Works 6.1 Experimental Results Objectives of this system are achieved through applying many steps. Some of these steps required techniques modifications to overcome some problems that are facing the desired results: 6.1.1 Tagging and Error Detection Time Reduction Assigning a POS tag to every token in the input text requires looking up the underhand dictionary. Looking up is an extensive process being more complex as the size of the dictionary becomes larger; the problem is solved by applying prefix dependent dictionary structure based on hashing and indexing. The structure is consisted of two levels of division: primary packets and secondary packets. Primary packets distribution depends on 3-symbols prefixes resulted in quite manageable sizes; as shown in figure (6.1) the search space is reduced to about one thousand of tokens at maximum instead of the original hundreds thousands with an average packets size of (11.16) tokens. In addition to the availability of packets heads random access that is provided by the hash function. Whereas, secondary packets are 6-prefix dependent resulted in more steadfast searching and minimized the looking up time to a reasonable amount as shown in figure (6.2), the search space reduced from hundreds thousands to some hundreds at maximum; the average size is (7.26) tokens per a secondary packet.
  • 118. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 99  Figure (6.1): Tokens distribution in primary packets Figure (6.2): Tokens distribution in secondary packets From the dictionary looking up side in the tagging phase, the hashing scheme provided a set of good properties to the looking up procedure: 6.1.1.1 Successful Looking Up: In the case of successful matching where the target token is found in the dictionary, looking up time is reduced through three steps: - Primary packets selection: the head of the every primary packet is randomly accessible by applying direct hash function; it consumes three symbols from the target token; this action, in turn, reduces matching time in the next steps. - Secondary packet selection: selecting a secondary packet involves examining only three symbols (4-6 indices), resulting in faster searching even if it is performed sequentially.
  • 119. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 100  - Inside secondary packet looking up: the remainder of the target token is (length of the token – 6), since six symbols where consumed in the two previous steps on the way to reach the target secondary packet. In other words, the best case for successful looking up has a time complexity of O(1) in which the target token is stored at the first entry of the primary packet (its head) that is randomly accessible. The worst case happens when the target token is stored at the last entry in a secondary packet and the head of that secondary packet is stored at the last entry of the primary packet to which that secondary packet is related. Time complexity is: O(1) for primary packet head access ( random access) O(Length of primary packet) for finding the secondary packet head where the target token is stored, at each step only three symbols are examined. O(Length of secondary packet) for catching the target token, matching only the remainder from the token after discarding the first six symbols. Totally, O(1)+O(L1)+O(L2) is the worst case. * 6.1.1.2 Failure Looking Up: If the target token is not found in the dictionary, the looking failure can be discovered in three different situations: - At the hashing step (generating the primary packet head address): if there is no match with the target token prefix, the reference table announces the failure by referring to an empty primary packet. ________________________________________________ * L1 and L2 are the lengths of primary and secondary packets, consecutively.
  • 120. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 101  This step consumes only one operation, O(1) time complexity. - Within a prefix of six symbols from the target token, the failure can be discovered through matching the symbols at indices (4-6) with the same indices of each token in the primary packet. This step consumes matching operations number equal to the length of the primary packet, O(L1) time complexity. - In the case of finding a match with a secondary packet head (the 6- symbol prefix at minimum), matching with the tokens of the secondary packet is limited within the remainder from the target token and those tokens after discarding matching the prefixes since they were previously checked. Time complexity is O(L2). 6.1.2 Candidates Generation and Similarity Search Space Reduction Candidates generation requires examining all the tokens in the lexical dictionary to compute its similarity to the misspelled token. Spell based clustering illusion is our proposed solution to reduce search space; similarly spelled tokens are grouped together in such a way keeps the structure of the dictionary unaffected and allows similarity based looking up using bi-grams analysis and prefixes similarity. Thresholds usage is application environment dependent. The misspelled token is the basic unit of candidates generation. The proposed similarity based looking up in this work generates similar tokens using the misspelled token and the underhand dictionary. It exploits the hash- indexing scheme to enhance the generation speed and the bi-gram analysis to improve candidates selection accuracy with no loss of interested candidates.
  • 121. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 102  The proposed approach is high flexible because it acts as a clustering model with well structured clusters (even if it is in fact an illusion), it supports a set of modifiable parameters: 1. Similarity measure: In this work, we've depended on the minimum edit distance techniques (specifically, an improvement of the Levenshtein method) and used a similarity measure based on the distance calculated by this method. The similarity based looking up approach is independent of the similarity measure; therefore, any method or technique can be used with it. 2. Thresholds: Threshold specification is a challenge facing many applications; it requires more work to be adjusted perfectly. This approach simplified the task in two ways: - Candidates generation can be performed without any consideration for thresholds. After collecting the candidates, they'll be ranked and the best set number simply can be selected. - Filtering candidates can be broken into three levels, the first is at the primary centroids selection; the second is at the secondary centroids selection and finally the third is at the candidates selection. As well as any other computation area, dividing a problem into sub- tasks simplifies the initial assignment and the updating of the parameters through the adjustment process. 3. Applicability: Similarity based looking up is resilient to accept any update on the candidates generation. It can be modulated to be used in different environments. For example, a developer can use it in the post correction of the OCR applications; he/she is able to add additional features to make the recognition more accurate. This action may be
  • 122. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 103  referred implicitly by the similarity measure, or explicitly as parameters to the generation procedure. 6.1.3 Reduction the Time of the Damerau-Levenshtein method The modification of Damerau on the Levenshtein method increased the time complexity because it adds additional checks on every symbol in the input strings, this returns to the simplicity of the idea of examining the foundation of a transposition. In this work, we modified the original method to consider the transposition cases by exploiting the same idea of the original method via merging the examining statement within an execution limited statement. Figure (6.3) : Time complexity variance of Levenshtein, Damerau-Levenshtein, and Enhanced Levenshtein (our modification) [ Y axis represents the consumed time measured in seconds, the X axis shows the samples used for testing] The time variance of the three methods (Levenshtein, Damerau- Levenshtein, and the enhanced Levenshtein) is shown in figure (6.3); the consumed time by the enhanced method is very close to the original Levenshtein, but the Damerau modification resulted in a somewhat longer
  • 123. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 104  time. The computed time is an average of repeating the execution of the three methods ten times for each one on the same testing group. 6.1.4 Features Effect on Candidates Suggestion The eight features selected for suggesting the best set of candidates are tested in three different cases to show how each of them affected the selection of the optimal suggestion for isolated words correction. Figure 6.4 shows the ratio of correctly suggested candidates and correctly chosen as optimal. Suggested tokens represent situations when the target token is found in the list of suggestions but not necessary selected as optimal. Chosen as optimal are the set of tokens which are correctly selected as optimal. Figure (6.4): Suggestion Accuracy with a comparison to Microsoft Office Word on a Sample from the Wikipedia 1 Total Misspelled Tokens 1825 Suggested Target Token 1691 Optimally Selected 1477 Microsoft Word Suggestion 1659 0 200 400 600 800 1000 1200 1400 1600 1800 2000 TokensNumber Experiment 1: Suggestion Accuracy Suggestion Accuracy = 92.657% Optimality Accuracy = 87.34% Microsoft Word Suggestion Accuracy = 90.904%
  • 124. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 105  Suggestion accuracy was computed from applying isolated words correction on a list of commonly misspelled words from Wikipedia website contains 1825 tokens resulted in an accuracy of (92.657%) where (87.34%) from them are correctly suggested as optimal candidates. The same testing data was checked with Microsoft Word resulted in (90.904%) suggestion accuracy. A sub set from the wiki sample (presented in Appendix A) was used to compare our system accuracy with other systems. The results were gained from a research made by Ahmed Farag and others [Ahm09] and our system tested the same data and gave the results shown in figure (6.5). Figure (6.5) : Testing the suggested system (I.T.D.C) accuracy and comparing the results with other systems using the same dataset The accuracies of the tested systems are:  ASPELL : 90.833%  Microsoft Word: 88.33%  MultiSpell : 92.5%  I.T.D.C system (the suggested system) : 95.83% 1 2 3 4 Correctly Suggested 109 104 111 115 Incorrectly Suggested 11 16 9 5 ASPELL MicroSoft Word MultiSpell I.T.D.C System 0 20 40 60 80 100 120 140 TokensNumber Experiment 2: A comparison among our work and some systems on the isolated words correction
  • 125. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 106  Another experiment was implemented to check the effect of every feature on the selection of the optimal candidate accuracy; results shown in figure (6.6) are computed by discarding one feature at a time, while figure (6.7) shows the results of using one feature at a time. Although some features gave high accuracy alone, taking such an action is not sufficient. An example is the duplication feature which accounted 1552 correctly selected optimal tokens when used alone, whereas discarding it did not affect the total number of optimal set tokens. Figure (6.6): Discarding one feature at a time for optimal candidate selection 1 Optimal Set 1477 Similarity Feature 827 First Letter Feature 1464 End Letter Feature 1468 Length Effect Feature 1476 Same Letter Set Feature 1436 Transpositionally Inserted Feature 1464 Duplication Feature 1465 Confusion Feature 1475 Transposition Feature 1487 0 200 400 600 800 1000 1200 1400 1600 TokensNumber Experiment 3: Features Selection Effect, Discarding One Feature
  • 126. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 107  Figure (6.7): Using one feature at a time for optimal candidate selection 6.2 Conclusions Text correction is a complex problem and an extensive task. It needs many linguistic and statistical resources. In addition, it needs efficient techniques for automatic execution. In this work we had performed a set of improvements on both resources and techniques sides; our dictionary, an integration of WordNet and ISPELL datasets, was retagged to achieve and simplify parsing process. Hashing and indexing techniques are used to shorten error detection process time; correction process based on exploiting the same hashed dictionary and an enhancement on the Levenshtein method for generating candidates. A set of features, some of them are statistics 1 Optimal Set 1477 Similarity Feature 1406 First Letter Feature 317 End Letter Feature 595 Length Effect Feature 447 Same Letter Set Feature 1486 Transpositionally Inserted Feature 1478 Duplication Feature 1552 Confusion Feature 909 Transposition Feature 923 0 200 400 600 800 1000 1200 1400 1600 1800 TokensNumber Experiment 4: Features Selection Effect, Applying One Feature
  • 127. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 108  dependent, are used in optimizing candidates before passing to the parser where the final decision is made at the level of phrases and sentences. There is no way to avoid human intervention because computers could never predict absolutely what a human intended; therefore, a set of alternatives were associated with every corrected word. 6.3 Future Works Automatic text correction is an open research; even with the presence of several techniques and applications, the desired results still imperfect. However some issues can be further considered in this work to improve its accuracy: - Semantic Processing: this system is fully dependent on an extensive parser at the level of syntax analyzing only; semantic information would increase accuracy if implemented to discard candidates that conflict the sentence meaning. Discourse and pragmatic analysis also can enhance results. - In addition to spell based clustering and phonetic based clustering, a technique for merging both of the two within the same searching time constraints is preferred. Such enhancement will maximize the candidates generation accuracy and minimize the time complexity. - In the hash table, the looking up inside primary packets and secondary packets is performed sequentially; an application of a faster technique like binary search is a good improvement. This action requires sorting the tokens according to their spell and applying the search in two directions: o On the level of the token itself, where moving from an entry to another is dependent on the tokens spell and therefore should consider the symbols of the token sequentially because the length is small enough to not to be looked with a complex technique.
  • 128. Chapter Six Experimental Results, Conclusion, and Future Works _________________________________________________________________________ 109  o On the level of the packets, where moving is performed at the level of tokens. - The similarity based looking up need to be faster, an enhancement is needed to reduce the number of the generated primary centroids. This problem may be solved if the application where the system is used becomes more specific. - In grammar correction, we considered only two types and it is preferable to consider as many types as possible. - Because of time constraints, this system was implemented on simple sentences only; an extension is required to make it general by including complex, compound and complex compound sentences. The task is forward easy because no more details are required for the construction process via exploiting the phrases level analysis made in this work. - A sophisticated study in the field of the type of errors and how people usually making writing mistakes, such study requires multi resources include corpus, statistics and even an interactive analyzer for recording and classifying commonly committed mistakes. Although it is not an easy task, it can simplify drawing a concluded idea about the general behavior of users when they are unintentionally change the spell of words to generate misspellings.
  • 130. References ___________________________________________________________  110  References Achenkunju A. and Bhuma V.R. (2014). "An Efficient Reformulated Model for Transformation of String." International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 International Conference on Humming Bird. 1 March. Ahmed F., Ernesto W. De L., and Andreas N.( 2009). Revised N-Gram based Automatic Spelling Correction Tool to Improve Retrieval Effectiveness. Technical University of Berlin. Ali A.(2011). Textual Similarity. Technical University of Denmark. Amber W. O., Graeme H., and Alexander B. (2008). Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. Ontario: University of Toronto. Baluja S., Vibhu O. M., and Rahul S.(2000). APPLYING MACHINE LEARNING FOR HIGH PERFORMANCE NAMED-ENTITY EXTRACTION. Cambridge: Blackwell Publishers. Bassil Y.( 2012). "Parallel Spell-Checking Algorithm Based on Yahoo! N- Grams Datasets." International Journal of Research and Reviews in Computer Science (IJRRCS), ISSN: 2079-2557, Vol.3, No.1, February. Bhattacharyya P. (2012). "Natural Language Processing A Perspective from Computation in Presence of Ambiguity, Resource Constraint and Multilinguality." CSI Journal of Computing, Vol.1 , No. 2, 3-13. Booth A. D., Brandwood L., and Cleave J. P.. (1958). Mechanical Resolution of Linguistic Problems. New York, London: Academic Press Ink Publishers; ButterWorths Scientific Publications. Boswell D. (2005). Speling KoreKsion: A survey of techniques from past to present. A USCD Research Exam.
  • 131. References ___________________________________________________________  111  Chakraborty R. C. (2010). "Artificial Intelligence: Natural Language Processing." www.myreaders.info/html/artificial_intelligence.html, 1 June. Church K. and Gale W. A. (1991). "Probability Scoring for Spelling Correction." Statistics and Computing, 93-103. Clark A., Chris F., and Shalom L. (2010). The Handbook of Computational Linguistics and Natural Language Processing. Singapore: Wiley-Blackwell. Dahlmeier D. and Hwee T. N. (2011). "Grammatical Error Correction with Alternative Structure Optimization." Proceedings of the Association for Computational Linguistics, 915-923. Damerau F. J. (1964). A Technique for Computer detection and Correction of Spelling Errors. New York: ACM, Vol.3,No.4. Dzikovska M. O. (2004). A Practical Semantic Representation For Natural Language Parsing. New York: University of Rochester. Farra N., Nadi T., Alla R., and Nizar H. ( 2014). "Generalized Character- Level Spelling Error Correction." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, June 23-25, 161–167. Felice M., Yuan Z., Andersen Ø. E., and others.( 2014). "Grammatical Error Correction using Hybrid Systems and Type Filtering." Proceedings of the Shared Task Eighteenth Conference on Computational Natural Language Learning, Maryland, 15-24. Fromkin V., Robert R., and Nina H. ( 2007). Language Change: The Syllabes of Time. Vol. 8, in An Introduction to Language, 461-497. Boston. Gamon M. (2010). "Using Mostly Native Data to Correct Errors in Learners' writing: A meta-classifier approach." proceedings of the Annual
  • 132. References ___________________________________________________________  112  Meeting of the North America Chapter of the Association for Computational Linguistics, 163-171. Golding A. R., and Yves S. (1996). Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction. Cambridge: Mitsubishi Electric Research Laboratories. Grune D. and Ceriel J. H. J. ( 2008). Parsing Techniques- a practical guide. Vol. Second Edition. Springer. Gupta A. (2014). "Grammatical Error Detection and Correction Using Tagger Disagreement." Proceedings of the Shared Task of the Eighteenth Conference on Computational Natural Language Learning, 49-52. Haldar R. and Debajyoti M. ( 2011). Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach. New York: ACM. Han N., Martin C., and Claudia L. (2006). "Detecting errors in English Article Usage by Non-native Speakers." Natural Language Engineering, 115-129. Hasan F. M.( 2006). COMPARISON OF DIFFERENT POS TAGGING TECHNIQUES FOR SOME SOUTH ASIAN LANGUAGES. Dhaka: BRAC University. Hasan F. M., Naushad U., and Mumit K.( 2006). Comparison of different POS Tagging Techniques (N-Gram, HMM and Brill’s tagger) for Bangla. Bangladesh: BRAC University. Hodge V. J. and Austin J. (2003). "A Comparison of Standard Spell Checking Algorithms and Novel Binary Neural Approach." IEEE Trans. Know. Dat. Eng, 1073-1081. Hwee T. N., Siew M. W., Ted B., and others. (2014). "The CoNLL-2014 Shared Task on Grammatical Error Correction." Proceedings of the Shared
  • 133. References ___________________________________________________________  113  Task of Eighteenth Conference on Computational Natural Language Learning, June 26-27,1-14. ISPELL. "ISPELL." April 10 , 2014 . http://Ispell/Wikipedia the free encyclopedia.htm (accessed September 2014). Jackson P. and Isabelle M.(2002). Natural Language Processing for Online applications, Text Retrieval, Extraction and Categorization. Amsterdam: John Benjamins Publishing Company. Jones K. S. (2001). Natural Language Processing - A historical Review. University of Cambridge, October. Julius G. III.(2013) Intrasentential Grammatical Correction with Weighted Finite State Transducers. Raleigh, North Carolina: North Carolina State University. Jurafsky D. and James H. M. (2000). Speech and Language Processing: An introduction to natural language processing, Computational Linguistics, and Speech Recognition. New Jersey: Alan Apt. Kirthi J., Neeju N.J., and Nithiya P. (2011). "Automatic Spell Correction of User query with Semantic Information Retrieval and Ranking of Search Results using WordNetApproach." IJCSI International Journal of Computer Science Issues, Vol. 8, No. 2, March, 557- 564. Kukich K. ( 1992). Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, Vol. 24, No. 4. Manning R. and Schútze. (2008). An Introduction to Information Retrieval. Cambridge University Press. Mihov S., Svetla K., and others. (2004). Precise and Efficient Text Correction Using Levenshtein Automata, Dynamic Web Dictionaries and Optimized Correction Models. Bulgarian Academy of Sciences.
  • 134. References ___________________________________________________________  114  Mishra R. and Navjot K.(2013). "A Survey of Spelling Error Detection and Correction Techniques." International Journal of Computer Trends and Technology, vol.4, No.3, 372-374. Momtazi S.(2012). Natural Language Processing: Introduction to Language Technology. University of Potsdam. Nadkarni P. M., Lucila O., and Wendy W. C.( 2011). "Natural language processing: an introduction." J Am Med Inform Assoc, October 5, 544-551. Niemann T.( 2009). SORTING AND SEARCHING ALGORITHMS. Portland: epaperpress.com. "Notes on Ambiguity." http://cs.nyu.edu/faculty/davise/ai/ambiguity.html. Peterson J. L. (1980). "Computer Programs for Detecting and Correcting Spelling Errors." Communications of the ACM, Vol.23, No. 12, 676- 687. Pollock J. J. and Zamara A. (1983). "Collection and Characterization of Spell Errors in Scientific and Scholary Text." Journal American Social Information Scientific, 51-58. Pollock J. J. and Zamara A. (1984). "Automatic Spelling Correction in Scientific and Scholary Text." Communications of the ACM, 358-368. Quirk R., Sidney G., Geoffreyleech, and Jan S.( 1985). A Comprehensive Grammar of the English Language. New York and London: Longman. Raaijmakers S. (2013). "A Deep Graphical Model for Spelling Correction." Proceedings of the 25th Benelux Conference on Artificial Intelligence. Delft, 7-8 November. Rich E. and Kevin K.( 1991). Chapter Fifteen: Natural Language Processing. Vol. 2, in Artificial Intelligence. Amazon.
  • 135. References ___________________________________________________________  115  Ritter A., Mausam S. C., and Oren E. ( 2011). Named Entity Recognition in Tweets An Experimental Study. Computer Science and Engineering, University of Washington. Rajesh K. S. and Lokanatha C. R.(2009). "Natural Language Processing - An Intelligent way to understand Context Sensitive Languages." International Journal of Intelligent Information Processing, December 3,421-428. Sagar and Shobha G. (2013). "Survey on Grammar Generation Methods for Natural Languages." International Journal of Computational Linguistics and Natural Language Processing ISSN 2279 – 0756, Vol. 2,No.1, January, 197-202. Salifou L. and Harouna N. (2014). "Design of A Spell Corrector For Hausa Language." International Journal of Computational Linguistics (IJCL), Vol.5,No.2, 14-26. Scott M. T. (1999). PARSING AND TAGGING SENTENCES CONTAINING LEXICALLY AMBIGUOUS AND UNKNOWN TOKENS. Purdue University. Seo H., Jonghoon L., Seokhwan K., and others. (2012). "A Meta Learning Approach to Grammatical Error Correction." 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, July, 8 - 14. Setiadi I. (2014). Damerau-Levenshtein Algorithm and Bayes Theorem for Spell Checker Optimization. Bandung: Makalah IF2211 Strategi Algoritma – Sem. I Tahun. Tetreault J., Jenniefer F., and Martin C. (2010). "Using Parse Features for Preposition Selection and Error Detection." Proceedings of the ACL 2010 Conference Short Papers, 353-358.
  • 136. References ___________________________________________________________  116  Toutanova K., and Moore R. C.( 2002). "Prounciation Modeling for Improved Spelling Correction." Proceedings 40th Annual Meeting of the Association for Computational Linguistics. Hong Kong, pp. 144-151,144- 151. Verberne S. (2002). Context-sensitive spell checking based on word trigram probabilities. University of Nijmegen. Voorhees E., Harman D.K., and others. ( 2005). TREC: Experiment and Evaluation in information Retreival. Cambridge: MIT press. Wanger R. A. and Fischer M.J. (1974). "The string-to-string correction proplem." Journal of the Association for Computer Machinary, 168-173. Wolniewicz R. (2011). Auto-Coding and Natural Language Processing. U.S.A: 3M Health Information Systems. Yannakoudakis E.J. and Fawthrop D. (1983). "An Intelligent Spelling Error Correction." Information Processing and Management, 101-108. Yule G. (2000). "Pragmatics." In Oxford Introductions to Language Study Series Editor H.G. Widdowson, 4. Oxford University Press. Zampieri M. and Renato C. de A. ( 2014). Between Sound and Spelling Combining Phonetics and Clustering Algorithms to Improve Target Word Recovery. Saarland: Saarland University. Zhan J., Xiolong M., Shu q. L., and Ditang F. (1998). A language Model in a Large-Vocabulary Speech Recognition System. Sydney: Proceedings of International Conference ICSLP98.
  • 138. Appendix (A): A comparison among this work and some systems on the isolated words correction * Bold words are incorrectly suggested ** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 117 Misspellings Correct Word ASPELL Microsoft Word MultiSpell [Ahm09] I.T.D.C System Abberration aberration aberration aberration aberration aberration accomodation accommodation accommodation accommodation accommodation accommodation acheive achieve Achieve achieve achieve achieve abortificant abortifacient aficionados - abortifacient abortifacient absorbsion absorption absorbsion absorbs ion absorption absorption ackward (awkward, backward) awkward (awkward, backward) (awkward, backward) (backward, awkward) additinally additionally additionally additionally additionally additionally adminstration administration administration administration administration administration admissability admissibility admissibility admissibility admissibility admissibility advertisments advertisements advertisements advertisements advertisements advertisements adviced advised advised advised advice advised afficionados aficionados aficionados aficionados aficionados aficionados affort (effort, afford) effort afford afford (effort, afford) agains against agings agings against against aggreement agreement agreement agreement agreement agreement agressively aggressively aggressively aggressively aggressively aggressively agriculturalist agriculturist - - agriculturist agriculturist alcoholical alcoholic alcoholically alcoholically alcoholic (alcoholically, alcoholic) algebraical algebraic algebraic algebraically algebraically algebraic algoritms algorithms algorithms algorithms algorithms (algorism, algorithms) alterior (ulterior, anterior) ulterior (anterior, ulterior) (anterior, ulterior) (ulterior, anterior)
  • 139. Appendix (A): A comparison among this work and some systems on the isolated words correction * Bold words are incorrectly suggested ** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 118 Misspellings Correct Word ASPELL Microsoft Word MultiSpell [Ahm09] I.T.D.C System anihilation annihilation annihilation annihilation annihilation annihilation anthromorphization anthropomorphization anthropomorphizing - anthropomorphization anthropomorphization bankrupcy bankruptcy bankruptcy bankruptcy bankruptcy bankruptcy baout (about, bout) bout (about, bout) bout (about, bout) basicly basically basically basically basically basically breakthough breakthrough break though breakthrough breakthrough breakthrough carachter character crocheter character character character cannotation connotation connotation (connotation, annotation) (connotation, annotation) connotation carismatic charismatic charismatic charismatic charismatic charismatic carmel caramel Carmel - caramel caramel cervial (cervical, servile) cervical cervical cervical cervical clasical classical classical classical classical classical cleareance clearance clearance clearance clearance clearance comissioning commissioning commissioning commissioning commissioning commissioning commemerative commemorative commemorative commemorative commemorative commemorative compatabilities compatibilities compatibilities compatibilities compatabilities compatibilities committment commitment commitment commitment commitment commitment debateable debatable debatable debatable debatable debatable determinining determining determinining determinining determining determining childbird childbirth child bird child bird childbirth childbirth definately definitely definitely definitely definitely definitely decribe describe describe describe describe describe elphant elephant elephant elephant elephant elephant emmediately immediately immediately immediately immediately immediately emphysyma emphysema emphysema emphysema emphysema emphysema erally (orally, really) orally really orally (really ,orally) eyasr (years, eyas) eyesore years eyas (eyas ,years)
  • 140. Appendix (A): A comparison among this work and some systems on the isolated words correction * Bold words are incorrectly suggested ** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 119 Misspellings Correct Word ASPELL Microsoft Word MultiSpell [Ahm09] I.T.D.C System facist fascist fascist fascist fascist fascist fluoroscent fluorescent fluorescent fluorescent fluorescent fluorescent geneology genealogy genealogy genealogy genealogy genealogy gernade grenade grenade grenade grenade grenade girates gyrates grates gyrates Gyrates gyrates gouvener governor governor souvenir convener (souvenir, gouverneur, governor) gurantees guarantee guarantee guarantee guarantee (guaranties,guarantee) guerrila (guerilla, guerrilla) guerrilla guerrilla (guerilla, guerrilla) (guerrilla, guerilla) guerrilas (guerillas, guerrillas) guerrillas guerrillas (guerillas, guerrillas) (guerrillas, guerillas) Guiseppe Giuseppe Giuseppe Giuseppe Giuseppe - habaeus (habeas, sabaeus) habeas habitués sabaeus Cabaeus hierarcical hierarchical hierarchical hierarchical hierarchical hierarchical heros heroes heroes heroes herbs heroes hypocracy hypocrisy hypocrisy hypocrisy hypocrisy hypocrisy independance Independence Independence - Independence Independence intergration integration integration integration integration integration intrest interest interest interest interest interest Johanine Johannine Johannes Johannes Johannine Johannine judisuary judiciary judiciary judiciary judiciary judiciary kindergarden kindergarten kindergarten kindergarten kindergarten kindergarten knowlegeable knowledgeable knowledgeable knowledgeable knowledgeable knowledgeable labatory (lavatory, laboratory) (lavatory, laboratory) (lavatory, laboratory) (lavatory, laboratory) lavatory lonelyness loneliness loneliness loneliness loneliness loneliness legitamate legitimate legitimate legitimate legitimate legitimate libguistics linguistics linguistics linguistics linguistics linguistics
  • 141. Appendix (A): A comparison among this work and some systems on the isolated words correction * Bold words are incorrectly suggested ** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 120 Misspellings Correct Word ASPELL Microsoft Word MultiSpell [Ahm09] I.T.D.C System lisence (license, licence) licence silence licence (licence, license) mathmatician mathematician mathematician mathematician mathematician mathematician ministery ministry ministry ministry ministry ministry mysogynist misogynist misogynist misogynist misogynist misogynist naturaly naturally naturally naturally naturally naturally ocuntries countries countries countries countries countries paraphenalia paraphernalia paraphernalia paraphernalia paraphernalia paraphernalia Palistian Palestinian Alsatain politian Palestinian (Pakistan, politian) pamflet pamphlet pamphlet pamphlet pamphlet partlet psyhic psychic psychic psychic psychic psychic Peloponnes Peloponnesus Peloponnese Peloponnese Peloponnesus Peloponnese personell personnel personnel personnel personnel ( personally, personnel) posseses possesses possesses possesses possess possesses prairy prairie priory prairie airy (priory, prairie) qutie (quite, quiet) quite quite queue quite radify (ratify,ramify) ratify ratify ramify (rarify, ratify, ramify) reccommended recommended recommended recommended recommended recommended reciever receiver receiver receiver reliever receiver reconaissance reconnaissance reconnaissance reconnaissance reconnaissance reconnaissance restauration restoration restoration restoration instauration restoration rigeur (rigueur, rigour, rigor) rigger rigueur (rigueur, rigour) rigour Saterday Saturday Saturday Saturday Saturday Saturday scandanavia Scandinavia Scandinavia Scandinavia Scandinavia Scandinavia scaleable scalable scalable - scalable scalable secceeded (seceded, succeeded) succeeded succeeded succeeded succeeded sepulchure (sepulchre, sepulcher) sepulcher sepulchered sepulchre (sepulchre, sepulcher)
  • 142. Appendix (A): A comparison among this work and some systems on the isolated words correction * Bold words are incorrectly suggested ** I.T.D. C system : Intelligent Text Document Correction System Based on Mining Technique (our suggested system) 121 Misspellings Correct Word ASPELL Microsoft Word MultiSpell [Ahm09] I.T.D.C System themselfs themselves themselves themselves themselves themselves throught (thought, through, throughout) (thought, through) (thought, through) (thought, through, throughout) (through, thought, throughout) troups (troupes, troops) (troupes, troops) troupes troops (troops, troupes) simultanous simultaneous simultaneous simultaneous simultaneous simultaneous sincerley sincerely sincerely sincerely sincerely sincerely sophicated sophisticated suffocated supplicated sophisticate sophister surrended (surrounded, surrendered) surrounded surrender surrounded (surrender, surrendered surrounded) unforetunately unfortunately unfortunately unfortunately unfortunately unfortunately unnecesarily unnecessarily unnecessarily unnecessarily unnecessarily unnecessarily usally usually usually usually usually usually useing using using using seeing using vaccum vacuum vacuum vacuum vacuum vacuum vegitables vegetables vegetables vegetables vegetables vegetables vetween between between between between between volcanoe volcano volcano volcano volcano ( volcanoes, volcano) weaponary weaponry weaponry weaponry weaponry weaponry worstened worsened worsened worsened worsened worsened wupport support support support support support yeasr years years years yeast years Yementite (Yemenite, Yemeni) Yemenite Yemenite Yemenite Yemenite yuonger younger younger younger sponger younger
  • 144. 211
  • 145. 1 ‫الخالصة‬: ‫مع‬ ‫االنسان‬ ‫بتفاعل‬ ‫المرتبطة‬ ‫المشكالت‬ ‫اهم‬ ‫من‬ ‫واحدة‬ ‫تلقائيا‬ ‫النصوص‬ ‫تصحيح‬ ‫عملية‬ ‫تعتبر‬ ‫المباشرة‬ ‫العملية‬ ‫الجوانب‬ ‫من‬ ‫العديد‬ ‫في‬ ‫تدخل‬ ‫؛اذ‬ ‫الحاسوب‬‫تحويل‬ ‫عن‬ ‫الناجمة‬ ‫االخطاء‬ ‫كتصحيح‬ ‫رقمية‬ ‫الى‬ ‫الخطية‬ ‫النصوص‬,‫عملية‬ ‫اجراء‬ ‫قبل‬ ‫المستخدمين‬ ‫إيعازات‬ ‫كتصحيح‬ ‫المباشرة‬ ‫وغير‬ ‫تفاعلية‬ ‫بيانات‬ ‫قاعدة‬ ‫في‬ ‫ما‬ ‫استرجاع‬. ‫التل‬ ‫التصحيح‬ ‫عملية‬ ‫تمر‬‫رئيسيتين‬ ‫بمرحلتين‬ ‫قائي‬:‫و‬ ‫االخطاء‬ ‫تحديد‬‫البدائل‬ ‫اقتراح‬. ‫كال‬ ‫في‬ ‫عديدة‬ ‫وطرق‬ ‫تقنيات‬ ‫توجد‬‫وقابليتها‬ ‫نتائجها‬ ‫دقة‬ ‫في‬ ‫الطرق‬ ‫هذه‬ ‫وتتباين‬ ‫المرحلتين‬ ‫التطبيقية‬,‫وإحصائية‬ ‫اجرائية‬ ‫طرق‬ ‫الى‬ ‫عامة‬ ‫بصورة‬ ‫تنقسم‬ ‫حيث‬.‫ما‬ ‫كل‬ ‫على‬ ‫منها‬ ‫االجرائية‬ ‫تشتمل‬ ‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫ذلك‬ ‫في‬ ‫بما‬ ‫النصوص‬ ‫مقبولية‬ ‫في‬ ‫تتحكم‬ ‫محددة‬ ‫قواعد‬ ‫على‬ ‫عمله‬ ‫في‬ ‫معتمدا‬ ‫هو‬ ‫ال‬ ‫تعتمد‬ ‫حين‬ ‫؛في‬ ‫الطبيعية‬‫عينات‬ ‫من‬ ‫عادة‬ ‫تجمع‬ ‫واحتمالية‬ ‫احصائية‬ ‫بيانات‬ ‫على‬ ‫االحصائية‬ ‫طرق‬ ‫المستخدمين‬ ‫بين‬ ‫يتداول‬ ‫مما‬ ً‫ا‬‫اساس‬ ‫مستخلصة‬ ‫هائلة‬. ‫اللغوية‬ ‫المقبولية‬ ‫وفحص‬ ‫للتحليل‬ ‫كأساس‬ ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫تقنيات‬ ‫اعتمدت‬ ‫النظام‬ ‫هذا‬ ‫في‬ ‫االنكليزية‬ ‫للنصوص‬ ‫والنحوية‬,‫مفر‬ ‫كل‬ ‫يضم‬ ‫قاموس‬ ‫استخدام‬ ‫تم‬ ‫حيث‬‫لغرض‬ ‫االنكليزية‬ ‫اللغة‬ ‫دات‬ ‫وطريقة‬ ‫هاش‬ ‫دالة‬ ‫استخدمت‬ ‫فقد‬ ‫القاموس‬ ‫لهذا‬ ‫الهائل‬ ‫للحجم‬ ‫ونظرا‬ ‫اللغوية‬ ‫االخطاء‬ ‫وتحديد‬ ‫اكتشاف‬ ‫على‬ ‫اعتمادا‬ ‫عشوائي‬ ‫وصول‬ ‫قابلية‬ ‫وتوفير‬ ‫المنشودة‬ ‫الكلمات‬ ‫عن‬ ‫البحث‬ ‫نطاق‬ ‫لتقليص‬ ‫فهرسة‬ ‫بادئاتها‬‫البحث‬ ‫وقت‬ ‫اختصار‬ ‫وبالتالي‬. ‫فيعت‬ ‫البدائل‬ ‫توليد‬ ‫أما‬‫كلمات‬ ‫وكافة‬ ‫المدخلة‬ ‫الكلمة‬ ‫بين‬ ‫التشابه‬ ‫مقدار‬ ‫حساب‬ ‫طريقة‬ ‫على‬ ‫مد‬ ‫طريقة‬ ‫باستخدام‬ ‫احتسب‬ ‫والذي‬ ‫المقدار‬ ‫لهذا‬ ‫وفقا‬ ‫ترتيبها‬ ‫وإعادة‬ ‫القاموس‬Levenshtein‫؛إن‬ ‫رة‬ّ‫و‬‫مط‬ ‫طويال‬ ‫وقتا‬ ‫تتطلب‬ ‫هذه‬ ‫التوليد‬ ‫عملية‬,‫مع‬ ‫صغيرة‬ ‫مجاميع‬ ‫الى‬ ‫القاموس‬ ‫في‬ ‫الكلمات‬ ‫تقسيم‬ ‫تم‬ ‫لذلك‬ ‫بقاب‬ ‫االحتفاظ‬‫المصدر‬ ‫الكلمة‬ ‫تهجئة‬ ‫على‬ ‫تعتمد‬ ‫لمحددات‬ ‫تبعا‬ ‫العشوائي‬ ‫الوصول‬ ‫لية‬.‫اقتراح‬ ‫يتضمن‬ ‫م‬ ‫حد‬ ‫الى‬ ‫تتصل‬ ‫خصائص‬ ‫مجموعة‬ ‫اختبار‬ ‫البدائل‬‫شيوعا‬ ‫االكثر‬ ‫االخطاء‬ ‫بطبيعة‬ ‫ا‬.‫النظام‬ ‫يقوم‬ ‫قواعد‬ ‫مع‬ ‫يتعارض‬ ‫ال‬ ‫ان‬ ‫على‬ ‫المصدر‬ ‫الكلمة‬ ‫مع‬ ‫توافقية‬ ‫اعلى‬ ‫يحقق‬ ‫الذي‬ ‫االمثل‬ ‫البديل‬ ‫باختيار‬ ‫لي‬ ‫النحو‬‫النص‬ ‫كون‬‫ونحويا‬ ‫لغويا‬ ‫مقبوال‬ ‫المصحح‬. ‫النظام‬ ‫دقة‬ ‫اختبار‬ ‫نتائج‬ ‫اظهرت‬‫المقترح‬ً‫ا‬‫تقدم‬‫أخرى‬ ‫وأنظمة‬ ‫وورد‬ ‫مايكروسوفت‬ ‫على‬,‫كما‬ ‫حافظت‬ ‫الرمزية‬ ‫السالسل‬ ‫تشابه‬ ‫لحساب‬ ‫رة‬ّ‫و‬‫المط‬ ‫الطريقة‬ ‫ان‬‫قدرتها‬ ‫مع‬ ‫الوقت‬ ‫تعقيدات‬ ‫على‬ ‫تقريبا‬ ‫االمالئية‬ ‫االخطاء‬ ‫من‬ ‫اضافي‬ ‫نوع‬ ‫اكتشاف‬ ‫على‬.
  • 146. 2 ‫نظام‬‫ّية‬‫ص‬‫الن‬ ‫المستندات‬ ‫تصحيح‬‫الذكي‬ ‫على‬ ‫باالعتماد‬‫الت‬ ‫تقنية‬‫شابه‬ ‫مقدمة‬ ‫رسالة‬ ‫إلى‬‫كلية‬ ‫مجلس‬‫المعلومات‬ ‫تكنولوجيا‬-‫متطلبات‬ ‫من‬ ‫جزء‬ ‫وهي‬ ‫بابل‬ ‫جامعة‬ ‫شهادة‬ ‫نيل‬‫الحاسبات‬ ‫علوم‬ ‫في‬ ‫الماجستير‬ ‫قبل‬ ‫من‬ ‫الركابي‬ ‫عبيد‬ ‫كاظم‬ ‫مروه‬ ‫بإش‬‫ــ‬‫راف‬ ‫أ‬.‫د‬.‫البكري‬ ‫حمسن‬ ‫عباس‬ ١٠٢5‫م‬٢٤١6‫هـ‬ ‫العلمي‬ ‫والبحث‬ ‫العالي‬ ‫التعليم‬ ‫وزارة‬ ‫بابل‬ ‫جامعة‬-‫كلية‬‫تكنولوجي‬‫ا‬‫المعلومات‬ ‫البرامجيات‬ ‫قسم‬