SlideShare a Scribd company logo
ΓΛ545
COMPUTATIONAL LINGUISTICS
AND CORPORA
Athanasios N. Karasimos
akarasimos@gmail.com
MA in Linguistics | School of English Language and Literature
Aristotle University of Thessaloniki
Lecture 7 | Wed 15 Noe 2017
WORDS AND
TRANSDUCERS
2
LECTURE 5 RECAP
Language Modeling with N-Grams
3
FINITE-STATE AUTOMATA
• Any regular expression can be realized as a finite state automaton (FSA).
• An automaton implicitly defines a formal language as the set of strings the
automaton accepts.
• An automaton can use any set of symbols for its vocabulary, including
letters, words, or even graphic images.
• The behavior of a deterministic automaton (DFSA) is fully determined by
the state it is in.
• A non-deterministic automaton (NFSA) sometimes has to make a choice
between multiple paths to take given the same current state and next input.
• Any NFSA can be converted to a DFSA.
4
MORPHOLOGY PART 2
Lets talk about WORDS
5
THE CASE OF PLURAL
• Simple cases of plural:
• woodchuck to woodchucks
• But what about fox, peccary, goose and fish?
• Orthographic rules: peccary to peccaries
• Morphological rules: fish with 0 plural suffix
goose to geese with vowel change
• Phonological rules: fox to foxes
6
MORPHOLOGICAL PARSING
• The task to recognize that a word (like foxes) breaks down into
component morphemes (fox and -es) and building a structured
representation of this fact is called morphological parsing.
• Parsing means taking an input and producing some sort of linguistic
structure for it.
• We use the term parsing very broadly, including many kinds of
structures that might be produced; morphological, syntactic, semantic,
discourse; in the form of a string, or a tree, or a network.
7
MORPHOLOGICAL PARSING
• Morphological parsing or stemming(?) applies to many affixes other
than plurals;
• for example we might need to take any English verb form ending in -ing
(going, talking, congratulating) and parse it into its verbal stem plus the -ing
morpheme.
• So given the surface or input form going, we might want to produce
the parsed form VERB-go + GERUND-ing.
• Morphological parsing is important throughout speech and language
processing. It plays a crucial role in Web search for morphologically
complex languages like Greek, Russian or German.
8
MORPHOLOGICAL PARSING
• Morphological parsing also plays a crucial role in part-of-speech tagging
for these morphologically complex languages.
• It is important for producing the large dictionaries that are necessary for
robust spell-checking.
• It is necessary in machine translation to realize for example that the
French words va and aller should both translate to forms of the English
verb go.
9
SURVEY OF ENGLISH
MORPHOLOGY
10
A FAMILIAR FACE: MORPHEMES
• A morpheme is often defined as the minimal meaning-bearing unit in a
language.
• So for example the word fox consists of a single morpheme (the morpheme fox)
while the word cats consists of two: the morpheme cat and the morpheme -s.
• As this example suggests, it is often distinguish two broad classes of
morphemes: stems and affixes.
• Affixes: divided into prefixes, suffixes, infixes, and circumfixes. Prefixes
precede the stem, suffixes follow the stem, circumfixes do both, and infixes
are inserted inside the stem.
• Circumfixes: [German] past participles (ge- and –en/-t)
• Infixes: [Tagalog] affix um, which marks the agent of an action, is infixed to
the stem hingi “borrow” to produce humingi.
11
WORD FORMATION PROCESSES
• Four processes are common and play important roles in speech and
language generation: inflection, derivation, compounding, and
cliticization.
• Inflection is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of the same class as the original stem, and usually filling
some syntactic function like agreement.
• Derivation is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of a different class, often with a meaning hard to predict
exactly.
• Compounding is the combination of multiple word stems together.
• Cliticization is the combination of a word stem with a clitic. A clitic is a morpheme
that acts syntactically like a word, but is reduced in form and attached
(phonologically and sometimes orthographically) to another word.
12
MORPHOLOGICAL TASKS
• TASK 1:
• Give two examples from each word formation process
• TASK II:
• Consider possible problematic cases for morphological parsing (inf, dev, com).
• TASK III:
• Test these cases with a morphological parser.
• http://nlpdotnet.com/services/Morphparser.aspx
• TASK IV:
• Ambiguity of morphological parsing.
• https://open.xerox.com/Services/fst-nlp-tools/Consume/Morphological%20Analysis-176
• http://langrid.org/playground/morphological-analyzer.html
13
INFLECTIONAL ENGLISH
• Nominal suffixes: an affix that marks plural and an affix that marks
possessive.
• Regular plural suffix -s (also spelled -es), and irregular plurals:
Regular Nouns Irregular Nouns
• Singular cat thrush mouse ox
• Plural cats thrushes mice oxen
• While the regular plural is spelled -s after most nouns, it is spelled -es after
words ending in -s (ibis/ibises), -z (waltz/waltzes), -sh (thrush/thrushes), -ch
(finch/finches), and sometimes -x (box/boxes). Nouns ending in -y preceded
by a consonant change the -y to -i (butterfly/butterflies).
• The possessive suffix is realized by apostrophe + -s for regular singular nouns
(llama’s) and plural nouns not ending in -s (children’s) and often by a lone
apostrophe after regular plural nouns (llamas’) and some names ending in -s
or -z (Euripides’ comedies).
14
INFLECTIONAL ENGLISH
• English verbal inflection is more complicated than nominal inflection.
• main verbs, (eat, sleep, impeach), modal verbs (can, will, should), and primary
verbs (be, have, do).
• Morphological Form Classes Regularly Inflected Verbs
stem walk merge try map
-s form walks merges tries maps
-ing participle walking merging trying mapping
Past form or -ed participle walked merged tried mapped
• we can predict the other forms by adding one of three predictable endings
and making some regular spelling.
15
INFLECTIONAL MORPHOLOGY
• The irregular verbs are those that have some more or less idiosyncratic
forms of inflection. Irregular verbs in English often have five different forms,
but can have as many as eight (e.g., the verb be) or as few as three (e.g. cut
or hit).
• Morphological Form Classes Irregularly Inflected Verbs
stem eat catch cut
-s form eats catches cuts
-ing participle eating catching cutting
Past form ate caught cut
-ed/-en participle eaten caught cut
More complex verbal inflectional paradigm of morphologically rich languages.
16
DERIVATIONAL ENGLISH
• While English inflection is relatively simple compared to other
languages, derivation in English is quite complex.
• A very common kind of derivation in English is the formation of new
nouns, often from verbs or adjectives. This process is called
nominalization.
• For example, the suffix -ation produces nouns from verbs ending often in the
suffix -ize (computerize → computerization).
17
COMPOUNDING ENGLISH
• Most English compound nouns are noun phrases (i.e. nominal phrases)
that include a noun modified by adjectives or noun adjuncts.
• The monoword forms in which two usually moderately short words appear
together as one. Examples are housewife, lawsuit, wallpaper, basketball, etc.
• The hyphenated form in which two or more words are connected by a hyphen.
Compounds that contain affixes, such as house-build(er) and single-
mind(ed)(ness), as well as adjective-adjective compounds and verb-verb
compounds, such as blue-green and freeze-dried.
• Loose compounds: the open or spaced form consisting of newer combinations
of usually longer words, such as distance learning, player piano, lawn tennis,
etc.
18
Modifier Head Compound
noun noun football
adjective noun blackboard
verb noun breakwater
preposition noun underworld
noun adjective snow white
adjective adjective blue-green
verb adjective tumbledown
preposition adjective over-ripe
noun verb browbeat
adjective verb highlight
verb verb freeze-dry
preposition verb undercut
noun preposition love-in
adverb preposition forthwith
verb preposition takeout
preposition preposition without
FINITE-STATE MORPHOLOGICAL
PARSING
19
MORPHOLOGICAL FEATURES
some
some +Pron+NomObl+3P+Pl
some +Det+SP
features
<feature> +Noun+Pl
<feature> +Verb+Pres+3sg
that
that +Conj+Sub
that +Det+Sg
that +Pron+NomObl+3P+Sg
that +Pron+Rel+NomObl+3P+SP
<that> +Adv
• εργασία
• εργασία +Noun+Common+Fem+Sg+Acc
• εργασία +Noun+Common+Fem+Sg+Voc
• εργασία +Noun+Common+Fem+Sg+Nom
• υπάρχουν
• υπάρχω +Verb+Indic+Pres+P3+Pl+Imperf
+Active
• σχόλια
• σχόλιο +Noun+Common+Neut+Pl+Acc
• σχόλιο +Noun+Common+Neut+Pl+Voc
• σχόλιο +Noun+Common+Neut+Pl+Nom
• ανατροφοδότησης
• ανατροφοδότηση +Noun+Common+Fem+Sg+
Gen
20
MORPHOLOGICAL FEATURES
• The features specify additional information about the stem.
• For example the feature +N means that the word is a noun; +Sg means it is
singular, +Pl that it is plural. (check also Chapter 5 and Chapter 16); for now,
consider +Sg to be a primitive unit that means “singular”.
• Greek has some features that don’t occur in English; for example the
nouns εργασία and ανατροφοδότησης are marked +Fem (feminine).
• Note that some of the input forms will be ambiguous between different
morphological parses. For now, we will consider the goal of
morphological parsing merely to list all possible parses.
21
BUILDING A MORPHOLOGICAL PARSER
• lexicon: the list of stems and affixes, together with basic information
about them (whether a stem is a Noun stem or a Verb stem, etc.).
• morphotactics: the model of morpheme ordering that explains which
classes of morphemes can follow other classes of morphemes inside a
word. For example, the fact that the English plural morpheme follows
the noun rather than preceding it is a morphotactic fact.
• orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes combine
(e.g., the y→ie spelling rule discussed above that changes city + -s to
cities rather than citys).
22
BUILDING A
FINITE-STATE
LEXICON
A lexicon is a repository for words.
The simplest possible lexicon would
consist of an explicit list of every
word of the language (every word,
i.e., including abbreviations (“AAA”)
and proper names (“Jane” or
“Beijing”)) as follows:
a, AAA, AA, Aachen, aardvark,
aardwolf, aba, abaca, aback, . . .
Inconvenient or impossible to list
every word in the language,
computational lexicons are usually
structured with a list of each of the
stems and affixes of the language
together with a representation of
the morphotactics that tells us how
they can fit together.
23
FINITE-STATE FOR NOMINAL PLURAL
24
How can we expand this finite-state transducer?
FINITE-STATE FOR
VERBAL TYPES
25
FINITE-STATE
FOR ADJECTIVES
• big, bigger, biggest, cool, cooler, coolest, coolly
• happy, happier, happiest, happily red, redder, reddest
• unhappy, unhappier, unhappiest, unhappily real, unreal, really
• clear, clearer, clearest, clearly, unclear, unclearly
26
FINITE-STATE FOR
DERIVATION
• i.e. fossilize, we can
predict the word
fossilization by following
states q0, q1, and q2.
• Similarly, adjectives
ending in -al or -able at
q5 (equal, formal,
realizable) can take the
suffix -ity, or sometimes
the suffix -ness to state
q6 (naturalness,
casualness).
27
MORPHOLOGICAL
RECOGNITION
• We can now use these FSAs to solve the
problem of morphological
recognition; that is, of determining
whether an input string of letters makes
up a legitimate English word or not. We
do this by taking the morphotactic
FSAs, and plugging in each “sublexicon”
into the FSA. That is, we expand each
arc (e.g., the reg-noun-stem arc) with
all the morphemes that make up the set
of reg-noun-stem.
28
FINITE-STATE TRANSDUCERS
29
FINITE-STATE TRANSDUCER: DEFINITION
• A transducer maps between one representation and another; a finite-
state transducer (FST) is a type of finite automaton which maps
between two sets of symbols.
• We can visualize an FST as a two-tape automaton which recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc
in the finite-state machine with two symbol strings, one from each tape
• More general function than an FSA; where an FSA defines a formal
language by defining a set of strings, an FST defines a relation between
sets of strings.
• Another way of looking at an FST is as a machine that reads one string
and generates another.
30
“FOUR-FOLD WAY” OF TRANSDUCERS
• FST as recognizer: a transducer that takes a pair of strings as input and
outputs accept if the string-pair is in the string-pair language, and reject
if it is not.
• FST as generator: a machine that outputs pairs of strings of the
language. Thus the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads a string and outputs another
string.
• FST as set relater: a machine that computes relations between sets.
31
PARAMETERS OF FST
• Q a finite set of N states q0,q1, . . . ,qN−1
• Σ a finite set corresponding to the input alphabet
• Δ a finite set corresponding to the output alphabet
• q0 ∈ Q the start state
• F ⊆ Q the set of final states
• δ(q,w) the transition function or transition matrix between states; Given a state q ∈ Q
and a string w ∈ Σ∗, d(q,w) returns a set of new states Q′ ∈ Q. δ is thus a function from
Q×Σ∗ to 2Q (because there are 2Q possible subsets of Q). d returns a set of states
rather than a single state because a given input may be ambiguous in which state it
maps to.
• σ(q,w) the output function giving the set of possible output strings for each state and
input. Given a state q ∈ Q and a string w ∈ Σ∗, σ(q,w) gives a set of output strings, each
a string o ∈ D∗. s is thus a function from Q×S∗ to 2Δ∗
32
REGULAR RELATIONS
• Regular relations are sets of pairs of strings, a natural extension of the
regular languages, which are sets of strings.
FSTs have two additional closure properties that turn out to be extremely
useful:
• inversion: The inversion of a transducer T (T−1) simply switches the input and
output labels. Thus if T maps from the input alphabet I to the output alphabet
O, T−1 maps from O to I.
• composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1
to O2, then T1 ◦ T2 maps from I1 to O2.
33
FST AS MORPHOLOGICAL PARSER
Coming soon…
34
FINITE-STATE
MORPHOLOGY
• In the finite-state
morphology paradigm,
we represent a word as a
correspondence between
a lexical level, which
represents a
concatenation of
morphemes making up a
word, and the surface
level, which represents
the concatenation of
letters which make up the
actual spelling of the
word.
35
FINITE-STATE MORPHOLOGY
• For finite-state morphology it’s convenient to view an FST as having two
tapes. The upper or lexical tape, is composed from characters from one
alphabet Σ. The lower or surface tape, is composed of characters from
another alphabet Δ.
• In the two-level morphology (Koskenniemi 1983), we allow each arc only
to have a single symbol from each alphabet.
• We can then combine the two symbol alphabets Σ and Δ to create a new
alphabet, Σ′, which makes the relationship to FSAs quite clear. Σ′ is a finite
alphabet of complex symbols. Each complex symbol is composed of an
input-output pair i : o; one symbol i from the input alphabet S, and one
symbol o from an output alphabet Δ, thus Σ′ ⊆ Σ×Δ. Σ and Δ may each also
include the epsilon symbol ε.
36
FSM: FEASIBLE PAIRS
• i.e. Σ′ = {a : a, b : b, ! : !, a : !, a : ε, ε : !}
In two-level morphology, the pairs of symbols in Σ′ are also called feasible
pairs.
Thus each feasible pair symbol a : b in the transducer alphabet Σ′ expresses
how the symbol a from one tape is mapped to the symbol b on the other
tape. For example a : ε means that an a on the upper tape will correspond
to nothing on the lower tape.
Just as for an FSA, we can write regular expressions in the complex
alphabet Σ′.
Since it’s most common for symbols to map to themselves, in two-level
morphology we call pairs like a : a default pairs, and just refer to them by
the single letter a.
37
BUILDING A FST MORPHOPARSER
• Lets build an FST morphological parser out of our earlier morphotactic
FSAs and lexica by adding an extra “lexical” tape and the appropriate
morphological features.
• The nominal morphological features (+Sg and +Pl) that correspond to
each morpheme.
• The symbol ^ indicates a morpheme boundary, while the symbol #
indicates a word boundary.
• The morphological features map to the empty string ǫ or the boundary
symbols since there is no segment corresponding to them on the output
tape.
38
AN FST-PARSING OF PLURAL
39
COMPLEXING THE FST-LEXICON
• A morphological noun parser, it needs to be expanded with all the
individual regular and irregular noun stems, replacing the labels reg-
noun etc.
• In order to do this we need to update the lexicon for this transducer, so
that irregular plurals like geese will parse into the correct stem goose
+N +Pl. We do this by allowing the lexicon to also have two levels. Since
surface geese maps to lexical goose, the new lexical entry will be “g:g o:e
o:e s:s e:e”.
• Regular forms are simpler; the two-level entry for fox will now be “f:f o:o
x:x”, but by relying on the orthographic convention that f stands for f:f
and so on, we can simply refer to it as fox and the form for geese as “g
o:e o:e s e”.
40
COMPLEXING THE FST-LEXICON
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e s e goose
cat sheep sheep
aardvark m o:i u:ǫ s:c e mouse
41
FSM: THE INTERMEDIATE LEVEL
c:c a:a t:t +N:ε +Pl:ˆs#
• Since the output symbols include the morpheme and word boundary
markers ˆ and #, the lower labels do not correspond exactly to the
surface level.
• Hence we refer to tapes with these morpheme boundary markers as
intermediate tapes; the next section will show how the boundary
marker is removed.
42
TRANSDUCERS AND
ORTHOGRAPHIC RULES
43
ORTHOGRAPHIC RULES
• But just concatenating the morphemes won’t work for cases where there
is a spelling change;
• it would incorrectly reject an input like foxes and accept an input like
foxs. We need to deal with the fact that English often requires spelling
changes at morpheme boundaries by introducing spelling rules (or
orthographic rules)
• In general, the ability to implement rules as a transducer turns out to be
useful throughout speech and language processing.
44
ORTHOGRAPHIC RULES
Name Description of Rule Example
Consonant Doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and -ed
make/making
E insertion e added after -s,-z,-x,-ch, -sh before -s
watch/watches
Y replacement -y changes to -ie before -s, -i before -ed
try/tries
K insertion verbs ending with vowel + -c add -k
panic/panicked
45
ORTHOGRAPHIC RULES 2
• This rule says something like “insert an e on the surface tape just when the
lexical tape has a morpheme ending in x (or z, etc) and the next morpheme
is -s”.
x
ε -> e / s ^__s#
z
46
FST FOR ORTHOGRAPHY
47
FST FOR ORTHOGRAPHY: EXPLANATION
• This rule is used to ensure that we can only see the ε:e pair if we are in
the proper context.
• So state q0, which models having seen only default pairs unrelated to
the rule, is an accepting state, as is q1, which models having seen a z, s,
or x.
• q2 models having seen the morpheme boundary after the z, s, or x, and
again is an accepting state.
• State q3 models having just seen the E-insertion; it is not an accepting
state, since the insertion is only allowed if it is followed by the s
morpheme and then the end-of-word
48
FST FOR ORTHOGRAPHY: EXPLANATION
• The other symbol passes through any parts of words that don’t play a role in
the E-insertion rule. Other means “any feasible pair that is not in this
transducer”.
• So for example when leaving state q0, we go to q1 on the z, s, or x symbols, rather
than following the other arc and staying in q0.
• The semantics of other depends on what symbols are on other arcs; since #
is mentioned on some arcs, it is (by definition) not included in other, and
thus, for example, is explicitly mentioned on the arc from q2 to q0.
• State q5 is used to ensure that the e is always inserted whenever the
environment is appropriate; the transducer reaches q5 only when it has seen
an s after an appropriate morpheme boundary. If the machine is in state q5
and the next symbol is #, the machine rejects the string (because there is no
legal transition on # from q5).
49
FST FOR ORTHOGRAPHY: EXPLANATION
• StateInput s:s x:x z:z ˆ:ε ε:e # other
• q0: 1 1 1 0 - 0 0
• q1: 1 1 1 2 - 0 0
• q2: 5 1 1 0 3 0 0
• q3 4 - - - - - -
• q4 - - - - - 0 -
• q5 1 1 1 2 - - 0
50
SUMMARY
• Morphological parsing is the process of finding the constituent morphemes in a word (e.g., cat +N
+PL for cats).
• English mainly uses prefixes and suffixes to express inflectional and derivational morphology.
• English inflectional morphology is relatively simple and includes person and number agreement (-
s) and tense markings (-ed and -ing).
• English derivational morphology is more complex and includes suffixes like -ation, -ness, -able as
well as prefixes like co- and re-.
• Many constraints on the English morphotactics (allowable morpheme sequences) can be
represented by finite automata.
• Finite-state transducers are an extension of finite-state automata that can generate output
symbols.
• Important operations for FSTs include composition, projection, and intersection.
• Finite-state morphology and two-level morphology are applications of finite-state transducers to
morphological representation and parsing.
• Spelling rules can be implemented as transducers.
51
READINGS
• Jurafsky D. & J. Martin (2008). SPEECH and LANGUAGE PROCESSING
An introduction to Natural Language Processing, Computational
Linguistics and Speech Recognition (2nd Edition). CHAPTER 3 (pp. 1-16).
Additional References:
• Μαρκόπουλος, Γ. (1997). Υπολογιστική Επεξεργασία του Ελληνικού Ονόματος.
Διδακτορική διατριβή (σσ. 99-106).
• Πετροπούλου, Ε. (2012). Η Σύνθεση με Δεσμευμένο Θέμα στην Αγγλική και τη
Νέα Ελληνική Θεωρητική Ανάλυση καιΥπολογιστική Επεξεργασία.
Διδακτορική διατριβή ((σσ. 160-164)-172)).
52

More Related Content

PPTX
Words _Transducers Finite state transducers in natural language processing
PPTX
chapter4.pptx natural language processing
PPTX
NL5MorphologyAndFinteStateTransducersPart1.pptx
PPT
lect4-morphology.ppt
PPTX
lect4-morphology.pptx
PPT
lect4-morphology.ppt
PDF
MorphologyAndFST.pdf
PPTX
MORPHOLOGICAL PROCESSING OF INDIAN LANGUAGRES
Words _Transducers Finite state transducers in natural language processing
chapter4.pptx natural language processing
NL5MorphologyAndFinteStateTransducersPart1.pptx
lect4-morphology.ppt
lect4-morphology.pptx
lect4-morphology.ppt
MorphologyAndFST.pdf
MORPHOLOGICAL PROCESSING OF INDIAN LANGUAGRES

Similar to Computational Linguistics - Finite State Automata (20)

PPTX
NLP topic CHAPTER 2_word level analysis.pptx
PDF
語言學概論Morphology 2
PDF
207 morphbooklet
PDF
Morpho 12 13
PPTX
Morphology___Group_6___Class_15_2 (1).pptx
PPTX
NLP_KASHK:Morphology
PPT
NL5MorphologyAndFinteStateTransducersPart1.ppt
PDF
Morphology.....a major topic in Linguistics
PPTX
Morphology
PPTX
Chapter 5.1.pptx
PPTX
GL_Day 6.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmm
PPTX
MORPHOLOGY-REPORT.pptx
PDF
English Grammar Rules for TOEIC, IELTS, TOEFL.pdf
PPTX
Morphology
PPTX
Morphology, grammar
DOCX
Paper - The Process of Word Formation: Inflection and Derivation (Makalah Pre...
PPT
Morphology
PDF
To dig into_english_forms_issues_group_551019_20
PPTX
Su 2012 ss morphology pp
NLP topic CHAPTER 2_word level analysis.pptx
語言學概論Morphology 2
207 morphbooklet
Morpho 12 13
Morphology___Group_6___Class_15_2 (1).pptx
NLP_KASHK:Morphology
NL5MorphologyAndFinteStateTransducersPart1.ppt
Morphology.....a major topic in Linguistics
Morphology
Chapter 5.1.pptx
GL_Day 6.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmm
MORPHOLOGY-REPORT.pptx
English Grammar Rules for TOEIC, IELTS, TOEFL.pdf
Morphology
Morphology, grammar
Paper - The Process of Word Formation: Inflection and Derivation (Makalah Pre...
Morphology
To dig into_english_forms_issues_group_551019_20
Su 2012 ss morphology pp
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Ad

Computational Linguistics - Finite State Automata

  • 1. ΓΛ545 COMPUTATIONAL LINGUISTICS AND CORPORA Athanasios N. Karasimos akarasimos@gmail.com MA in Linguistics | School of English Language and Literature Aristotle University of Thessaloniki Lecture 7 | Wed 15 Noe 2017
  • 3. LECTURE 5 RECAP Language Modeling with N-Grams 3
  • 4. FINITE-STATE AUTOMATA • Any regular expression can be realized as a finite state automaton (FSA). • An automaton implicitly defines a formal language as the set of strings the automaton accepts. • An automaton can use any set of symbols for its vocabulary, including letters, words, or even graphic images. • The behavior of a deterministic automaton (DFSA) is fully determined by the state it is in. • A non-deterministic automaton (NFSA) sometimes has to make a choice between multiple paths to take given the same current state and next input. • Any NFSA can be converted to a DFSA. 4
  • 5. MORPHOLOGY PART 2 Lets talk about WORDS 5
  • 6. THE CASE OF PLURAL • Simple cases of plural: • woodchuck to woodchucks • But what about fox, peccary, goose and fish? • Orthographic rules: peccary to peccaries • Morphological rules: fish with 0 plural suffix goose to geese with vowel change • Phonological rules: fox to foxes 6
  • 7. MORPHOLOGICAL PARSING • The task to recognize that a word (like foxes) breaks down into component morphemes (fox and -es) and building a structured representation of this fact is called morphological parsing. • Parsing means taking an input and producing some sort of linguistic structure for it. • We use the term parsing very broadly, including many kinds of structures that might be produced; morphological, syntactic, semantic, discourse; in the form of a string, or a tree, or a network. 7
  • 8. MORPHOLOGICAL PARSING • Morphological parsing or stemming(?) applies to many affixes other than plurals; • for example we might need to take any English verb form ending in -ing (going, talking, congratulating) and parse it into its verbal stem plus the -ing morpheme. • So given the surface or input form going, we might want to produce the parsed form VERB-go + GERUND-ing. • Morphological parsing is important throughout speech and language processing. It plays a crucial role in Web search for morphologically complex languages like Greek, Russian or German. 8
  • 9. MORPHOLOGICAL PARSING • Morphological parsing also plays a crucial role in part-of-speech tagging for these morphologically complex languages. • It is important for producing the large dictionaries that are necessary for robust spell-checking. • It is necessary in machine translation to realize for example that the French words va and aller should both translate to forms of the English verb go. 9
  • 11. A FAMILIAR FACE: MORPHEMES • A morpheme is often defined as the minimal meaning-bearing unit in a language. • So for example the word fox consists of a single morpheme (the morpheme fox) while the word cats consists of two: the morpheme cat and the morpheme -s. • As this example suggests, it is often distinguish two broad classes of morphemes: stems and affixes. • Affixes: divided into prefixes, suffixes, infixes, and circumfixes. Prefixes precede the stem, suffixes follow the stem, circumfixes do both, and infixes are inserted inside the stem. • Circumfixes: [German] past participles (ge- and –en/-t) • Infixes: [Tagalog] affix um, which marks the agent of an action, is infixed to the stem hingi “borrow” to produce humingi. 11
  • 12. WORD FORMATION PROCESSES • Four processes are common and play important roles in speech and language generation: inflection, derivation, compounding, and cliticization. • Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function like agreement. • Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. • Compounding is the combination of multiple word stems together. • Cliticization is the combination of a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word, but is reduced in form and attached (phonologically and sometimes orthographically) to another word. 12
  • 13. MORPHOLOGICAL TASKS • TASK 1: • Give two examples from each word formation process • TASK II: • Consider possible problematic cases for morphological parsing (inf, dev, com). • TASK III: • Test these cases with a morphological parser. • http://nlpdotnet.com/services/Morphparser.aspx • TASK IV: • Ambiguity of morphological parsing. • https://open.xerox.com/Services/fst-nlp-tools/Consume/Morphological%20Analysis-176 • http://langrid.org/playground/morphological-analyzer.html 13
  • 14. INFLECTIONAL ENGLISH • Nominal suffixes: an affix that marks plural and an affix that marks possessive. • Regular plural suffix -s (also spelled -es), and irregular plurals: Regular Nouns Irregular Nouns • Singular cat thrush mouse ox • Plural cats thrushes mice oxen • While the regular plural is spelled -s after most nouns, it is spelled -es after words ending in -s (ibis/ibises), -z (waltz/waltzes), -sh (thrush/thrushes), -ch (finch/finches), and sometimes -x (box/boxes). Nouns ending in -y preceded by a consonant change the -y to -i (butterfly/butterflies). • The possessive suffix is realized by apostrophe + -s for regular singular nouns (llama’s) and plural nouns not ending in -s (children’s) and often by a lone apostrophe after regular plural nouns (llamas’) and some names ending in -s or -z (Euripides’ comedies). 14
  • 15. INFLECTIONAL ENGLISH • English verbal inflection is more complicated than nominal inflection. • main verbs, (eat, sleep, impeach), modal verbs (can, will, should), and primary verbs (be, have, do). • Morphological Form Classes Regularly Inflected Verbs stem walk merge try map -s form walks merges tries maps -ing participle walking merging trying mapping Past form or -ed participle walked merged tried mapped • we can predict the other forms by adding one of three predictable endings and making some regular spelling. 15
  • 16. INFLECTIONAL MORPHOLOGY • The irregular verbs are those that have some more or less idiosyncratic forms of inflection. Irregular verbs in English often have five different forms, but can have as many as eight (e.g., the verb be) or as few as three (e.g. cut or hit). • Morphological Form Classes Irregularly Inflected Verbs stem eat catch cut -s form eats catches cuts -ing participle eating catching cutting Past form ate caught cut -ed/-en participle eaten caught cut More complex verbal inflectional paradigm of morphologically rich languages. 16
  • 17. DERIVATIONAL ENGLISH • While English inflection is relatively simple compared to other languages, derivation in English is quite complex. • A very common kind of derivation in English is the formation of new nouns, often from verbs or adjectives. This process is called nominalization. • For example, the suffix -ation produces nouns from verbs ending often in the suffix -ize (computerize → computerization). 17
  • 18. COMPOUNDING ENGLISH • Most English compound nouns are noun phrases (i.e. nominal phrases) that include a noun modified by adjectives or noun adjuncts. • The monoword forms in which two usually moderately short words appear together as one. Examples are housewife, lawsuit, wallpaper, basketball, etc. • The hyphenated form in which two or more words are connected by a hyphen. Compounds that contain affixes, such as house-build(er) and single- mind(ed)(ness), as well as adjective-adjective compounds and verb-verb compounds, such as blue-green and freeze-dried. • Loose compounds: the open or spaced form consisting of newer combinations of usually longer words, such as distance learning, player piano, lawn tennis, etc. 18 Modifier Head Compound noun noun football adjective noun blackboard verb noun breakwater preposition noun underworld noun adjective snow white adjective adjective blue-green verb adjective tumbledown preposition adjective over-ripe noun verb browbeat adjective verb highlight verb verb freeze-dry preposition verb undercut noun preposition love-in adverb preposition forthwith verb preposition takeout preposition preposition without
  • 20. MORPHOLOGICAL FEATURES some some +Pron+NomObl+3P+Pl some +Det+SP features <feature> +Noun+Pl <feature> +Verb+Pres+3sg that that +Conj+Sub that +Det+Sg that +Pron+NomObl+3P+Sg that +Pron+Rel+NomObl+3P+SP <that> +Adv • εργασία • εργασία +Noun+Common+Fem+Sg+Acc • εργασία +Noun+Common+Fem+Sg+Voc • εργασία +Noun+Common+Fem+Sg+Nom • υπάρχουν • υπάρχω +Verb+Indic+Pres+P3+Pl+Imperf +Active • σχόλια • σχόλιο +Noun+Common+Neut+Pl+Acc • σχόλιο +Noun+Common+Neut+Pl+Voc • σχόλιο +Noun+Common+Neut+Pl+Nom • ανατροφοδότησης • ανατροφοδότηση +Noun+Common+Fem+Sg+ Gen 20
  • 21. MORPHOLOGICAL FEATURES • The features specify additional information about the stem. • For example the feature +N means that the word is a noun; +Sg means it is singular, +Pl that it is plural. (check also Chapter 5 and Chapter 16); for now, consider +Sg to be a primitive unit that means “singular”. • Greek has some features that don’t occur in English; for example the nouns εργασία and ανατροφοδότησης are marked +Fem (feminine). • Note that some of the input forms will be ambiguous between different morphological parses. For now, we will consider the goal of morphological parsing merely to list all possible parses. 21
  • 22. BUILDING A MORPHOLOGICAL PARSER • lexicon: the list of stems and affixes, together with basic information about them (whether a stem is a Noun stem or a Verb stem, etc.). • morphotactics: the model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. For example, the fact that the English plural morpheme follows the noun rather than preceding it is a morphotactic fact. • orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., the y→ie spelling rule discussed above that changes city + -s to cities rather than citys). 22
  • 23. BUILDING A FINITE-STATE LEXICON A lexicon is a repository for words. The simplest possible lexicon would consist of an explicit list of every word of the language (every word, i.e., including abbreviations (“AAA”) and proper names (“Jane” or “Beijing”)) as follows: a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, . . . Inconvenient or impossible to list every word in the language, computational lexicons are usually structured with a list of each of the stems and affixes of the language together with a representation of the morphotactics that tells us how they can fit together. 23
  • 24. FINITE-STATE FOR NOMINAL PLURAL 24 How can we expand this finite-state transducer?
  • 26. FINITE-STATE FOR ADJECTIVES • big, bigger, biggest, cool, cooler, coolest, coolly • happy, happier, happiest, happily red, redder, reddest • unhappy, unhappier, unhappiest, unhappily real, unreal, really • clear, clearer, clearest, clearly, unclear, unclearly 26
  • 27. FINITE-STATE FOR DERIVATION • i.e. fossilize, we can predict the word fossilization by following states q0, q1, and q2. • Similarly, adjectives ending in -al or -able at q5 (equal, formal, realizable) can take the suffix -ity, or sometimes the suffix -ness to state q6 (naturalness, casualness). 27
  • 28. MORPHOLOGICAL RECOGNITION • We can now use these FSAs to solve the problem of morphological recognition; that is, of determining whether an input string of letters makes up a legitimate English word or not. We do this by taking the morphotactic FSAs, and plugging in each “sublexicon” into the FSA. That is, we expand each arc (e.g., the reg-noun-stem arc) with all the morphemes that make up the set of reg-noun-stem. 28
  • 30. FINITE-STATE TRANSDUCER: DEFINITION • A transducer maps between one representation and another; a finite- state transducer (FST) is a type of finite automaton which maps between two sets of symbols. • We can visualize an FST as a two-tape automaton which recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the finite-state machine with two symbol strings, one from each tape • More general function than an FSA; where an FSA defines a formal language by defining a set of strings, an FST defines a relation between sets of strings. • Another way of looking at an FST is as a machine that reads one string and generates another. 30
  • 31. “FOUR-FOLD WAY” OF TRANSDUCERS • FST as recognizer: a transducer that takes a pair of strings as input and outputs accept if the string-pair is in the string-pair language, and reject if it is not. • FST as generator: a machine that outputs pairs of strings of the language. Thus the output is a yes or no, and a pair of output strings. • FST as translator: a machine that reads a string and outputs another string. • FST as set relater: a machine that computes relations between sets. 31
  • 32. PARAMETERS OF FST • Q a finite set of N states q0,q1, . . . ,qN−1 • Σ a finite set corresponding to the input alphabet • Δ a finite set corresponding to the output alphabet • q0 ∈ Q the start state • F ⊆ Q the set of final states • δ(q,w) the transition function or transition matrix between states; Given a state q ∈ Q and a string w ∈ Σ∗, d(q,w) returns a set of new states Q′ ∈ Q. δ is thus a function from Q×Σ∗ to 2Q (because there are 2Q possible subsets of Q). d returns a set of states rather than a single state because a given input may be ambiguous in which state it maps to. • σ(q,w) the output function giving the set of possible output strings for each state and input. Given a state q ∈ Q and a string w ∈ Σ∗, σ(q,w) gives a set of output strings, each a string o ∈ D∗. s is thus a function from Q×S∗ to 2Δ∗ 32
  • 33. REGULAR RELATIONS • Regular relations are sets of pairs of strings, a natural extension of the regular languages, which are sets of strings. FSTs have two additional closure properties that turn out to be extremely useful: • inversion: The inversion of a transducer T (T−1) simply switches the input and output labels. Thus if T maps from the input alphabet I to the output alphabet O, T−1 maps from O to I. • composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to O2, then T1 ◦ T2 maps from I1 to O2. 33
  • 34. FST AS MORPHOLOGICAL PARSER Coming soon… 34
  • 35. FINITE-STATE MORPHOLOGY • In the finite-state morphology paradigm, we represent a word as a correspondence between a lexical level, which represents a concatenation of morphemes making up a word, and the surface level, which represents the concatenation of letters which make up the actual spelling of the word. 35
  • 36. FINITE-STATE MORPHOLOGY • For finite-state morphology it’s convenient to view an FST as having two tapes. The upper or lexical tape, is composed from characters from one alphabet Σ. The lower or surface tape, is composed of characters from another alphabet Δ. • In the two-level morphology (Koskenniemi 1983), we allow each arc only to have a single symbol from each alphabet. • We can then combine the two symbol alphabets Σ and Δ to create a new alphabet, Σ′, which makes the relationship to FSAs quite clear. Σ′ is a finite alphabet of complex symbols. Each complex symbol is composed of an input-output pair i : o; one symbol i from the input alphabet S, and one symbol o from an output alphabet Δ, thus Σ′ ⊆ Σ×Δ. Σ and Δ may each also include the epsilon symbol ε. 36
  • 37. FSM: FEASIBLE PAIRS • i.e. Σ′ = {a : a, b : b, ! : !, a : !, a : ε, ε : !} In two-level morphology, the pairs of symbols in Σ′ are also called feasible pairs. Thus each feasible pair symbol a : b in the transducer alphabet Σ′ expresses how the symbol a from one tape is mapped to the symbol b on the other tape. For example a : ε means that an a on the upper tape will correspond to nothing on the lower tape. Just as for an FSA, we can write regular expressions in the complex alphabet Σ′. Since it’s most common for symbols to map to themselves, in two-level morphology we call pairs like a : a default pairs, and just refer to them by the single letter a. 37
  • 38. BUILDING A FST MORPHOPARSER • Lets build an FST morphological parser out of our earlier morphotactic FSAs and lexica by adding an extra “lexical” tape and the appropriate morphological features. • The nominal morphological features (+Sg and +Pl) that correspond to each morpheme. • The symbol ^ indicates a morpheme boundary, while the symbol # indicates a word boundary. • The morphological features map to the empty string ǫ or the boundary symbols since there is no segment corresponding to them on the output tape. 38
  • 39. AN FST-PARSING OF PLURAL 39
  • 40. COMPLEXING THE FST-LEXICON • A morphological noun parser, it needs to be expanded with all the individual regular and irregular noun stems, replacing the labels reg- noun etc. • In order to do this we need to update the lexicon for this transducer, so that irregular plurals like geese will parse into the correct stem goose +N +Pl. We do this by allowing the lexicon to also have two levels. Since surface geese maps to lexical goose, the new lexical entry will be “g:g o:e o:e s:s e:e”. • Regular forms are simpler; the two-level entry for fox will now be “f:f o:o x:x”, but by relying on the orthographic convention that f stands for f:f and so on, we can simply refer to it as fox and the form for geese as “g o:e o:e s e”. 40
  • 41. COMPLEXING THE FST-LEXICON reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e s e goose cat sheep sheep aardvark m o:i u:ǫ s:c e mouse 41
  • 42. FSM: THE INTERMEDIATE LEVEL c:c a:a t:t +N:ε +Pl:ˆs# • Since the output symbols include the morpheme and word boundary markers ˆ and #, the lower labels do not correspond exactly to the surface level. • Hence we refer to tapes with these morpheme boundary markers as intermediate tapes; the next section will show how the boundary marker is removed. 42
  • 44. ORTHOGRAPHIC RULES • But just concatenating the morphemes won’t work for cases where there is a spelling change; • it would incorrectly reject an input like foxes and accept an input like foxs. We need to deal with the fact that English often requires spelling changes at morpheme boundaries by introducing spelling rules (or orthographic rules) • In general, the ability to implement rules as a transducer turns out to be useful throughout speech and language processing. 44
  • 45. ORTHOGRAPHIC RULES Name Description of Rule Example Consonant Doubling 1-letter consonant doubled before -ing/-ed beg/begging E deletion Silent e dropped before -ing and -ed make/making E insertion e added after -s,-z,-x,-ch, -sh before -s watch/watches Y replacement -y changes to -ie before -s, -i before -ed try/tries K insertion verbs ending with vowel + -c add -k panic/panicked 45
  • 46. ORTHOGRAPHIC RULES 2 • This rule says something like “insert an e on the surface tape just when the lexical tape has a morpheme ending in x (or z, etc) and the next morpheme is -s”. x ε -> e / s ^__s# z 46
  • 48. FST FOR ORTHOGRAPHY: EXPLANATION • This rule is used to ensure that we can only see the ε:e pair if we are in the proper context. • So state q0, which models having seen only default pairs unrelated to the rule, is an accepting state, as is q1, which models having seen a z, s, or x. • q2 models having seen the morpheme boundary after the z, s, or x, and again is an accepting state. • State q3 models having just seen the E-insertion; it is not an accepting state, since the insertion is only allowed if it is followed by the s morpheme and then the end-of-word 48
  • 49. FST FOR ORTHOGRAPHY: EXPLANATION • The other symbol passes through any parts of words that don’t play a role in the E-insertion rule. Other means “any feasible pair that is not in this transducer”. • So for example when leaving state q0, we go to q1 on the z, s, or x symbols, rather than following the other arc and staying in q0. • The semantics of other depends on what symbols are on other arcs; since # is mentioned on some arcs, it is (by definition) not included in other, and thus, for example, is explicitly mentioned on the arc from q2 to q0. • State q5 is used to ensure that the e is always inserted whenever the environment is appropriate; the transducer reaches q5 only when it has seen an s after an appropriate morpheme boundary. If the machine is in state q5 and the next symbol is #, the machine rejects the string (because there is no legal transition on # from q5). 49
  • 50. FST FOR ORTHOGRAPHY: EXPLANATION • StateInput s:s x:x z:z ˆ:ε ε:e # other • q0: 1 1 1 0 - 0 0 • q1: 1 1 1 2 - 0 0 • q2: 5 1 1 0 3 0 0 • q3 4 - - - - - - • q4 - - - - - 0 - • q5 1 1 1 2 - - 0 50
  • 51. SUMMARY • Morphological parsing is the process of finding the constituent morphemes in a word (e.g., cat +N +PL for cats). • English mainly uses prefixes and suffixes to express inflectional and derivational morphology. • English inflectional morphology is relatively simple and includes person and number agreement (- s) and tense markings (-ed and -ing). • English derivational morphology is more complex and includes suffixes like -ation, -ness, -able as well as prefixes like co- and re-. • Many constraints on the English morphotactics (allowable morpheme sequences) can be represented by finite automata. • Finite-state transducers are an extension of finite-state automata that can generate output symbols. • Important operations for FSTs include composition, projection, and intersection. • Finite-state morphology and two-level morphology are applications of finite-state transducers to morphological representation and parsing. • Spelling rules can be implemented as transducers. 51
  • 52. READINGS • Jurafsky D. & J. Martin (2008). SPEECH and LANGUAGE PROCESSING An introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (2nd Edition). CHAPTER 3 (pp. 1-16). Additional References: • Μαρκόπουλος, Γ. (1997). Υπολογιστική Επεξεργασία του Ελληνικού Ονόματος. Διδακτορική διατριβή (σσ. 99-106). • Πετροπούλου, Ε. (2012). Η Σύνθεση με Δεσμευμένο Θέμα στην Αγγλική και τη Νέα Ελληνική Θεωρητική Ανάλυση καιΥπολογιστική Επεξεργασία. Διδακτορική διατριβή ((σσ. 160-164)-172)). 52