Computational Linguistics - Finite State Automata

ΓΛ545
COMPUTATIONAL LINGUISTICS
AND CORPORA
Athanasios N. Karasimos
akarasimos@gmail.com
MA in Linguistics | School of English Language and Literature
Aristotle University of Thessaloniki
Lecture 7 | Wed 15 Noe 2017

LECTURE 5 RECAP
Language Modeling with N-Grams
3

FINITE-STATE AUTOMATA
• Any regular expression can be realized as a finite state automaton (FSA).
• An automaton implicitly defines a formal language as the set of strings the
automaton accepts.
• An automaton can use any set of symbols for its vocabulary, including
letters, words, or even graphic images.
• The behavior of a deterministic automaton (DFSA) is fully determined by
the state it is in.
• A non-deterministic automaton (NFSA) sometimes has to make a choice
between multiple paths to take given the same current state and next input.
• Any NFSA can be converted to a DFSA.
4

MORPHOLOGY PART 2
Lets talk about WORDS
5

THE CASE OF PLURAL
• Simple cases of plural:
• woodchuck to woodchucks
• But what about fox, peccary, goose and fish?
• Orthographic rules: peccary to peccaries
• Morphological rules: fish with 0 plural suffix
goose to geese with vowel change
• Phonological rules: fox to foxes
6

MORPHOLOGICAL PARSING
• The task to recognize that a word (like foxes) breaks down into
component morphemes (fox and -es) and building a structured
representation of this fact is called morphological parsing.
• Parsing means taking an input and producing some sort of linguistic
structure for it.
• We use the term parsing very broadly, including many kinds of
structures that might be produced; morphological, syntactic, semantic,
discourse; in the form of a string, or a tree, or a network.
7

• Morphological parsing or stemming(?) applies to many affixes other
than plurals;
• for example we might need to take any English verb form ending in -ing
(going, talking, congratulating) and parse it into its verbal stem plus the -ing
morpheme.
• So given the surface or input form going, we might want to produce
the parsed form VERB-go + GERUND-ing.
• Morphological parsing is important throughout speech and language
processing. It plays a crucial role in Web search for morphologically
complex languages like Greek, Russian or German.
8

• Morphological parsing also plays a crucial role in part-of-speech tagging
for these morphologically complex languages.
• It is important for producing the large dictionaries that are necessary for
robust spell-checking.
• It is necessary in machine translation to realize for example that the
French words va and aller should both translate to forms of the English
verb go.
9

SURVEY OF ENGLISH
MORPHOLOGY
10

A FAMILIAR FACE: MORPHEMES
• A morpheme is often defined as the minimal meaning-bearing unit in a
language.
• So for example the word fox consists of a single morpheme (the morpheme fox)
while the word cats consists of two: the morpheme cat and the morpheme -s.
• As this example suggests, it is often distinguish two broad classes of
morphemes: stems and affixes.
• Affixes: divided into prefixes, suffixes, infixes, and circumfixes. Prefixes
precede the stem, suffixes follow the stem, circumfixes do both, and infixes
are inserted inside the stem.
• Circumfixes: [German] past participles (ge- and –en/-t)
• Infixes: [Tagalog] affix um, which marks the agent of an action, is infixed to
the stem hingi “borrow” to produce humingi.
11

WORD FORMATION PROCESSES
• Four processes are common and play important roles in speech and
language generation: inflection, derivation, compounding, and
cliticization.
• Inflection is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of the same class as the original stem, and usually filling
some syntactic function like agreement.
• Derivation is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of a different class, often with a meaning hard to predict
exactly.
• Compounding is the combination of multiple word stems together.
• Cliticization is the combination of a word stem with a clitic. A clitic is a morpheme
that acts syntactically like a word, but is reduced in form and attached
(phonologically and sometimes orthographically) to another word.
12

MORPHOLOGICAL TASKS
• TASK 1:
• Give two examples from each word formation process
• TASK II:
• Consider possible problematic cases for morphological parsing (inf, dev, com).
• TASK III:
• Test these cases with a morphological parser.
• http://nlpdotnet.com/services/Morphparser.aspx
• TASK IV:
• Ambiguity of morphological parsing.
• https://open.xerox.com/Services/fst-nlp-tools/Consume/Morphological%20Analysis-176
• http://langrid.org/playground/morphological-analyzer.html
13

INFLECTIONAL ENGLISH
• Nominal suffixes: an affix that marks plural and an affix that marks
possessive.
• Regular plural suffix -s (also spelled -es), and irregular plurals:
Regular Nouns Irregular Nouns
• Singular cat thrush mouse ox
• Plural cats thrushes mice oxen
• While the regular plural is spelled -s after most nouns, it is spelled -es after
words ending in -s (ibis/ibises), -z (waltz/waltzes), -sh (thrush/thrushes), -ch
(finch/finches), and sometimes -x (box/boxes). Nouns ending in -y preceded
by a consonant change the -y to -i (butterfly/butterflies).
• The possessive suffix is realized by apostrophe + -s for regular singular nouns
(llama’s) and plural nouns not ending in -s (children’s) and often by a lone
apostrophe after regular plural nouns (llamas’) and some names ending in -s
or -z (Euripides’ comedies).
14

INFLECTIONAL ENGLISH
• English verbal inflection is more complicated than nominal inflection.
• main verbs, (eat, sleep, impeach), modal verbs (can, will, should), and primary
verbs (be, have, do).
• Morphological Form Classes Regularly Inflected Verbs
stem walk merge try map
-s form walks merges tries maps
-ing participle walking merging trying mapping
Past form or -ed participle walked merged tried mapped
• we can predict the other forms by adding one of three predictable endings
and making some regular spelling.
15

INFLECTIONAL MORPHOLOGY
• The irregular verbs are those that have some more or less idiosyncratic
forms of inflection. Irregular verbs in English often have five different forms,
but can have as many as eight (e.g., the verb be) or as few as three (e.g. cut
or hit).
• Morphological Form Classes Irregularly Inflected Verbs
stem eat catch cut
-s form eats catches cuts
-ing participle eating catching cutting
Past form ate caught cut
-ed/-en participle eaten caught cut
More complex verbal inflectional paradigm of morphologically rich languages.
16

DERIVATIONAL ENGLISH
• While English inflection is relatively simple compared to other
languages, derivation in English is quite complex.
• A very common kind of derivation in English is the formation of new
nouns, often from verbs or adjectives. This process is called
nominalization.
• For example, the suffix -ation produces nouns from verbs ending often in the
suffix -ize (computerize → computerization).
17

COMPOUNDING ENGLISH
• Most English compound nouns are noun phrases (i.e. nominal phrases)
that include a noun modified by adjectives or noun adjuncts.
• The monoword forms in which two usually moderately short words appear
together as one. Examples are housewife, lawsuit, wallpaper, basketball, etc.
• The hyphenated form in which two or more words are connected by a hyphen.
Compounds that contain affixes, such as house-build(er) and single-
mind(ed)(ness), as well as adjective-adjective compounds and verb-verb
compounds, such as blue-green and freeze-dried.
• Loose compounds: the open or spaced form consisting of newer combinations
of usually longer words, such as distance learning, player piano, lawn tennis,
etc.
18
Modifier Head Compound
noun noun football
adjective noun blackboard
verb noun breakwater
preposition noun underworld
noun adjective snow white
adjective adjective blue-green
verb adjective tumbledown
preposition adjective over-ripe
noun verb browbeat
adjective verb highlight
verb verb freeze-dry
preposition verb undercut
noun preposition love-in
adverb preposition forthwith
verb preposition takeout
preposition preposition without

FINITE-STATE MORPHOLOGICAL
PARSING
19

MORPHOLOGICAL FEATURES
some
some +Pron+NomObl+3P+Pl
some +Det+SP
features
<feature> +Noun+Pl
<feature> +Verb+Pres+3sg
that
that +Conj+Sub
that +Det+Sg
that +Pron+NomObl+3P+Sg
that +Pron+Rel+NomObl+3P+SP
<that> +Adv
• εργασία
• εργασία +Noun+Common+Fem+Sg+Acc
• εργασία +Noun+Common+Fem+Sg+Voc
• εργασία +Noun+Common+Fem+Sg+Nom
• υπάρχουν
• υπάρχω +Verb+Indic+Pres+P3+Pl+Imperf
+Active
• σχόλια
• σχόλιο +Noun+Common+Neut+Pl+Acc
• σχόλιο +Noun+Common+Neut+Pl+Voc
• σχόλιο +Noun+Common+Neut+Pl+Nom
• ανατροφοδότησης
• ανατροφοδότηση +Noun+Common+Fem+Sg+
Gen
20

MORPHOLOGICAL FEATURES
• The features specify additional information about the stem.
• For example the feature +N means that the word is a noun; +Sg means it is
singular, +Pl that it is plural. (check also Chapter 5 and Chapter 16); for now,
consider +Sg to be a primitive unit that means “singular”.
• Greek has some features that don’t occur in English; for example the
nouns εργασία and ανατροφοδότησης are marked +Fem (feminine).
• Note that some of the input forms will be ambiguous between different
morphological parses. For now, we will consider the goal of
morphological parsing merely to list all possible parses.
21

BUILDING A MORPHOLOGICAL PARSER
• lexicon: the list of stems and affixes, together with basic information
about them (whether a stem is a Noun stem or a Verb stem, etc.).
• morphotactics: the model of morpheme ordering that explains which
classes of morphemes can follow other classes of morphemes inside a
word. For example, the fact that the English plural morpheme follows
the noun rather than preceding it is a morphotactic fact.
• orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes combine
(e.g., the y→ie spelling rule discussed above that changes city + -s to
cities rather than citys).
22

BUILDING A
FINITE-STATE
LEXICON
A lexicon is a repository for words.
The simplest possible lexicon would
consist of an explicit list of every
word of the language (every word,
i.e., including abbreviations (“AAA”)
and proper names (“Jane” or
“Beijing”)) as follows:
a, AAA, AA, Aachen, aardvark,
aardwolf, aba, abaca, aback, . . .
Inconvenient or impossible to list
every word in the language,
computational lexicons are usually
structured with a list of each of the
stems and affixes of the language
together with a representation of
the morphotactics that tells us how
they can fit together.
23

FINITE-STATE FOR NOMINAL PLURAL
24
How can we expand this finite-state transducer?

FINITE-STATE FOR
VERBAL TYPES
25

FINITE-STATE
FOR ADJECTIVES
• big, bigger, biggest, cool, cooler, coolest, coolly
• happy, happier, happiest, happily red, redder, reddest
• unhappy, unhappier, unhappiest, unhappily real, unreal, really
• clear, clearer, clearest, clearly, unclear, unclearly
26

FINITE-STATE FOR
DERIVATION
• i.e. fossilize, we can
predict the word
fossilization by following
states q0, q1, and q2.
• Similarly, adjectives
ending in -al or -able at
q5 (equal, formal,
realizable) can take the
suffix -ity, or sometimes
the suffix -ness to state
q6 (naturalness,
casualness).
27

MORPHOLOGICAL
RECOGNITION
• We can now use these FSAs to solve the
problem of morphological
recognition; that is, of determining
whether an input string of letters makes
up a legitimate English word or not. We
do this by taking the morphotactic
FSAs, and plugging in each “sublexicon”
into the FSA. That is, we expand each
arc (e.g., the reg-noun-stem arc) with
all the morphemes that make up the set
of reg-noun-stem.
28

FINITE-STATE TRANSDUCER: DEFINITION
• A transducer maps between one representation and another; a finite-
state transducer (FST) is a type of finite automaton which maps
between two sets of symbols.
• We can visualize an FST as a two-tape automaton which recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc
in the finite-state machine with two symbol strings, one from each tape
• More general function than an FSA; where an FSA defines a formal
language by defining a set of strings, an FST defines a relation between
sets of strings.
• Another way of looking at an FST is as a machine that reads one string
and generates another.
30

“FOUR-FOLD WAY” OF TRANSDUCERS
• FST as recognizer: a transducer that takes a pair of strings as input and
outputs accept if the string-pair is in the string-pair language, and reject
if it is not.
• FST as generator: a machine that outputs pairs of strings of the
language. Thus the output is a yes or no, and a pair of output strings.
• FST as translator: a machine that reads a string and outputs another
string.
• FST as set relater: a machine that computes relations between sets.
31

PARAMETERS OF FST
• Q a finite set of N states q0,q1, . . . ,qN−1
• Σ a finite set corresponding to the input alphabet
• Δ a finite set corresponding to the output alphabet
• q0 ∈ Q the start state
• F ⊆ Q the set of final states
• δ(q,w) the transition function or transition matrix between states; Given a state q ∈ Q
and a string w ∈ Σ∗, d(q,w) returns a set of new states Q′ ∈ Q. δ is thus a function from
Q×Σ∗ to 2Q (because there are 2Q possible subsets of Q). d returns a set of states
rather than a single state because a given input may be ambiguous in which state it
maps to.
• σ(q,w) the output function giving the set of possible output strings for each state and
input. Given a state q ∈ Q and a string w ∈ Σ∗, σ(q,w) gives a set of output strings, each
a string o ∈ D∗. s is thus a function from Q×S∗ to 2Δ∗
32

REGULAR RELATIONS
• Regular relations are sets of pairs of strings, a natural extension of the
regular languages, which are sets of strings.
FSTs have two additional closure properties that turn out to be extremely
useful:
• inversion: The inversion of a transducer T (T−1) simply switches the input and
output labels. Thus if T maps from the input alphabet I to the output alphabet
O, T−1 maps from O to I.
• composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1
to O2, then T1 ◦ T2 maps from I1 to O2.
33

FST AS MORPHOLOGICAL PARSER
Coming soon…
34

FINITE-STATE
MORPHOLOGY
• In the finite-state
morphology paradigm,
we represent a word as a
correspondence between
a lexical level, which
represents a
concatenation of
morphemes making up a
word, and the surface
level, which represents
the concatenation of
letters which make up the
actual spelling of the
word.
35

FINITE-STATE MORPHOLOGY
• For finite-state morphology it’s convenient to view an FST as having two
tapes. The upper or lexical tape, is composed from characters from one
alphabet Σ. The lower or surface tape, is composed of characters from
another alphabet Δ.
• In the two-level morphology (Koskenniemi 1983), we allow each arc only
to have a single symbol from each alphabet.
• We can then combine the two symbol alphabets Σ and Δ to create a new
alphabet, Σ′, which makes the relationship to FSAs quite clear. Σ′ is a finite
alphabet of complex symbols. Each complex symbol is composed of an
input-output pair i : o; one symbol i from the input alphabet S, and one
symbol o from an output alphabet Δ, thus Σ′ ⊆ Σ×Δ. Σ and Δ may each also
include the epsilon symbol ε.
36

FSM: FEASIBLE PAIRS
• i.e. Σ′ = {a : a, b : b, ! : !, a : !, a : ε, ε : !}
In two-level morphology, the pairs of symbols in Σ′ are also called feasible
pairs.
Thus each feasible pair symbol a : b in the transducer alphabet Σ′ expresses
how the symbol a from one tape is mapped to the symbol b on the other
tape. For example a : ε means that an a on the upper tape will correspond
to nothing on the lower tape.
Just as for an FSA, we can write regular expressions in the complex
alphabet Σ′.
Since it’s most common for symbols to map to themselves, in two-level
morphology we call pairs like a : a default pairs, and just refer to them by
the single letter a.
37

BUILDING A FST MORPHOPARSER
• Lets build an FST morphological parser out of our earlier morphotactic
FSAs and lexica by adding an extra “lexical” tape and the appropriate
morphological features.
• The nominal morphological features (+Sg and +Pl) that correspond to
each morpheme.
• The symbol ^ indicates a morpheme boundary, while the symbol #
indicates a word boundary.
• The morphological features map to the empty string ǫ or the boundary
symbols since there is no segment corresponding to them on the output
tape.
38

COMPLEXING THE FST-LEXICON
• A morphological noun parser, it needs to be expanded with all the
individual regular and irregular noun stems, replacing the labels reg-
noun etc.
• In order to do this we need to update the lexicon for this transducer, so
that irregular plurals like geese will parse into the correct stem goose
+N +Pl. We do this by allowing the lexicon to also have two levels. Since
surface geese maps to lexical goose, the new lexical entry will be “g:g o:e
o:e s:s e:e”.
• Regular forms are simpler; the two-level entry for fox will now be “f:f o:o
x:x”, but by relying on the orthographic convention that f stands for f:f
and so on, we can simply refer to it as fox and the form for geese as “g
o:e o:e s e”.
40

COMPLEXING THE FST-LEXICON
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e s e goose
cat sheep sheep
aardvark m o:i u:ǫ s:c e mouse
41

FSM: THE INTERMEDIATE LEVEL
c:c a:a t:t +N:ε +Pl:ˆs#
• Since the output symbols include the morpheme and word boundary
markers ˆ and #, the lower labels do not correspond exactly to the
surface level.
• Hence we refer to tapes with these morpheme boundary markers as
intermediate tapes; the next section will show how the boundary
marker is removed.
42

TRANSDUCERS AND
ORTHOGRAPHIC RULES
43

ORTHOGRAPHIC RULES
• But just concatenating the morphemes won’t work for cases where there
is a spelling change;
• it would incorrectly reject an input like foxes and accept an input like
foxs. We need to deal with the fact that English often requires spelling
changes at morpheme boundaries by introducing spelling rules (or
orthographic rules)
• In general, the ability to implement rules as a transducer turns out to be
useful throughout speech and language processing.
44

ORTHOGRAPHIC RULES
Name Description of Rule Example
Consonant Doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and -ed
make/making
E insertion e added after -s,-z,-x,-ch, -sh before -s
watch/watches
Y replacement -y changes to -ie before -s, -i before -ed
try/tries
K insertion verbs ending with vowel + -c add -k
panic/panicked
45

ORTHOGRAPHIC RULES 2
• This rule says something like “insert an e on the surface tape just when the
lexical tape has a morpheme ending in x (or z, etc) and the next morpheme
is -s”.
x
ε -> e / s ^__s#
z
46

FST FOR ORTHOGRAPHY: EXPLANATION
• This rule is used to ensure that we can only see the ε:e pair if we are in
the proper context.
• So state q0, which models having seen only default pairs unrelated to
the rule, is an accepting state, as is q1, which models having seen a z, s,
or x.
• q2 models having seen the morpheme boundary after the z, s, or x, and
again is an accepting state.
• State q3 models having just seen the E-insertion; it is not an accepting
state, since the insertion is only allowed if it is followed by the s
morpheme and then the end-of-word
48

• The other symbol passes through any parts of words that don’t play a role in
the E-insertion rule. Other means “any feasible pair that is not in this
transducer”.
• So for example when leaving state q0, we go to q1 on the z, s, or x symbols, rather
than following the other arc and staying in q0.
• The semantics of other depends on what symbols are on other arcs; since #
is mentioned on some arcs, it is (by definition) not included in other, and
thus, for example, is explicitly mentioned on the arc from q2 to q0.
• State q5 is used to ensure that the e is always inserted whenever the
environment is appropriate; the transducer reaches q5 only when it has seen
an s after an appropriate morpheme boundary. If the machine is in state q5
and the next symbol is #, the machine rejects the string (because there is no
legal transition on # from q5).
49

• StateInput s:s x:x z:z ˆ:ε ε:e # other
• q0: 1 1 1 0 - 0 0
• q1: 1 1 1 2 - 0 0
• q2: 5 1 1 0 3 0 0
• q3 4 - - - - - -
• q4 - - - - - 0 -
• q5 1 1 1 2 - - 0
50

SUMMARY
• Morphological parsing is the process of finding the constituent morphemes in a word (e.g., cat +N
+PL for cats).
• English mainly uses prefixes and suffixes to express inflectional and derivational morphology.
• English inflectional morphology is relatively simple and includes person and number agreement (-
s) and tense markings (-ed and -ing).
• English derivational morphology is more complex and includes suffixes like -ation, -ness, -able as
well as prefixes like co- and re-.
• Many constraints on the English morphotactics (allowable morpheme sequences) can be
represented by finite automata.
• Finite-state transducers are an extension of finite-state automata that can generate output
symbols.
• Important operations for FSTs include composition, projection, and intersection.
• Finite-state morphology and two-level morphology are applications of finite-state transducers to
morphological representation and parsing.
• Spelling rules can be implemented as transducers.
51

READINGS
• Jurafsky D. & J. Martin (2008). SPEECH and LANGUAGE PROCESSING
An introduction to Natural Language Processing, Computational
Linguistics and Speech Recognition (2nd Edition). CHAPTER 3 (pp. 1-16).
Additional References:
• Μαρκόπουλος, Γ. (1997). Υπολογιστική Επεξεργασία του Ελληνικού Ονόματος.
Διδακτορική διατριβή (σσ. 99-106).
• Πετροπούλου, Ε. (2012). Η Σύνθεση με Δεσμευμένο Θέμα στην Αγγλική και τη
Νέα Ελληνική Θεωρητική Ανάλυση καιΥπολογιστική Επεξεργασία.
Διδακτορική διατριβή ((σσ. 160-164)-172)).
52

Computational Linguistics - Finite State Automata

More Related Content

Similar to Computational Linguistics - Finite State Automata (20)

Recently uploaded (20)

Computational Linguistics - Finite State Automata