SlideShare a Scribd company logo
Morphological Processing
Natural Language Processing 1
• Morphology is the study of the way words are built from smaller meaningful units
called morphemes.
• We can divide morphemes into two broad classes.
– Stems – the core meaningful units, the root of the word.
– Affixes – add additional meanings and grammatical functions to words.
• Affixes are further divided into:
– Prefixes – precede the stem: do / undo
– Suffixes – follow the stem: eat / eats
– Infixes – are inserted inside the stem
– Circumfixes – precede and follow the stem
• English doesn’t stack more affixes.
• But Turkish can have words with a lot of suffixes.
• Languages, such as Turkish, tend to string affixes together are called agglutinative
languages.
Morphology
Natural Language Processing 2
• The surface level of a word represents the actual spelling of that word.
– geliyorum eats cats kitabım
• The lexical level of a word represents a simple concatenation of morphemes making
up that word.
– gel +PROG +1SG
– eat +AOR
– cat +PLU
– kitap +P1SG
• Morphological processors try to find correspondences between lexical and surface
forms of words.
– Morphological recognition – surface to lexical
– Morphological generation – lexical to surface
Surface and Lexical Forms
Natural Language Processing 3
• There are two broad classes of morphology:
– Inflectional morphology
– Derivational morphology
• After a combination with an inflectional morpheme, the meaning and class of the
actual stem usually do not change.
– eat / eats pencil / pencils
– gel / geliyorum masa / masam
• After a combination with an derivational morpheme, the meaning and the class of the
actual stem usually change.
– compute / computer do / undo friend / friendly
– Uygar / uygarlaş kapı / kapıcı
• The irregular changes may happen with derivational affixes.
Inflectional and Derivational Morphology
Natural Language Processing 4
• Nouns have simple inflectional morphology.
– plural -- cat / cats
– possessive -- John / John’s
• Verbs have slightly more complex inflectional, but still relatively simple inflectional
morphology.
– past form -- walk / walked
– past participle form -- walk / walked
– gerund -- walk / walking
– singular third person -- walk / walks
• Verbs can be categorized as:
– main verbs
– modal verbs -- can, will, should
– primary verbs -- be, have, do
• Regular and irregular verbs: walk / walked -- go / went
English Inflectional Morphology
Natural Language Processing 5
• Some English derivational affixes
– -ation : transport / transportation
– -er : kill / killer
– -ness : fuzzy / fuzziness
– -al : computation / computational
– -able : break / breakable
– -less : help / helpless
– un : do / undo
– re : try / retry
English Derivational Morphology
Natural Language Processing 6
• Some of inflectional suffixes that Turkish nouns can have:
– singular/plural : masa / masalar
– possessive markers : masam / masan / masası / masamız / masanız / masaları
– case markers :
• ablative : masadan
• accusative : masayı
• dative : masaya
• Some of inflectional suffixes that Turkish verbs can have:
– tense : gel / geldi / geliyor / gelmiş / gelecek
– second tense : geliyordu / gelmişti / gelecekti
– agreement marker : geldim / geldin / geldi / geldik / geldiniz / geldiler
• There are order among inflectional suffixes (morphotactics )
– masalarımdan -- masa +PLU +P1SG +ABL
– geliyordum -- gel +PROG +PAST +1SG
Turkish Inflectional Morphology
Natural Language Processing 7
• Turkish derivational morphology is very rich.
• Some of derivational suffixes in Turkish:
– -cı : kapı / kapıcı
– -laş : uygar / uygarlaş
– -mek : gel / gelmek
– -cik : mini / minicik
– -li : Ankara / Ankaralı
Turkish Derivational Morphology
Natural Language Processing 8
• Morphological parsing is to find the lexical form of a word from its surface form.
– cats -- cat +N +PLU
– cat -- cat +N +SG
– goose -- goose +N +SG or goose +V
– geese -- goose +N +PLU
– gooses -- goose +V +3SG
– catch -- catch +V
– caught -- catch +V +PAST or catch +V +PP
– geliyorum -- gel +V +PROG +1SG
– masalardan -- masa +N +PLU +ABL
• There can be more than one lexical level representation for a given word. (ambiguity)
Morphological Parsing
Natural Language Processing 9
• For a morphological processor, we need at least followings:
• Lexicon : The list of stems and affixes together with basic information about them
such as their main categories (noun, verb, adjective, …) and their sub-categories
(regular noun, irregular noun, …).
• Morphotactics : The model of morpheme ordering that explains which classes of
morphemes can follow other classes of morphemes inside a word.
• Orthographic Rules (Spelling Rules) : These spelling rules are used to model
changes that occur in a word (normally when two morphemes combine).
Parts of A Morphological Processor
Natural Language Processing 10
• A lexicon is a repository for words (stems).
• They are grouped according to their main categories.
– noun, verb, adjective, adverb, …
• They may be also divided into sub-categories.
– regular-nouns, irregular-singular nouns, irregular-plural nouns, …
• The simplest way to create a morphological parser, put all possible words (together
with its inflections) into a lexicon.
– We do not this because their numbers are huge (theoretically for Turkish,
it is infinite)
Lexicon
Natural Language Processing 11
• Which morphemes can follow which morphemes.
Lexicon:
regular-noun irregular-pl-noun irreg-sg-noun plural
fox geese goose -s
cat sheep sheep
dog mice mouse
• Simple English Nominal Inflection (Morphotactic Rules)
Morphotactics
Natural Language Processing 12
• This only says yes or no. Does not give lexical representation.
• It accepts a wrong word (foxs).
Combine Lexicon and Morphotactics
Natural Language Processing 13
f
o
x
s
c a t
d o g
s
h e e
p
g
o
e e
o s
e
m
o u s
i c
e
• Two-level morphology represents the correspondence between lexical and surface
levels.
• We use a finite-state transducer to find mapping between these two levels.
• A FST is a two-tape automaton:
– Reads from one tape, and writes to other one.
• For morphological processing, one tape holds lexical representation, the second one
holds the surface form of a word.
Two-Level Morphology
Natural Language Processing 14
FST is Q x  x q0 x F x 
• Q : a finite set of N states q0, q1, … qN
•  : a finite input alphabet of complex symbols.
– Each complex symbol is a pair of an input and an output symbol i:o
– where i is a member of I (an input alphabet),
– and o is a member of O (an output alphabet).
– I and O may contain empty string.
– So,  is a subset of IxO.
• q0 : the start state
• F : the set of final states -- F is a subset of Q
• (q,i:o) : transition function
Formal Definition of FST (Mealey Machine)
Natural Language Processing 15
•  may not contain all possible pairs from IxO.
• For example:
– I = {a, b, c} O={a,b,c, є}
–  = {a:a, b:b, c:c, a:є, b: є, c: є}
• feasible pairs – In two-level morphology terminology, the pairs in  are called as
feasible pairs.
• default pair – Instead of a:a we can use a single character for this default pair.
• FSAs are isomorphic to regular languages, and FSTs are isomorphic to regular
relations (pair of strings of regular languages).
FST (cont.)
Natural Language Processing 16
• FSTs are closed under: union, inversion, and composition.
• union : The union of two regular relations is also a regular relation.
• inversion : The inversion of a FST simply switches the input and output labels.
– This means that the same FST can be used for both directions of a morphological
processor.
• composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1 to O2, then
composition of T1 and T2 (T1oT2) maps from I1 to O2.
• We use these properties of FSTs in the creation of the FST for a morphological
processor.
FST Properties
Natural Language Processing 17
A FST for Simple English Nominals
Natural Language Processing 18
• A FST for stems which maps roots to their root-class
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e se goose
cat sheep sheep
dog m o:i u:є s:c e mouse
• fox stands for f:f o:o x:x
• When these two transducers are composed, we have a FST which maps lexical forms
to intermediate forms of words for simple English noun inflections.
• Next thing that we should handle is to design the FSTs for orthographic rules, and
combine all these transducers.
FST for stems
Natural Language Processing 19
• A frequently use FST idiom, called cascade, is to have the output of one FST read in
as the input to a subsequent machine.
• So, to handle spelling we use three tapes:
– lexical, intermediate and surface
• We need one transducer to work between the lexical and intermediate levels, and a
second (a bunch of FSTs) to work between intermediate and surface levels to patch up
the spelling.
Multi-Level Multi-Tape Machines
Natural Language Processing 20
+PL
+N
g
o
d
s
g
o
d
s #
^
g
o
d
lexical
intermediate
surface
Lexical to Intermediate FST
Natural Language Processing 21
• We need FSTs to map intermediate level to surface level.
• For each spelling rule we will have a FST, and these FSTs run parallel.
• Some of English Spelling Rules:
– consonant doubling -- 1-letter consonant doubled before ing/ed -- beg/begging
– E deletion -- Silent e dropped before ing and ed -- make/making
– E insertion -- e added after s, z, x, ch, sh before s -- watch/watches
– Y replacement -- y changes to ie before s, and to i before ed -- try/tries
– K insertion -- verbs ending with vowel+c we add k -- panic/panicked
• We represent these rules using two-level morphology rules:
– a => b / c __ d rewrite a as b when it occurs between c and d.
Orthographic Rules
Natural Language Processing 22
E-insertion rule: є => e / {x,s,z}^ __ s#
• ^ (morpheme boundary) means ^: є
FST for E-Insertion Rule
Natural Language Processing 23
E-insertion rule: є => e / {x,s,z}^ __ s#
FST for E-Insertion Rule
Natural Language Processing 24
Generating or Parsing with FST Lexicon and Rules
Natural Language Processing 25
Accepting foxes
Natural Language Processing 26
• We can intersect all rule FSTs to create a single FST.
• Intersection algorithm just takes the Cartesian product of states.
– For each state qi of the first machine and qj of the second machine, we create a
new state qij
– For input symbol a, if the first machine would transition to state qn and the second
machine would transition to qm the new machine would transition to qnm.
Intersection
Natural Language Processing 27
• Cascade can turn out to be somewhat pain.
– it is hard to manage all tapes
– it fails to take advantage of restricting power of the machines
• So, it is better to compile the cascade into a single large machine.
• Create a new state (x,y) for every pair of states x є Q1 and y є Q2.
• The transition function of composition will be defined as follows:
δ((x,y),i:o) = (v,z) if
there exists c such that δ1(x,i:c) = v and δ2(y,c:o) = z
Composition
Natural Language Processing 28
Intersect Rule FSTs
Natural Language Processing 29
LEXICON NOUNS
aba POST-NOUN;
aday POST-NOUN;
benzin POST-NOUN;
…
LEXICON POST-NOUN
+Noun:0 POST-NOUNR;
Simplified Turkish Noun Morphotactics
in Foma Environment
Natural Language Processing 30
LEXICON POSSESSIVE
+Acc:+yH End;
+Dat:+yA End;
+Loc:+DA End;
+Abl:+DAn End;
+Gen:+nHn End;
+Ins:+ylA End;
+Nom:0 End;
LEXICON POST-NOUNR
+A3pl:+lAr PLURAL;
+A3sg:0 PLURAL;
LEXICON PLURAL
+P1sg:+Hm POSSESSIVE;
+P2sg:+Hn POSSESSIVE;
+P1pl:+HmHz POSSESSIVE;
+P2pl:+HnHz POSSESSIVE;
+Pnon:0 POSSESSIVE;
+P3sg:+sH POSSESSIVE;
##### Turkish Foma 2016 ####
define ALPHABET [a | e | ı | i | o | ö | u | ü | A | H | … | b | c | ç | d
| f | g|ğ|h|j| k | l | m | n | p | r | s | ş | t | v | y | z | D | … ];
define CONS [b | c | ç | d | f | g | ğ | h | j | k | l | m | n | p | r | s
| ş | t | v | y | z | D | Z | Y | K | J | B];
define VOWEL [a | e | ı | i | o | ö | u | ü | A | H | … ];
define SVOWEL [a | e | ı | i | o | ö | u | ü];
define BACKV [a | ı | u | o]; #kalın ünlüler
define FRONTV [e | i | ö | ü]; #ince ünlüler
define HIGHV [ı | i | u | ü]; #dar ünlüler
define FRUNRV [i | e]; #düz ince
define FRROV [ö | ü]; #yuvarlak ince
define BKROV [u | o]; #yuvarlak kalın
define BKUNRV [a | ı]; #düz kalın
define Xsyn [s | y | n];
define NDCONS [c | Z | l | d | D];
Simplified Turkish Orthographic Rules
in Foma Environment
Natural Language Processing 31
#---------------ALTERNATION RULE SECTION----------------------
define AReplacement
A -> a || [BACKV | … ] [CONS | … | "+"]* _ ;
A -> e || [FRONTV | …] [CONS | … | "+"]* _ ;
define HReplacement
H -> u || [BKROV | … ] [CONS | "+" | … ]* _ ,,
H -> ü || [FRROV | … ] [CONS | "+" | … ]* _ ,,
H -> ı || [BKUNRV | … ] [CONS | "+" | … ]* _ ,,
H -> i || [FRUNRV | … ] [CONS | "+" | … ]* _ ,,
H -> 0 || VOWEL "+" _ ;
Simplified Turkish Orthographic Rules
in Foma Environment
Natural Language Processing 32
Morphological Processing in Foma Environment
Natural Language Processing 33

More Related Content

PPT
morphology is that the inflectional morphology deals with the creation of new...
PPT
Morphology based image processing and Machine learning
PDF
MorphologyAndFST.pdf
PPTX
NL5MorphologyAndFinteStateTransducersPart1.pptx
PPTX
NLP_KASHK:Finite-State Morphological Parsing
PDF
Natural language Processing: Word Level Analysis
PPT
Morphology.ppt
PPTX
MORPHOLOGICAL PROCESSING OF INDIAN LANGUAGRES
morphology is that the inflectional morphology deals with the creation of new...
Morphology based image processing and Machine learning
MorphologyAndFST.pdf
NL5MorphologyAndFinteStateTransducersPart1.pptx
NLP_KASHK:Finite-State Morphological Parsing
Natural language Processing: Word Level Analysis
Morphology.ppt
MORPHOLOGICAL PROCESSING OF INDIAN LANGUAGRES

Similar to NLP Unit notes on morphological processing (20)

PPT
NL5MorphologyAndFinteStateTransducersPart1.ppt
PPTX
Computational Linguistics - Finite State Automata
PDF
Introduction to Computational Linguistics
PPTX
Words _Transducers Finite state transducers in natural language processing
PDF
Lecture Notes-Finite State Automata for NLP.pdf
PPT
lect4-morphology.ppt
PPTX
lect4-morphology.pptx
PPT
lect4-morphology.ppt
PPT
haenelt.ppt
PPT
NLP Finite state machine needed.ppt
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PPTX
NLP topic CHAPTER 2_word level analysis.pptx
PDF
Adnan: Introduction to Natural Language Processing
PDF
Turkish Natural Language Processing Studies
PDF
Computational linguistics
PDF
Natural Language Processing: State of The Art, Current Trends and Challenges
PPTX
chapter4.pptx natural language processing
PPTX
Morphological Analyzer and Generator for Tamil Language
PPT
Machine Translation ppt for engineering students
PPT
Issues & Morphological models NLP engineering
NL5MorphologyAndFinteStateTransducersPart1.ppt
Computational Linguistics - Finite State Automata
Introduction to Computational Linguistics
Words _Transducers Finite state transducers in natural language processing
Lecture Notes-Finite State Automata for NLP.pdf
lect4-morphology.ppt
lect4-morphology.pptx
lect4-morphology.ppt
haenelt.ppt
NLP Finite state machine needed.ppt
Welcome to International Journal of Engineering Research and Development (IJERD)
NLP topic CHAPTER 2_word level analysis.pptx
Adnan: Introduction to Natural Language Processing
Turkish Natural Language Processing Studies
Computational linguistics
Natural Language Processing: State of The Art, Current Trends and Challenges
chapter4.pptx natural language processing
Morphological Analyzer and Generator for Tamil Language
Machine Translation ppt for engineering students
Issues & Morphological models NLP engineering
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
573137875-Attendance-Management-System-original
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Project quality management in manufacturing
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Operating System & Kernel Study Guide-1 - converted.pdf
OOP with Java - Java Introduction (Basics)
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
573137875-Attendance-Management-System-original
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Project quality management in manufacturing
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Geodesy 1.pptx...............................................
Well-logging-methods_new................
Lecture Notes Electrical Wiring System Components
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Internet of Things (IOT) - A guide to understanding
Ad

NLP Unit notes on morphological processing

  • 2. • Morphology is the study of the way words are built from smaller meaningful units called morphemes. • We can divide morphemes into two broad classes. – Stems – the core meaningful units, the root of the word. – Affixes – add additional meanings and grammatical functions to words. • Affixes are further divided into: – Prefixes – precede the stem: do / undo – Suffixes – follow the stem: eat / eats – Infixes – are inserted inside the stem – Circumfixes – precede and follow the stem • English doesn’t stack more affixes. • But Turkish can have words with a lot of suffixes. • Languages, such as Turkish, tend to string affixes together are called agglutinative languages. Morphology Natural Language Processing 2
  • 3. • The surface level of a word represents the actual spelling of that word. – geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. – gel +PROG +1SG – eat +AOR – cat +PLU – kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. – Morphological recognition – surface to lexical – Morphological generation – lexical to surface Surface and Lexical Forms Natural Language Processing 3
  • 4. • There are two broad classes of morphology: – Inflectional morphology – Derivational morphology • After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. – eat / eats pencil / pencils – gel / geliyorum masa / masam • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. – compute / computer do / undo friend / friendly – Uygar / uygarlaş kapı / kapıcı • The irregular changes may happen with derivational affixes. Inflectional and Derivational Morphology Natural Language Processing 4
  • 5. • Nouns have simple inflectional morphology. – plural -- cat / cats – possessive -- John / John’s • Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology. – past form -- walk / walked – past participle form -- walk / walked – gerund -- walk / walking – singular third person -- walk / walks • Verbs can be categorized as: – main verbs – modal verbs -- can, will, should – primary verbs -- be, have, do • Regular and irregular verbs: walk / walked -- go / went English Inflectional Morphology Natural Language Processing 5
  • 6. • Some English derivational affixes – -ation : transport / transportation – -er : kill / killer – -ness : fuzzy / fuzziness – -al : computation / computational – -able : break / breakable – -less : help / helpless – un : do / undo – re : try / retry English Derivational Morphology Natural Language Processing 6
  • 7. • Some of inflectional suffixes that Turkish nouns can have: – singular/plural : masa / masalar – possessive markers : masam / masan / masası / masamız / masanız / masaları – case markers : • ablative : masadan • accusative : masayı • dative : masaya • Some of inflectional suffixes that Turkish verbs can have: – tense : gel / geldi / geliyor / gelmiş / gelecek – second tense : geliyordu / gelmişti / gelecekti – agreement marker : geldim / geldin / geldi / geldik / geldiniz / geldiler • There are order among inflectional suffixes (morphotactics ) – masalarımdan -- masa +PLU +P1SG +ABL – geliyordum -- gel +PROG +PAST +1SG Turkish Inflectional Morphology Natural Language Processing 7
  • 8. • Turkish derivational morphology is very rich. • Some of derivational suffixes in Turkish: – -cı : kapı / kapıcı – -laş : uygar / uygarlaş – -mek : gel / gelmek – -cik : mini / minicik – -li : Ankara / Ankaralı Turkish Derivational Morphology Natural Language Processing 8
  • 9. • Morphological parsing is to find the lexical form of a word from its surface form. – cats -- cat +N +PLU – cat -- cat +N +SG – goose -- goose +N +SG or goose +V – geese -- goose +N +PLU – gooses -- goose +V +3SG – catch -- catch +V – caught -- catch +V +PAST or catch +V +PP – geliyorum -- gel +V +PROG +1SG – masalardan -- masa +N +PLU +ABL • There can be more than one lexical level representation for a given word. (ambiguity) Morphological Parsing Natural Language Processing 9
  • 10. • For a morphological processor, we need at least followings: • Lexicon : The list of stems and affixes together with basic information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …). • Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. • Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine). Parts of A Morphological Processor Natural Language Processing 10
  • 11. • A lexicon is a repository for words (stems). • They are grouped according to their main categories. – noun, verb, adjective, adverb, … • They may be also divided into sub-categories. – regular-nouns, irregular-singular nouns, irregular-plural nouns, … • The simplest way to create a morphological parser, put all possible words (together with its inflections) into a lexicon. – We do not this because their numbers are huge (theoretically for Turkish, it is infinite) Lexicon Natural Language Processing 11
  • 12. • Which morphemes can follow which morphemes. Lexicon: regular-noun irregular-pl-noun irreg-sg-noun plural fox geese goose -s cat sheep sheep dog mice mouse • Simple English Nominal Inflection (Morphotactic Rules) Morphotactics Natural Language Processing 12
  • 13. • This only says yes or no. Does not give lexical representation. • It accepts a wrong word (foxs). Combine Lexicon and Morphotactics Natural Language Processing 13 f o x s c a t d o g s h e e p g o e e o s e m o u s i c e
  • 14. • Two-level morphology represents the correspondence between lexical and surface levels. • We use a finite-state transducer to find mapping between these two levels. • A FST is a two-tape automaton: – Reads from one tape, and writes to other one. • For morphological processing, one tape holds lexical representation, the second one holds the surface form of a word. Two-Level Morphology Natural Language Processing 14
  • 15. FST is Q x  x q0 x F x  • Q : a finite set of N states q0, q1, … qN •  : a finite input alphabet of complex symbols. – Each complex symbol is a pair of an input and an output symbol i:o – where i is a member of I (an input alphabet), – and o is a member of O (an output alphabet). – I and O may contain empty string. – So,  is a subset of IxO. • q0 : the start state • F : the set of final states -- F is a subset of Q • (q,i:o) : transition function Formal Definition of FST (Mealey Machine) Natural Language Processing 15
  • 16. •  may not contain all possible pairs from IxO. • For example: – I = {a, b, c} O={a,b,c, є} –  = {a:a, b:b, c:c, a:є, b: є, c: є} • feasible pairs – In two-level morphology terminology, the pairs in  are called as feasible pairs. • default pair – Instead of a:a we can use a single character for this default pair. • FSAs are isomorphic to regular languages, and FSTs are isomorphic to regular relations (pair of strings of regular languages). FST (cont.) Natural Language Processing 16
  • 17. • FSTs are closed under: union, inversion, and composition. • union : The union of two regular relations is also a regular relation. • inversion : The inversion of a FST simply switches the input and output labels. – This means that the same FST can be used for both directions of a morphological processor. • composition : If T1 is a FST from I1 to O1 and T2 is a FST from O1 to O2, then composition of T1 and T2 (T1oT2) maps from I1 to O2. • We use these properties of FSTs in the creation of the FST for a morphological processor. FST Properties Natural Language Processing 17
  • 18. A FST for Simple English Nominals Natural Language Processing 18
  • 19. • A FST for stems which maps roots to their root-class reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e se goose cat sheep sheep dog m o:i u:є s:c e mouse • fox stands for f:f o:o x:x • When these two transducers are composed, we have a FST which maps lexical forms to intermediate forms of words for simple English noun inflections. • Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers. FST for stems Natural Language Processing 19
  • 20. • A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine. • So, to handle spelling we use three tapes: – lexical, intermediate and surface • We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling. Multi-Level Multi-Tape Machines Natural Language Processing 20 +PL +N g o d s g o d s # ^ g o d lexical intermediate surface
  • 21. Lexical to Intermediate FST Natural Language Processing 21
  • 22. • We need FSTs to map intermediate level to surface level. • For each spelling rule we will have a FST, and these FSTs run parallel. • Some of English Spelling Rules: – consonant doubling -- 1-letter consonant doubled before ing/ed -- beg/begging – E deletion -- Silent e dropped before ing and ed -- make/making – E insertion -- e added after s, z, x, ch, sh before s -- watch/watches – Y replacement -- y changes to ie before s, and to i before ed -- try/tries – K insertion -- verbs ending with vowel+c we add k -- panic/panicked • We represent these rules using two-level morphology rules: – a => b / c __ d rewrite a as b when it occurs between c and d. Orthographic Rules Natural Language Processing 22
  • 23. E-insertion rule: є => e / {x,s,z}^ __ s# • ^ (morpheme boundary) means ^: є FST for E-Insertion Rule Natural Language Processing 23
  • 24. E-insertion rule: є => e / {x,s,z}^ __ s# FST for E-Insertion Rule Natural Language Processing 24
  • 25. Generating or Parsing with FST Lexicon and Rules Natural Language Processing 25
  • 27. • We can intersect all rule FSTs to create a single FST. • Intersection algorithm just takes the Cartesian product of states. – For each state qi of the first machine and qj of the second machine, we create a new state qij – For input symbol a, if the first machine would transition to state qn and the second machine would transition to qm the new machine would transition to qnm. Intersection Natural Language Processing 27
  • 28. • Cascade can turn out to be somewhat pain. – it is hard to manage all tapes – it fails to take advantage of restricting power of the machines • So, it is better to compile the cascade into a single large machine. • Create a new state (x,y) for every pair of states x є Q1 and y є Q2. • The transition function of composition will be defined as follows: δ((x,y),i:o) = (v,z) if there exists c such that δ1(x,i:c) = v and δ2(y,c:o) = z Composition Natural Language Processing 28
  • 29. Intersect Rule FSTs Natural Language Processing 29
  • 30. LEXICON NOUNS aba POST-NOUN; aday POST-NOUN; benzin POST-NOUN; … LEXICON POST-NOUN +Noun:0 POST-NOUNR; Simplified Turkish Noun Morphotactics in Foma Environment Natural Language Processing 30 LEXICON POSSESSIVE +Acc:+yH End; +Dat:+yA End; +Loc:+DA End; +Abl:+DAn End; +Gen:+nHn End; +Ins:+ylA End; +Nom:0 End; LEXICON POST-NOUNR +A3pl:+lAr PLURAL; +A3sg:0 PLURAL; LEXICON PLURAL +P1sg:+Hm POSSESSIVE; +P2sg:+Hn POSSESSIVE; +P1pl:+HmHz POSSESSIVE; +P2pl:+HnHz POSSESSIVE; +Pnon:0 POSSESSIVE; +P3sg:+sH POSSESSIVE;
  • 31. ##### Turkish Foma 2016 #### define ALPHABET [a | e | ı | i | o | ö | u | ü | A | H | … | b | c | ç | d | f | g|ğ|h|j| k | l | m | n | p | r | s | ş | t | v | y | z | D | … ]; define CONS [b | c | ç | d | f | g | ğ | h | j | k | l | m | n | p | r | s | ş | t | v | y | z | D | Z | Y | K | J | B]; define VOWEL [a | e | ı | i | o | ö | u | ü | A | H | … ]; define SVOWEL [a | e | ı | i | o | ö | u | ü]; define BACKV [a | ı | u | o]; #kalın ünlüler define FRONTV [e | i | ö | ü]; #ince ünlüler define HIGHV [ı | i | u | ü]; #dar ünlüler define FRUNRV [i | e]; #düz ince define FRROV [ö | ü]; #yuvarlak ince define BKROV [u | o]; #yuvarlak kalın define BKUNRV [a | ı]; #düz kalın define Xsyn [s | y | n]; define NDCONS [c | Z | l | d | D]; Simplified Turkish Orthographic Rules in Foma Environment Natural Language Processing 31
  • 32. #---------------ALTERNATION RULE SECTION---------------------- define AReplacement A -> a || [BACKV | … ] [CONS | … | "+"]* _ ; A -> e || [FRONTV | …] [CONS | … | "+"]* _ ; define HReplacement H -> u || [BKROV | … ] [CONS | "+" | … ]* _ ,, H -> ü || [FRROV | … ] [CONS | "+" | … ]* _ ,, H -> ı || [BKUNRV | … ] [CONS | "+" | … ]* _ ,, H -> i || [FRUNRV | … ] [CONS | "+" | … ]* _ ,, H -> 0 || VOWEL "+" _ ; Simplified Turkish Orthographic Rules in Foma Environment Natural Language Processing 32
  • 33. Morphological Processing in Foma Environment Natural Language Processing 33