SlideShare a Scribd company logo
2
Most read
5
Most read
8
Most read
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
Lecture Notes - Are Natural Languages Regular?
This is an important question for two reasons: first, it places an upper bound on the running time of
algorithms that process natural language; second, it may tell us something about human language
processing and language acquisition.
To answer this question let us first understand…
• What is a language (natural language / formal language)?
• What is a regular language?
• What are regular grammars?
What is a natural language?
A natural language is a human communication system. A natural language can be thought of as a
mutually understandable communication system that is used between members of some population.
When communicating, speakers of a natural language are tacitly agreeing on what strings are
allowed (i.e., which strings are grammatical). Dialects and specialized languages (including e.g.,
the language used on social media) are all natural languages in their own right.
Named languages that you are familiar with, such as French, Chinese, English etc, are usually
historically, politically or geographically derived labels for populations of speakers.
A natural language has high ambiguity.
Example: I made her duck
1. I cooked waterfowl* for her.
2. I cooked waterfowl* belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head.
5. I turned her into a duck.
Several types of ambiguity combine to cause many meanings:
• morphological (her can be a dative pronoun or possessive pronoun and duck can be a noun
or a verb)
• syntactic (make can behave both transitively and ditransitively; make can select a direct
object or a verb)
• semantic (make can mean create, cause, cook ...)
What is a formal language?
A formal language is a set of strings over an alphabet.
Alphabet: An alphabet is specified by a finite set, ∑ , whose elements are called symbols. Some
examples are shown below:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} the 10-element set of decimal digits.
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
{a, b, c, …. x, y, z} the 26-element set of lower-case characters of written English.
{aardvark, ….. zebra} the 250,000-element set of words in the Oxford English Dictionary.
The set of natural numbers N = {0, 1, 2, 3, ….} cannot be an alphabet because it is infinite.
Strings: A string of length n over an alphabet ∑ is an ordered n-tuple of elements of ∑.
∑ * denotes the set of all strings over ∑ of finite length.
If ∑ = {a, b} then ∊, ba, bab, aab are examples of strings over ∑.
If ∑ = {a} then ∑ * = {∊, a, aa, aaa, ….}
If ∑ = {cats, dogs, eat} then
∑ * = {∊, cats, cats eat, cats eat dogs, …..}
Languages: Given an alphabet ∑ any subset of ∑ * is a formal language over alphabet ∑.
What is a regular language?
A language is regular if it is equal to the set of strings accepted by some deterministic finite-state
automaton (DFA).
Regular languages are accepted by DFAs.
Given a DFA M = (Q,∑,∆,s,F) the language, L(M), of strings accepted by M can be generated by
the regular grammar Greg = (N, ∑, S,P) where:
N= {Q} the non-terminals are the states of M
∑ = ∑ the terminals, set of transition symbols of M
S = s the starting symbol is the starting state of M
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
P = qi → aqj when (qi , a) = qj ∊ ∆
or qi → ∊ when q ∊ F (i.e. when q is an end state)
In order to derive a string from a grammar
• start with the designated starting symbol
• then non-terminal symbols are repeatedly expanded using the rewrite rules until there is
nothing further left to expand.
The rewrite rules derive the members of a language from their internal structure (or phrase
structure).
A regular language has a left- and right-linear grammar.
For every regular grammar the rewrite rules of the grammar can all be expressed in the form:
X → aY
X → a
or alternatively, they can all be expressed as:
X → Ya
X → a
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
A phrase structure grammar over an alphabet ∑ is defined by a tuple G = (N, ∑, S,P). The language
generated by grammar G is L(G):
Non-terminals N: Non-terminal symbols (often uppercase letters) may be rewritten using the rules
of the grammar.
Terminals ∑ : Terminal symbols (often lowercase letters) are elements of ∑ and cannot be rewritten.
Note N ∩ ∑ = ϕ.
Start Symbol S: A distinguished non-terminal symbol S ∊ N. This non-terminal provides the starting
point for derivations.
Phrase Structure Rules P: Phrase structure rules are pairs of the form (w, v) usually written :
w → v, where w ∊ (∑ ∪ N)*N(∑ ∪ N)* and v ∊ (∑ ∪ N)*
Now lets try and the answer the question Can regular grammars model natural language?
It turns out that regular grammars have limitations when modelling natural languages for following
reasons:
• Centre Embedding
• Redundancy
• Useful internal structures
Problems using regular grammars for natural language
1. Centre Embedding
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
In principle, the syntax of natural languages cannot be described by a regular language due to the
presence of centre-embedding; i.e. infinitely recursive structures described by the rule, A → αAβ,
which generate language examples of the form, an
bn
.
For instance, the sentences below have a center embedded structure.
1. The students the police arrested complained.
2. The luggage that the passengers checked arrived.
3. The luggage that the passengers that the storm delayed checked arrived
Intuitively, the reason that a regular language cannot describe centre-embedding is that its
associated automaton has no memory of what has occurred previously in a string.
In order to ‘know’ that n verbs were required to match n nominals already seen, an automaton would
need to ‘record’ that n nominals had been seen; but a DFA has no mechanism to do this.
Formally, we can prove this using Pumping Lemma property to show that strings of the form anbn
are not regular.
The pumping lemma for regular languages is used to prove that a language is not regular. The
pumping lemma property is:
All w ∊ L with |w| ≥ l can be expressed as a concatenation of three strings, w = u1vu2, where u1, v
and u2 satisfy:
|v| ≥ 1 (i.e. v ≠ ∊)
u1|v| ≤ l
for all n ≥ 0, u1vnu2 ∊ L (i.e. u1u2 ∊ L, u1vu2 ∊ L, u1vvu2 2 L, u1vvvu2 ∊ L, etc.)
If you intersect a regular language with another regular language you should get a third regular
language.
Lreg1 ∩ Lreg2 = Lreg3
Also regular languages are closed under homomorphism (we can map all nouns to a and all verbs
to b)
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
So if English is regular and we intersect it with another regular language (e.g. the one generated by
/the a (that the a)*b*/) we should get another regular language.
if Leng then Leng ∩ La*b* = Lreg3
However the intersection of an a*b* with English is anbn ( in our example case specifically /the a
(that the a)n-1bn/), which is not regular as it fails the pumping lemma property.
but Leng ∩ La*b* = La
n
b
n
(which is not regular )
The assumption that English is regular must be incorrect.
2. Redundancy
Grammars written using regular grammar rules alone are highly redundant: since the rules are very
simple we need a great many of them to describe the language. This makes regular grammars very
difficult to build and maintain.
Useful internal structures
There are instances where a regular language can recognize the strings of a language but in doing
so does not provide a structure that is linguistically useful to us. The left-linear or right-linear
internal structures derived by regular grammars are generally not very useful for higher level NLP
applications.
We need informative internal structure so that we can, for example, build up good semantic
representations.
In practice, regular grammars can be useful for partial grammars (i.e. when we don’t need to know
the syntax tree for the whole sentence but rather just some part of it) and also when we don’t care
about derivational structure (i.e. when we just want a Boolean for whether a string is in a language).
For example, in information extraction, we need to recognize named entities.
The internal structure of named entities is normally unimportant to us, we just want to recognize
when we encounter them.
For instance, using rules such as:
NP → nnsb NP
NP → np1 NP
NP → np1
where NP is a non-terminal and nnsb and np1 are terminals representing tags from the large tagset,
you could match a titled name like, Prof. Stephen William Hawking.
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
For every natural language that exists, can we find a context-free grammar to generate it?
There is some evidence that natural language can contain cross serial dependencies. A small
number of languages exhibit strings of the form shown below.
There is a Zurich dialect of Swiss German in which constructions like the following are found:
mer d’chind em Hans es huus haend wele laa hälfe aastriiche.
we the children Hans the house have wanted to let help paint.
we have wanted to let the children help Hans paint the house.
Such expressions may not be derivable by a context-free grammar.
Where do natural languages fit in Chomsky hierarchy?
If we are to use formal grammars to represent natural language, it is useful to know where they
appear in the Chomsky hierarchy. With respect to natural language, it might turn out that the set of
all attested natural languages is actually as depicted in Figure.
The overlap with the context-sensitive languages which accounts for those languages that have
cross-serial dependencies.
Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
Natural languages are an infinite set of sentences constructed out of a finite set of characters.
Words in a sentence don’t have defined upper limits either.
When natural languages are reverse engineered into their component parts, they get broken down
into four parts - syntax, semantics, morphology, phonology.
Natural languages are believed to be at least context-free. However, Dutch and Swiss German
contain grammatical constructions with cross-serial dependencies which make them context
sensitive.
Extensions to Chomsky hierarchy that find relevance in NLP
There are two extensions to the traditional Chomsky hierarchy that have proved useful in linguistics
and cognitive science:
Mildly context-sensitive languages – CFGs are not adequate (weakly or strongly) to characterize
some aspects of language structure. To derive extra power beyond CFG, a grammatical formalism
called Tree Adjoining Grammars (TAG) was proposed as an approximate characterization of Mildly
Context-Sensitive Grammars. composition, called 'adjoining’.
Another classification called Minimalist Grammars (MG) describes an even larger class of formal
languages.
Sub-regular languages
A sub-regular language is a set of strings that can be described without employing the full power of
finite state automata. Many aspects of human language are manifestly sub-regular, such as some
‘strictly local’ dependencies.
Example – identifying recurring sub-string patterns within words is one such common application.

More Related Content

PDF
Lecture Notes-Finite State Automata for NLP.pdf
PDF
Time and Space Complexity
PPT
Introduction to fa and dfa
PPTX
Church Turing Thesis
PDF
I. AO* SEARCH ALGORITHM
PDF
UNIT-V.pdf daa unit material 5 th unit ppt
PDF
Operator precedence
PPT
Primitive Recursive Functions
Lecture Notes-Finite State Automata for NLP.pdf
Time and Space Complexity
Introduction to fa and dfa
Church Turing Thesis
I. AO* SEARCH ALGORITHM
UNIT-V.pdf daa unit material 5 th unit ppt
Operator precedence
Primitive Recursive Functions

What's hot (20)

PPTX
Problem Formulation in Artificial Inteligence Projects
PPTX
Clock driven scheduling
PDF
Symbol table in compiler Design
PPTX
Loop optimization
PDF
Syntax Directed Definition and its applications
PPTX
Context free grammar
PDF
Automata theory
PPTX
Asymptotic Notation
PPTX
Dynamic programming
PPTX
Language models
PPT
Chapter 5 -Syntax Directed Translation - Copy.ppt
PPTX
PPTX
Semantic nets in artificial intelligence
PPTX
Uninformed search /Blind search in AI
PPT
Context free grammars
PDF
Syntax analysis
PDF
P, NP, NP-Complete, and NP-Hard
PPT
Chapter 5 Syntax Directed Translation
PPTX
Lexical analyzer generator lex
Problem Formulation in Artificial Inteligence Projects
Clock driven scheduling
Symbol table in compiler Design
Loop optimization
Syntax Directed Definition and its applications
Context free grammar
Automata theory
Asymptotic Notation
Dynamic programming
Language models
Chapter 5 -Syntax Directed Translation - Copy.ppt
Semantic nets in artificial intelligence
Uninformed search /Blind search in AI
Context free grammars
Syntax analysis
P, NP, NP-Complete, and NP-Hard
Chapter 5 Syntax Directed Translation
Lexical analyzer generator lex
Ad

Similar to Lecture Notes-Are Natural Languages Regular.pdf (20)

PDF
Tree Based Regular Languages
PPTX
RegularLanguageProperties [Autosaved].pptx
PDF
Mathematical Structures In Languages Lecture Notes Edward L Keenan
PPTX
Lecture 1 of automata theory where .pptx
PDF
FLAT Notes
PDF
Theory of Automata ___ Basis ...........
PDF
01-Introduction&Languages.pdf
PPTX
Ch2 automata.pptx
PPTX
THEORYOFAUTOMATATHEORYOFAUTOMATATHEORYOFAUTOMATA.pptx
PPTX
Formal language
PPT
A195259101 22750 24_2018_grammars and languages generated by grammars
PPTX
theory of computation lecture 02
PPTX
Module 1 TOC.pptx
PDF
Ch3 4 regular expression and grammar
PDF
Flat unit 3
PPT
Lecture 1,2
PDF
Syntax Analyzer.pdf
PDF
Language
PPT
UNIT III REGULAR GRAMMAR. in automatappt
PPT
Tree Based Regular Languages
RegularLanguageProperties [Autosaved].pptx
Mathematical Structures In Languages Lecture Notes Edward L Keenan
Lecture 1 of automata theory where .pptx
FLAT Notes
Theory of Automata ___ Basis ...........
01-Introduction&Languages.pdf
Ch2 automata.pptx
THEORYOFAUTOMATATHEORYOFAUTOMATATHEORYOFAUTOMATA.pptx
Formal language
A195259101 22750 24_2018_grammars and languages generated by grammars
theory of computation lecture 02
Module 1 TOC.pptx
Ch3 4 regular expression and grammar
Flat unit 3
Lecture 1,2
Syntax Analyzer.pdf
Language
UNIT III REGULAR GRAMMAR. in automatappt
Ad

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Digital Logic Computer Design lecture notes
PPTX
additive manufacturing of ss316l using mig welding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Well-logging-methods_new................
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Digital Logic Computer Design lecture notes
additive manufacturing of ss316l using mig welding
Operating System & Kernel Study Guide-1 - converted.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
OOP with Java - Java Introduction (Basics)
Well-logging-methods_new................
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Lecture Notes Electrical Wiring System Components
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CH1 Production IntroductoryConcepts.pptx
Foundation to blockchain - A guide to Blockchain Tech

Lecture Notes-Are Natural Languages Regular.pdf

  • 1. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT Lecture Notes - Are Natural Languages Regular? This is an important question for two reasons: first, it places an upper bound on the running time of algorithms that process natural language; second, it may tell us something about human language processing and language acquisition. To answer this question let us first understand… • What is a language (natural language / formal language)? • What is a regular language? • What are regular grammars? What is a natural language? A natural language is a human communication system. A natural language can be thought of as a mutually understandable communication system that is used between members of some population. When communicating, speakers of a natural language are tacitly agreeing on what strings are allowed (i.e., which strings are grammatical). Dialects and specialized languages (including e.g., the language used on social media) are all natural languages in their own right. Named languages that you are familiar with, such as French, Chinese, English etc, are usually historically, politically or geographically derived labels for populations of speakers. A natural language has high ambiguity. Example: I made her duck 1. I cooked waterfowl* for her. 2. I cooked waterfowl* belonging to her. 3. I created the (plaster?) duck she owns. 4. I caused her to quickly lower her head. 5. I turned her into a duck. Several types of ambiguity combine to cause many meanings: • morphological (her can be a dative pronoun or possessive pronoun and duck can be a noun or a verb) • syntactic (make can behave both transitively and ditransitively; make can select a direct object or a verb) • semantic (make can mean create, cause, cook ...) What is a formal language? A formal language is a set of strings over an alphabet. Alphabet: An alphabet is specified by a finite set, ∑ , whose elements are called symbols. Some examples are shown below: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} the 10-element set of decimal digits.
  • 2. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT {a, b, c, …. x, y, z} the 26-element set of lower-case characters of written English. {aardvark, ….. zebra} the 250,000-element set of words in the Oxford English Dictionary. The set of natural numbers N = {0, 1, 2, 3, ….} cannot be an alphabet because it is infinite. Strings: A string of length n over an alphabet ∑ is an ordered n-tuple of elements of ∑. ∑ * denotes the set of all strings over ∑ of finite length. If ∑ = {a, b} then ∊, ba, bab, aab are examples of strings over ∑. If ∑ = {a} then ∑ * = {∊, a, aa, aaa, ….} If ∑ = {cats, dogs, eat} then ∑ * = {∊, cats, cats eat, cats eat dogs, …..} Languages: Given an alphabet ∑ any subset of ∑ * is a formal language over alphabet ∑. What is a regular language? A language is regular if it is equal to the set of strings accepted by some deterministic finite-state automaton (DFA). Regular languages are accepted by DFAs. Given a DFA M = (Q,∑,∆,s,F) the language, L(M), of strings accepted by M can be generated by the regular grammar Greg = (N, ∑, S,P) where: N= {Q} the non-terminals are the states of M ∑ = ∑ the terminals, set of transition symbols of M S = s the starting symbol is the starting state of M
  • 3. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT P = qi → aqj when (qi , a) = qj ∊ ∆ or qi → ∊ when q ∊ F (i.e. when q is an end state) In order to derive a string from a grammar • start with the designated starting symbol • then non-terminal symbols are repeatedly expanded using the rewrite rules until there is nothing further left to expand. The rewrite rules derive the members of a language from their internal structure (or phrase structure). A regular language has a left- and right-linear grammar. For every regular grammar the rewrite rules of the grammar can all be expressed in the form: X → aY X → a or alternatively, they can all be expressed as: X → Ya X → a
  • 4. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT A phrase structure grammar over an alphabet ∑ is defined by a tuple G = (N, ∑, S,P). The language generated by grammar G is L(G): Non-terminals N: Non-terminal symbols (often uppercase letters) may be rewritten using the rules of the grammar. Terminals ∑ : Terminal symbols (often lowercase letters) are elements of ∑ and cannot be rewritten. Note N ∩ ∑ = ϕ. Start Symbol S: A distinguished non-terminal symbol S ∊ N. This non-terminal provides the starting point for derivations. Phrase Structure Rules P: Phrase structure rules are pairs of the form (w, v) usually written : w → v, where w ∊ (∑ ∪ N)*N(∑ ∪ N)* and v ∊ (∑ ∪ N)* Now lets try and the answer the question Can regular grammars model natural language? It turns out that regular grammars have limitations when modelling natural languages for following reasons: • Centre Embedding • Redundancy • Useful internal structures Problems using regular grammars for natural language 1. Centre Embedding
  • 5. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT In principle, the syntax of natural languages cannot be described by a regular language due to the presence of centre-embedding; i.e. infinitely recursive structures described by the rule, A → αAβ, which generate language examples of the form, an bn . For instance, the sentences below have a center embedded structure. 1. The students the police arrested complained. 2. The luggage that the passengers checked arrived. 3. The luggage that the passengers that the storm delayed checked arrived Intuitively, the reason that a regular language cannot describe centre-embedding is that its associated automaton has no memory of what has occurred previously in a string. In order to ‘know’ that n verbs were required to match n nominals already seen, an automaton would need to ‘record’ that n nominals had been seen; but a DFA has no mechanism to do this. Formally, we can prove this using Pumping Lemma property to show that strings of the form anbn are not regular. The pumping lemma for regular languages is used to prove that a language is not regular. The pumping lemma property is: All w ∊ L with |w| ≥ l can be expressed as a concatenation of three strings, w = u1vu2, where u1, v and u2 satisfy: |v| ≥ 1 (i.e. v ≠ ∊) u1|v| ≤ l for all n ≥ 0, u1vnu2 ∊ L (i.e. u1u2 ∊ L, u1vu2 ∊ L, u1vvu2 2 L, u1vvvu2 ∊ L, etc.) If you intersect a regular language with another regular language you should get a third regular language. Lreg1 ∩ Lreg2 = Lreg3 Also regular languages are closed under homomorphism (we can map all nouns to a and all verbs to b)
  • 6. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT So if English is regular and we intersect it with another regular language (e.g. the one generated by /the a (that the a)*b*/) we should get another regular language. if Leng then Leng ∩ La*b* = Lreg3 However the intersection of an a*b* with English is anbn ( in our example case specifically /the a (that the a)n-1bn/), which is not regular as it fails the pumping lemma property. but Leng ∩ La*b* = La n b n (which is not regular ) The assumption that English is regular must be incorrect. 2. Redundancy Grammars written using regular grammar rules alone are highly redundant: since the rules are very simple we need a great many of them to describe the language. This makes regular grammars very difficult to build and maintain. Useful internal structures There are instances where a regular language can recognize the strings of a language but in doing so does not provide a structure that is linguistically useful to us. The left-linear or right-linear internal structures derived by regular grammars are generally not very useful for higher level NLP applications. We need informative internal structure so that we can, for example, build up good semantic representations. In practice, regular grammars can be useful for partial grammars (i.e. when we don’t need to know the syntax tree for the whole sentence but rather just some part of it) and also when we don’t care about derivational structure (i.e. when we just want a Boolean for whether a string is in a language). For example, in information extraction, we need to recognize named entities. The internal structure of named entities is normally unimportant to us, we just want to recognize when we encounter them. For instance, using rules such as: NP → nnsb NP NP → np1 NP NP → np1 where NP is a non-terminal and nnsb and np1 are terminals representing tags from the large tagset, you could match a titled name like, Prof. Stephen William Hawking.
  • 7. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT For every natural language that exists, can we find a context-free grammar to generate it? There is some evidence that natural language can contain cross serial dependencies. A small number of languages exhibit strings of the form shown below. There is a Zurich dialect of Swiss German in which constructions like the following are found: mer d’chind em Hans es huus haend wele laa hälfe aastriiche. we the children Hans the house have wanted to let help paint. we have wanted to let the children help Hans paint the house. Such expressions may not be derivable by a context-free grammar. Where do natural languages fit in Chomsky hierarchy? If we are to use formal grammars to represent natural language, it is useful to know where they appear in the Chomsky hierarchy. With respect to natural language, it might turn out that the set of all attested natural languages is actually as depicted in Figure. The overlap with the context-sensitive languages which accounts for those languages that have cross-serial dependencies.
  • 8. Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT Natural languages are an infinite set of sentences constructed out of a finite set of characters. Words in a sentence don’t have defined upper limits either. When natural languages are reverse engineered into their component parts, they get broken down into four parts - syntax, semantics, morphology, phonology. Natural languages are believed to be at least context-free. However, Dutch and Swiss German contain grammatical constructions with cross-serial dependencies which make them context sensitive. Extensions to Chomsky hierarchy that find relevance in NLP There are two extensions to the traditional Chomsky hierarchy that have proved useful in linguistics and cognitive science: Mildly context-sensitive languages – CFGs are not adequate (weakly or strongly) to characterize some aspects of language structure. To derive extra power beyond CFG, a grammatical formalism called Tree Adjoining Grammars (TAG) was proposed as an approximate characterization of Mildly Context-Sensitive Grammars. composition, called 'adjoining’. Another classification called Minimalist Grammars (MG) describes an even larger class of formal languages. Sub-regular languages A sub-regular language is a set of strings that can be described without employing the full power of finite state automata. Many aspects of human language are manifestly sub-regular, such as some ‘strictly local’ dependencies. Example – identifying recurring sub-string patterns within words is one such common application.