Lecture Notes-Are Natural Languages Regular.pdf

Prof. Deptii Chaudhari, Department of Computer Engineering, I2IT
Lecture Notes - Are Natural Languages Regular?
This is an important question for two reasons: first, it places an upper bound on the running time of
algorithms that process natural language; second, it may tell us something about human language
processing and language acquisition.
To answer this question let us first understand…
• What is a language (natural language / formal language)?
• What is a regular language?
• What are regular grammars?
What is a natural language?
A natural language is a human communication system. A natural language can be thought of as a
mutually understandable communication system that is used between members of some population.
When communicating, speakers of a natural language are tacitly agreeing on what strings are
allowed (i.e., which strings are grammatical). Dialects and specialized languages (including e.g.,
the language used on social media) are all natural languages in their own right.
Named languages that you are familiar with, such as French, Chinese, English etc, are usually
historically, politically or geographically derived labels for populations of speakers.
A natural language has high ambiguity.
Example: I made her duck
1. I cooked waterfowl* for her.
2. I cooked waterfowl* belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head.
5. I turned her into a duck.
Several types of ambiguity combine to cause many meanings:
• morphological (her can be a dative pronoun or possessive pronoun and duck can be a noun
or a verb)
• syntactic (make can behave both transitively and ditransitively; make can select a direct
object or a verb)
• semantic (make can mean create, cause, cook ...)
What is a formal language?
A formal language is a set of strings over an alphabet.
Alphabet: An alphabet is specified by a finite set, ∑ , whose elements are called symbols. Some
examples are shown below:
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} the 10-element set of decimal digits.

{a, b, c, …. x, y, z} the 26-element set of lower-case characters of written English.
{aardvark, ….. zebra} the 250,000-element set of words in the Oxford English Dictionary.
The set of natural numbers N = {0, 1, 2, 3, ….} cannot be an alphabet because it is infinite.
Strings: A string of length n over an alphabet ∑ is an ordered n-tuple of elements of ∑.
∑ * denotes the set of all strings over ∑ of finite length.
If ∑ = {a, b} then ∊, ba, bab, aab are examples of strings over ∑.
If ∑ = {a} then ∑ * = {∊, a, aa, aaa, ….}
If ∑ = {cats, dogs, eat} then
∑ * = {∊, cats, cats eat, cats eat dogs, …..}
Languages: Given an alphabet ∑ any subset of ∑ * is a formal language over alphabet ∑.
What is a regular language?
A language is regular if it is equal to the set of strings accepted by some deterministic finite-state
automaton (DFA).
Regular languages are accepted by DFAs.
Given a DFA M = (Q,∑,∆,s,F) the language, L(M), of strings accepted by M can be generated by
the regular grammar Greg = (N, ∑, S,P) where:
N= {Q} the non-terminals are the states of M
∑ = ∑ the terminals, set of transition symbols of M
S = s the starting symbol is the starting state of M

P = qi → aqj when (qi , a) = qj ∊ ∆
or qi → ∊ when q ∊ F (i.e. when q is an end state)
In order to derive a string from a grammar
• start with the designated starting symbol
• then non-terminal symbols are repeatedly expanded using the rewrite rules until there is
nothing further left to expand.
The rewrite rules derive the members of a language from their internal structure (or phrase
structure).
A regular language has a left- and right-linear grammar.
For every regular grammar the rewrite rules of the grammar can all be expressed in the form:
X → aY
X → a
or alternatively, they can all be expressed as:
X → Ya
X → a

A phrase structure grammar over an alphabet ∑ is defined by a tuple G = (N, ∑, S,P). The language
generated by grammar G is L(G):
Non-terminals N: Non-terminal symbols (often uppercase letters) may be rewritten using the rules
of the grammar.
Terminals ∑ : Terminal symbols (often lowercase letters) are elements of ∑ and cannot be rewritten.
Note N ∩ ∑ = ϕ.
Start Symbol S: A distinguished non-terminal symbol S ∊ N. This non-terminal provides the starting
point for derivations.
Phrase Structure Rules P: Phrase structure rules are pairs of the form (w, v) usually written :
w → v, where w ∊ (∑ ∪ N)*N(∑ ∪ N)* and v ∊ (∑ ∪ N)*
Now lets try and the answer the question Can regular grammars model natural language?
It turns out that regular grammars have limitations when modelling natural languages for following
reasons:
• Centre Embedding
• Redundancy
• Useful internal structures
Problems using regular grammars for natural language
1. Centre Embedding

In principle, the syntax of natural languages cannot be described by a regular language due to the
presence of centre-embedding; i.e. infinitely recursive structures described by the rule, A → αAβ,
which generate language examples of the form, an
bn
.
For instance, the sentences below have a center embedded structure.
1. The students the police arrested complained.
2. The luggage that the passengers checked arrived.
3. The luggage that the passengers that the storm delayed checked arrived
Intuitively, the reason that a regular language cannot describe centre-embedding is that its
associated automaton has no memory of what has occurred previously in a string.
In order to ‘know’ that n verbs were required to match n nominals already seen, an automaton would
need to ‘record’ that n nominals had been seen; but a DFA has no mechanism to do this.
Formally, we can prove this using Pumping Lemma property to show that strings of the form anbn
are not regular.
The pumping lemma for regular languages is used to prove that a language is not regular. The
pumping lemma property is:
All w ∊ L with |w| ≥ l can be expressed as a concatenation of three strings, w = u1vu2, where u1, v
and u2 satisfy:
|v| ≥ 1 (i.e. v ≠ ∊)
u1|v| ≤ l
for all n ≥ 0, u1vnu2 ∊ L (i.e. u1u2 ∊ L, u1vu2 ∊ L, u1vvu2 2 L, u1vvvu2 ∊ L, etc.)
If you intersect a regular language with another regular language you should get a third regular
language.
Lreg1 ∩ Lreg2 = Lreg3
Also regular languages are closed under homomorphism (we can map all nouns to a and all verbs
to b)

So if English is regular and we intersect it with another regular language (e.g. the one generated by
/the a (that the a)*b*/) we should get another regular language.
if Leng then Leng ∩ La*b* = Lreg3
However the intersection of an a*b* with English is anbn ( in our example case specifically /the a
(that the a)n-1bn/), which is not regular as it fails the pumping lemma property.
but Leng ∩ La*b* = La
n
b
n
(which is not regular )
The assumption that English is regular must be incorrect.
2. Redundancy
Grammars written using regular grammar rules alone are highly redundant: since the rules are very
simple we need a great many of them to describe the language. This makes regular grammars very
difficult to build and maintain.
Useful internal structures
There are instances where a regular language can recognize the strings of a language but in doing
so does not provide a structure that is linguistically useful to us. The left-linear or right-linear
internal structures derived by regular grammars are generally not very useful for higher level NLP
applications.
We need informative internal structure so that we can, for example, build up good semantic
representations.
In practice, regular grammars can be useful for partial grammars (i.e. when we don’t need to know
the syntax tree for the whole sentence but rather just some part of it) and also when we don’t care
about derivational structure (i.e. when we just want a Boolean for whether a string is in a language).
For example, in information extraction, we need to recognize named entities.
The internal structure of named entities is normally unimportant to us, we just want to recognize
when we encounter them.
For instance, using rules such as:
NP → nnsb NP
NP → np1 NP
NP → np1
where NP is a non-terminal and nnsb and np1 are terminals representing tags from the large tagset,
you could match a titled name like, Prof. Stephen William Hawking.

For every natural language that exists, can we find a context-free grammar to generate it?
There is some evidence that natural language can contain cross serial dependencies. A small
number of languages exhibit strings of the form shown below.
There is a Zurich dialect of Swiss German in which constructions like the following are found:
mer d’chind em Hans es huus haend wele laa hälfe aastriiche.
we the children Hans the house have wanted to let help paint.
we have wanted to let the children help Hans paint the house.
Such expressions may not be derivable by a context-free grammar.
Where do natural languages fit in Chomsky hierarchy?
If we are to use formal grammars to represent natural language, it is useful to know where they
appear in the Chomsky hierarchy. With respect to natural language, it might turn out that the set of
all attested natural languages is actually as depicted in Figure.
The overlap with the context-sensitive languages which accounts for those languages that have
cross-serial dependencies.

Natural languages are an infinite set of sentences constructed out of a finite set of characters.
Words in a sentence don’t have defined upper limits either.
When natural languages are reverse engineered into their component parts, they get broken down
into four parts - syntax, semantics, morphology, phonology.
Natural languages are believed to be at least context-free. However, Dutch and Swiss German
contain grammatical constructions with cross-serial dependencies which make them context
sensitive.
Extensions to Chomsky hierarchy that find relevance in NLP
There are two extensions to the traditional Chomsky hierarchy that have proved useful in linguistics
and cognitive science:
Mildly context-sensitive languages – CFGs are not adequate (weakly or strongly) to characterize
some aspects of language structure. To derive extra power beyond CFG, a grammatical formalism
called Tree Adjoining Grammars (TAG) was proposed as an approximate characterization of Mildly
Context-Sensitive Grammars. composition, called 'adjoining’.
Another classification called Minimalist Grammars (MG) describes an even larger class of formal
languages.
Sub-regular languages
A sub-regular language is a set of strings that can be described without employing the full power of
finite state automata. Many aspects of human language are manifestly sub-regular, such as some
‘strictly local’ dependencies.
Example – identifying recurring sub-string patterns within words is one such common application.

Lecture Notes-Are Natural Languages Regular.pdf

More Related Content

What's hot (20)

Similar to Lecture Notes-Are Natural Languages Regular.pdf (20)

Recently uploaded (20)

Lecture Notes-Are Natural Languages Regular.pdf