Understanding Natural Languange with Corpora-based Generation of Dependency Grammars

Corpora-Based Generation of Dependency Parser Models for Natural Language
Processing
Edmond C. Lepedus
School of Computing
University of Kent
Canterbury, UK
Email: el210@kent.ac.uk
Abstract—In this paper, we show that it is possible to
train a dependency parser model using an unparsed corpus
of English language text. This is a novel development in
computational linguistics, with the potential to transform
parser generation.
Parsing is an essential part of natural language process-
ing, which is currently performed using parsers trained on
manually parsed and annotated texts.
However, the training data required is expensive to
produce, which limits the size and availability of training
sets, and has a knock-on effect on the performance of the
resulting parser.
In order to negate the need for annotated training data,
we develop an iterative training flow which generates train-
ing examples using heuristics extracted from past parsing
decisions.
We show that the parse trees produced using parsers
trained in this way bear qualitative resemblance to those
produced by conventionally trained parsers, and propose
three avenues for future research.
Index Terms—Natural Language Processing; Dependency
Parsing; Grammar Generation;
1. Introduction
Parsing is a fundamental part of natural language pro-
cessing. It extracts the syntactical structure of the sentence
in order to provide clues about the underlying meaning.
Parser training requires large quantities of text which
has been manually parsed and annotated by human lin-
guists using tools such as GATE [1]. The specific infor-
mation included in an annotation depends on the goals
of the particular annotation initiative, but for linguistics,
it would typically contain a canonical ‘gold’ parse tree
for each sentence in the corpus (subsection 2.1 shows one
such tree).
The production of high-quality training sets can take
months of effort by skilled linguists, and is therefore very
expensive. Even when the resulting data is made available
free of charge, the cost of producing it limits the size of
the training sets that can be developed.
This problem is particularly pronounced when the
number of linguists available to contribute to such a
project is small, such as in the case of dying or already
dead languages.
By enabling the use of unparsed texts for training, we
make available every written work as training data, thus
minimising the cost of training data, while simultaneously
increasing its availability.
In this paper, we show that it is possible to train a
parser model using an unparsed corpus of English lan-
guage text, and introduce an iterative training flow to sup-
port further research. Moreover, we do this by modifying
the Stanford CoreNLP Dependency Parser, which places
our results in a well known context and makes them easier
to replicate or expand upon.
We present brief primers on parsing and the Stanford
CoreNLP in section 2, and provide more information
about our aims and the scope of the project in section 3.
The main body of the paper starts with a high-level
overview of our approach (section 4), followed by specific
details of our implementation (section 5). We present
concrete and qualitative results of our training in section 6,
and an analysis of the outcome in section 7. We suggest
three avenues for future work in section 8.
2. Background
Parsing is an essential part of any natural language
processing pipeline. In our case, it takes sentences which
have been tokenised and annotated with their parts of
speech (e.g. Listing 1), and works out the syntactical
structure between them in order to provide clues about
the underlying meaning [2] .
Sentence #1 (7 tokens ) :
The c a t s a t on the mat .
[ Text=The C h a r a c t e r O f f s e t B e g i n =47
CharacterOffsetEnd =50 PartOfSpeech=DT]
[ Text= c a t C h a r a c t e r O f f s e t B e g i n =51
CharacterOffsetEnd =54 PartOfSpeech=NN]
[ Text= s a t C h a r a c t e r O f f s e t B e g i n =55
CharacterOffsetEnd =58 PartOfSpeech=VBD]
[ Text=on C h a r a c t e r O f f s e t B e g i n =59
CharacterOffsetEnd =61 PartOfSpeech=IN ]
[ Text= the C h a r a c t e r O f f s e t B e g i n =62
CharacterOffsetEnd =65 PartOfSpeech=DT]
[ Text=mat C h a r a c t e r O f f s e t B e g i n =66
CharacterOffsetEnd =69 PartOfSpeech=NN]
[ Text =. C h a r a c t e r O f f s e t B e g i n =69
CharacterOffsetEnd =70 PartOfSpeech = . ]
Listing 1. Example of parser input which has been tokenised, split into
sentences and Part-of-Speech tagged

2.1. Constituency Parsing
Traditionally, computational linguistics has
approached the task by breaking “sentences into
constituents (phrases), which are then broken into smaller
constituents” [3], until each constituent is a single word.
This is grounded in the notion of constituency grammars,
which has been passed down from the ancient Stoics
to linguists through formal logic [3], and is typically
represented using constituency trees as in Figure 1.
S
VP
PP
NP
N
tree
NP
N
constituency
DT
a
P
of
NP
N
example
DT
an
V
is
NP
PRN
This
Figure 1. Example constituency-based parse tree output
2.2. Dependency Parsing
There exists a different, and older, parsing tradition,
which assumes that sentence structure consists of words
linked by binary asymmetric relations called dependencies
[4]. These relations involve a syntactically subordinate
word called a dependent, and the word on which it de-
pends — the head. This is known as dependency parsing,
and results in a representation known as a dependency
tree. An example of such a tree can be seen in Figure 2.
Due to its ability to parse languages with loose con-
straints on the ordering of words in a sentence (e.g.
Finnish, Polish), dependency parsing has seen renewed
interest in the natural language processing community, and
is considered state-of-the-art [5].
This is an example of a dependency tree
ROOT
nsubj
cop
det
nmod
case
det
compound
Arc label describes dependency type
Figure 2. Example dependency-based parse tree
2.3. Transition-Based Dependency Parsers
Transition-based dependency parsing “is a purely data-
driven method that makes no use of a formal grammar
but relies on machine learning from treebank data” [4].
A transition based parser “learns to score possible next
actions in a state machine” [4] in order to produce a de-
pendency tree. The state machine is known as a transition
system, and the possible actions are called transitions [4].
A transition system consists of configurations representing
partial parses of the sentence and a set of transitions used
to move between them. A configuration consists of a stack,
a buffer and the list of transitions which have been applied
to reach it. The transitions are:
• SHIFT: move a word from the buffer to the stack
• LEFT-ARC: remove the second word from the
stack, and add a dependency arc between it and
the first
• RIGHT-ARC: remove the first word from the
stack, and add a dependency arc between it and
the second word
A terminal configuration is one which has only the
special “ROOT” word on the stack and an empty buffer.
Parsing proceeds by applying the best transition, as scored
by an evaluation function, until a terminal configuration
is reached.
Various evaluation functions can be used, ranging
from deterministic rule-based implementations, to neural
network classifiers which are trained to pick the optimum
transition given a configuration.
Figure 3 illustrates the transition-based parsing pro-
cess, whereby the words on the buffer are gradually con-
sumed and a set of dependency arcs is produced. In this
example, we used our own knowledge as an interactive
evaluation function.
2.4. Stanford CoreNLP
The Stanford CoreNLP toolkit is a free, open source,
high quality set of natural language processing tools. It
includes a dependency parser based on a neural network
classifier, which can parse 1000 sentences per second at
an accuracy of 92.2% [6]. The system is structured as
pipeline, which takes input text and applies a series of
annotations to it. The specific annotators used can be
varied and range from simple tokenisers and sentence
splitters to high-level sentiment analysis [7]. The linear
pipeline allows the output of previous annotators to be
used by subsequent ones, and thus achieves great separa-
tion of responsibility and modularity. Figure 4 illustrates
the CoreNLP annotation pipeline.
The CoreNLP Dependency Parser uses a transition
system with a neural network classifier as its scoring
function. The classifier takes a vector of features extracted
from the current system configuration and returns a vector
with a score for each possible transition. When training, it
is supplied with examples consisting of input feature vec-
tors and a desired transition, and its weights are modified
using Adaptive Gradient Descent until it reliably outputs
vectors whose maximum scores correspond to the target
transitions. The specific features used are covered in detail
by Chen et al. [6] and remain unchanged in our system.
Broadly speaking, they include word, part-of-speech and
arc label embeddings for words on both the stack and
the buffer, and of some of their children as described by
existing dependency arcs.

Stack Buffer Arcs Action
-ROOT- The, cat,
sat, on,
the, mat
∅ SHIFT
-ROOT-,
The
cat, sat,
on, the,
mat
∅ SHIFT
-ROOT-,
The, cat
sat, on,
the, mat
∅ LEFT
-ROOT-,
cat
sat, on,
the, mat
L(The,cat) SHIFT
-ROOT-,
cat, sat
on, the,
mat
L(The,cat) LEFT
-ROOT-,
sat
on, the,
mat
L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on
the, mat L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on,
the
mat L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on,
the, mat
L(The,cat),
L(cat,sat)
LEFT
-ROOT-,
sat, on,
mat
L(The,cat),
L(cat,sat),
L(the,mat)
LEFT
-ROOT-,
sat, mat
L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat)
RIGHT
-ROOT-
sat
L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat),
R(sat,mat)
RIGHT
-ROOT- L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat),
R(sat,mat),
R(-ROOT-,sat)
DONE
Figure 3. An example of transition-based dependency parsing
Figure 4. CoreNLP architecture. Reproduced from Manning et al. [7].
3. Aims
We set out to train a parser model using an unparsed
corpus of English language text.
Parser performance is normally evaluated using La-
belled and/or Unlabelled Attachment Scores (LAS/UAS).
Both metrics measure the proportion of words which are
assigned the correct head, but LAS also requires that the
correct label is applied to the dependency relation.
However, given the early stage of this line of research,
the quantitative parsing accuracy of the resulting model is
of secondary concern — simply being able to develop a
model which begins to approximate the types of depen-
dency trees produced by conventionally trained parsers is
a significant step forward.
Therefore, in this preliminary research, we will only
be using placeholder ‘UNKNOWN’ and ‘PARSED’ de-
pendency labels, rather than the full set of linguistically
meaningful labels, as the derivation of suitable labels from
unparsed text merits its own research project.
In order to evaluate the outcome of our research, we
will be inspecting the high-level structure of the depen-
dency tree and looking for non-trivial relations making
use of both left and right arcs and linking non-adjacent
words where appropriate.
4. Approach
We took a high quality, open source parser and mod-
ified it to support an iterative training flow which allows
us to gradually develop a model which can parse the
whole corpus. Instead of extracting training examples for
the classifier from an annotated corpus, our system uses
its own past parse decisions modified in accordance to a
heuristic extracted from the corpus.
4.1. Overview
We start by generating a ‘blank’ parser model (subsec-
tion 5.1), which is then used to parse the corpus, while log-
ging every parsing decision made (subsection 5.2). From
the parse log, we extract heuristics (subsection 5.3), which
are then used to produce improved training examples from
the logged decisions (subsection 5.4). The classifier is
trained on the resulting data, and the next iteration starts.
A significant advantage of this flow is that a new
model is generated at the end of each iteration and can be
evaluated and used while the training continues.
5. Implementation
We decided to modify an existing parser, rather than
develop our own in order to circumscribe the scope of
our work and facilitate its evaluation. CoreNLP provides
a robust and well documented foundation with a state-of-
the-art dependency parser implementation in an accessible
open-source package, making it an ideal match.
5.1. Blank Model Creation
A key requirement of our project was to be able to
generate a blank initial model which could gradually be
populated. However, the CoreNLP implementation is built
with the assumption that its model will always return
a fully formed parse tree. In order to satisfy CoreNLP
without extensive modifications, we therefore decided to
use its own model generation functionality, but co-opt it to
produce a trivial model which just outputs left-bound arcs
with a custom ‘UNKNOWN’ label as seen in Figure 6
below.

Generate
’blank’
model
Parse
corpus
Extract
heuristics
Generate
training
examples
Train parser
Figure 5. Training flow
Example ‘blank’ model output
UNKNOWN UNKNOWN UNKNOWN
Figure 6. Example output from ‘blank’ model
Obviously, the parse trees produced by this model are
of little use for real-world parsing, but they are essential
to bootstrap the training example generation process.
In order to generate the model, we subclassed the
DependencyParser class and modified its training func-
tionality to use a new deterministic oracle.
The original oracle would extract the correct arc di-
rections and labels for a configuration from the parse trees
in the annotated training set. However, since we have no
parse trees to use, and no way of extracting labels, we
replaced it with our own code which deterministically
returns one of three transitions based on hard-coded rules.
The new implementation (Listing 2) takes a transition
system configuration, examines its stack, buffer and exist-
ing transitions and returns a new transition. Conceptually,
it tries to output a left arc, and uses a few rules to ensure
that it is a valid transition (e.g. words don’t have multiple
heads or dependents and the special -ROOT- word does
not become a dependent), otherwise it falls back to either
a SHIFT or a right arc, each of which must satisfy certain
conditions. The key property of this new oracle is that
given a valid input configuration it will always provide a
transition which results in a new configuration which is
also valid, thus allowing us to generate a model which
will always produce a valid parse tree.
Although this approach provided a quick way to boot-
strap the system, a more robust approach would be to
develop a standalone system which implements the or-
acle and outputs a CoNLL treebank which the standard
1 public S t r i n g g e t O r a c l e P r e d i c t i o n (
C o n f i g u r a t i o n c ) {
2 i n t w1 = c . g et S ta c k ( 1 ) ;
3 i n t w2 = c . g et S ta c k ( 0 ) ;
4 i f ( c . g e t S t a c k S i z e ( ) < 3) {
5 i f ( c . g e t B u f f e r S i z e ( ) > 0) {
6 return ”S” ;
7 } e l s e {
8 return ”R(UNKNOWN) ” ;
9 }
10 } e l s e i f ( c . getChildCount (w1) < 1) {
11 return ”R(UNKNOWN) ” ;
12 } e l s e i f ( c . getChildCount (w2) < 1) {
13 return ”L(UNKNOWN) ” ;
14 }
15 return null ;
16 }
Listing 2. Deterministic oracle implementation
CoreNLP dependency parser can then use as training data
without any modifications.
5.2. Logging
We modified the subclassed dependency parser to log
every parsing decision to a YAML file. YAML [8] is a
data serialisation language with a human-readable syntax
which allows us to visually inspect and manually modify
the parser’s decisions while maintaining the ability to load
the file back into the program. Listing 3 shows an example
of the YAML output.
1 −−−
2 ! ! uk . ac . kent . p a r s e r . ParserLogEntry
3 a r c s : [ ’PARSED( Example , log ) ’]
4 bufferPOS : [ ]
5 bufferWords : [ ]
6 f e a t u r e s : [2 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1 , 0 , 1 , 1 , 1 , 1 , 839 , 837 , 837 , 838 ,838 ,
838 , 838 , 838 , 838 , 838 , 838 , 838 , 838 ,
837 , 838 , 838 , 838 , 838 , 841 , 841 , 841 ,
841 , 841 , 841 , 841 , 844 , 841 , 841 , 841 ,
841]
7 partOfSpeechArcs : [ ’PARSED(NN, NN) ’]
8 stackPOS : [−ROOT−, NN, NN]
9 stackWords : [−ROOT−, Example , output ]
10 t r a n s i t i o n : R(PARSED)
11 −−−
Listing 3. Example YAML output
The features vector (line 6) and the transition (line 10)
are needed to train the classifier as previously described
in subsection 2.4, while the rest of the logged information
provides additional context for heuristic extraction and
human inspection.
5.3. Heuristic Extraction
The heuristic we used for this proof of concept simply
counted how often pairs of part-of-speech tags called
bigrams occur in the input corpus. Although dependency
relations hold between words, we use POS bigrams as
a higher level abstraction which allows us to generalise
to unseen words, and assume that frequent bigrams are a
good indication of dependency between the two part-of-
speech tags.

Our corpus analyser runs the input text through a
tagger and scans through the tagged output counting how
many times each bigram is encountered. The parse log
analyser performs a similar function, but counts how
many times bigrams are encountered on the parser’s stack.
Listing 4 shows an example of extracted bigrams.
DT NN 6034
IN DT 5003
. 4707
NN IN 3755
PRP VBD 3256
JJ NN 2783
NN , 2653
NN . 2615
DT JJ 2551
. . .
Listing 4. Example of Part-Of-Speech Bigram histogram extracted from
the Sherlock Holmes corpus
It is worth mentioning that this heuristic, apart from
being na¨ıve, is also rooted in the assumption that the order
of words is significant. This is generally true in English,
but as we discussed in subsection 2.2, there are other
languages for which this assumption does not hold, and
therefore different heuristics would be required.
5.4. Training Example Generation
A training example consists of a feature vector and an
associated transition. Both of these elements are present
the parser log output, and can be extracted and modified
to generate new training examples. In fact, our implemen-
tation exploits this similarity so that training examples are
just log entries which have had their transition modified,
and saved to a different file.
The exact modifications depend on the specific heuris-
tics extracted in the previous step, but in the current
implementation, ‘UNKNOWN’ transitions which have the
most frequent bigram at the top of the stack are replaced
with ‘L(PARSED)’ transitions.
5.5. Training
As previously mentioned, the model training func-
tionality built into the CoreNLP dependency parser is
designed to work with annotated parse trees in CoNLL
format (Listing 5), which we don’t have. We therefore
implemented a new, simplified trainer by copying the
original implementation and stripping out the CoNLL-
handling code. The resulting trainer takes feature vectors
and their corresponding transitions and trains the neural
network classifier to match the expected output.
5.6. Unified Click and Forget Flow
In order to facilitate unattended operation, we de-
veloped a unified flow which, given a suitably prepared
training directory, will generate a new model and run
through the training loop until the corpus is fully parsed.
This flow also launches a web server which is reloaded
with the latest model after every iteration, thus allowing us
to monitor the quality of intermediate models at runtime.
1 And CC 3 DEP
2 t h a t DT 3 DEP
3 might MD 0 ROOT
4 have VB 3 VC
5 been VBN 4 VC
6 the DT 7 NMOD
7 case NN 5 PRD
8 . . 3 P
Listing 5. Example training data in CoNLL format
6. Results
After running our system, we quickly see qualitative
improvements in the resulting parse trees. Instead of a
degenerate parse with no redeeming qualities, we begin
to see parse trees which resemble the output of conven-
tionally trained parsers:
The quick brown fox jumped over the lazy dog
det
amod
amod nsubj
nmod
case
det
amod
Figure 7. Example output from conventional model
The quick brown fox jumped over the lazy dog
parsed
parsed
parsed
parsed
parsed
parsed
parsed
parsed
Figure 8. Example output from trained model
Figure 7 shows a sentence parsed by the Stanford
Dependency Parser using a conventionally trained model.
Compared to our output (Figure 8), we can see that
at a high level, there are obvious similarities in the struc-
tures of the parse trees, with both exhibiting two distinct
clusters, one in the first half of the sentence and another
in the second half.
However, on closer inspection, the clusters in our
output appear to be offset towards the left, causing most
of the dependency arcs to link the wrong words. Although
there are two dependencies linking the correct words (fox
- jumped & lazy - dog), the direction is incorrect in both
cases.
In fact, if we focus on the directions of the arcs, we can
see that most of the arcs in our output are right-bound,
while all but one of the arcs in the canonical parse are
left-bound. This suggests that our efforts to overcome the
absolute left bias of our ‘blank’ model were overzealous
and have produced a substantial right-bias. It is possible
that the use of a sufficiently large input corpus would
overcome this bias, but we were unable to test this.
Quantitatively, we can see in Figure 9 that after a sin-
gle iteration using “The Adventures of Sherlock Holmes”

as a corpus, the number of ‘UNKNOWN’ transitions
decreases by nearly 4.5% from 122530 to 117042, and
continues to decrease, until it begins to level out after
iteration 20. This indicates that while the most frequent
bigrams might be a useful indication of dependency re-
lations, this approach quickly gets diminishing returns as
we look at less common bigrams.
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
·105
Iteration number
UNKNOWNTransitions
Figure 9. Number of UNKNOWN arcs in parse output
7. Conclusion
We have shown that it is possible to train a dependency
parser model using an unparsed corpus of English lan-
guage text. This was achieved using an iterative training
ﬂow in which every iteration uses heuristics extracted
from past parses to generate training data for the next.
As far as we are aware, there is no prior work
on training parser models without using previously
annotated text, and therefore our project’s main aim
was to determine whether this is possible.
Limitations
Although our research has achieved a positive result,
the current implementation still has three major limita-
tions:
• Quantitatively, the resulting model’s parsing ac-
curacy is very low, due to na¨ıve heuristics and a
small input corpus.
• The proof-of-concept implementation is memory
intensive, which limits the size of the input corpus.
• It is only able to produce unlabelled dependencies,
which are not as informative as labelled dependen-
cies.
8. Further work
There are a number of avenues for future work, rang-
ing from simple implementation optimisations, to major
conceptual hurdles.
8.1. Improve Memory Efﬁciency
The current implementation produces over a gigabyte
of parse logs even for small corpora of a few megabytes,
which causes it to exhaust the system’s memory when
they are loaded for processing. This limitation could be
overcome by implementing stream-based processing of
log entries. Combined with minor tweaks to minimise the
size of each log entry, this should enable the use of much
larger corpora. We suspect that, in order to achieve optimal
parser accuracy, input corpus size should be on the order
of terabytes.
8.2. Develop Improved Heuristics
As discussed, the heuristic used in this proof-of-
concept was very na¨ıve, so there is great scope to de-
velop novel and sophisticated heuristics which will di-
rectly affect the quality of the resulting parser. Computing
skip-grams, for example, could highlight pairs of part-of-
speech tags which frequently appear together in sentences,
but are separated by one or more words. Such relationships
could provide additional information to overcome the bias
towards greedy dependencies, but is currently missed.
8.3. Introduce Arc Labels
Although there is no obvious way to derive arc labels
from unparsed text, it might be possible to use unsu-
pervised learning techniques (e.g clustering) to identify
commonly occurring types of arcs, which could then be
used as labels.
9. Acknowledgements
I would like to thank my supervisors, Christian Kissig,
Marek Grze´s and Laura Bocchi, for their continued sup-
port and guidance throughout this challenging project.
References
[1] H. Cunningham, D. Maynard, K. Bontcheva,
V. Tablan, N. Aswani, I. Roberts, G. Gorrell,
A. Funk, A. Roberts, D. Damljanovic, T. Heitz,
M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and
W. Peters, Text Processing with GATE (Version 6),
2011. [Online]. Available: http://tinyurl.com/gatebook
[2] D. Jurafsky and J. H. Martin, Speech and Language
Processing. Prentice Hall, 2000.
[3] M. A. Covington, “A fundamental algorithm for de-
pendency parsing,” in Proceedings of the 39th annual
ACM southeast ..., 2001.
[4] J. Nivre, “Dependency Parsing,” vol. 4, no. 3, pp. 138–
152, Mar. 2010.
[5] N. Green, “Dependency Parsing,” in WDS, Dec. 2011,
pp. 1–6.
[6] D. Chen and C. D. Manning, “A Fast and Accurate
Dependency Parser using Neural Networks.” EMNLP,
pp. 740–750, 2014.
[7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel,
S. Bethard, and D. McClosky, “The Stanford
CoreNLP Natural Language Processing Toolkit,” in

Proceedings of 52nd Annual Meeting of the Associ-
ation for Computational Linguistics: System Demon-
strations. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2014, pp. 55–60.
[8] O. Ben-Kiki and C. Evans. YAML Ain’t Markup
Language Version 1.2 Speciﬁcation. [Online].
Available: http://www.yaml.org/spec/1.2/spec.html

Understanding Natural Languange with Corpora-based Generation of Dependency Grammars

More Related Content

What's hot (20)

Similar to Understanding Natural Languange with Corpora-based Generation of Dependency Grammars (20)

Understanding Natural Languange with Corpora-based Generation of Dependency Grammars