SlideShare a Scribd company logo
Corpora-Based Generation of Dependency Parser Models for Natural Language
Processing
Edmond C. Lepedus
School of Computing
University of Kent
Canterbury, UK
Email: el210@kent.ac.uk
Abstract—In this paper, we show that it is possible to
train a dependency parser model using an unparsed corpus
of English language text. This is a novel development in
computational linguistics, with the potential to transform
parser generation.
Parsing is an essential part of natural language process-
ing, which is currently performed using parsers trained on
manually parsed and annotated texts.
However, the training data required is expensive to
produce, which limits the size and availability of training
sets, and has a knock-on effect on the performance of the
resulting parser.
In order to negate the need for annotated training data,
we develop an iterative training flow which generates train-
ing examples using heuristics extracted from past parsing
decisions.
We show that the parse trees produced using parsers
trained in this way bear qualitative resemblance to those
produced by conventionally trained parsers, and propose
three avenues for future research.
Index Terms—Natural Language Processing; Dependency
Parsing; Grammar Generation;
1. Introduction
Parsing is a fundamental part of natural language pro-
cessing. It extracts the syntactical structure of the sentence
in order to provide clues about the underlying meaning.
Parser training requires large quantities of text which
has been manually parsed and annotated by human lin-
guists using tools such as GATE [1]. The specific infor-
mation included in an annotation depends on the goals
of the particular annotation initiative, but for linguistics,
it would typically contain a canonical ‘gold’ parse tree
for each sentence in the corpus (subsection 2.1 shows one
such tree).
The production of high-quality training sets can take
months of effort by skilled linguists, and is therefore very
expensive. Even when the resulting data is made available
free of charge, the cost of producing it limits the size of
the training sets that can be developed.
This problem is particularly pronounced when the
number of linguists available to contribute to such a
project is small, such as in the case of dying or already
dead languages.
By enabling the use of unparsed texts for training, we
make available every written work as training data, thus
minimising the cost of training data, while simultaneously
increasing its availability.
In this paper, we show that it is possible to train a
parser model using an unparsed corpus of English lan-
guage text, and introduce an iterative training flow to sup-
port further research. Moreover, we do this by modifying
the Stanford CoreNLP Dependency Parser, which places
our results in a well known context and makes them easier
to replicate or expand upon.
We present brief primers on parsing and the Stanford
CoreNLP in section 2, and provide more information
about our aims and the scope of the project in section 3.
The main body of the paper starts with a high-level
overview of our approach (section 4), followed by specific
details of our implementation (section 5). We present
concrete and qualitative results of our training in section 6,
and an analysis of the outcome in section 7. We suggest
three avenues for future work in section 8.
2. Background
Parsing is an essential part of any natural language
processing pipeline. In our case, it takes sentences which
have been tokenised and annotated with their parts of
speech (e.g. Listing 1), and works out the syntactical
structure between them in order to provide clues about
the underlying meaning [2] .
Sentence #1 (7 tokens ) :
The c a t s a t on the mat .
[ Text=The C h a r a c t e r O f f s e t B e g i n =47
CharacterOffsetEnd =50 PartOfSpeech=DT]
[ Text= c a t C h a r a c t e r O f f s e t B e g i n =51
CharacterOffsetEnd =54 PartOfSpeech=NN]
[ Text= s a t C h a r a c t e r O f f s e t B e g i n =55
CharacterOffsetEnd =58 PartOfSpeech=VBD]
[ Text=on C h a r a c t e r O f f s e t B e g i n =59
CharacterOffsetEnd =61 PartOfSpeech=IN ]
[ Text= the C h a r a c t e r O f f s e t B e g i n =62
CharacterOffsetEnd =65 PartOfSpeech=DT]
[ Text=mat C h a r a c t e r O f f s e t B e g i n =66
CharacterOffsetEnd =69 PartOfSpeech=NN]
[ Text =. C h a r a c t e r O f f s e t B e g i n =69
CharacterOffsetEnd =70 PartOfSpeech = . ]
Listing 1. Example of parser input which has been tokenised, split into
sentences and Part-of-Speech tagged
2.1. Constituency Parsing
Traditionally, computational linguistics has
approached the task by breaking “sentences into
constituents (phrases), which are then broken into smaller
constituents” [3], until each constituent is a single word.
This is grounded in the notion of constituency grammars,
which has been passed down from the ancient Stoics
to linguists through formal logic [3], and is typically
represented using constituency trees as in Figure 1.
S
VP
PP
NP
N
tree
NP
N
constituency
DT
a
P
of
NP
N
example
DT
an
V
is
NP
PRN
This
Figure 1. Example constituency-based parse tree output
2.2. Dependency Parsing
There exists a different, and older, parsing tradition,
which assumes that sentence structure consists of words
linked by binary asymmetric relations called dependencies
[4]. These relations involve a syntactically subordinate
word called a dependent, and the word on which it de-
pends — the head. This is known as dependency parsing,
and results in a representation known as a dependency
tree. An example of such a tree can be seen in Figure 2.
Due to its ability to parse languages with loose con-
straints on the ordering of words in a sentence (e.g.
Finnish, Polish), dependency parsing has seen renewed
interest in the natural language processing community, and
is considered state-of-the-art [5].
This is an example of a dependency tree
ROOT
nsubj
cop
det
nmod
case
det
compound
Arc label describes dependency type
Figure 2. Example dependency-based parse tree
2.3. Transition-Based Dependency Parsers
Transition-based dependency parsing “is a purely data-
driven method that makes no use of a formal grammar
but relies on machine learning from treebank data” [4].
A transition based parser “learns to score possible next
actions in a state machine” [4] in order to produce a de-
pendency tree. The state machine is known as a transition
system, and the possible actions are called transitions [4].
A transition system consists of configurations representing
partial parses of the sentence and a set of transitions used
to move between them. A configuration consists of a stack,
a buffer and the list of transitions which have been applied
to reach it. The transitions are:
• SHIFT: move a word from the buffer to the stack
• LEFT-ARC: remove the second word from the
stack, and add a dependency arc between it and
the first
• RIGHT-ARC: remove the first word from the
stack, and add a dependency arc between it and
the second word
A terminal configuration is one which has only the
special “ROOT” word on the stack and an empty buffer.
Parsing proceeds by applying the best transition, as scored
by an evaluation function, until a terminal configuration
is reached.
Various evaluation functions can be used, ranging
from deterministic rule-based implementations, to neural
network classifiers which are trained to pick the optimum
transition given a configuration.
Figure 3 illustrates the transition-based parsing pro-
cess, whereby the words on the buffer are gradually con-
sumed and a set of dependency arcs is produced. In this
example, we used our own knowledge as an interactive
evaluation function.
2.4. Stanford CoreNLP
The Stanford CoreNLP toolkit is a free, open source,
high quality set of natural language processing tools. It
includes a dependency parser based on a neural network
classifier, which can parse 1000 sentences per second at
an accuracy of 92.2% [6]. The system is structured as
pipeline, which takes input text and applies a series of
annotations to it. The specific annotators used can be
varied and range from simple tokenisers and sentence
splitters to high-level sentiment analysis [7]. The linear
pipeline allows the output of previous annotators to be
used by subsequent ones, and thus achieves great separa-
tion of responsibility and modularity. Figure 4 illustrates
the CoreNLP annotation pipeline.
The CoreNLP Dependency Parser uses a transition
system with a neural network classifier as its scoring
function. The classifier takes a vector of features extracted
from the current system configuration and returns a vector
with a score for each possible transition. When training, it
is supplied with examples consisting of input feature vec-
tors and a desired transition, and its weights are modified
using Adaptive Gradient Descent until it reliably outputs
vectors whose maximum scores correspond to the target
transitions. The specific features used are covered in detail
by Chen et al. [6] and remain unchanged in our system.
Broadly speaking, they include word, part-of-speech and
arc label embeddings for words on both the stack and
the buffer, and of some of their children as described by
existing dependency arcs.
Stack Buffer Arcs Action
-ROOT- The, cat,
sat, on,
the, mat
∅ SHIFT
-ROOT-,
The
cat, sat,
on, the,
mat
∅ SHIFT
-ROOT-,
The, cat
sat, on,
the, mat
∅ LEFT
-ROOT-,
cat
sat, on,
the, mat
L(The,cat) SHIFT
-ROOT-,
cat, sat
on, the,
mat
L(The,cat) LEFT
-ROOT-,
sat
on, the,
mat
L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on
the, mat L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on,
the
mat L(The,cat),
L(cat,sat)
SHIFT
-ROOT-,
sat, on,
the, mat
L(The,cat),
L(cat,sat)
LEFT
-ROOT-,
sat, on,
mat
L(The,cat),
L(cat,sat),
L(the,mat)
LEFT
-ROOT-,
sat, mat
L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat)
RIGHT
-ROOT-
sat
L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat),
R(sat,mat)
RIGHT
-ROOT- L(The,cat),
L(cat,sat),
L(the, mat),
L(on, mat),
R(sat,mat),
R(-ROOT-,sat)
DONE
Figure 3. An example of transition-based dependency parsing
Figure 4. CoreNLP architecture. Reproduced from Manning et al. [7].
3. Aims
We set out to train a parser model using an unparsed
corpus of English language text.
Parser performance is normally evaluated using La-
belled and/or Unlabelled Attachment Scores (LAS/UAS).
Both metrics measure the proportion of words which are
assigned the correct head, but LAS also requires that the
correct label is applied to the dependency relation.
However, given the early stage of this line of research,
the quantitative parsing accuracy of the resulting model is
of secondary concern — simply being able to develop a
model which begins to approximate the types of depen-
dency trees produced by conventionally trained parsers is
a significant step forward.
Therefore, in this preliminary research, we will only
be using placeholder ‘UNKNOWN’ and ‘PARSED’ de-
pendency labels, rather than the full set of linguistically
meaningful labels, as the derivation of suitable labels from
unparsed text merits its own research project.
In order to evaluate the outcome of our research, we
will be inspecting the high-level structure of the depen-
dency tree and looking for non-trivial relations making
use of both left and right arcs and linking non-adjacent
words where appropriate.
4. Approach
We took a high quality, open source parser and mod-
ified it to support an iterative training flow which allows
us to gradually develop a model which can parse the
whole corpus. Instead of extracting training examples for
the classifier from an annotated corpus, our system uses
its own past parse decisions modified in accordance to a
heuristic extracted from the corpus.
4.1. Overview
We start by generating a ‘blank’ parser model (subsec-
tion 5.1), which is then used to parse the corpus, while log-
ging every parsing decision made (subsection 5.2). From
the parse log, we extract heuristics (subsection 5.3), which
are then used to produce improved training examples from
the logged decisions (subsection 5.4). The classifier is
trained on the resulting data, and the next iteration starts.
A significant advantage of this flow is that a new
model is generated at the end of each iteration and can be
evaluated and used while the training continues.
5. Implementation
We decided to modify an existing parser, rather than
develop our own in order to circumscribe the scope of
our work and facilitate its evaluation. CoreNLP provides
a robust and well documented foundation with a state-of-
the-art dependency parser implementation in an accessible
open-source package, making it an ideal match.
5.1. Blank Model Creation
A key requirement of our project was to be able to
generate a blank initial model which could gradually be
populated. However, the CoreNLP implementation is built
with the assumption that its model will always return
a fully formed parse tree. In order to satisfy CoreNLP
without extensive modifications, we therefore decided to
use its own model generation functionality, but co-opt it to
produce a trivial model which just outputs left-bound arcs
with a custom ‘UNKNOWN’ label as seen in Figure 6
below.
Generate
’blank’
model
Parse
corpus
Extract
heuristics
Generate
training
examples
Train parser
Figure 5. Training flow
Example ‘blank’ model output
UNKNOWN UNKNOWN UNKNOWN
Figure 6. Example output from ‘blank’ model
Obviously, the parse trees produced by this model are
of little use for real-world parsing, but they are essential
to bootstrap the training example generation process.
In order to generate the model, we subclassed the
DependencyParser class and modified its training func-
tionality to use a new deterministic oracle.
The original oracle would extract the correct arc di-
rections and labels for a configuration from the parse trees
in the annotated training set. However, since we have no
parse trees to use, and no way of extracting labels, we
replaced it with our own code which deterministically
returns one of three transitions based on hard-coded rules.
The new implementation (Listing 2) takes a transition
system configuration, examines its stack, buffer and exist-
ing transitions and returns a new transition. Conceptually,
it tries to output a left arc, and uses a few rules to ensure
that it is a valid transition (e.g. words don’t have multiple
heads or dependents and the special -ROOT- word does
not become a dependent), otherwise it falls back to either
a SHIFT or a right arc, each of which must satisfy certain
conditions. The key property of this new oracle is that
given a valid input configuration it will always provide a
transition which results in a new configuration which is
also valid, thus allowing us to generate a model which
will always produce a valid parse tree.
Although this approach provided a quick way to boot-
strap the system, a more robust approach would be to
develop a standalone system which implements the or-
acle and outputs a CoNLL treebank which the standard
1 public S t r i n g g e t O r a c l e P r e d i c t i o n (
C o n f i g u r a t i o n c ) {
2 i n t w1 = c . g et S ta c k ( 1 ) ;
3 i n t w2 = c . g et S ta c k ( 0 ) ;
4 i f ( c . g e t S t a c k S i z e ( ) < 3) {
5 i f ( c . g e t B u f f e r S i z e ( ) > 0) {
6 return ”S” ;
7 } e l s e {
8 return ”R(UNKNOWN) ” ;
9 }
10 } e l s e i f ( c . getChildCount (w1) < 1) {
11 return ”R(UNKNOWN) ” ;
12 } e l s e i f ( c . getChildCount (w2) < 1) {
13 return ”L(UNKNOWN) ” ;
14 }
15 return null ;
16 }
Listing 2. Deterministic oracle implementation
CoreNLP dependency parser can then use as training data
without any modifications.
5.2. Logging
We modified the subclassed dependency parser to log
every parsing decision to a YAML file. YAML [8] is a
data serialisation language with a human-readable syntax
which allows us to visually inspect and manually modify
the parser’s decisions while maintaining the ability to load
the file back into the program. Listing 3 shows an example
of the YAML output.
1 −−−
2 ! ! uk . ac . kent . p a r s e r . ParserLogEntry
3 a r c s : [ ’PARSED( Example , log ) ’]
4 bufferPOS : [ ]
5 bufferWords : [ ]
6 f e a t u r e s : [2 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 ,
1 , 0 , 1 , 1 , 1 , 1 , 839 , 837 , 837 , 838 ,838 ,
838 , 838 , 838 , 838 , 838 , 838 , 838 , 838 ,
837 , 838 , 838 , 838 , 838 , 841 , 841 , 841 ,
841 , 841 , 841 , 841 , 844 , 841 , 841 , 841 ,
841]
7 partOfSpeechArcs : [ ’PARSED(NN, NN) ’]
8 stackPOS : [−ROOT−, NN, NN]
9 stackWords : [−ROOT−, Example , output ]
10 t r a n s i t i o n : R(PARSED)
11 −−−
Listing 3. Example YAML output
The features vector (line 6) and the transition (line 10)
are needed to train the classifier as previously described
in subsection 2.4, while the rest of the logged information
provides additional context for heuristic extraction and
human inspection.
5.3. Heuristic Extraction
The heuristic we used for this proof of concept simply
counted how often pairs of part-of-speech tags called
bigrams occur in the input corpus. Although dependency
relations hold between words, we use POS bigrams as
a higher level abstraction which allows us to generalise
to unseen words, and assume that frequent bigrams are a
good indication of dependency between the two part-of-
speech tags.
Our corpus analyser runs the input text through a
tagger and scans through the tagged output counting how
many times each bigram is encountered. The parse log
analyser performs a similar function, but counts how
many times bigrams are encountered on the parser’s stack.
Listing 4 shows an example of extracted bigrams.
DT NN 6034
IN DT 5003
. 4707
NN IN 3755
PRP VBD 3256
JJ NN 2783
NN , 2653
NN . 2615
DT JJ 2551
. . .
Listing 4. Example of Part-Of-Speech Bigram histogram extracted from
the Sherlock Holmes corpus
It is worth mentioning that this heuristic, apart from
being na¨ıve, is also rooted in the assumption that the order
of words is significant. This is generally true in English,
but as we discussed in subsection 2.2, there are other
languages for which this assumption does not hold, and
therefore different heuristics would be required.
5.4. Training Example Generation
A training example consists of a feature vector and an
associated transition. Both of these elements are present
the parser log output, and can be extracted and modified
to generate new training examples. In fact, our implemen-
tation exploits this similarity so that training examples are
just log entries which have had their transition modified,
and saved to a different file.
The exact modifications depend on the specific heuris-
tics extracted in the previous step, but in the current
implementation, ‘UNKNOWN’ transitions which have the
most frequent bigram at the top of the stack are replaced
with ‘L(PARSED)’ transitions.
5.5. Training
As previously mentioned, the model training func-
tionality built into the CoreNLP dependency parser is
designed to work with annotated parse trees in CoNLL
format (Listing 5), which we don’t have. We therefore
implemented a new, simplified trainer by copying the
original implementation and stripping out the CoNLL-
handling code. The resulting trainer takes feature vectors
and their corresponding transitions and trains the neural
network classifier to match the expected output.
5.6. Unified Click and Forget Flow
In order to facilitate unattended operation, we de-
veloped a unified flow which, given a suitably prepared
training directory, will generate a new model and run
through the training loop until the corpus is fully parsed.
This flow also launches a web server which is reloaded
with the latest model after every iteration, thus allowing us
to monitor the quality of intermediate models at runtime.
1 And CC 3 DEP
2 t h a t DT 3 DEP
3 might MD 0 ROOT
4 have VB 3 VC
5 been VBN 4 VC
6 the DT 7 NMOD
7 case NN 5 PRD
8 . . 3 P
Listing 5. Example training data in CoNLL format
6. Results
After running our system, we quickly see qualitative
improvements in the resulting parse trees. Instead of a
degenerate parse with no redeeming qualities, we begin
to see parse trees which resemble the output of conven-
tionally trained parsers:
The quick brown fox jumped over the lazy dog
det
amod
amod nsubj
nmod
case
det
amod
Figure 7. Example output from conventional model
The quick brown fox jumped over the lazy dog
parsed
parsed
parsed
parsed
parsed
parsed
parsed
parsed
Figure 8. Example output from trained model
Figure 7 shows a sentence parsed by the Stanford
Dependency Parser using a conventionally trained model.
Compared to our output (Figure 8), we can see that
at a high level, there are obvious similarities in the struc-
tures of the parse trees, with both exhibiting two distinct
clusters, one in the first half of the sentence and another
in the second half.
However, on closer inspection, the clusters in our
output appear to be offset towards the left, causing most
of the dependency arcs to link the wrong words. Although
there are two dependencies linking the correct words (fox
- jumped & lazy - dog), the direction is incorrect in both
cases.
In fact, if we focus on the directions of the arcs, we can
see that most of the arcs in our output are right-bound,
while all but one of the arcs in the canonical parse are
left-bound. This suggests that our efforts to overcome the
absolute left bias of our ‘blank’ model were overzealous
and have produced a substantial right-bias. It is possible
that the use of a sufficiently large input corpus would
overcome this bias, but we were unable to test this.
Quantitatively, we can see in Figure 9 that after a sin-
gle iteration using “The Adventures of Sherlock Holmes”
as a corpus, the number of ‘UNKNOWN’ transitions
decreases by nearly 4.5% from 122530 to 117042, and
continues to decrease, until it begins to level out after
iteration 20. This indicates that while the most frequent
bigrams might be a useful indication of dependency re-
lations, this approach quickly gets diminishing returns as
we look at less common bigrams.
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
·105
Iteration number
UNKNOWNTransitions
Figure 9. Number of UNKNOWN arcs in parse output
7. Conclusion
We have shown that it is possible to train a dependency
parser model using an unparsed corpus of English lan-
guage text. This was achieved using an iterative training
flow in which every iteration uses heuristics extracted
from past parses to generate training data for the next.
As far as we are aware, there is no prior work
on training parser models without using previously
annotated text, and therefore our project’s main aim
was to determine whether this is possible.
Limitations
Although our research has achieved a positive result,
the current implementation still has three major limita-
tions:
• Quantitatively, the resulting model’s parsing ac-
curacy is very low, due to na¨ıve heuristics and a
small input corpus.
• The proof-of-concept implementation is memory
intensive, which limits the size of the input corpus.
• It is only able to produce unlabelled dependencies,
which are not as informative as labelled dependen-
cies.
8. Further work
There are a number of avenues for future work, rang-
ing from simple implementation optimisations, to major
conceptual hurdles.
8.1. Improve Memory Efficiency
The current implementation produces over a gigabyte
of parse logs even for small corpora of a few megabytes,
which causes it to exhaust the system’s memory when
they are loaded for processing. This limitation could be
overcome by implementing stream-based processing of
log entries. Combined with minor tweaks to minimise the
size of each log entry, this should enable the use of much
larger corpora. We suspect that, in order to achieve optimal
parser accuracy, input corpus size should be on the order
of terabytes.
8.2. Develop Improved Heuristics
As discussed, the heuristic used in this proof-of-
concept was very na¨ıve, so there is great scope to de-
velop novel and sophisticated heuristics which will di-
rectly affect the quality of the resulting parser. Computing
skip-grams, for example, could highlight pairs of part-of-
speech tags which frequently appear together in sentences,
but are separated by one or more words. Such relationships
could provide additional information to overcome the bias
towards greedy dependencies, but is currently missed.
8.3. Introduce Arc Labels
Although there is no obvious way to derive arc labels
from unparsed text, it might be possible to use unsu-
pervised learning techniques (e.g clustering) to identify
commonly occurring types of arcs, which could then be
used as labels.
9. Acknowledgements
I would like to thank my supervisors, Christian Kissig,
Marek Grze´s and Laura Bocchi, for their continued sup-
port and guidance throughout this challenging project.
References
[1] H. Cunningham, D. Maynard, K. Bontcheva,
V. Tablan, N. Aswani, I. Roberts, G. Gorrell,
A. Funk, A. Roberts, D. Damljanovic, T. Heitz,
M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and
W. Peters, Text Processing with GATE (Version 6),
2011. [Online]. Available: http://tinyurl.com/gatebook
[2] D. Jurafsky and J. H. Martin, Speech and Language
Processing. Prentice Hall, 2000.
[3] M. A. Covington, “A fundamental algorithm for de-
pendency parsing,” in Proceedings of the 39th annual
ACM southeast ..., 2001.
[4] J. Nivre, “Dependency Parsing,” vol. 4, no. 3, pp. 138–
152, Mar. 2010.
[5] N. Green, “Dependency Parsing,” in WDS, Dec. 2011,
pp. 1–6.
[6] D. Chen and C. D. Manning, “A Fast and Accurate
Dependency Parser using Neural Networks.” EMNLP,
pp. 740–750, 2014.
[7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel,
S. Bethard, and D. McClosky, “The Stanford
CoreNLP Natural Language Processing Toolkit,” in
Proceedings of 52nd Annual Meeting of the Associ-
ation for Computational Linguistics: System Demon-
strations. Stroudsburg, PA, USA: Association for
Computational Linguistics, 2014, pp. 55–60.
[8] O. Ben-Kiki and C. Evans. YAML Ain’t Markup
Language Version 1.2 Specification. [Online].
Available: http://www.yaml.org/spec/1.2/spec.html

More Related Content

PDF
Text summarization
PDF
text summarization using amr
PPTX
Text summarization
PDF
Text Summarization
PDF
Extraction Based automatic summarization
PDF
Document Summarization
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
PDF
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
Text summarization
text summarization using amr
Text summarization
Text Summarization
Extraction Based automatic summarization
Document Summarization
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION

What's hot (20)

PDF
A Survey of Various Methods for Text Summarization
PDF
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
PDF
Improving Neural Abstractive Text Summarization with Prior Knowledge
PDF
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
PDF
Conceptual framework for abstractive text summarization
PPTX
Word embedding
PDF
semantic text doc clustering
PDF
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
PDF
Y24168171
PDF
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
PDF
Multi Document Text Summarization using Backpropagation Network
PDF
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
PPTX
Intent Classifier with Facebook fastText
PDF
Abstractive Text Summarization
PDF
Towards Building Parallel Dependency Treebanks: Intra-Chunk Expansion and Ali...
PDF
Turkish language modeling using BERT
PDF
Indexing of Arabic documents automatically based on lexical analysis
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
PDF
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
A Survey of Various Methods for Text Summarization
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
Improving Neural Abstractive Text Summarization with Prior Knowledge
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
Conceptual framework for abstractive text summarization
Word embedding
semantic text doc clustering
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
Y24168171
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURES
Multi Document Text Summarization using Backpropagation Network
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
Intent Classifier with Facebook fastText
Abstractive Text Summarization
Towards Building Parallel Dependency Treebanks: Intra-Chunk Expansion and Ali...
Turkish language modeling using BERT
Indexing of Arabic documents automatically based on lexical analysis
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
Ad

Similar to Understanding Natural Languange with Corpora-based Generation of Dependency Grammars (20)

PDF
Cc35451454
PDF
Extractive Document Summarization - An Unsupervised Approach
PDF
Extractive Summarization with Very Deep Pretrained Language Model
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PDF
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
PPTX
3__Python - Tool Text summarization.pptx
PDF
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
DOC
amta-decision-trees.doc Word document
PDF
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
PPTX
PDF
EasyChair-Preprint-7375.pdf
PDF
Lectura 3.5 word normalizationintwitter finitestate_transducers
PDF
Isolated word recognition using lpc & vector quantization
PDF
Isolated word recognition using lpc &amp; vector quantization
PPTX
Unit II Natural Language Processing.pptx
PDF
G04124041046
PDF
Class Diagram Extraction from Textual Requirements Using NLP Techniques
PDF
D017232729
PDF
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
PDF
An expert system for automatic reading of a text written in standard arabic
Cc35451454
Extractive Document Summarization - An Unsupervised Approach
Extractive Summarization with Very Deep Pretrained Language Model
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
3__Python - Tool Text summarization.pptx
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
amta-decision-trees.doc Word document
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
EasyChair-Preprint-7375.pdf
Lectura 3.5 word normalizationintwitter finitestate_transducers
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc &amp; vector quantization
Unit II Natural Language Processing.pptx
G04124041046
Class Diagram Extraction from Textual Requirements Using NLP Techniques
D017232729
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
An expert system for automatic reading of a text written in standard arabic
Ad

Understanding Natural Languange with Corpora-based Generation of Dependency Grammars

  • 1. Corpora-Based Generation of Dependency Parser Models for Natural Language Processing Edmond C. Lepedus School of Computing University of Kent Canterbury, UK Email: el210@kent.ac.uk Abstract—In this paper, we show that it is possible to train a dependency parser model using an unparsed corpus of English language text. This is a novel development in computational linguistics, with the potential to transform parser generation. Parsing is an essential part of natural language process- ing, which is currently performed using parsers trained on manually parsed and annotated texts. However, the training data required is expensive to produce, which limits the size and availability of training sets, and has a knock-on effect on the performance of the resulting parser. In order to negate the need for annotated training data, we develop an iterative training flow which generates train- ing examples using heuristics extracted from past parsing decisions. We show that the parse trees produced using parsers trained in this way bear qualitative resemblance to those produced by conventionally trained parsers, and propose three avenues for future research. Index Terms—Natural Language Processing; Dependency Parsing; Grammar Generation; 1. Introduction Parsing is a fundamental part of natural language pro- cessing. It extracts the syntactical structure of the sentence in order to provide clues about the underlying meaning. Parser training requires large quantities of text which has been manually parsed and annotated by human lin- guists using tools such as GATE [1]. The specific infor- mation included in an annotation depends on the goals of the particular annotation initiative, but for linguistics, it would typically contain a canonical ‘gold’ parse tree for each sentence in the corpus (subsection 2.1 shows one such tree). The production of high-quality training sets can take months of effort by skilled linguists, and is therefore very expensive. Even when the resulting data is made available free of charge, the cost of producing it limits the size of the training sets that can be developed. This problem is particularly pronounced when the number of linguists available to contribute to such a project is small, such as in the case of dying or already dead languages. By enabling the use of unparsed texts for training, we make available every written work as training data, thus minimising the cost of training data, while simultaneously increasing its availability. In this paper, we show that it is possible to train a parser model using an unparsed corpus of English lan- guage text, and introduce an iterative training flow to sup- port further research. Moreover, we do this by modifying the Stanford CoreNLP Dependency Parser, which places our results in a well known context and makes them easier to replicate or expand upon. We present brief primers on parsing and the Stanford CoreNLP in section 2, and provide more information about our aims and the scope of the project in section 3. The main body of the paper starts with a high-level overview of our approach (section 4), followed by specific details of our implementation (section 5). We present concrete and qualitative results of our training in section 6, and an analysis of the outcome in section 7. We suggest three avenues for future work in section 8. 2. Background Parsing is an essential part of any natural language processing pipeline. In our case, it takes sentences which have been tokenised and annotated with their parts of speech (e.g. Listing 1), and works out the syntactical structure between them in order to provide clues about the underlying meaning [2] . Sentence #1 (7 tokens ) : The c a t s a t on the mat . [ Text=The C h a r a c t e r O f f s e t B e g i n =47 CharacterOffsetEnd =50 PartOfSpeech=DT] [ Text= c a t C h a r a c t e r O f f s e t B e g i n =51 CharacterOffsetEnd =54 PartOfSpeech=NN] [ Text= s a t C h a r a c t e r O f f s e t B e g i n =55 CharacterOffsetEnd =58 PartOfSpeech=VBD] [ Text=on C h a r a c t e r O f f s e t B e g i n =59 CharacterOffsetEnd =61 PartOfSpeech=IN ] [ Text= the C h a r a c t e r O f f s e t B e g i n =62 CharacterOffsetEnd =65 PartOfSpeech=DT] [ Text=mat C h a r a c t e r O f f s e t B e g i n =66 CharacterOffsetEnd =69 PartOfSpeech=NN] [ Text =. C h a r a c t e r O f f s e t B e g i n =69 CharacterOffsetEnd =70 PartOfSpeech = . ] Listing 1. Example of parser input which has been tokenised, split into sentences and Part-of-Speech tagged
  • 2. 2.1. Constituency Parsing Traditionally, computational linguistics has approached the task by breaking “sentences into constituents (phrases), which are then broken into smaller constituents” [3], until each constituent is a single word. This is grounded in the notion of constituency grammars, which has been passed down from the ancient Stoics to linguists through formal logic [3], and is typically represented using constituency trees as in Figure 1. S VP PP NP N tree NP N constituency DT a P of NP N example DT an V is NP PRN This Figure 1. Example constituency-based parse tree output 2.2. Dependency Parsing There exists a different, and older, parsing tradition, which assumes that sentence structure consists of words linked by binary asymmetric relations called dependencies [4]. These relations involve a syntactically subordinate word called a dependent, and the word on which it de- pends — the head. This is known as dependency parsing, and results in a representation known as a dependency tree. An example of such a tree can be seen in Figure 2. Due to its ability to parse languages with loose con- straints on the ordering of words in a sentence (e.g. Finnish, Polish), dependency parsing has seen renewed interest in the natural language processing community, and is considered state-of-the-art [5]. This is an example of a dependency tree ROOT nsubj cop det nmod case det compound Arc label describes dependency type Figure 2. Example dependency-based parse tree 2.3. Transition-Based Dependency Parsers Transition-based dependency parsing “is a purely data- driven method that makes no use of a formal grammar but relies on machine learning from treebank data” [4]. A transition based parser “learns to score possible next actions in a state machine” [4] in order to produce a de- pendency tree. The state machine is known as a transition system, and the possible actions are called transitions [4]. A transition system consists of configurations representing partial parses of the sentence and a set of transitions used to move between them. A configuration consists of a stack, a buffer and the list of transitions which have been applied to reach it. The transitions are: • SHIFT: move a word from the buffer to the stack • LEFT-ARC: remove the second word from the stack, and add a dependency arc between it and the first • RIGHT-ARC: remove the first word from the stack, and add a dependency arc between it and the second word A terminal configuration is one which has only the special “ROOT” word on the stack and an empty buffer. Parsing proceeds by applying the best transition, as scored by an evaluation function, until a terminal configuration is reached. Various evaluation functions can be used, ranging from deterministic rule-based implementations, to neural network classifiers which are trained to pick the optimum transition given a configuration. Figure 3 illustrates the transition-based parsing pro- cess, whereby the words on the buffer are gradually con- sumed and a set of dependency arcs is produced. In this example, we used our own knowledge as an interactive evaluation function. 2.4. Stanford CoreNLP The Stanford CoreNLP toolkit is a free, open source, high quality set of natural language processing tools. It includes a dependency parser based on a neural network classifier, which can parse 1000 sentences per second at an accuracy of 92.2% [6]. The system is structured as pipeline, which takes input text and applies a series of annotations to it. The specific annotators used can be varied and range from simple tokenisers and sentence splitters to high-level sentiment analysis [7]. The linear pipeline allows the output of previous annotators to be used by subsequent ones, and thus achieves great separa- tion of responsibility and modularity. Figure 4 illustrates the CoreNLP annotation pipeline. The CoreNLP Dependency Parser uses a transition system with a neural network classifier as its scoring function. The classifier takes a vector of features extracted from the current system configuration and returns a vector with a score for each possible transition. When training, it is supplied with examples consisting of input feature vec- tors and a desired transition, and its weights are modified using Adaptive Gradient Descent until it reliably outputs vectors whose maximum scores correspond to the target transitions. The specific features used are covered in detail by Chen et al. [6] and remain unchanged in our system. Broadly speaking, they include word, part-of-speech and arc label embeddings for words on both the stack and the buffer, and of some of their children as described by existing dependency arcs.
  • 3. Stack Buffer Arcs Action -ROOT- The, cat, sat, on, the, mat ∅ SHIFT -ROOT-, The cat, sat, on, the, mat ∅ SHIFT -ROOT-, The, cat sat, on, the, mat ∅ LEFT -ROOT-, cat sat, on, the, mat L(The,cat) SHIFT -ROOT-, cat, sat on, the, mat L(The,cat) LEFT -ROOT-, sat on, the, mat L(The,cat), L(cat,sat) SHIFT -ROOT-, sat, on the, mat L(The,cat), L(cat,sat) SHIFT -ROOT-, sat, on, the mat L(The,cat), L(cat,sat) SHIFT -ROOT-, sat, on, the, mat L(The,cat), L(cat,sat) LEFT -ROOT-, sat, on, mat L(The,cat), L(cat,sat), L(the,mat) LEFT -ROOT-, sat, mat L(The,cat), L(cat,sat), L(the, mat), L(on, mat) RIGHT -ROOT- sat L(The,cat), L(cat,sat), L(the, mat), L(on, mat), R(sat,mat) RIGHT -ROOT- L(The,cat), L(cat,sat), L(the, mat), L(on, mat), R(sat,mat), R(-ROOT-,sat) DONE Figure 3. An example of transition-based dependency parsing Figure 4. CoreNLP architecture. Reproduced from Manning et al. [7]. 3. Aims We set out to train a parser model using an unparsed corpus of English language text. Parser performance is normally evaluated using La- belled and/or Unlabelled Attachment Scores (LAS/UAS). Both metrics measure the proportion of words which are assigned the correct head, but LAS also requires that the correct label is applied to the dependency relation. However, given the early stage of this line of research, the quantitative parsing accuracy of the resulting model is of secondary concern — simply being able to develop a model which begins to approximate the types of depen- dency trees produced by conventionally trained parsers is a significant step forward. Therefore, in this preliminary research, we will only be using placeholder ‘UNKNOWN’ and ‘PARSED’ de- pendency labels, rather than the full set of linguistically meaningful labels, as the derivation of suitable labels from unparsed text merits its own research project. In order to evaluate the outcome of our research, we will be inspecting the high-level structure of the depen- dency tree and looking for non-trivial relations making use of both left and right arcs and linking non-adjacent words where appropriate. 4. Approach We took a high quality, open source parser and mod- ified it to support an iterative training flow which allows us to gradually develop a model which can parse the whole corpus. Instead of extracting training examples for the classifier from an annotated corpus, our system uses its own past parse decisions modified in accordance to a heuristic extracted from the corpus. 4.1. Overview We start by generating a ‘blank’ parser model (subsec- tion 5.1), which is then used to parse the corpus, while log- ging every parsing decision made (subsection 5.2). From the parse log, we extract heuristics (subsection 5.3), which are then used to produce improved training examples from the logged decisions (subsection 5.4). The classifier is trained on the resulting data, and the next iteration starts. A significant advantage of this flow is that a new model is generated at the end of each iteration and can be evaluated and used while the training continues. 5. Implementation We decided to modify an existing parser, rather than develop our own in order to circumscribe the scope of our work and facilitate its evaluation. CoreNLP provides a robust and well documented foundation with a state-of- the-art dependency parser implementation in an accessible open-source package, making it an ideal match. 5.1. Blank Model Creation A key requirement of our project was to be able to generate a blank initial model which could gradually be populated. However, the CoreNLP implementation is built with the assumption that its model will always return a fully formed parse tree. In order to satisfy CoreNLP without extensive modifications, we therefore decided to use its own model generation functionality, but co-opt it to produce a trivial model which just outputs left-bound arcs with a custom ‘UNKNOWN’ label as seen in Figure 6 below.
  • 4. Generate ’blank’ model Parse corpus Extract heuristics Generate training examples Train parser Figure 5. Training flow Example ‘blank’ model output UNKNOWN UNKNOWN UNKNOWN Figure 6. Example output from ‘blank’ model Obviously, the parse trees produced by this model are of little use for real-world parsing, but they are essential to bootstrap the training example generation process. In order to generate the model, we subclassed the DependencyParser class and modified its training func- tionality to use a new deterministic oracle. The original oracle would extract the correct arc di- rections and labels for a configuration from the parse trees in the annotated training set. However, since we have no parse trees to use, and no way of extracting labels, we replaced it with our own code which deterministically returns one of three transitions based on hard-coded rules. The new implementation (Listing 2) takes a transition system configuration, examines its stack, buffer and exist- ing transitions and returns a new transition. Conceptually, it tries to output a left arc, and uses a few rules to ensure that it is a valid transition (e.g. words don’t have multiple heads or dependents and the special -ROOT- word does not become a dependent), otherwise it falls back to either a SHIFT or a right arc, each of which must satisfy certain conditions. The key property of this new oracle is that given a valid input configuration it will always provide a transition which results in a new configuration which is also valid, thus allowing us to generate a model which will always produce a valid parse tree. Although this approach provided a quick way to boot- strap the system, a more robust approach would be to develop a standalone system which implements the or- acle and outputs a CoNLL treebank which the standard 1 public S t r i n g g e t O r a c l e P r e d i c t i o n ( C o n f i g u r a t i o n c ) { 2 i n t w1 = c . g et S ta c k ( 1 ) ; 3 i n t w2 = c . g et S ta c k ( 0 ) ; 4 i f ( c . g e t S t a c k S i z e ( ) < 3) { 5 i f ( c . g e t B u f f e r S i z e ( ) > 0) { 6 return ”S” ; 7 } e l s e { 8 return ”R(UNKNOWN) ” ; 9 } 10 } e l s e i f ( c . getChildCount (w1) < 1) { 11 return ”R(UNKNOWN) ” ; 12 } e l s e i f ( c . getChildCount (w2) < 1) { 13 return ”L(UNKNOWN) ” ; 14 } 15 return null ; 16 } Listing 2. Deterministic oracle implementation CoreNLP dependency parser can then use as training data without any modifications. 5.2. Logging We modified the subclassed dependency parser to log every parsing decision to a YAML file. YAML [8] is a data serialisation language with a human-readable syntax which allows us to visually inspect and manually modify the parser’s decisions while maintaining the ability to load the file back into the program. Listing 3 shows an example of the YAML output. 1 −−− 2 ! ! uk . ac . kent . p a r s e r . ParserLogEntry 3 a r c s : [ ’PARSED( Example , log ) ’] 4 bufferPOS : [ ] 5 bufferWords : [ ] 6 f e a t u r e s : [2 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 839 , 837 , 837 , 838 ,838 , 838 , 838 , 838 , 838 , 838 , 838 , 838 , 838 , 837 , 838 , 838 , 838 , 838 , 841 , 841 , 841 , 841 , 841 , 841 , 841 , 844 , 841 , 841 , 841 , 841] 7 partOfSpeechArcs : [ ’PARSED(NN, NN) ’] 8 stackPOS : [−ROOT−, NN, NN] 9 stackWords : [−ROOT−, Example , output ] 10 t r a n s i t i o n : R(PARSED) 11 −−− Listing 3. Example YAML output The features vector (line 6) and the transition (line 10) are needed to train the classifier as previously described in subsection 2.4, while the rest of the logged information provides additional context for heuristic extraction and human inspection. 5.3. Heuristic Extraction The heuristic we used for this proof of concept simply counted how often pairs of part-of-speech tags called bigrams occur in the input corpus. Although dependency relations hold between words, we use POS bigrams as a higher level abstraction which allows us to generalise to unseen words, and assume that frequent bigrams are a good indication of dependency between the two part-of- speech tags.
  • 5. Our corpus analyser runs the input text through a tagger and scans through the tagged output counting how many times each bigram is encountered. The parse log analyser performs a similar function, but counts how many times bigrams are encountered on the parser’s stack. Listing 4 shows an example of extracted bigrams. DT NN 6034 IN DT 5003 . 4707 NN IN 3755 PRP VBD 3256 JJ NN 2783 NN , 2653 NN . 2615 DT JJ 2551 . . . Listing 4. Example of Part-Of-Speech Bigram histogram extracted from the Sherlock Holmes corpus It is worth mentioning that this heuristic, apart from being na¨ıve, is also rooted in the assumption that the order of words is significant. This is generally true in English, but as we discussed in subsection 2.2, there are other languages for which this assumption does not hold, and therefore different heuristics would be required. 5.4. Training Example Generation A training example consists of a feature vector and an associated transition. Both of these elements are present the parser log output, and can be extracted and modified to generate new training examples. In fact, our implemen- tation exploits this similarity so that training examples are just log entries which have had their transition modified, and saved to a different file. The exact modifications depend on the specific heuris- tics extracted in the previous step, but in the current implementation, ‘UNKNOWN’ transitions which have the most frequent bigram at the top of the stack are replaced with ‘L(PARSED)’ transitions. 5.5. Training As previously mentioned, the model training func- tionality built into the CoreNLP dependency parser is designed to work with annotated parse trees in CoNLL format (Listing 5), which we don’t have. We therefore implemented a new, simplified trainer by copying the original implementation and stripping out the CoNLL- handling code. The resulting trainer takes feature vectors and their corresponding transitions and trains the neural network classifier to match the expected output. 5.6. Unified Click and Forget Flow In order to facilitate unattended operation, we de- veloped a unified flow which, given a suitably prepared training directory, will generate a new model and run through the training loop until the corpus is fully parsed. This flow also launches a web server which is reloaded with the latest model after every iteration, thus allowing us to monitor the quality of intermediate models at runtime. 1 And CC 3 DEP 2 t h a t DT 3 DEP 3 might MD 0 ROOT 4 have VB 3 VC 5 been VBN 4 VC 6 the DT 7 NMOD 7 case NN 5 PRD 8 . . 3 P Listing 5. Example training data in CoNLL format 6. Results After running our system, we quickly see qualitative improvements in the resulting parse trees. Instead of a degenerate parse with no redeeming qualities, we begin to see parse trees which resemble the output of conven- tionally trained parsers: The quick brown fox jumped over the lazy dog det amod amod nsubj nmod case det amod Figure 7. Example output from conventional model The quick brown fox jumped over the lazy dog parsed parsed parsed parsed parsed parsed parsed parsed Figure 8. Example output from trained model Figure 7 shows a sentence parsed by the Stanford Dependency Parser using a conventionally trained model. Compared to our output (Figure 8), we can see that at a high level, there are obvious similarities in the struc- tures of the parse trees, with both exhibiting two distinct clusters, one in the first half of the sentence and another in the second half. However, on closer inspection, the clusters in our output appear to be offset towards the left, causing most of the dependency arcs to link the wrong words. Although there are two dependencies linking the correct words (fox - jumped & lazy - dog), the direction is incorrect in both cases. In fact, if we focus on the directions of the arcs, we can see that most of the arcs in our output are right-bound, while all but one of the arcs in the canonical parse are left-bound. This suggests that our efforts to overcome the absolute left bias of our ‘blank’ model were overzealous and have produced a substantial right-bias. It is possible that the use of a sufficiently large input corpus would overcome this bias, but we were unable to test this. Quantitatively, we can see in Figure 9 that after a sin- gle iteration using “The Adventures of Sherlock Holmes”
  • 6. as a corpus, the number of ‘UNKNOWN’ transitions decreases by nearly 4.5% from 122530 to 117042, and continues to decrease, until it begins to level out after iteration 20. This indicates that while the most frequent bigrams might be a useful indication of dependency re- lations, this approach quickly gets diminishing returns as we look at less common bigrams. 0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 1.2 ·105 Iteration number UNKNOWNTransitions Figure 9. Number of UNKNOWN arcs in parse output 7. Conclusion We have shown that it is possible to train a dependency parser model using an unparsed corpus of English lan- guage text. This was achieved using an iterative training flow in which every iteration uses heuristics extracted from past parses to generate training data for the next. As far as we are aware, there is no prior work on training parser models without using previously annotated text, and therefore our project’s main aim was to determine whether this is possible. Limitations Although our research has achieved a positive result, the current implementation still has three major limita- tions: • Quantitatively, the resulting model’s parsing ac- curacy is very low, due to na¨ıve heuristics and a small input corpus. • The proof-of-concept implementation is memory intensive, which limits the size of the input corpus. • It is only able to produce unlabelled dependencies, which are not as informative as labelled dependen- cies. 8. Further work There are a number of avenues for future work, rang- ing from simple implementation optimisations, to major conceptual hurdles. 8.1. Improve Memory Efficiency The current implementation produces over a gigabyte of parse logs even for small corpora of a few megabytes, which causes it to exhaust the system’s memory when they are loaded for processing. This limitation could be overcome by implementing stream-based processing of log entries. Combined with minor tweaks to minimise the size of each log entry, this should enable the use of much larger corpora. We suspect that, in order to achieve optimal parser accuracy, input corpus size should be on the order of terabytes. 8.2. Develop Improved Heuristics As discussed, the heuristic used in this proof-of- concept was very na¨ıve, so there is great scope to de- velop novel and sophisticated heuristics which will di- rectly affect the quality of the resulting parser. Computing skip-grams, for example, could highlight pairs of part-of- speech tags which frequently appear together in sentences, but are separated by one or more words. Such relationships could provide additional information to overcome the bias towards greedy dependencies, but is currently missed. 8.3. Introduce Arc Labels Although there is no obvious way to derive arc labels from unparsed text, it might be possible to use unsu- pervised learning techniques (e.g clustering) to identify commonly occurring types of arcs, which could then be used as labels. 9. Acknowledgements I would like to thank my supervisors, Christian Kissig, Marek Grze´s and Laura Bocchi, for their continued sup- port and guidance throughout this challenging project. References [1] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters, Text Processing with GATE (Version 6), 2011. [Online]. Available: http://tinyurl.com/gatebook [2] D. Jurafsky and J. H. Martin, Speech and Language Processing. Prentice Hall, 2000. [3] M. A. Covington, “A fundamental algorithm for de- pendency parsing,” in Proceedings of the 39th annual ACM southeast ..., 2001. [4] J. Nivre, “Dependency Parsing,” vol. 4, no. 3, pp. 138– 152, Mar. 2010. [5] N. Green, “Dependency Parsing,” in WDS, Dec. 2011, pp. 1–6. [6] D. Chen and C. D. Manning, “A Fast and Accurate Dependency Parser using Neural Networks.” EMNLP, pp. 740–750, 2014. [7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” in
  • 7. Proceedings of 52nd Annual Meeting of the Associ- ation for Computational Linguistics: System Demon- strations. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 55–60. [8] O. Ben-Kiki and C. Evans. YAML Ain’t Markup Language Version 1.2 Specification. [Online]. Available: http://www.yaml.org/spec/1.2/spec.html