amta-decision-trees.doc Word document

Better contextual translation using machine learning

Arul Menezes

Microsoft Research, One Microsoft Way, Redmond WA 98008, USA
arulm@microsoft.com

Abstract: One of the problems facing translation systems that automatically
extract transfer mappings (rules or examples) from bilingual corpora is the
trade-off between contextual specificity and general applicability of the
mappings, which typically results in conflicting mappings without
distinguishing context. We present a machine-learning approach to choosing
between such mappings, using classifiers that, in effect, selectively expand the
context for these mappings using features available in a linguistic representation
of the source language input. We show that using these classifiers in our
machine translation system significantly improves the quality of the translated
output. Additionally, the set of distinguishing features selected by the classifiers
provides insight into the relative importance of the various linguistic features in
choosing the correct contextual translation.

1 Introduction

Much recent research in machine translation has explored data-driven approaches that
automatically acquire translation knowledge from aligned or unaligned bilingual
corpora. One thread of this research focuses on extracting transfer mappings, rules or
examples, from parsed sentence-aligned bilingual corpora [1,2,3,4]. Recently this
approach has been shown to produce translations of quality comparable to
commercial translation systems [5].
These systems typically obtain a dependency/predicate argument structure (called
“logical form” in our system) for source and target sentences in a sentence-aligned
bilingual corpus. The structures are then aligned at the sub-sentence level. From the
resulting alignment, lexical and structural translation correspondences are extracted,
which are then represented as a set of transfer mappings, rules or examples, for
translation. Mappings may be fully specified or contain “wild cards” or under-
specified nodes.
A problem shared by all such systems is choosing the appropriate level of
generalization for the mappings. Larger, fully specified mappings provide the best
contextual translation, but can result in extreme data sparsity, while smaller and more
1

under-specified mappings are more general, but often do not make the necessary
contextual translation distinctions.

1 In this paper, context refers only to that within the same sentence. We do not address issues
of discourse or document-level context.

All such systems (including our own) must therefore make an implicit or explicit
compromise between generality and specificity. For example, Lavoie [4] uses a hand-
coded set of language-pair specific alignment constraints and attribute constraints that
act as templates for the actual induced transfer rules.
As a result, such a system necessarily produces many mappings that are in conflict
with each other and do not include the necessary distinguishing context. A method is
needed, therefore, to automatically choose between such mappings.
In this paper, we present a machine-learning approach to choosing between
conflicting transfer mappings. For each set of conflicting mappings, we build a
decision tree classifier that learns to choose the most appropriate mapping, based on
the linguistic features present in the source language logical form. The decision tree,
by selecting distinguishing context, in effect, selectively expands the context of each
such mapping.

2 Previous Work

Meyers [6] ranks acquired rules by frequency in the training corpus. When choosing
between conflicting rules, the most frequent rule is selected. However, this will results
in choosing the incorrect translation for input for which the correct contextual
translation is not the most frequent translation in the corpus.
Lavoie [4] ranks induced rules by log-likelihood ratio and uses error-driven
filtering to accept only those rules that reduce the error rate on the training corpus. In
the case of conflicting rules this effectively picks the most frequent rule.
Watanabe [7] addressed a subset of this problem by identifying “exceptional”
examples, such as idiomatic or irregular translations, and using such examples only
under stricter matching conditions than more general examples. However he also did
not attempt to identify what context best selected between these examples.
Kaji [1] has a particularly cogent exposition of the problem of conflicting
translation templates and includes a template refinement step. The translation
examples from which conflicting templates were derived are examined, and the
templates are expanded by the addition of extra distinguishing features such as
semantic categories. For example, he refines two conflicting templates for “play
<NP>” which translate into different Japanese verbs, by recognizing that one template
is used in the training data with sports (“play baseball”, “play tennis”) while the other
is used with musical instruments (“play the piano”, “play the violin”). The conflicting
templates are then expanded to include semantic categories of “sport” and
“instrument” on their respective NPs. This approach is likely to produce much better
contextual translations than the alternatives cited. However, it appears that in Kaji’s
approach this is a manual step, and hence impractical in a large-scale system.
The approach described in this paper is analogous to Kaji, but uses machine
learning instead of hand-inspection.

3 System Overview

3.1 The logical form

Our machine translation system [5] uses logical form representations (LFs) in
transfer. These representations are graphs, representing the predicate argument
structure of a sentence. The nodes in the graph are identified by the lemma (base
form) of a content word. The edges are directed, labeled arcs, indicating the logical
relations between nodes.
Additionally, nodes are labeled with a wealth of morpho-syntactic and semantic
features extracted by the source language analysis module.
Logical forms are intended to be as structurally language-neutral as possible. In
particular, logical forms from different languages use the same relation types and
provide similar analyses for similar constructions. The logical form abstracts away
from such language-particular aspects of a sentence as voice, constituent order and
inflectional morphology. Figure 1 depicts an example Spanish logical form, including
features such as number, gender, definiteness, etc.

Figure 1: Example Logical Form

3.2 Acquiring and using transfer mappings

In our MT architecture, alignments between logical form subgraphs of source and
target language are identified in a training phase using aligned bilingual corpora.
From these alignments a set of transfer mappings is acquired and stored in a database.
A set of language-neutral heuristic rules determines the specificity of the acquired
mappings. This process is discussed in detail in [8].
During the translation process, a sentence in the source language is analyzed, and
its logical form is matched against the database of transfer mappings. From the
matching transfer mappings, a target logical form is constructed which serves as input
to a generation component that produces a target string.

3.3 Competing transfer mappings

It is usually the case that multiple transfer mappings are found to match each input
sentence, each matching some subgraph of the input logical form. The subgraphs
matched by these transfer mappings may overlap, partially or wholly.

Overlapping mappings that indicate an identical translation for the nodes and
relations in the overlapping portion are considered compatible. The translation system
can merge such mappings when constructing a target logical form.
Overlapping transfer mappings that indicate a different translation for the
overlapping portions are considered competing. These mappings cannot be merged
when constructing a target logical form, and hence the translation system must choose
between them. Figure 2 shows an example of two partially overlapping mappings that
compete on the word “presentar”. The first mapping translates “presentar ventaja” as
“have advantage”, whereas the second mapping translates “presentar” by itself to
“display”.

Figure 2: Competing transfer mappings

3.4 Conflicting transfer mappings

In this paper we examine the subset of competing mappings that overlap fully.
Following Kaji [1] we define conflicting mappings as those whose left-hand sides are
identical, but whose right-hand sides differ.
Figure 3 shows two conflicting mappings that translate the same left-hand side,
“todo (las) columna(s)”, as “all (of the) column(s)” and “(the) entire column”
respectively.

Figure 3: Conflicting transfer mappings

4 Data

Our training corpus consists of 351,026 aligned Spanish-English aligned sentence
pairs taken from published computer software manuals and online help documents.
The sentences have an average length of 17.44 words in Spanish and 15.10 words in

English. Our parser produces a parse in every case, but in each language roughly 15%
of the parses produced are “fitted” or non-spanning. We apply a conservative heuristic
and only use in alignment those sentence-pairs that produced spanning parses in both
languages. In this corpus, 277,109 sentence pairs (or 78.9% of the original corpus)
were used in training.

5 Using machine learning

The machine learning approach that we use in this paper is a decision tree model. The
reason for this choice is a purely pragmatic one: decision trees are easy to construct
and easy to inspect. Nothing in our methodology, however, hinges on this particular
choice . 2

We use a set of automated tools to construct decision trees [9] based on the
features extracted from logical forms.

5.1 The classification task

Each set of conflicting transfer mappings (those with identical left-hand sides)
comprises a distinct classification task. The goal of each classifier is to pick the
correct transfer mapping. For each task the data consists of the set of sentence pairs
where the common left-hand side of these mappings matches a portion of the source
(Spanish) logical form of the sentence pair. For a given training sentence pair for a
particular task, the correct mapping is determined by matching the (differing) right-
hand sides of the transfer mappings with a portion of the reference target (English)
logical form.

5.2 Features

The logical form provides over 200 linguistic features, including semantic
relationships such as subject, object, location, manner etc., and features such as
person, number, gender, tense, definiteness, voice, aspect, finiteness etc. We use all
these features in our classification task.
The features are extracted from the logical form of the source sentence as follows:
1. For every source sentence node that matches a node in the transfer mapping, we
extract the lemma, part-of-speech and all other linguistic features available on
that node.
2. For every source node that is a child or parent of a matching node, we extract
the relationship between that node and its matching parent or child, in
conjunction with the linguistic features on that node.

2 Our focus in this paper is understanding this key problem in data-driven MT, and the
application of machine learning to it, and not the learner itself. Hence we use an off-the-shelf
learner and do not compare different machine learning techniques etc.

3. For every source node that is a grandparent of a matching node, we extract the
chain of relationships between that node and its matching grandchild in
conjunction with the linguistic features on that node.
These features are extracted automatically by traversing the source LF and the
transfer mapping. The automated approach is advantageous, since any new features
that are added to the system will automatically be made available to the learner. The
learner in turn automatically discovers which features are predictive. Not all features
are selected for all models by the decision tree learning tools.

5.3 Sparse data problem

Most of the data sets for which we wish to build classifiers are small (our median data
set is 134 sentence pairs), making overfitting of our learned models likely. We
therefore employ smoothing analogous to average-count smoothing as described by
Chen and Goodman [10], which in turn is a variant of Jelinek and Mercer [11]
smoothing.
For each classification task we split the available data into a training set (70%) and
a parameter tuning set (30%). From the training set, we build decision tree classifiers
at varying levels of granularity (by manipulating the prior probability of tree
structures to favor simpler structures). If we were to pick the tree with the maximal
accuracy for each data set independently, we would run the risk of over-fitting to the
parameter tuning data. Instead, we pool all classifiers that have the same average
number of cases per mapping (i.e. per target feature value). For each such pool, we
then evaluate the pool as a whole at each level of decision tree granularity and pick
the level of granularity that maximizes the accuracy of the pool as a whole. We then
choose the same granularity level for all classifiers in the pool.

5.4 Building decision trees

From our corpus of 277,109 sentence pairs, we extracted 161,685 transfer mappings.
Among these, there were 7027 groups of conflicting mappings, amounting to 19,582
conflicting transfer mappings in all, or about 2.79 mappings per conflicting group.
The median size of the data sets was 134 sentence pairs (training and parameter
tuning) per conflicting group.
We built decision trees for all groups that had at least 10 sentence pairs, resulting
in a total of 6912 decision trees . The number of features emitted per data set ranged
3

from 15 to 1775, with an average of 653. (The number of features emitted for each
data set depends on the size of the mappings, since each mapping node provides a
distinct set of features. The number of features also depends on the diversity of
linguistic features actually present in the logical forms of the data set).

3 We discard mappings with frequency less than 2. A conflicting mapping group (at least 2
mappings) would have, at minimum, 4 sentence pairs. The 115 groups for which we don’t
build decision trees are those with 4 to 9 sentence pairs each.

In total there were 1775 distinct features over all the data sets. Of these, 1363
features were selected by at least one decision tree model. The average model had
35.49 splits in the tree, and used 18.18 distinct features.
The average accuracy of the classifiers against the parameter tuning data set was
81.1% without smoothing, and 80.3% with smoothing. By comparison the average
baseline (most frequent mapping within each group) was 70.8% . 4

Table 1: Training data and decision trees

Size of corpus used (sentence pairs) 277,109
Total number of transfer mappings 161,685
Number of conflicting transfer mappings 19,582
Number of groups of conflicting mappings 7,027
Number of decision tree classifiers built 6,912
Median size of the data set used to train each classifier 134
Total number of features emitted over all classifiers 1,775
Total number of features selected by at least one classifier 1,363
Average number of features emitted per data set 653
Average number of features used per data set 18.18
Average number of decision tree splits 35.49
Average baseline accuracy 70.8%
Average decision tree accuracy without smoothing 81.1%
Average decision tree accuracy with smoothing 80.3%

6 Evaluating the decision trees

6.1 Evaluation method

We evaluated the decision trees using a human evaluation that compared the output of
our Spanish-English machine translation system using the decision trees (WithDT) to
the output of the same system without the decision trees (NoDT), keeping all other
aspects of the system constant.
Each system used the same set of learned transfer mappings. The system without
decision trees picked between conflicting mappings by simply choosing the most
frequent mapping, (i.e., the baseline) in each case.
We translated a test set of 2000 previously unseen sentences. Of these sentences,
1683 had at least one decision tree apply.
For 407 of these sentences at least one decision tree indicated a choice other than
the default (highest frequency) choice, hence a different translation was produced
between the two systems. Of these 407 different translations, 250 were randomly
selected for evaluation.

4 All of the averages mentioned in this section are weighted in proportion to the size of the
respective data sets.

Seven evaluators from an independent vendor agency were asked to rate the
sentences. For each sentence the evaluators were presented with an English reference
(human) translation and the two machine translations . The machine translations were
5

presented in random order, so the evaluators could not know their provenance.
Assuming that the reference translation was a perfect translation, the evaluators
were asked to pick the better machine translation or to pick neither if both were
equally good or bad. Each sentence was then rated –1, 0 or 1, based on this choice,
where –1 indicates that the translation from NoDT is preferred, 1 indicates that
WithDT is preferred, and 0 indicates no preference. The scores were then averaged
across all raters and all sentences.

6.2 Evaluation results

The results of this evaluation are presented in Tables 2a and 2b. In Table 2a, note that
a mean score of 1 would indicate a uniform preference (across all raters and all
sentences) for WithDT, while a score of –1 would indicate a uniform preference for
NoDT. The score of 0.33+/-0.093 indicates a strong preference for WithDT.
Table 2b shows the number of translations preferred from each system, based on
the average score across all raters for each translation. Note that WithDT was
preferred more than twice as often as NoDT.
The evaluation thus shows that in cases where the decision trees played a role, the
mapping chosen by the decision tree resulted in a significantly better translation.

Table 2a: Evaluation results: Mean Score

Score Significance Sample size
WithDT vs. NoDT 0.330 +/- 0.093 >0.99999 250

Table 2b: Evaluation results: Sentence preference

Number of Number of Number of
sentences sentences NoDT sentences neither
WithDT rated rated better rated better
better
WithDT vs. NoDT 167 (66.8%) 75 (30%) 8 (3.2%)

7 Comparisons

The baseline NoDT system uses the same (highest-frequency) strategy for choosing
between conflicting mappings as is used by Meyers et al [6] and by Lavoie et al [4].

5 Since the evaluators are given a high-quality human reference translation, the original
Spanish sentence is not essential for judging the MT quality, and is therefore omitted. This
controls for differing levels of fluency in Spanish among the evaluators.

The results show that the use of machine learning significantly improves upon this
strategy.
The human classification task proposed by Kaji [1] points us in the right direction,
but is impractical for large-scale systems. The machine learning strategy we use is the
first automated realization of this strategy.

8 Examining the decision trees

One of the advantages of using decision trees to build our classifiers is that decision
trees lend themselves to inspection, potentially leading to interesting insights that can
aid system development. In particular, as discussed in Sections 1 and 2, conflicting
mappings are a consequence of a heuristic compromise between specificity and
generality when the transfer mappings are acquired. Hence, examining the decision
trees may help understand the nature of that compromise.
We found that of a total of 1775 features, 1363 (77%) were used by at least one
decision tree, 556 (31%) of them at the top-level of the tree. The average model had
35.49 splits in the decision tree, and used 18.18 distinct features. Furthermore, the
single most popular feature accounted for no more than 8.68% of all splits and 10.4%
of top-level splits.
This diversity of features suggests that the current heuristics used during transfer
mapping acquisition strike a good compromise between specificity and generality.
This is complemented by the learner, which enlarges the context for our mappings in
a highly selective, case-by-case manner, drawing upon the full range of linguistics
features available in the logical form.

9 An example

We used DnetViewer [12], a visualization tool for viewing decision trees and
Bayesian networks, to explore the decision trees, looking for interesting insights into
problem areas in our MT system. Figure 4 shows a simple decision tree displayed by
this viewer. The text has been enlarged for readability and each leaf node has been
annotated with the highest probability mapping (i.e., the mode of the predicted
probability distribution shown as a bar chart) at that node.
The figure depicts the decision tree for the transfer mappings for (*—Attrib—
agrupado), which translates, in this technical corpus, to either “grouped”, “clustered”,
or “banded”. The top-level split is based on the input lemma that matches the * (wild-
card) node. If this lemma is “indice” then the mapping for “clustered” is chosen.
If the lemma is not “indice”, the next split is based on whether the parent node is
marked as indefinite, which leads to further splits, as shown in the figure, based again
on the input lemma that matches the wild-card node.
Examples of sentence pairs used to build this decision tree:
Los datos y el índice agrupado residen siempre en el mismo grupo de archivos.
The data and the clustered index always reside in the same filegroup.

Se produce antes de mostrar el primer conjunto de registros en una página de
datos agrupados.
Occurs before the first set of records is displayed on a banded data page.
En el caso de páginas de acceso a datos agrupadas, puede ordenar los
registros incluidos en un grupo.
For grouped data access pages, you can sort the records within a group.

Figure 4: Decision tree for (*—Attrib—agrupado)

10 Conclusions and future work

We have shown that applying machine learning to this problem resulted in a
significant improvement in translation over the highest-frequency strategy used by
previous systems.
However, we built decision trees only for conflicting mappings, which comprise a
small subset of competing transfer mappings (discussed in Section 3.3). For instance,
in our test corpus, on average, 38.67 mappings applied to each sentence, of which
12.76 mappings competed with at least one other mapping, but of these only 2.35
were conflicting (and hence had decision trees built for them). We intend on
extending this approach to all competing mappings, which is likely to have a much
greater impact. This is, however, not entirely straightforward, since such matches
compete on some sentences but not others.

In addition, we would like to explore whether abstracting away from specific
lemmas, using thesaurus classes, WordNet syn-sets or hypernyms, etc. would result in
improved classifier performance.

11 Acknowledgements

Thanks go to Robert C. Moore for many helpful discussions, useful suggestions and
advice, particularly in connection with the smoothing method we used. Thanks to
Simon Corston-Oliver for advice on decision trees and code to use them, and to Max
Chickering, whose excellent tool-kit we used extensively. Thanks also go to members
of the NLP group at Microsoft Research for valuable feedback.

References

1. Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto: Learning Translation Templates
from Bilingual Text. In Proceedings of COLING (1992)
2. Adam Meyers, Michiko Kosaka and Ralph Grishman: Chart-based transfer rule
application in machine translation. In Proceedings of COLING (2000)
3. Hideo Watanabe, Sado Kurohashi, and Eiji Aramaki: Finding Structural Correspondences
from Bilingual Parsed Corpus for Corpus-based Translation. In Proceedings of COLING
(2000)
4. Benoit Lavoie, Michael White and Tanya Korelsky: Inducing Lexico-Structural Transfer
Rules from Parsed Bi-texts. In Proceedings of the Workshop on Data-driven Machine
Translation, ACL 2001, Toulouse, France (2001)
5. Stephen D. Richardson, William Dolan, Monica Corston-Oliver, and Arul Menezes,
Overcoming the customization bottleneck using example-based MT. In Proceedings of the
Workshop on Data-Driven Machine Translation, ACL 2001.Toulouse, France (2001)
6. Adam Meyers, Roman Yangarber, Ralph Grishman, Catherine Macleod, and Antonio
Moreno-Sandoval: Deriving transfer rules from dominance-preserving alignments. In
Proceedings of COLING (1998)
7. Hideo Watanabe: A method for distinguishing exceptional and general examples in
example-based transfer systems. In Proceedings of COLING (1994)
8. Arul Menezes and Stephen D. Richardson: A best-first alignment algorithm for automatic
extraction of transfer mappings from bilingual corpora. In Proceedings of the Workshop
on Data-Driven Machine Translation, ACL 2001.Toulouse, France (2001)
9. David Maxwell Chickering, David Heckerman, and Christopher Meek: A Bayesian
approach to learning Bayesian networks with local structure. In D. Geiger, and P. Pundalik
Shenoy (Eds.), Uncertainty in Artificial Intelligence: Proceedings of the Thirteenth
Conference. 80-89. (1997)
10. Stanley Chen and Joshua Goodman: An empirical study of smoothing techniques for
language modeling, In Proceedings of ACL (1996)
11. Frederick Jelinek and Robert L. Mercer: Interpolated estimation of Markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in
Practice, Amsterdam, The Netherlands (1980)
12. David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite,
Carl Kadie: Dependency networks for inference, collaborative filtering and data
visualization. In Journal of Machine Learning Research 1:49-75 (2000)

amta-decision-trees.doc Word document

More Related Content

What's hot (17)

Similar to amta-decision-trees.doc Word document (20)

More from butest (20)

amta-decision-trees.doc Word document