Fast evaluation of Connectionist Language Models

Fast evaluation of connectionist language models
10th International Work-Conference on Artiﬁcal Neural Networks

F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera

Departamento de Ciencias Físicas, Matemáticas y de la Computación
Universidad CEU-Cardenal Herrera
46115 Alfara del Patriarca (Valencia), Spain

Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia
Valencia, Spain

{fzamora,mcastro,sespana}@dsic.upv.es

June 11 2009

F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 1 / 26

Index

1 Introduction and motivation

2 Neural Network Language Models (NN LMs)

3 Fast evaluation of NNLMs

4 Estimation of the NNLMs

5 Evaluation of the proposed approach

6 Discussion and conclusions


Index








Introduction and motivation

Language modelling is the attempt to characterize, capture and exploit
regularities in natural language.
In pattern recognition problems language models (LM) are useful to guide
the search for the optimal response and to increase the success rate of
the system.

Example
LM statistical framework

S = A move to stop . . .
|S|
p(S) = p(si |si−1 ) =
1
i=1
= p(A) p(move|A) p(to|A move) p(stop|A move to) . . .


Statistical framework: n-grams

+ n-grams are the most popular LM, due to their simplicity and robustness.
+ The model parameters are learnt from text corpora using the occurrence
frequencies of subsequences of n word units.

Examples
Possible n-grams with n = 2 (bigrams)
S = A move to stop Mr. Gaitskell . . . =
= <s> A move to stop Mr. Gaitskell . . . </s>
(<s> A), (A move), (move to), (to stop), (stop Mr.), (Mr. Gaitskell), . . .

Drawbacks of n-grams
– Larger values of n can capture longer-term dependencies between words.
– But the number of different n-grams grows exponentially with n, and
requires more and more training data.
– To alleviate this problem, some techniques such as smoothing or
clustering can be applied.


Connectionist language models

Recently some authors propose the application of neural networks (NN)
to language modelling [Bengio][Castro][Schwenk].
These models have the capacity of calculate an automatic smoothing of
unseen n-grams, and are more scalable with n.

⇓
Despite their theoretical advantages, these LM are more expensive to
compute.
⇓
A novel technique to speedup the computation of connectionist language
models is presented in this work.

Motivation
To integrate the connectionist language model in the Viterbi decoder of a
pattern recognition software.


Index








Neural Network Language Models (NN LMs)

LM probability equation, n-grams:
|S|
p(s1 . . . s|S| ) ≈ i=1 p(si |si−n+1 . . . si−1 ) .

A NN LM is a statistical LM which follows the same equation as n-grams.
Probabilities that appear in that expression are estimated with a NN.
The model naturally fits under the probabilistic interpretation of the
outputs of the NNs: if a NN, in this case a Multilayer Perceptron (MLP), is
trained as a classifier, the outputs associated to each class are
estimations of the posterior probabilities of the defined classes.


NN LMs: Codiﬁcation of the vocabulary I

The training set for a LM is a
sequence s1 s2 . . . s|S| of words
from a vocabulary Ω. A trigram example of NN LM
Each input word is locally
encoded following a “1-of-|Ω|”
scheme.
Problems:
– For tasks with large
vocabularies, the resulting NN
very huge. 1 2 3 4 ... |Ω|
– The input of the NN is very si−2 = 0 0 1 0 ... 0
sparse.
– It leads to slow convergence si−1 = 1 0 0 0 ... 0
during the training process.


NN LMs: Codiﬁcation of the vocabulary II

We use ideas from Bengio and Schwenk to learn a distributed representation
of each word during the MLP training.

Examples
Distributed encoding
si−2 = 0.2 0.1 0.5 0.3 ... 0.1

si−1 = 0.4 0.4 0.3 0.6 ... 0.2

|si−2 | << |Ω|
|si−1 | << |Ω|


NN LMs: Codiﬁcation of the vocabulary III
The input is composed of words si−n+1 , . . . , si−1 of n-grams equation. Each
word is represented using a local encoding.

p(si |si−n+1 . . . si−1 )


A new P, projection layer, formed by Pi−n+1 , . . . , Pi−1 subsets of projection
units is added. Pj encodes the corresponding input word sj .

Pj ⇒ codiﬁed word.


The weights from each local encoding of input word sj to the corresponding
subset of projection units Pj are the same for all input words j.

Shared weights in
projection layer.


After training, the projection layer is removed from the network by
pre-computing a table of size |Ω| which serves as a distributed encoding.

a 0.4 0.2 ... 0.3
move 0.2 0.1 ... 0.8
to 0.6 0.7 ... 0.6
stop 0.1 0.5 ... 0.2
...
</s> 0.4 0.3 ... 0.9



H is the hidden layer with an empirical number of units.



O the output layer with |Ω| units. The softmax activation function ensures also
that the output values sum to one.

p(ω|si−n+1 . . . si−1 ), ω ∈ Ω.

exp(ai )
oi = ,
|Ω|
X
exp(aj )
j=1

being ai the activation value
of the i-th output unit and oi
is its output value.


NN LMs: Codiﬁcation of the vocabulary IV

This NN predicts the posterior probability of each word of the vocabulary
given the history. A single forward pass of the MLP gives p(ω|si−n+1 . . . si−1 )
for every word ω ∈ Ω.

Advantages
+ Automatic estimation (as with statistical LM).
+ The Lowest (in general) number of parameters of the obtained models.
+ Automatic smoothing performed by the neural networks estimators.

Problems
– The larger the lexicon is, the larger the number of parameters the neural
network needs.
– On speech or handwritten recognition or in translation tasks, there’s a
thousands number of language model lookups.
– Huge NN LMs consume excessive time computing these values.


Index








Fast evaluation of NNLMs I
The softmax normalization term requires the computation of every output
value. This computation dominates the cost of the forward pass in a typical
NN LM topology.

p(ω|si−n+1 . . . si−1 ), ω ∈ Ω.

exp(ai )
oi = ,
|Ω|
X
exp(aj )
j=1

being ai the activation value
of the i-th output unit and oi
is its output value.


Proposed approach I

Pre-computing and storing the softmax normalization constants most
probably needed during the LM evaluation.
A space/time trade-off has to be considered: the more space is dedicated
to store pre-computed softmax normalization constants, the more time
reduction can be obtained.
When a given normalization constant is not found:
Computed on-the-ﬂy.
Some kind of smoothing must be applied.
We follow the last idea when a softmax normalization constant is not found.


Proposed approach II
1 Observe that a bigram NN LM only needs a |Ω|-sized pre-computed table
of softmax normalization constants.


2 Choose a hierarchy of models, from higher order n-grams downto
bigrams.

Possible hierarchy of models


3 For the bigram NN LM: pre-compute every softmax normalization
constants for every word of the lexicon. Store the constants in a |Ω|-sized
table.

|Ω|
for each ω ∈ Ω ⇒ exp(aj )
j=1

⇓
Pre-computed table
8
>
> <s> 0.1
a -0.1
<
|Ω|
>
> move 1.0
... ...
:


4 For each n-gram NN LM bigger than bigram: pre-compute the softmax
normalization constants for the K more frequent (n − 1)-grams in the
training set. Store the constants in a table.

|Ω|
exp(aj )
j=1

⇓
Pre-computed table
8
>
> <s> <s> <s> 0.5
<s> <s> a 0.1
<
K
>
> <s> a move 1.0
a move to -0.1
:


5 During the test evaluation, for each token: search the softmax pre-computed
constant associated to the (n − 1)-length preﬁx of the token in the table. If the
constant is in the table, calculate the probability; otherwise, switch to the
inmediately inferior NN LM.


Index








Estimation of the NN LMs: Corpus

Experiments with the LOB text corpus have been conducted.
A subcorpus with a lexicon of 3 000 words has been built as follows:
Apply a random ordering of the sentences.
Select those whose lexicon lies in the ﬁrst different 3 000 words.
This subcorpus was partitioned into three sets:

Partition # Sentences # running words
Training 4 303 37 606
Validation 600 5 348
Test 600 5 455


Validation set perplexity for the estimated NN LMs

Language Model Bigram Trigram 4gram
NN LM 80–128 82.17 73.62 74.07
NN LM 128–192 82.52 73.30 72.50
NN LM 192–192 80.01 71.90 71.91
Mixed NN LM 78.92 71.34 70.63

Three different NN LM topologies for bigram, trigram and 4gram, and a
combination of the three for each n-gram.


Test set perplexity of the best NN LMs and SRI

Language Model Bigram Trigram 4-gram
Mixed NN LM 88.72 80.94 79.90
SRI 88.29 83.19 87.05

Best NN LM compared with SRI statistical n-gram.


Index








Evaluation of the proposed approach: perplexity

Inﬂuence of the number of pre-computed softmax normalization constants in
the test set perplexity of the proposed approach for the Mixed NN LMs (left),
and the Mixed NN LMs and statistical bigram (right).


Evaluation of the proposed approach: speed up

Model seconds/token tokens/second
NN LM 6.43 × 10−3 155
Fast NN LM 1.94 × 10−4 5 154

A speedup of 33 times faster is achieved (for 3 000 words, higher speedups
would be achieved for bigger lexicon sizes).

Conclusion
This speed-up allows the integration of these LMs in a search procedure of a
recognition task.


Index








Discussion and conclusions

A novel method to allow fast evaluation of connectionist language models
has been presented.
The best perplexity is obtained by combining the Mixed NN LM with a
statistical bigram. A speedup of 33 times faster is achieved with this
lexicon size (3 000 words).
Nevertheless, higher speedups would be achieved for bigger lexicon
sizes.
Our next goal is to train more complex NN LMs and to integrate them in a
recognition or translation systems.


Fast evaluation of connectionist language models
10th International Work-Conference on Artiﬁcal Neural Networks

F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera

Departamento de Ciencias Físicas, Matemáticas y de la Computación
Universidad CEU-Cardenal Herrera
46115 Alfara del Patriarca (Valencia), Spain

Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia
Valencia, Spain

{fzamora,mcastro,sespana}@dsic.upv.es

June 11 2009


Fast evaluation of Connectionist Language Models

More Related Content

What's hot (9)

Similar to Fast evaluation of Connectionist Language Models (20)

More from Francisco Zamora-Martinez (10)

Recently uploaded (20)

Fast evaluation of Connectionist Language Models