SlideShare a Scribd company logo
Fast evaluation of connectionist language models
       10th International Work-Conference on Artifical Neural Networks


        F. Zamora-Martínez                  M.J. Castro-Bleda                     S. España-Boquera

                    Departamento de Ciencias Físicas, Matemáticas y de la Computación
                                   Universidad CEU-Cardenal Herrera
                               46115 Alfara del Patriarca (Valencia), Spain

                             Departamento de Sistemas Informáticos y Computación
                                     Universidad Politécnica de Valencia
                                               Valencia, Spain

                                   {fzamora,mcastro,sespana}@dsic.upv.es


                                                June 11 2009




F. Zamora et al (UCH CEU - UPV)      Fast Evaluation of connectionist language models          June 11 2009   1 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   2 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   3 / 26
Introduction and motivation

        Language modelling is the attempt to characterize, capture and exploit
        regularities in natural language.
        In pattern recognition problems language models (LM) are useful to guide
        the search for the optimal response and to increase the success rate of
        the system.

Example
LM statistical framework

 S           =      A move to stop . . .
                     |S|
 p(S)        =             p(si |si−1 ) =
                                  1
                    i=1
             = p(A) p(move|A) p(to|A move) p(stop|A move to) . . .



     F. Zamora et al (UCH CEU - UPV)    Fast Evaluation of connectionist language models   June 11 2009   4 / 26
Statistical framework: n-grams

  + n-grams are the most popular LM, due to their simplicity and robustness.
  + The model parameters are learnt from text corpora using the occurrence
    frequencies of subsequences of n word units.

Examples
Possible n-grams with n = 2 (bigrams)
 S = A move to stop Mr. Gaitskell . . . =
    = <s> A move to stop Mr. Gaitskell . . . </s>
(<s> A), (A move), (move to), (to stop), (stop Mr.), (Mr. Gaitskell), . . .

Drawbacks of n-grams
  – Larger values of n can capture longer-term dependencies between words.
  – But the number of different n-grams grows exponentially with n, and
    requires more and more training data.
  – To alleviate this problem, some techniques such as smoothing or
    clustering can be applied.

  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   5 / 26
Connectionist language models

     Recently some authors propose the application of neural networks (NN)
     to language modelling [Bengio][Castro][Schwenk].
     These models have the capacity of calculate an automatic smoothing of
     unseen n-grams, and are more scalable with n.

                                                         ⇓
Despite their theoretical advantages, these LM are more expensive to
compute.
                                                         ⇓
A novel technique to speedup the computation of connectionist language
models is presented in this work.

Motivation
To integrate the connectionist language model in the Viterbi decoder of a
pattern recognition software.


  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   6 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   7 / 26
Neural Network Language Models (NN LMs)



LM probability equation, n-grams:
                                                      |S|
                   p(s1 . . . s|S| ) ≈                i=1   p(si |si−n+1 . . . si−1 ) .

     A NN LM is a statistical LM which follows the same equation as n-grams.
     Probabilities that appear in that expression are estimated with a NN.
     The model naturally fits under the probabilistic interpretation of the
     outputs of the NNs: if a NN, in this case a Multilayer Perceptron (MLP), is
     trained as a classifier, the outputs associated to each class are
     estimations of the posterior probabilities of the defined classes.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models      June 11 2009   8 / 26
NN LMs: Codification of the vocabulary I

       The training set for a LM is a
       sequence s1 s2 . . . s|S| of words
       from a vocabulary Ω.                                         A trigram example of NN LM
       Each input word is locally
       encoded following a “1-of-|Ω|”
       scheme.
 Problems:
  – For tasks with large
    vocabularies, the resulting NN
    very huge.                                                                  1     2   3   4    ...       |Ω|
  – The input of the NN is very                                 si−2 =          0     0   1   0    ...        0
    sparse.
  – It leads to slow convergence                                si−1 =          1     0   0   0    ...           0
    during the training process.


 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models               June 11 2009       9 / 26
NN LMs: Codification of the vocabulary II


We use ideas from Bengio and Schwenk to learn a distributed representation
of each word during the MLP training.

Examples
Distributed encoding
                             si−2 =     0.2        0.1       0.5        0.3        ...   0.1

                             si−1 =     0.4        0.4       0.3        0.6        ...   0.2

                                                  |si−2 | << |Ω|
                                                  |si−1 | << |Ω|




  F. Zamora et al (UCH CEU - UPV)     Fast Evaluation of connectionist language models         June 11 2009   10 / 26
NN LMs: Codification of the vocabulary III
The input is composed of words si−n+1 , . . . , si−1 of n-grams equation. Each
word is represented using a local encoding.




   p(si |si−n+1 . . . si−1 )




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III
A new P, projection layer, formed by Pi−n+1 , . . . , Pi−1 subsets of projection
units is added. Pj encodes the corresponding input word sj .




    Pj ⇒ codified word.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III
The weights from each local encoding of input word sj to the corresponding
subset of projection units Pj are the same for all input words j.




    Shared weights in
  projection layer.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III
After training, the projection layer is removed from the network by
pre-computing a table of size |Ω| which serves as a distributed encoding.




        a       0.4     0.2         ...   0.3
    move        0.2     0.1         ...   0.8
       to       0.6     0.7         ...   0.6
     stop       0.1     0.5         ...   0.2
                       ...
     </s>       0.4     0.3         ...   0.9




  F. Zamora et al (UCH CEU - UPV)         Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III


H is the hidden layer with an empirical number of units.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III

O the output layer with |Ω| units. The softmax activation function ensures also
that the output values sum to one.



      p(ω|si−n+1 . . . si−1 ), ω ∈ Ω.

                      exp(ai )
           oi =                      ,
                    |Ω|
                   X
                          exp(aj )
                    j=1

  being ai the activation value
  of the i-th output unit and oi
  is its output value.




  F. Zamora et al (UCH CEU - UPV)        Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary III

O the output layer with |Ω| units. The softmax activation function ensures also
that the output values sum to one.



  p(ω|si−n+1 . . . si−1 ), ω ∈ Ω.

                      exp(ai )
           oi =                      ,
                   |Ω|
                   X
                          exp(aj )
                    j=1

  being ai the activation value
  of the i-th output unit and oi
  is its output value.




  F. Zamora et al (UCH CEU - UPV)        Fast Evaluation of connectionist language models   June 11 2009   11 / 26
NN LMs: Codification of the vocabulary IV

This NN predicts the posterior probability of each word of the vocabulary
given the history. A single forward pass of the MLP gives p(ω|si−n+1 . . . si−1 )
for every word ω ∈ Ω.

Advantages
  + Automatic estimation (as with statistical LM).
  + The Lowest (in general) number of parameters of the obtained models.
  + Automatic smoothing performed by the neural networks estimators.

Problems
  – The larger the lexicon is, the larger the number of parameters the neural
    network needs.
  – On speech or handwritten recognition or in translation tasks, there’s a
    thousands number of language model lookups.
  – Huge NN LMs consume excessive time computing these values.


  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   12 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   13 / 26
Fast evaluation of NNLMs I
The softmax normalization term requires the computation of every output
value. This computation dominates the cost of the forward pass in a typical
NN LM topology.




 p(ω|si−n+1 . . . si−1 ), ω ∈ Ω.

                    exp(ai )
         oi =                       ,
                  |Ω|
                 X
                        exp(aj )
                  j=1

 being ai the activation value
 of the i-th output unit and oi
 is its output value.




  F. Zamora et al (UCH CEU - UPV)       Fast Evaluation of connectionist language models   June 11 2009   14 / 26
Proposed approach I



     Pre-computing and storing the softmax normalization constants most
     probably needed during the LM evaluation.
     A space/time trade-off has to be considered: the more space is dedicated
     to store pre-computed softmax normalization constants, the more time
     reduction can be obtained.
     When a given normalization constant is not found:
             Computed on-the-fly.
             Some kind of smoothing must be applied.
We follow the last idea when a softmax normalization constant is not found.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   15 / 26
Proposed approach II
 1 Observe that a bigram NN LM only needs a |Ω|-sized pre-computed table
   of softmax normalization constants.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 2 Choose a hierarchy of models, from higher order n-grams downto
   bigrams.




                                   Possible hierarchy of models




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 3 For the bigram NN LM: pre-compute every softmax normalization
   constants for every word of the lexicon. Store the constants in a |Ω|-sized
   table.




                                                                                          |Ω|
      for each ω ∈ Ω ⇒                                                                            exp(aj )
                                                                                          j=1


                                                                                                  ⇓
                                                                                      Pre-computed table
                                                                                           8
                                                                                              >
                                                                                              >   <s>      0.1
                                                                                                  a       -0.1
                                                                                              <
                                                                                        |Ω|
                                                                                              >
                                                                                              >   move     1.0
                                                                                                  ...      ...
                                                                                              :




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models                      June 11 2009   16 / 26
Proposed approach II
 4 For each n-gram NN LM bigger than bigram: pre-compute the softmax
   normalization constants for the K more frequent (n − 1)-grams in the
   training set. Store the constants in a table.




                                                                                                    |Ω|
                                                                                                          exp(aj )
                                                                                                    j=1




                                                                                                          ⇓
                                                                                          Pre-computed table
                                                                                          8
                                                                                          >
                                                                                          >   <s>     <s>       <s>       0.5
                                                                                              <s>     <s>       a         0.1
                                                                                          <
                                                                                      K
                                                                                          >
                                                                                          >   <s>     a         move      1.0
                                                                                              a       move      to       -0.1
                                                                                          :




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models                       June 11 2009    16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Proposed approach II
 5 During the test evaluation, for each token: search the softmax pre-computed
   constant associated to the (n − 1)-length prefix of the token in the table. If the
   constant is in the table, calculate the probability; otherwise, switch to the
   inmediately inferior NN LM.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   16 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   17 / 26
Estimation of the NN LMs: Corpus


     Experiments with the LOB text corpus have been conducted.
     A subcorpus with a lexicon of 3 000 words has been built as follows:
             Apply a random ordering of the sentences.
             Select those whose lexicon lies in the first different 3 000 words.
This subcorpus was partitioned into three sets:

                           Partition       # Sentences                # running words
                           Training              4 303                         37 606
                           Validation              600                          5 348
                           Test                    600                          5 455




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models    June 11 2009   18 / 26
Validation set perplexity for the estimated NN LMs




                         Language Model              Bigram            Trigram         4gram
                         NN LM 80–128                82.17              73.62          74.07
                         NN LM 128–192               82.52              73.30          72.50
                         NN LM 192–192               80.01              71.90          71.91
                         Mixed NN LM                 78.92              71.34          70.63


Three different NN LM topologies for bigram, trigram and 4gram, and a
combination of the three for each n-gram.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models           June 11 2009   19 / 26
Test set perplexity of the best NN LMs and SRI




                        Language Model               Bigram           Trigram          4-gram
                         Mixed NN LM                 88.72             80.94            79.90
                             SRI                     88.29             83.19            87.05


Best NN LM compared with SRI statistical n-gram.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models            June 11 2009   20 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   21 / 26
Evaluation of the proposed approach: perplexity




Influence of the number of pre-computed softmax normalization constants in
the test set perplexity of the proposed approach for the Mixed NN LMs (left),
and the Mixed NN LMs and statistical bigram (right).




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   22 / 26
Evaluation of the proposed approach: speed up


                        Model               seconds/token                  tokens/second
                        NN LM               6.43 × 10−3                               155
                        Fast NN LM          1.94 × 10−4                             5 154


A speedup of 33 times faster is achieved (for 3 000 words, higher speedups
would be achieved for bigger lexicon sizes).


Conclusion
This speed-up allows the integration of these LMs in a search procedure of a
recognition task.




  F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models        June 11 2009   23 / 26
Index


1    Introduction and motivation

2    Neural Network Language Models (NN LMs)

3    Fast evaluation of NNLMs

4    Estimation of the NNLMs

5    Evaluation of the proposed approach

6    Discussion and conclusions




    F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   24 / 26
Discussion and conclusions



    A novel method to allow fast evaluation of connectionist language models
    has been presented.
    The best perplexity is obtained by combining the Mixed NN LM with a
    statistical bigram. A speedup of 33 times faster is achieved with this
    lexicon size (3 000 words).
    Nevertheless, higher speedups would be achieved for bigger lexicon
    sizes.
    Our next goal is to train more complex NN LMs and to integrate them in a
    recognition or translation systems.




 F. Zamora et al (UCH CEU - UPV)   Fast Evaluation of connectionist language models   June 11 2009   25 / 26
Fast evaluation of connectionist language models
       10th International Work-Conference on Artifical Neural Networks


        F. Zamora-Martínez                  M.J. Castro-Bleda                     S. España-Boquera

                    Departamento de Ciencias Físicas, Matemáticas y de la Computación
                                   Universidad CEU-Cardenal Herrera
                               46115 Alfara del Patriarca (Valencia), Spain

                             Departamento de Sistemas Informáticos y Computación
                                     Universidad Politécnica de Valencia
                                               Valencia, Spain

                                   {fzamora,mcastro,sespana}@dsic.upv.es


                                                June 11 2009




F. Zamora et al (UCH CEU - UPV)      Fast Evaluation of connectionist language models         June 11 2009   26 / 26

More Related Content

PDF
Lecture 6: Hidden Variables and Expectation-Maximization
PDF
Adding morphological information to a connectionist Part-Of-Speech tagger
PDF
Lecture 5: Bayesian Classification
PDF
Lecture11 logistic regression
PPTX
Teaching algebra through functional programming
PDF
A survey on parallel corpora alignment
DOCX
Fla 5th cse
ODP
Advanced Language Technologies for Mathematical Markup
Lecture 6: Hidden Variables and Expectation-Maximization
Adding morphological information to a connectionist Part-Of-Speech tagger
Lecture 5: Bayesian Classification
Lecture11 logistic regression
Teaching algebra through functional programming
A survey on parallel corpora alignment
Fla 5th cse
Advanced Language Technologies for Mathematical Markup

What's hot (9)

PPTX
Complexity
ODP
Multilingual Mathematics in WebALT
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PDF
Applying simclair coulthard model
PDF
Daa notes 3
PDF
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
PDF
Parekh dfa
PPT
slides
PPT
Introduction to NP Completeness
Complexity
Multilingual Mathematics in WebALT
Embedding for fun fumarola Meetup Milano DLI luglio
Applying simclair coulthard model
Daa notes 3
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Parekh dfa
slides
Introduction to NP Completeness
Ad

Similar to Fast evaluation of Connectionist Language Models (20)

PPTX
NLP Bootcamp
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
PDF
Representation Learning of Text for NLP
PDF
Anthiil Inside workshop on NLP
PDF
Representation Learning of Vectors of Words and Phrases
PPT
Natural Language Processing: N-Gram Language Models
PPT
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
PPT
Natural Language Processing: N-Gram Language Models
PPTX
A Neural Probabilistic Language Model
PDF
Contributions to connectionist language modeling and its application to seque...
PDF
L05 language model_part2
PPT
2-Chapter Two-N-gram Language Models.ppt
PDF
LLM.pdf
DOCX
Language Modeling.docx
PPTX
Neural Networks with Focus on Language Modeling
PPTX
NLP_KASHK:Evaluating Language Model
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PPTX
Language models
PDF
Visual-Semantic Embeddings: some thoughts on Language
PPTX
Language model in nature language processing
NLP Bootcamp
NLP Bootcamp 2018 : Representation Learning of text for NLP
Representation Learning of Text for NLP
Anthiil Inside workshop on NLP
Representation Learning of Vectors of Words and Phrases
Natural Language Processing: N-Gram Language Models
N GRAM FOR NATURAL LANGUGAE PROCESSINGG
Natural Language Processing: N-Gram Language Models
A Neural Probabilistic Language Model
Contributions to connectionist language modeling and its application to seque...
L05 language model_part2
2-Chapter Two-N-gram Language Models.ppt
LLM.pdf
Language Modeling.docx
Neural Networks with Focus on Language Modeling
NLP_KASHK:Evaluating Language Model
Gnerative AI presidency Module1_L4_LLMs_new.pptx
Language models
Visual-Semantic Embeddings: some thoughts on Language
Language model in nature language processing
Ad

More from Francisco Zamora-Martinez (10)

PDF
Integration of Unsupervised and Supervised Criteria for DNNs Training
PDF
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
PDF
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
PDF
F-Measure as the error function to train Neural Networks
PDF
A Connectionist approach to Part-Of-Speech Tagging
PDF
Mejora del reconocimiento de palabras manuscritas aisladas mediante un clasif...
PDF
Behaviour-based Clustering of Neural Networks applied to Document Enhancement
PDF
Efficient Viterbi algorithms for lexical tree based models
PDF
Efficient BP Algorithms for General Feedforward Neural Networks
PDF
Some empirical evaluations of a temperature forecasting module based on Art...
Integration of Unsupervised and Supervised Criteria for DNNs Training
ESAI-CEU-UCH solution for American Epilepsy Society Seizure Prediction Challenge
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
F-Measure as the error function to train Neural Networks
A Connectionist approach to Part-Of-Speech Tagging
Mejora del reconocimiento de palabras manuscritas aisladas mediante un clasif...
Behaviour-based Clustering of Neural Networks applied to Document Enhancement
Efficient Viterbi algorithms for lexical tree based models
Efficient BP Algorithms for General Feedforward Neural Networks
Some empirical evaluations of a temperature forecasting module based on Art...

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...

Fast evaluation of Connectionist Language Models

  • 1. Fast evaluation of connectionist language models 10th International Work-Conference on Artifical Neural Networks F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera Departamento de Ciencias Físicas, Matemáticas y de la Computación Universidad CEU-Cardenal Herrera 46115 Alfara del Patriarca (Valencia), Spain Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia Valencia, Spain {fzamora,mcastro,sespana}@dsic.upv.es June 11 2009 F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 1 / 26
  • 2. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 2 / 26
  • 3. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 3 / 26
  • 4. Introduction and motivation Language modelling is the attempt to characterize, capture and exploit regularities in natural language. In pattern recognition problems language models (LM) are useful to guide the search for the optimal response and to increase the success rate of the system. Example LM statistical framework S = A move to stop . . . |S| p(S) = p(si |si−1 ) = 1 i=1 = p(A) p(move|A) p(to|A move) p(stop|A move to) . . . F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 4 / 26
  • 5. Statistical framework: n-grams + n-grams are the most popular LM, due to their simplicity and robustness. + The model parameters are learnt from text corpora using the occurrence frequencies of subsequences of n word units. Examples Possible n-grams with n = 2 (bigrams) S = A move to stop Mr. Gaitskell . . . = = <s> A move to stop Mr. Gaitskell . . . </s> (<s> A), (A move), (move to), (to stop), (stop Mr.), (Mr. Gaitskell), . . . Drawbacks of n-grams – Larger values of n can capture longer-term dependencies between words. – But the number of different n-grams grows exponentially with n, and requires more and more training data. – To alleviate this problem, some techniques such as smoothing or clustering can be applied. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 5 / 26
  • 6. Connectionist language models Recently some authors propose the application of neural networks (NN) to language modelling [Bengio][Castro][Schwenk]. These models have the capacity of calculate an automatic smoothing of unseen n-grams, and are more scalable with n. ⇓ Despite their theoretical advantages, these LM are more expensive to compute. ⇓ A novel technique to speedup the computation of connectionist language models is presented in this work. Motivation To integrate the connectionist language model in the Viterbi decoder of a pattern recognition software. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 6 / 26
  • 7. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 7 / 26
  • 8. Neural Network Language Models (NN LMs) LM probability equation, n-grams: |S| p(s1 . . . s|S| ) ≈ i=1 p(si |si−n+1 . . . si−1 ) . A NN LM is a statistical LM which follows the same equation as n-grams. Probabilities that appear in that expression are estimated with a NN. The model naturally fits under the probabilistic interpretation of the outputs of the NNs: if a NN, in this case a Multilayer Perceptron (MLP), is trained as a classifier, the outputs associated to each class are estimations of the posterior probabilities of the defined classes. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 8 / 26
  • 9. NN LMs: Codification of the vocabulary I The training set for a LM is a sequence s1 s2 . . . s|S| of words from a vocabulary Ω. A trigram example of NN LM Each input word is locally encoded following a “1-of-|Ω|” scheme. Problems: – For tasks with large vocabularies, the resulting NN very huge. 1 2 3 4 ... |Ω| – The input of the NN is very si−2 = 0 0 1 0 ... 0 sparse. – It leads to slow convergence si−1 = 1 0 0 0 ... 0 during the training process. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 9 / 26
  • 10. NN LMs: Codification of the vocabulary II We use ideas from Bengio and Schwenk to learn a distributed representation of each word during the MLP training. Examples Distributed encoding si−2 = 0.2 0.1 0.5 0.3 ... 0.1 si−1 = 0.4 0.4 0.3 0.6 ... 0.2 |si−2 | << |Ω| |si−1 | << |Ω| F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 10 / 26
  • 11. NN LMs: Codification of the vocabulary III The input is composed of words si−n+1 , . . . , si−1 of n-grams equation. Each word is represented using a local encoding. p(si |si−n+1 . . . si−1 ) F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 12. NN LMs: Codification of the vocabulary III A new P, projection layer, formed by Pi−n+1 , . . . , Pi−1 subsets of projection units is added. Pj encodes the corresponding input word sj . Pj ⇒ codified word. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 13. NN LMs: Codification of the vocabulary III The weights from each local encoding of input word sj to the corresponding subset of projection units Pj are the same for all input words j. Shared weights in projection layer. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 14. NN LMs: Codification of the vocabulary III After training, the projection layer is removed from the network by pre-computing a table of size |Ω| which serves as a distributed encoding. a 0.4 0.2 ... 0.3 move 0.2 0.1 ... 0.8 to 0.6 0.7 ... 0.6 stop 0.1 0.5 ... 0.2 ... </s> 0.4 0.3 ... 0.9 F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 15. NN LMs: Codification of the vocabulary III H is the hidden layer with an empirical number of units. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 16. NN LMs: Codification of the vocabulary III O the output layer with |Ω| units. The softmax activation function ensures also that the output values sum to one. p(ω|si−n+1 . . . si−1 ), ω ∈ Ω. exp(ai ) oi = , |Ω| X exp(aj ) j=1 being ai the activation value of the i-th output unit and oi is its output value. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 17. NN LMs: Codification of the vocabulary III O the output layer with |Ω| units. The softmax activation function ensures also that the output values sum to one. p(ω|si−n+1 . . . si−1 ), ω ∈ Ω. exp(ai ) oi = , |Ω| X exp(aj ) j=1 being ai the activation value of the i-th output unit and oi is its output value. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 11 / 26
  • 18. NN LMs: Codification of the vocabulary IV This NN predicts the posterior probability of each word of the vocabulary given the history. A single forward pass of the MLP gives p(ω|si−n+1 . . . si−1 ) for every word ω ∈ Ω. Advantages + Automatic estimation (as with statistical LM). + The Lowest (in general) number of parameters of the obtained models. + Automatic smoothing performed by the neural networks estimators. Problems – The larger the lexicon is, the larger the number of parameters the neural network needs. – On speech or handwritten recognition or in translation tasks, there’s a thousands number of language model lookups. – Huge NN LMs consume excessive time computing these values. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 12 / 26
  • 19. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 13 / 26
  • 20. Fast evaluation of NNLMs I The softmax normalization term requires the computation of every output value. This computation dominates the cost of the forward pass in a typical NN LM topology. p(ω|si−n+1 . . . si−1 ), ω ∈ Ω. exp(ai ) oi = , |Ω| X exp(aj ) j=1 being ai the activation value of the i-th output unit and oi is its output value. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 14 / 26
  • 21. Proposed approach I Pre-computing and storing the softmax normalization constants most probably needed during the LM evaluation. A space/time trade-off has to be considered: the more space is dedicated to store pre-computed softmax normalization constants, the more time reduction can be obtained. When a given normalization constant is not found: Computed on-the-fly. Some kind of smoothing must be applied. We follow the last idea when a softmax normalization constant is not found. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 15 / 26
  • 22. Proposed approach II 1 Observe that a bigram NN LM only needs a |Ω|-sized pre-computed table of softmax normalization constants. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 23. Proposed approach II 2 Choose a hierarchy of models, from higher order n-grams downto bigrams. Possible hierarchy of models F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 24. Proposed approach II 3 For the bigram NN LM: pre-compute every softmax normalization constants for every word of the lexicon. Store the constants in a |Ω|-sized table. |Ω| for each ω ∈ Ω ⇒ exp(aj ) j=1 ⇓ Pre-computed table 8 > > <s> 0.1 a -0.1 < |Ω| > > move 1.0 ... ... : F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 25. Proposed approach II 4 For each n-gram NN LM bigger than bigram: pre-compute the softmax normalization constants for the K more frequent (n − 1)-grams in the training set. Store the constants in a table. |Ω| exp(aj ) j=1 ⇓ Pre-computed table 8 > > <s> <s> <s> 0.5 <s> <s> a 0.1 < K > > <s> a move 1.0 a move to -0.1 : F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 26. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 27. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 28. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 29. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 30. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 31. Proposed approach II 5 During the test evaluation, for each token: search the softmax pre-computed constant associated to the (n − 1)-length prefix of the token in the table. If the constant is in the table, calculate the probability; otherwise, switch to the inmediately inferior NN LM. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 16 / 26
  • 32. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 17 / 26
  • 33. Estimation of the NN LMs: Corpus Experiments with the LOB text corpus have been conducted. A subcorpus with a lexicon of 3 000 words has been built as follows: Apply a random ordering of the sentences. Select those whose lexicon lies in the first different 3 000 words. This subcorpus was partitioned into three sets: Partition # Sentences # running words Training 4 303 37 606 Validation 600 5 348 Test 600 5 455 F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 18 / 26
  • 34. Validation set perplexity for the estimated NN LMs Language Model Bigram Trigram 4gram NN LM 80–128 82.17 73.62 74.07 NN LM 128–192 82.52 73.30 72.50 NN LM 192–192 80.01 71.90 71.91 Mixed NN LM 78.92 71.34 70.63 Three different NN LM topologies for bigram, trigram and 4gram, and a combination of the three for each n-gram. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 19 / 26
  • 35. Test set perplexity of the best NN LMs and SRI Language Model Bigram Trigram 4-gram Mixed NN LM 88.72 80.94 79.90 SRI 88.29 83.19 87.05 Best NN LM compared with SRI statistical n-gram. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 20 / 26
  • 36. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 21 / 26
  • 37. Evaluation of the proposed approach: perplexity Influence of the number of pre-computed softmax normalization constants in the test set perplexity of the proposed approach for the Mixed NN LMs (left), and the Mixed NN LMs and statistical bigram (right). F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 22 / 26
  • 38. Evaluation of the proposed approach: speed up Model seconds/token tokens/second NN LM 6.43 × 10−3 155 Fast NN LM 1.94 × 10−4 5 154 A speedup of 33 times faster is achieved (for 3 000 words, higher speedups would be achieved for bigger lexicon sizes). Conclusion This speed-up allows the integration of these LMs in a search procedure of a recognition task. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 23 / 26
  • 39. Index 1 Introduction and motivation 2 Neural Network Language Models (NN LMs) 3 Fast evaluation of NNLMs 4 Estimation of the NNLMs 5 Evaluation of the proposed approach 6 Discussion and conclusions F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 24 / 26
  • 40. Discussion and conclusions A novel method to allow fast evaluation of connectionist language models has been presented. The best perplexity is obtained by combining the Mixed NN LM with a statistical bigram. A speedup of 33 times faster is achieved with this lexicon size (3 000 words). Nevertheless, higher speedups would be achieved for bigger lexicon sizes. Our next goal is to train more complex NN LMs and to integrate them in a recognition or translation systems. F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 25 / 26
  • 41. Fast evaluation of connectionist language models 10th International Work-Conference on Artifical Neural Networks F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera Departamento de Ciencias Físicas, Matemáticas y de la Computación Universidad CEU-Cardenal Herrera 46115 Alfara del Patriarca (Valencia), Spain Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia Valencia, Spain {fzamora,mcastro,sespana}@dsic.upv.es June 11 2009 F. Zamora et al (UCH CEU - UPV) Fast Evaluation of connectionist language models June 11 2009 26 / 26