SlideShare a Scribd company logo
GENERALIZED BAUM-WELCH ALGORITHM FOR DISCRIMINATIVE TRAINING ON
        LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM

                                      Roger Hsiao, Yik-Cheung Tam and Tanja Schultz

                                         InterACT, Language Technologies Institute
                                                Carnegie Mellon University
                                                   Pittsburgh, PA 15213
                                       {wrhsiao, yct, tanja}@cs.cmu.edu


                            ABSTRACT                                               2. EXTENDED BAUM-WELCH ALGORITHM

We propose a new optimization algorithm called Generalized Baum             The objective function for discriminative training, in its simplest
Welch (GBW) algorithm for discriminative training on hidden                 form, involves the difference between two log likelihood functions.
Markov model (HMM). GBW is based on Lagrange relaxation on a                Consider the simplest case that we only have one reference and one
transformed optimization problem. We show that both Baum-Welch              competitor, then,
(BW) algorithm for ML estimate of HMM parameters, and the popu-                            F (X, θ) = Qr (X, θ) − Qc (X, θ) ,                  (1)
lar extended Baum-Welch (EBW) algorithm for discriminative train-                        P P
ing are special cases of GBW. We compare the performance of GBW             where Q = t j γt (j)[log |Σj | + (xt − μj ) Σ−1 (xt − μj )] is an
                                                                                                                                j
and EBW for Farsi large vocabulary continuous speech recognition            auxiliary function to represent the negative log likelihood. In which,
(LVCSR).                                                                    x is the observation; γt (j) is the posterior probability of x being
                                                                            at Gaussian j at time t. The function F represents the difference
    Index Terms— Speech recognition, discriminative training.
                                                                            between the negative log likelihood of the reference, Qr , and the
                                                                            competitor Qc on observation X = {x1 , . . . , xT }. θ is the model
                                                                            parameter set including the mean vectors (μ), covariances (Σ) and
                       1. INTRODUCTION
                                                                            mixture weights in a HMM. Minimization of F is the same as maxi-
                                                                            mizing the mutual information so this form of discriminative training
Discriminative training is an important technique to improve recog-         is also known as MMI estimation. MPE is based on the same princi-
nition accuracy for large vocabulary continuous speech recognition          ple but it has a more sophisticated objective function.
(LVCSR) [1][2]. Common discriminative training algorithms in                     The function F is non-convex and optimization of F can be lo-
speech recognition employ maximum mutual information (MMI)                  cal optimal. Another bigger problem of F is the unbounded issue.
estimation [1] and minimum phone error (MPE) [2]. While MMI                 For example, if a Gaussian appears only as a competitor, optimiz-
and MPE have different objective functions, they use the the ex-            ing the parameters of this Gaussian becomes a minimum likelihood
tended Baum-Welch (EBW) algorithm [3] for optimization. In re-              problem, which is unbounded. In general, if the denominator oc-
cent years, large margin based approach gains popularity such as            cupancy of a Gaussian is higher than its numerator occupancy, the
[4] which shows promising results. However, on large scale system,          solution is unbounded. In sum, optimization of F is not trivial.
approaches based on lattices like MMI/MPE using EBW algorithm                    The idea of EBW is to add an additional auxiliary function to the
remain to be the most popular methods.                                      function F to enforce convexity. That auxiliary function is required
     One of the major challenges of discriminative training is opti-        to have zero gradient at the current parameter set [2]. Details of
mization. The objective functions used in discriminative training like      EBW is available in [3] and the reestimation formula for mean μj is:
MMI/MPE can be unbounded. It is also the reason why EBW may
                                                                                             P r              P c                 0
                                                                                               t γt (j)xt −     t γ (j)xt + Dj μj
corrupt the acoustic model if it is not properly tuned and smoothed.
In this paper, we propose a new optimization algorithm called Gen-                    μj =      P r            P tc                 ,          (2)
                                                                                                    t γt (j) −    t γt (j) + Dj
eralized Baum Welch (GBW). GBW is based on Lagrange relaxation
[5] and the optimization is operated in a dual space. We show that          where the subscript shows whether the term belongs to reference (r)
GBW does not have the unbound issue and it does not corrupt the             or competitor (c); Dj is a constant chosen to guarantee the estimate
model due to improper tuning. More importantly, we show that both           is valid (say, covariance must be positive definite). Comparing to
Baum-Welch (BW) algorithm for maximum likelihood (ML) esti-                 the Baum Welch algorithm which provides ML estimate for HMM,
mate, and EBW for MMI/MPE estimate are special cases of GBW.                EBW algorithm considers the competitors as well. [3] shows that
The formulation of GBW also gives us a new insight of EBW for-              EBW algorithm converges when D → ∞, where D is directly pro-
mulation which is naturally understood in GBW framework.                    portional to the number of discrete distributions used to represent
     This paper is organized as follows: in section 2, we review the        a Gaussian in continuous space. It is the reason why EBW needs
EBW algorithm and the MMI objective. In section 3, we formu-                D → ∞ to guarantee convergence.
late the GBW algorithm to generalize BW and EBW algorithm. In                   In practice, we cannot choose D → ∞, so EBW is not guar-
section 4, we report experimental results on EBW and GBW. We                anteed to converge. In addition, the EBW reestimation formula in
conclude our work and discuss future work in section 5.                     equation 2 often leads to overtraining. Hence, smoothing technique




978-1-4244-2354-5/09/$25.00 ©2009 IEEE                               3769                                                         ICASSP 2009
such as I-smoothing [2], has been proposed. While EBW has been                    For simplicity, we show the formulation for optimizing the mean
proven successful in practice, it often involves careful tuning.              vectors, and this formulation also includes an optional regularization
                                                                              using Mahalanobis distance on the means. We would like to empha-
     3. GENERALIZED BAUM-WELCH ALGORITHM                                      size that this method also allows us to train covariances and the op-
                                                                              tional regularization is not required for GBW to work. The primal
We introduce Generalized Baum Welch (GBW) algorithm in this
                                                                              problem becomes,
section. GBW algorithm uses Lagrangian relaxation on a trans-
                                                                                                    P                 X
formed optimization problem, and we optimize the parameters for                        min             i i         +      Dj ||μj − μ0 ||Σj
                                                                                                                                      j
the dual problem which itself is a relaxed problem . GBW does not                       ,μ
                                                                                                                                    j
have the unbounded issue and we can show that both BW and EBW
                                                                                       s.t.       i   ≥ Qi (μ) − Ci        ∀i
are special cases of GBW.
                                                                                                  i   ≥ Ci − Qi (μ)        ∀i ,                     (5)
3.1. Checkpointing
                                                                              where Dj is a Gaussian specific constant to control the importance
As mentioned, optimizing the function F can be unbounded to
                                                                              of the regularization term; μ0 is the mean vector that we want GBW
                                                                                                            j
some parameters. However, we can address the unbounded issue
                                                                              to backoff to, and it is assumed to be an ML estimate here.
by adding checkpoints to the problem,
                                                                                   We can then construct the Lagrangian dual for the primal prob-
      G(X, θ) = |Qr (X, θ) − Cr | + |Qc (X, θ) − Cc |               (3)       lem. The Lagrangian is defined as,
                                                                                                           X         X
where Cr and Cc are the checkpoints that we want Qr and Qc to                      Lm ( , μ, α, β) =             i −    αi ( i − Qi (μ) + Ci )
achieve respectively. For this particular example, we choose the                                                 i              i
                                                                                                                X
checkpoints such that Qr (X, θ) > Cr and Qc (X, θ) < Cc . As                                               −           βi ( i − Ci + Qi (μ))
a result, by minimizing the function G, we are maximizing the log                                                i
likelihood difference between the reference and the competitor, but                                             X
                                                                                                           +           Dj ||μj − μ0 ||Σj
                                                                                                                                  j                 (6)
we only want them to achieve the checkpoints we have chosen. In
                                                                                                                 j
general, we have multiple files and each file has possibly multiple
competitors. Hence, the formulation becomes,                                  where {αi } and {βi } are the Lagrange multipliers for the first and
                          X                                                   the second set of constraints of the primal problem in equation 5.
               G(X, θ) =      |Qi (X, θ) − Ci | .               (4)           The Lagrangian dual is then defined as,
                                i
                                                                                              LD (α, β) = inf Lm ( , μ, α, β)                       (7)
                                                                                                                  ,μ
Note that this formulation is very flexible that we can represent ref-
erence and competitors at different granularity levels. Since we are              Now, we can differentiate L w.r.t. μ and . Hence,
using a lattice based approach, each term in equation 4 corresponds                ∂Lm
to a word arc. As a result, we have multiple terms for reference and                      = 1 − αi − βi                                        (8)
                                                                                    ∂ i
competing word arcs and each word arc has its own checkpoint. It                               X
                                                                                   ∂Lm                        ∂Qi            ∂
is also important to note that when each term corresponds to a word                       =        (αi − βi )       + Dj        ||μj − μ0 ||Σj
                                                                                                                                        j
arc, not every term has equal importance because of different poste-               ∂μj           i
                                                                                                              ∂μj           ∂μj
rior count. To reflect this, one may add a weighting factor for each                            X                   X i
                                                                                          =        (αi − βi )(−2        γt (j)Σ−1 (xi − μj ))
                                                                                                                                j    t
term or scale the checkpoints. The formulas shown in this paper,                                       i                        t
however, assume they have equal importance for simplicity, but it is
trivial to incorporate this information into the algorithm.                                   +       Dj (2Σ−1 (μj − μ0 )) .
                                                                                                            j         j                             (9)
     Although the function G remains to be non-convex, this formu-            By setting them to zero, it implies,
lation has an obvious advantage over the original problem. The rea-
son is the unbounded issue no longer exists in this form, since G                                           αi + βi = 1             ∀i             (10)
must be larger than or equal to zero. One easy way to define the               and,
checkpoints is to encourage higher likelihood for the reference word                                       P             P i          i      0
                                                                                                            i (αi − βi )    t γ (j)x + Dj μj
arcs and lower likelihood for the competing word arcs. This scheme               μj = Φj (α, β) =           P               Pt i t             ,   (11)
                                                                                                               i (αi − βi )    t γt (j) + Dj
is equivalent to MMI estimation.
                                                                              and this is the GBW update equation for mean vectors.
3.2. Lagrange Relaxation                                                          BW algorithm is a special case of GBW, since if we disable the
Assuming good checkpoints are given so that if our model can reach            regularization (D = 0) and set all α to one and β to zero for refer-
those checkpoints, the model can achieve good performance. To                 ence word arcs and α = β = 0.5 for all competitors, we get
minimize the function G, we may first transform the problem to,                                       P       P i          i
                                                                                                       i∈ref   t γt (j)xt
                             P                                                                  μj = P        P i           ,                (12)
                  min                                                                                   i∈ref    t γt (j)
                     ,θ         i i
                                                                              which is the BW update equation. EBW is also a special case of
                    s.t.    i   ≥ Qi (θ) − Ci   ∀i
                                                                              GBW, since if we set α equals one and β equals zero for all refer-
                            i   ≥ Ci − Qi (θ)   ∀i ,                          ence, and α equals zero and β equals one for all competitors, the
                                                                              GBW update equation becomes EBW update equation,
where represents slack variables and i is an index to a word arc.                    P        P i              P       P i
                                                                                                           i                       i      0
This is equivalent to the original problem in equation 4 without con-                   i∈ref   t γt (j)xt −     i∈com   t γt (j)xt + Dj μj
                                                                               μj =      P        P i           P       P i                 . (13)
                                                                                                     t γt (j) −            t γt (j) + Dj
straints. We call this as the primal problem for the rest of this paper.                    i∈ref                 i∈com




                                                                       3770
∂LD                X              X ∂Qj ∂Φk
One should note that this result implies the D-term used in EBW                                     = Qi − Ci +      (αj − βj )                ,        (18)
can be considered as a regularization using Mahalanobis distance                              ∂αi                 j
                                                                                                                                     ∂Φk ∂αi
                                                                                                                                  k
between the mean vectors of the new and the ML model, and the                                                       P i         i
meaning is well represented.                                                             and,               ∂Φk        t γt (k)xt
                                                                                                                 =                                      (19)
    If the optimization is performed on the covariance, the modifi-                                          ∂αi       Zk (α, β)
cation to the primal problem is                                                                            P            P i
     X         X                                                                        where Zk (α, β) = i (αi −βi ) t γt (k)+Dk and it is considered
min        i +    Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj ) + log |Σj |)
                              j               j                                         to be constant and we can obtain this value from the past iteration.
 Σ
       i            j                                                                   When αi is updated, βi can be obtained using the constraint αi +
                        s.t.          i   ≥ Qi (Σ) − Ci     ∀i                          βi = 1.
                                      i   ≥ Ci − Qi (Σ)     ∀i ,              (14)      3.3. Convergence condition of EBW and GBW
where  Σ0
        j =             0
               Aj Aj ; Mj≡                μ0 μ0
                                   = Bj Bj . Assuming both A and
                                           j j                                          This technique we use for GBW is known as Lagrange relaxation [5],
B exist, we have this Lagrangian, Lc ,                                                  since it converts a primal problem into a dual problem. In theory, the
                         X        X                                                     dual problem is always a convex problem (maximizing a concave
  Lc ( , Σ, α, β) =          i −       αi ( i − Qi (Σ) + Ci )                           objective function here) [5]. Note that when strong duality does not
                                      i             i
                                     X                                                  hold, which means the optimal value of the dual can only serve as a
                          −                βi ( i − Ci + Qi (Σ))                        strict lower bound to the primal objective, there is no guarantee that
                                      i                                                 the solution obtained from the dual is primal optimal. We can only
                                     X                                                  consider this technique as a relaxation method.
                          +                Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj )
                                                      j                j
                                      j
                                                                                             Consider when D → ∞ and this term dominates the objective
                                                                                        function, strong duality occurs and GBW is guaranteed to converge
                          +          log |Σj |) .                             (15)      in this case. Although the solution is simply the backoff model, this
We then differentiate the Lc w.r.t. the covariance,                                     behavior is the same as EBW. However, given a problem and a finite
                                                                                        D, if the solution of GBW is equivalent to BW or EBW, it can be
    ∂Lc          X              X i
            =        (αi − βi )      γt (j)(Σ−1 − Σ−1 Stj Σ−1 )
                                             j      j
                                                       i
                                                           j
                                                                                        shown GBW is guaranteed to converge for this specific problem.
    ∂Σj            i              t
                                                                                        One should also note that the D constant in GBW is related to the
                                                                                        checkpoints. If the checkpoints are set more aggressively, that is
               +        Dj (Σ−1 − Σ−1 Σ0 Σ−1 − Σ−1 Mj Σ−1 ) ,
                             j     j   j j      j
                                                    0
                                                       j                      (16)
                                                                                        very high likelihood for reference word arcs and very low likelihood
where Stj ≡ (xi − μj )(xi − μj ) . Then by setting it to zero, we
         i                                                                              for competing word arcs, GBW is very likely to reduce to EBW
               t        t
obtain the GBW update equation for covariance,                                          (but it is possible to construct artificial cases that GBW does not
                                                                                        reduce to EBW). However, in such case, the of the primal problem
     Σj =                                       Ψj (α, β)                               becomes larger, and therefore, D has to be larger for regularization
               P                 P   i     i        i             0   0
                   i (αi −βi )   t γt (j)(xt −μj )(xt −μj ) +Dj (Σj +Mj )               to be effective. Hence, although we claim GBW must converge when
           =                     P            P i                           , (17)
                                   i (αi −βi ) t γt (j)+Dj                              it reduces to EBW, this case is equivalent to having D → ∞.
which is also a generalization of BW and EBW. Instead of solving
                                                                                                          4. EXPERIMENTAL SETUP
two independent optimization problems, one may use the parame-
ters obtained from the first problem as the solution for the second                      We evaluated the performance of GBW and EBW on a speaker inde-
problem to compute the covariances. This procedure assumes the                          pendent Farsi LVCSR system with 33K vocabulary. The Farsi sys-
solutions of the two problems are similar and we adopt this proce-                      tem was trained with more than 110 hours of audio data in force
dure in our experiments. One should also note that the formulation                      protection and medical screening domain. The audio data can be
of GBW can incorporate I-smoothing [2] easily as well.                                  roughly divided into two categories: 1.5-way and 2-way data. 1.5-
     GBW is the same as BW and EBW that it is based on the EM                           way means basic question and answering and the sentences tend to
algorithm. However, the M-step of GBW is replaced by solving a                          be simpler; 2-way data is conversational and it may have more com-
dual problem to retrieve the Lagrange multipliers, so we can use                        plicated or incomplete sentences. A development test set was se-
equation 11 and equation 17 to obtain the HMM parameters. The                           lected from the 2-way data set as we are interested in conversational
dual problem is formulated by plugging equations 10, 11 and 17 back                     data. This development set consists of around 45 minutes of 2-way
to the Lagrangian. Assuming we are optimizing the mean vectors,                         data. For the test set, we selected the Farsi offline evaluation set
we have                                                                                 used in DARPA TransTac 2007 evaluation, which consists of around
                             X
      max LD (α, β) =            (αi − βi )(Qi (Φ(α, β)) − Ci )                         2 hours of conversational data. We tuned the algorithms based on
       α,β                                                                              the development set and tested on the test set at the end.
                                            i
     s.t. ∀i       αi + βi            = 1 and αi , βi ≥ 0 .                                  MMI objective was chosen for optimization. The checkpoints
                                                                                        were selected based on the model used in E-step, and they were set
This dual problem can be solved by gradient ascent. By taking                           to be 10% to 40% higher than the log likelihood of the reference
derivative w.r.t. the Lagrange multipliers, we obtain the gradients.                    word arcs, and 10% to 40% lower of the competing word arcs. In
We need an assumption that at each iteration, the parameters do not                     the M-step, we performed four iterations of gradient ascent to update
move too far away. If this assumption holds, we can assume the de-                      the dual variables. From the dual variables, we then reestimated the
nominators of equation 11 and 17 are unchanged. Otherwise, the                          Gaussian parameters. No regularization nor smoothing was used for
gradient equation would couple with all the multipliers in the pro-                     GBW in the first experiment.
gram which would become computationally intractable. Finally, we                             The result in figure 1 shows that GBW without regularization
have,                                                                                   and smoothing can improve the baseline ML system. It shows GBW



                                                                                 3771
it over trains the system very soon due to the aggressiveness when
                                                                            regularization is not used.
                                                                                 Table 1 summarizes the WER performance of EBW and GBW
                                                                            on the test set. Both EBW and GBW make significant improvement

                                                                                           algo     obj func     dev       test
                                                                                            BW        ML        50.7%     50.2%
                                                                                           EBW       MMI        46.7%     46.5%
                                                                                           GBW       MMI        46.0%     45.8%

                                                                            Table 1. WER of BW, EBW, and GBW on dev set and TransTac
                                                                            2007 Farsi offline evaluation set.
                                                                            over the baseline ML model and GBW performs slightly better than
                                                                            EBW.

                                                                                      5. CONCLUSION AND FUTURE WORK
  Fig. 1. Performance of GBW without regularization on dev set.             We presented generalized Baum-Welch algorithm for discriminative
                                                                            training. We showed that the common BW and EBW algorithms
                                                                            are special cases of GBW. Unlike EBW, GBW uses a checkpoint-
is reliable as it works even without regularization and smoothing.          ing technique to address the unbound issue, and GBW works even
On the contrary, EBW does not work when there is no regularization          without regularization. Preliminary experiments also showed that
or smoothing and it just corrupts the model. As the checkpoints             GBW can improve BW and EBW algorithm. More experiments on
are set more aggressively, GBW gives more improvement at earlier            the checkpoints, and the training procedure are needed in order to
iterations but degrades afterwards.                                         understand the behavior of this algorithm.
     Model initialization for GBW is important due to the EM frame-              The formulation of GBW helps us to understand EBW better.
work, one option is to initialize the dual variables such that it con-      We learn that the regularization and smoothing of EBW can be rep-
forms to the ML model, that is BW initialization. Another option            resented as a distance based regularization to the primal objective.
is to initialize the dual variables with EBW after the first iteration       Regularization and smoothing are not always necessary for GBW,
of EBW. Figure 2 shows the performance of GBW and EBW with                  but these methods improve the performance.
different settings. Although the figure only shows the first seven it-
erations, the experiment was performed with 16 iterations and no                              6. ACKNOWLEDGMENTS
improvement after the first seven iterations is observed for all algo-       This work is in part supported by the US DARPA under the TransTac
rithms.                                                                     (Spoken Language Communication and Translation System for Tac-
                                                                            tical Use) program. Any opinions, findings, and conclusions or rec-
                                                                            ommendations expressed in this material are those of the authors and
                                                                            do not necessarily reflect the views of DARPA.

                                                                                                   7. REFERENCES

                                                                            [1] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young,
                                                                                “MMIE Training of Large Vocabulary Recognition Systems,”
                                                                                Speech Communication, vol. 22, no. 4, pp. 303–314, 1997.
                                                                            [2] D. Povey, Discriminative Training for Large Vocabulary Speech
                                                                                Recognition, Ph.D. thesis, Cambridge University Engineering
                                                                                Dept., 2003.
                                                                            [3] Y. Normandin and S. D. Morgera, “ An Improved MMIE
                                                                                Training Algorithm for Speaker-independent, Small Vocabulary,
                                                                                Continuous Speech Recognition,” in Proceedings of the IEEE
                                                                                International Conference on Acoustics, Speech, and Signal Pro-
                                                                                cessing, 1991.
                                                                            [4] F. Sha and L. K. Saul, “Large Margin Hidden Markov Models
      Fig. 2. Performance of BW, EBW and GBW on dev set.
                                                                                for Automatic Speech Recognition,” Advances in Neural Infor-
                                                                                mation Processing Systems, vol. 19, pp. 1249–1256, 2007.
     When GBW is initialized as EBW, GBW outperforms EBW at
all iterations. GBW with BW initialization lags behind EBW at the           [5] Stephen Boyd and Lieven Vandenberghe, Convex Optimization,
earlier stages of the training since GBW is close to ML at the be-              Cambridge University Press, 2004.
ginning, but GBW can obtain the same performance of EBW at the
end. When BW initialization is used, in addition to the figure, GBW
without regularization and smoothing gives more improvement at
the early stages compared to the one with regularization. However,



                                                                     3772

More Related Content

PDF
Quantum Deep Learning
PDF
Performance analysis of image compression using fuzzy logic algorithm
PDF
Neural Style Transfer in practice
PDF
CFM Challenge - Course Project
PDF
Bq25399403
PDF
38 116-1-pb
PDF
Mm2521542158
PDF
Ba26343346
Quantum Deep Learning
Performance analysis of image compression using fuzzy logic algorithm
Neural Style Transfer in practice
CFM Challenge - Course Project
Bq25399403
38 116-1-pb
Mm2521542158
Ba26343346

What's hot (19)

PDF
ssc_icml13
PDF
PDF
Probabilistic Self-Organizing Maps for Text-Independent Speaker Identification
PDF
Solvers and Applications with CP
PDF
Evolving CSP Algorithm in Predicting the Path Loss of Indoor Propagation Models
PDF
Speaker Identification From Youtube Obtained Data
PDF
Ay32333339
PDF
PDF
Mjfg now
PDF
Improving Machine Learning Approaches to Coreference Resolution
PDF
A review of automatic differentiationand its efficient implementation
PDF
Certified global minima
PDF
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
PDF
hankel_norm approximation_fir_ ijc
PDF
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
PDF
2330365
PDF
Speaker Identification based on GFCC using GMM-UBM
PPT
Multiobjective presentation
PDF
COMPARISON OF DIFFERENT APPROXIMATIONS OF FUZZY NUMBERS
ssc_icml13
Probabilistic Self-Organizing Maps for Text-Independent Speaker Identification
Solvers and Applications with CP
Evolving CSP Algorithm in Predicting the Path Loss of Indoor Propagation Models
Speaker Identification From Youtube Obtained Data
Ay32333339
Mjfg now
Improving Machine Learning Approaches to Coreference Resolution
A review of automatic differentiationand its efficient implementation
Certified global minima
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
hankel_norm approximation_fir_ ijc
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
2330365
Speaker Identification based on GFCC using GMM-UBM
Multiobjective presentation
COMPARISON OF DIFFERENT APPROXIMATIONS OF FUZZY NUMBERS
Ad

Viewers also liked (20)

PDF
Daftarhadir&nilai statistik 1415
PPTX
InterTradeIreland Venture Capital Conference 2012
PDF
Debt Funding Your Bioenergy Project
PPTX
Data Scientists: Your Must-Have Business Investment
PPTX
Daima biz hotel presentation
PPTX
Algumas curiosidades e cases web 2.0
PDF
Aula 3 materiais de desenho
PDF
Maude’s life in pictures final-as PDF
PPT
Final pages
DOCX
Evl doc 1
PPT
Consumers Italy presentation
PPTX
Pengembangan kurikulum
PDF
CloudFoundryこと始め
PDF
Selectivos valenciano sept. 2010 y junio 2011
PDF
An approach for integrating legacy systems in the manufacturing industry
PPTX
Housing and Finance Presentation
PPTX
My boutiquehotel
PPTX
Top 3 hangover cures
Daftarhadir&nilai statistik 1415
InterTradeIreland Venture Capital Conference 2012
Debt Funding Your Bioenergy Project
Data Scientists: Your Must-Have Business Investment
Daima biz hotel presentation
Algumas curiosidades e cases web 2.0
Aula 3 materiais de desenho
Maude’s life in pictures final-as PDF
Final pages
Evl doc 1
Consumers Italy presentation
Pengembangan kurikulum
CloudFoundryこと始め
Selectivos valenciano sept. 2010 y junio 2011
An approach for integrating legacy systems in the manufacturing industry
Housing and Finance Presentation
My boutiquehotel
Top 3 hangover cures
Ad

Similar to Baum2 (20)

DOC
ASCE_ChingHuei_Rev00..
DOC
ASCE_ChingHuei_Rev00..
PDF
Baum3
PPTX
PPTX
PDF
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
PDF
Effects of Weight Approximation Methods on Performance of Digital Beamforming...
PDF
Shogun 2.0 @ PyData NYC 2012
PDF
Multilayer Neural Networks
PDF
05 history of cv a machine learning (theory) perspective on computer vision
PDF
Bayesian Inference: An Introduction to Principles and ...
PDF
Support Vector Machine
PDF
Lecture3 linear svm_with_slack
PDF
Cheatsheet supervised-learning
PDF
Lecture12 - SVM
PDF
Machine learning (1)
PDF
Huong dan cu the svm
PDF
PDF
Beamforming for Antenna Array
PDF
ABC and empirical likelihood
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
Baum3
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
Effects of Weight Approximation Methods on Performance of Digital Beamforming...
Shogun 2.0 @ PyData NYC 2012
Multilayer Neural Networks
05 history of cv a machine learning (theory) perspective on computer vision
Bayesian Inference: An Introduction to Principles and ...
Support Vector Machine
Lecture3 linear svm_with_slack
Cheatsheet supervised-learning
Lecture12 - SVM
Machine learning (1)
Huong dan cu the svm
Beamforming for Antenna Array
ABC and empirical likelihood

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Baum2

  • 1. GENERALIZED BAUM-WELCH ALGORITHM FOR DISCRIMINATIVE TRAINING ON LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM Roger Hsiao, Yik-Cheung Tam and Tanja Schultz InterACT, Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 {wrhsiao, yct, tanja}@cs.cmu.edu ABSTRACT 2. EXTENDED BAUM-WELCH ALGORITHM We propose a new optimization algorithm called Generalized Baum The objective function for discriminative training, in its simplest Welch (GBW) algorithm for discriminative training on hidden form, involves the difference between two log likelihood functions. Markov model (HMM). GBW is based on Lagrange relaxation on a Consider the simplest case that we only have one reference and one transformed optimization problem. We show that both Baum-Welch competitor, then, (BW) algorithm for ML estimate of HMM parameters, and the popu- F (X, θ) = Qr (X, θ) − Qc (X, θ) , (1) lar extended Baum-Welch (EBW) algorithm for discriminative train- P P ing are special cases of GBW. We compare the performance of GBW where Q = t j γt (j)[log |Σj | + (xt − μj ) Σ−1 (xt − μj )] is an j and EBW for Farsi large vocabulary continuous speech recognition auxiliary function to represent the negative log likelihood. In which, (LVCSR). x is the observation; γt (j) is the posterior probability of x being at Gaussian j at time t. The function F represents the difference Index Terms— Speech recognition, discriminative training. between the negative log likelihood of the reference, Qr , and the competitor Qc on observation X = {x1 , . . . , xT }. θ is the model parameter set including the mean vectors (μ), covariances (Σ) and 1. INTRODUCTION mixture weights in a HMM. Minimization of F is the same as maxi- mizing the mutual information so this form of discriminative training Discriminative training is an important technique to improve recog- is also known as MMI estimation. MPE is based on the same princi- nition accuracy for large vocabulary continuous speech recognition ple but it has a more sophisticated objective function. (LVCSR) [1][2]. Common discriminative training algorithms in The function F is non-convex and optimization of F can be lo- speech recognition employ maximum mutual information (MMI) cal optimal. Another bigger problem of F is the unbounded issue. estimation [1] and minimum phone error (MPE) [2]. While MMI For example, if a Gaussian appears only as a competitor, optimiz- and MPE have different objective functions, they use the the ex- ing the parameters of this Gaussian becomes a minimum likelihood tended Baum-Welch (EBW) algorithm [3] for optimization. In re- problem, which is unbounded. In general, if the denominator oc- cent years, large margin based approach gains popularity such as cupancy of a Gaussian is higher than its numerator occupancy, the [4] which shows promising results. However, on large scale system, solution is unbounded. In sum, optimization of F is not trivial. approaches based on lattices like MMI/MPE using EBW algorithm The idea of EBW is to add an additional auxiliary function to the remain to be the most popular methods. function F to enforce convexity. That auxiliary function is required One of the major challenges of discriminative training is opti- to have zero gradient at the current parameter set [2]. Details of mization. The objective functions used in discriminative training like EBW is available in [3] and the reestimation formula for mean μj is: MMI/MPE can be unbounded. It is also the reason why EBW may P r P c 0 t γt (j)xt − t γ (j)xt + Dj μj corrupt the acoustic model if it is not properly tuned and smoothed. In this paper, we propose a new optimization algorithm called Gen- μj = P r P tc , (2) t γt (j) − t γt (j) + Dj eralized Baum Welch (GBW). GBW is based on Lagrange relaxation [5] and the optimization is operated in a dual space. We show that where the subscript shows whether the term belongs to reference (r) GBW does not have the unbound issue and it does not corrupt the or competitor (c); Dj is a constant chosen to guarantee the estimate model due to improper tuning. More importantly, we show that both is valid (say, covariance must be positive definite). Comparing to Baum-Welch (BW) algorithm for maximum likelihood (ML) esti- the Baum Welch algorithm which provides ML estimate for HMM, mate, and EBW for MMI/MPE estimate are special cases of GBW. EBW algorithm considers the competitors as well. [3] shows that The formulation of GBW also gives us a new insight of EBW for- EBW algorithm converges when D → ∞, where D is directly pro- mulation which is naturally understood in GBW framework. portional to the number of discrete distributions used to represent This paper is organized as follows: in section 2, we review the a Gaussian in continuous space. It is the reason why EBW needs EBW algorithm and the MMI objective. In section 3, we formu- D → ∞ to guarantee convergence. late the GBW algorithm to generalize BW and EBW algorithm. In In practice, we cannot choose D → ∞, so EBW is not guar- section 4, we report experimental results on EBW and GBW. We anteed to converge. In addition, the EBW reestimation formula in conclude our work and discuss future work in section 5. equation 2 often leads to overtraining. Hence, smoothing technique 978-1-4244-2354-5/09/$25.00 ©2009 IEEE 3769 ICASSP 2009
  • 2. such as I-smoothing [2], has been proposed. While EBW has been For simplicity, we show the formulation for optimizing the mean proven successful in practice, it often involves careful tuning. vectors, and this formulation also includes an optional regularization using Mahalanobis distance on the means. We would like to empha- 3. GENERALIZED BAUM-WELCH ALGORITHM size that this method also allows us to train covariances and the op- tional regularization is not required for GBW to work. The primal We introduce Generalized Baum Welch (GBW) algorithm in this problem becomes, section. GBW algorithm uses Lagrangian relaxation on a trans- P X formed optimization problem, and we optimize the parameters for min i i + Dj ||μj − μ0 ||Σj j the dual problem which itself is a relaxed problem . GBW does not ,μ j have the unbounded issue and we can show that both BW and EBW s.t. i ≥ Qi (μ) − Ci ∀i are special cases of GBW. i ≥ Ci − Qi (μ) ∀i , (5) 3.1. Checkpointing where Dj is a Gaussian specific constant to control the importance As mentioned, optimizing the function F can be unbounded to of the regularization term; μ0 is the mean vector that we want GBW j some parameters. However, we can address the unbounded issue to backoff to, and it is assumed to be an ML estimate here. by adding checkpoints to the problem, We can then construct the Lagrangian dual for the primal prob- G(X, θ) = |Qr (X, θ) − Cr | + |Qc (X, θ) − Cc | (3) lem. The Lagrangian is defined as, X X where Cr and Cc are the checkpoints that we want Qr and Qc to Lm ( , μ, α, β) = i − αi ( i − Qi (μ) + Ci ) achieve respectively. For this particular example, we choose the i i X checkpoints such that Qr (X, θ) > Cr and Qc (X, θ) < Cc . As − βi ( i − Ci + Qi (μ)) a result, by minimizing the function G, we are maximizing the log i likelihood difference between the reference and the competitor, but X + Dj ||μj − μ0 ||Σj j (6) we only want them to achieve the checkpoints we have chosen. In j general, we have multiple files and each file has possibly multiple competitors. Hence, the formulation becomes, where {αi } and {βi } are the Lagrange multipliers for the first and X the second set of constraints of the primal problem in equation 5. G(X, θ) = |Qi (X, θ) − Ci | . (4) The Lagrangian dual is then defined as, i LD (α, β) = inf Lm ( , μ, α, β) (7) ,μ Note that this formulation is very flexible that we can represent ref- erence and competitors at different granularity levels. Since we are Now, we can differentiate L w.r.t. μ and . Hence, using a lattice based approach, each term in equation 4 corresponds ∂Lm to a word arc. As a result, we have multiple terms for reference and = 1 − αi − βi (8) ∂ i competing word arcs and each word arc has its own checkpoint. It X ∂Lm ∂Qi ∂ is also important to note that when each term corresponds to a word = (αi − βi ) + Dj ||μj − μ0 ||Σj j arc, not every term has equal importance because of different poste- ∂μj i ∂μj ∂μj rior count. To reflect this, one may add a weighting factor for each X X i = (αi − βi )(−2 γt (j)Σ−1 (xi − μj )) j t term or scale the checkpoints. The formulas shown in this paper, i t however, assume they have equal importance for simplicity, but it is trivial to incorporate this information into the algorithm. + Dj (2Σ−1 (μj − μ0 )) . j j (9) Although the function G remains to be non-convex, this formu- By setting them to zero, it implies, lation has an obvious advantage over the original problem. The rea- son is the unbounded issue no longer exists in this form, since G αi + βi = 1 ∀i (10) must be larger than or equal to zero. One easy way to define the and, checkpoints is to encourage higher likelihood for the reference word P P i i 0 i (αi − βi ) t γ (j)x + Dj μj arcs and lower likelihood for the competing word arcs. This scheme μj = Φj (α, β) = P Pt i t , (11) i (αi − βi ) t γt (j) + Dj is equivalent to MMI estimation. and this is the GBW update equation for mean vectors. 3.2. Lagrange Relaxation BW algorithm is a special case of GBW, since if we disable the Assuming good checkpoints are given so that if our model can reach regularization (D = 0) and set all α to one and β to zero for refer- those checkpoints, the model can achieve good performance. To ence word arcs and α = β = 0.5 for all competitors, we get minimize the function G, we may first transform the problem to, P P i i i∈ref t γt (j)xt P μj = P P i , (12) min i∈ref t γt (j) ,θ i i which is the BW update equation. EBW is also a special case of s.t. i ≥ Qi (θ) − Ci ∀i GBW, since if we set α equals one and β equals zero for all refer- i ≥ Ci − Qi (θ) ∀i , ence, and α equals zero and β equals one for all competitors, the GBW update equation becomes EBW update equation, where represents slack variables and i is an index to a word arc. P P i P P i i i 0 This is equivalent to the original problem in equation 4 without con- i∈ref t γt (j)xt − i∈com t γt (j)xt + Dj μj μj = P P i P P i . (13) t γt (j) − t γt (j) + Dj straints. We call this as the primal problem for the rest of this paper. i∈ref i∈com 3770
  • 3. ∂LD X X ∂Qj ∂Φk One should note that this result implies the D-term used in EBW = Qi − Ci + (αj − βj ) , (18) can be considered as a regularization using Mahalanobis distance ∂αi j ∂Φk ∂αi k between the mean vectors of the new and the ML model, and the P i i meaning is well represented. and, ∂Φk t γt (k)xt = (19) If the optimization is performed on the covariance, the modifi- ∂αi Zk (α, β) cation to the primal problem is P P i X X where Zk (α, β) = i (αi −βi ) t γt (k)+Dk and it is considered min i + Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj ) + log |Σj |) j j to be constant and we can obtain this value from the past iteration. Σ i j When αi is updated, βi can be obtained using the constraint αi + s.t. i ≥ Qi (Σ) − Ci ∀i βi = 1. i ≥ Ci − Qi (Σ) ∀i , (14) 3.3. Convergence condition of EBW and GBW where Σ0 j = 0 Aj Aj ; Mj≡ μ0 μ0 = Bj Bj . Assuming both A and j j This technique we use for GBW is known as Lagrange relaxation [5], B exist, we have this Lagrangian, Lc , since it converts a primal problem into a dual problem. In theory, the X X dual problem is always a convex problem (maximizing a concave Lc ( , Σ, α, β) = i − αi ( i − Qi (Σ) + Ci ) objective function here) [5]. Note that when strong duality does not i i X hold, which means the optimal value of the dual can only serve as a − βi ( i − Ci + Qi (Σ)) strict lower bound to the primal objective, there is no guarantee that i the solution obtained from the dual is primal optimal. We can only X consider this technique as a relaxation method. + Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj ) j j j Consider when D → ∞ and this term dominates the objective function, strong duality occurs and GBW is guaranteed to converge + log |Σj |) . (15) in this case. Although the solution is simply the backoff model, this We then differentiate the Lc w.r.t. the covariance, behavior is the same as EBW. However, given a problem and a finite D, if the solution of GBW is equivalent to BW or EBW, it can be ∂Lc X X i = (αi − βi ) γt (j)(Σ−1 − Σ−1 Stj Σ−1 ) j j i j shown GBW is guaranteed to converge for this specific problem. ∂Σj i t One should also note that the D constant in GBW is related to the checkpoints. If the checkpoints are set more aggressively, that is + Dj (Σ−1 − Σ−1 Σ0 Σ−1 − Σ−1 Mj Σ−1 ) , j j j j j 0 j (16) very high likelihood for reference word arcs and very low likelihood where Stj ≡ (xi − μj )(xi − μj ) . Then by setting it to zero, we i for competing word arcs, GBW is very likely to reduce to EBW t t obtain the GBW update equation for covariance, (but it is possible to construct artificial cases that GBW does not reduce to EBW). However, in such case, the of the primal problem Σj = Ψj (α, β) becomes larger, and therefore, D has to be larger for regularization P P i i i 0 0 i (αi −βi ) t γt (j)(xt −μj )(xt −μj ) +Dj (Σj +Mj ) to be effective. Hence, although we claim GBW must converge when = P P i , (17) i (αi −βi ) t γt (j)+Dj it reduces to EBW, this case is equivalent to having D → ∞. which is also a generalization of BW and EBW. Instead of solving 4. EXPERIMENTAL SETUP two independent optimization problems, one may use the parame- ters obtained from the first problem as the solution for the second We evaluated the performance of GBW and EBW on a speaker inde- problem to compute the covariances. This procedure assumes the pendent Farsi LVCSR system with 33K vocabulary. The Farsi sys- solutions of the two problems are similar and we adopt this proce- tem was trained with more than 110 hours of audio data in force dure in our experiments. One should also note that the formulation protection and medical screening domain. The audio data can be of GBW can incorporate I-smoothing [2] easily as well. roughly divided into two categories: 1.5-way and 2-way data. 1.5- GBW is the same as BW and EBW that it is based on the EM way means basic question and answering and the sentences tend to algorithm. However, the M-step of GBW is replaced by solving a be simpler; 2-way data is conversational and it may have more com- dual problem to retrieve the Lagrange multipliers, so we can use plicated or incomplete sentences. A development test set was se- equation 11 and equation 17 to obtain the HMM parameters. The lected from the 2-way data set as we are interested in conversational dual problem is formulated by plugging equations 10, 11 and 17 back data. This development set consists of around 45 minutes of 2-way to the Lagrangian. Assuming we are optimizing the mean vectors, data. For the test set, we selected the Farsi offline evaluation set we have used in DARPA TransTac 2007 evaluation, which consists of around X max LD (α, β) = (αi − βi )(Qi (Φ(α, β)) − Ci ) 2 hours of conversational data. We tuned the algorithms based on α,β the development set and tested on the test set at the end. i s.t. ∀i αi + βi = 1 and αi , βi ≥ 0 . MMI objective was chosen for optimization. The checkpoints were selected based on the model used in E-step, and they were set This dual problem can be solved by gradient ascent. By taking to be 10% to 40% higher than the log likelihood of the reference derivative w.r.t. the Lagrange multipliers, we obtain the gradients. word arcs, and 10% to 40% lower of the competing word arcs. In We need an assumption that at each iteration, the parameters do not the M-step, we performed four iterations of gradient ascent to update move too far away. If this assumption holds, we can assume the de- the dual variables. From the dual variables, we then reestimated the nominators of equation 11 and 17 are unchanged. Otherwise, the Gaussian parameters. No regularization nor smoothing was used for gradient equation would couple with all the multipliers in the pro- GBW in the first experiment. gram which would become computationally intractable. Finally, we The result in figure 1 shows that GBW without regularization have, and smoothing can improve the baseline ML system. It shows GBW 3771
  • 4. it over trains the system very soon due to the aggressiveness when regularization is not used. Table 1 summarizes the WER performance of EBW and GBW on the test set. Both EBW and GBW make significant improvement algo obj func dev test BW ML 50.7% 50.2% EBW MMI 46.7% 46.5% GBW MMI 46.0% 45.8% Table 1. WER of BW, EBW, and GBW on dev set and TransTac 2007 Farsi offline evaluation set. over the baseline ML model and GBW performs slightly better than EBW. 5. CONCLUSION AND FUTURE WORK Fig. 1. Performance of GBW without regularization on dev set. We presented generalized Baum-Welch algorithm for discriminative training. We showed that the common BW and EBW algorithms are special cases of GBW. Unlike EBW, GBW uses a checkpoint- is reliable as it works even without regularization and smoothing. ing technique to address the unbound issue, and GBW works even On the contrary, EBW does not work when there is no regularization without regularization. Preliminary experiments also showed that or smoothing and it just corrupts the model. As the checkpoints GBW can improve BW and EBW algorithm. More experiments on are set more aggressively, GBW gives more improvement at earlier the checkpoints, and the training procedure are needed in order to iterations but degrades afterwards. understand the behavior of this algorithm. Model initialization for GBW is important due to the EM frame- The formulation of GBW helps us to understand EBW better. work, one option is to initialize the dual variables such that it con- We learn that the regularization and smoothing of EBW can be rep- forms to the ML model, that is BW initialization. Another option resented as a distance based regularization to the primal objective. is to initialize the dual variables with EBW after the first iteration Regularization and smoothing are not always necessary for GBW, of EBW. Figure 2 shows the performance of GBW and EBW with but these methods improve the performance. different settings. Although the figure only shows the first seven it- erations, the experiment was performed with 16 iterations and no 6. ACKNOWLEDGMENTS improvement after the first seven iterations is observed for all algo- This work is in part supported by the US DARPA under the TransTac rithms. (Spoken Language Communication and Translation System for Tac- tical Use) program. Any opinions, findings, and conclusions or rec- ommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA. 7. REFERENCES [1] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “MMIE Training of Large Vocabulary Recognition Systems,” Speech Communication, vol. 22, no. 4, pp. 303–314, 1997. [2] D. Povey, Discriminative Training for Large Vocabulary Speech Recognition, Ph.D. thesis, Cambridge University Engineering Dept., 2003. [3] Y. Normandin and S. D. Morgera, “ An Improved MMIE Training Algorithm for Speaker-independent, Small Vocabulary, Continuous Speech Recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing, 1991. [4] F. Sha and L. K. Saul, “Large Margin Hidden Markov Models Fig. 2. Performance of BW, EBW and GBW on dev set. for Automatic Speech Recognition,” Advances in Neural Infor- mation Processing Systems, vol. 19, pp. 1249–1256, 2007. When GBW is initialized as EBW, GBW outperforms EBW at all iterations. GBW with BW initialization lags behind EBW at the [5] Stephen Boyd and Lieven Vandenberghe, Convex Optimization, earlier stages of the training since GBW is close to ML at the be- Cambridge University Press, 2004. ginning, but GBW can obtain the same performance of EBW at the end. When BW initialization is used, in addition to the figure, GBW without regularization and smoothing gives more improvement at the early stages compared to the one with regularization. However, 3772