Baum2

GENERALIZED BAUM-WELCH ALGORITHM FOR DISCRIMINATIVE TRAINING ON
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM

Roger Hsiao, Yik-Cheung Tam and Tanja Schultz

InterACT, Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213
{wrhsiao, yct, tanja}@cs.cmu.edu

ABSTRACT 2. EXTENDED BAUM-WELCH ALGORITHM

We propose a new optimization algorithm called Generalized Baum The objective function for discriminative training, in its simplest
Welch (GBW) algorithm for discriminative training on hidden form, involves the difference between two log likelihood functions.
Markov model (HMM). GBW is based on Lagrange relaxation on a Consider the simplest case that we only have one reference and one
transformed optimization problem. We show that both Baum-Welch competitor, then,
(BW) algorithm for ML estimate of HMM parameters, and the popu- F (X, θ) = Qr (X, θ) − Qc (X, θ) , (1)
lar extended Baum-Welch (EBW) algorithm for discriminative train- P P
ing are special cases of GBW. We compare the performance of GBW where Q = t j γt (j)[log |Σj | + (xt − μj ) Σ−1 (xt − μj )] is an
j
and EBW for Farsi large vocabulary continuous speech recognition auxiliary function to represent the negative log likelihood. In which,
(LVCSR). x is the observation; γt (j) is the posterior probability of x being
at Gaussian j at time t. The function F represents the difference
Index Terms— Speech recognition, discriminative training.
between the negative log likelihood of the reference, Qr , and the
competitor Qc on observation X = {x1 , . . . , xT }. θ is the model
parameter set including the mean vectors (μ), covariances (Σ) and
1. INTRODUCTION
mixture weights in a HMM. Minimization of F is the same as maxi-
mizing the mutual information so this form of discriminative training
Discriminative training is an important technique to improve recog- is also known as MMI estimation. MPE is based on the same princi-
nition accuracy for large vocabulary continuous speech recognition ple but it has a more sophisticated objective function.
(LVCSR) [1][2]. Common discriminative training algorithms in The function F is non-convex and optimization of F can be lo-
speech recognition employ maximum mutual information (MMI) cal optimal. Another bigger problem of F is the unbounded issue.
estimation [1] and minimum phone error (MPE) [2]. While MMI For example, if a Gaussian appears only as a competitor, optimiz-
and MPE have different objective functions, they use the the ex- ing the parameters of this Gaussian becomes a minimum likelihood
tended Baum-Welch (EBW) algorithm [3] for optimization. In re- problem, which is unbounded. In general, if the denominator oc-
cent years, large margin based approach gains popularity such as cupancy of a Gaussian is higher than its numerator occupancy, the
[4] which shows promising results. However, on large scale system, solution is unbounded. In sum, optimization of F is not trivial.
approaches based on lattices like MMI/MPE using EBW algorithm The idea of EBW is to add an additional auxiliary function to the
remain to be the most popular methods. function F to enforce convexity. That auxiliary function is required
One of the major challenges of discriminative training is opti- to have zero gradient at the current parameter set [2]. Details of
mization. The objective functions used in discriminative training like EBW is available in [3] and the reestimation formula for mean μj is:
MMI/MPE can be unbounded. It is also the reason why EBW may
P r P c 0
t γt (j)xt − t γ (j)xt + Dj μj
corrupt the acoustic model if it is not properly tuned and smoothed.
In this paper, we propose a new optimization algorithm called Gen- μj = P r P tc , (2)
t γt (j) − t γt (j) + Dj
eralized Baum Welch (GBW). GBW is based on Lagrange relaxation
[5] and the optimization is operated in a dual space. We show that where the subscript shows whether the term belongs to reference (r)
GBW does not have the unbound issue and it does not corrupt the or competitor (c); Dj is a constant chosen to guarantee the estimate
model due to improper tuning. More importantly, we show that both is valid (say, covariance must be positive deﬁnite). Comparing to
Baum-Welch (BW) algorithm for maximum likelihood (ML) esti- the Baum Welch algorithm which provides ML estimate for HMM,
mate, and EBW for MMI/MPE estimate are special cases of GBW. EBW algorithm considers the competitors as well. [3] shows that
The formulation of GBW also gives us a new insight of EBW for- EBW algorithm converges when D → ∞, where D is directly pro-
mulation which is naturally understood in GBW framework. portional to the number of discrete distributions used to represent
This paper is organized as follows: in section 2, we review the a Gaussian in continuous space. It is the reason why EBW needs
EBW algorithm and the MMI objective. In section 3, we formu- D → ∞ to guarantee convergence.
late the GBW algorithm to generalize BW and EBW algorithm. In In practice, we cannot choose D → ∞, so EBW is not guar-
section 4, we report experimental results on EBW and GBW. We anteed to converge. In addition, the EBW reestimation formula in
conclude our work and discuss future work in section 5. equation 2 often leads to overtraining. Hence, smoothing technique

978-1-4244-2354-5/09/$25.00 ©2009 IEEE 3769 ICASSP 2009

such as I-smoothing [2], has been proposed. While EBW has been For simplicity, we show the formulation for optimizing the mean
proven successful in practice, it often involves careful tuning. vectors, and this formulation also includes an optional regularization
using Mahalanobis distance on the means. We would like to empha-
3. GENERALIZED BAUM-WELCH ALGORITHM size that this method also allows us to train covariances and the op-
tional regularization is not required for GBW to work. The primal
We introduce Generalized Baum Welch (GBW) algorithm in this
problem becomes,
section. GBW algorithm uses Lagrangian relaxation on a trans-
P X
formed optimization problem, and we optimize the parameters for min i i + Dj ||μj − μ0 ||Σj
j
the dual problem which itself is a relaxed problem . GBW does not ,μ
j
have the unbounded issue and we can show that both BW and EBW
s.t. i ≥ Qi (μ) − Ci ∀i
are special cases of GBW.
i ≥ Ci − Qi (μ) ∀i , (5)
3.1. Checkpointing
where Dj is a Gaussian specific constant to control the importance
As mentioned, optimizing the function F can be unbounded to
of the regularization term; μ0 is the mean vector that we want GBW
j
some parameters. However, we can address the unbounded issue
to backoff to, and it is assumed to be an ML estimate here.
by adding checkpoints to the problem,
We can then construct the Lagrangian dual for the primal prob-
G(X, θ) = |Qr (X, θ) − Cr | + |Qc (X, θ) − Cc | (3) lem. The Lagrangian is defined as,
X X
where Cr and Cc are the checkpoints that we want Qr and Qc to Lm ( , μ, α, β) = i − αi ( i − Qi (μ) + Ci )
achieve respectively. For this particular example, we choose the i i
X
checkpoints such that Qr (X, θ) > Cr and Qc (X, θ) < Cc . As − βi ( i − Ci + Qi (μ))
a result, by minimizing the function G, we are maximizing the log i
likelihood difference between the reference and the competitor, but X
+ Dj ||μj − μ0 ||Σj
j (6)
we only want them to achieve the checkpoints we have chosen. In
j
general, we have multiple files and each file has possibly multiple
competitors. Hence, the formulation becomes, where {αi } and {βi } are the Lagrange multipliers for the first and
X the second set of constraints of the primal problem in equation 5.
G(X, θ) = |Qi (X, θ) − Ci | . (4) The Lagrangian dual is then defined as,
i
LD (α, β) = inf Lm ( , μ, α, β) (7)
,μ
Note that this formulation is very flexible that we can represent ref-
erence and competitors at different granularity levels. Since we are Now, we can differentiate L w.r.t. μ and . Hence,
using a lattice based approach, each term in equation 4 corresponds ∂Lm
to a word arc. As a result, we have multiple terms for reference and = 1 − αi − βi (8)
∂ i
competing word arcs and each word arc has its own checkpoint. It X
∂Lm ∂Qi ∂
is also important to note that when each term corresponds to a word = (αi − βi ) + Dj ||μj − μ0 ||Σj
j
arc, not every term has equal importance because of different poste- ∂μj i
∂μj ∂μj
rior count. To reflect this, one may add a weighting factor for each X X i
= (αi − βi )(−2 γt (j)Σ−1 (xi − μj ))
j t
term or scale the checkpoints. The formulas shown in this paper, i t
however, assume they have equal importance for simplicity, but it is
trivial to incorporate this information into the algorithm. + Dj (2Σ−1 (μj − μ0 )) .
j j (9)
Although the function G remains to be non-convex, this formu- By setting them to zero, it implies,
lation has an obvious advantage over the original problem. The rea-
son is the unbounded issue no longer exists in this form, since G αi + βi = 1 ∀i (10)
must be larger than or equal to zero. One easy way to define the and,
checkpoints is to encourage higher likelihood for the reference word P P i i 0
i (αi − βi ) t γ (j)x + Dj μj
arcs and lower likelihood for the competing word arcs. This scheme μj = Φj (α, β) = P Pt i t , (11)
i (αi − βi ) t γt (j) + Dj
is equivalent to MMI estimation.
and this is the GBW update equation for mean vectors.
3.2. Lagrange Relaxation BW algorithm is a special case of GBW, since if we disable the
Assuming good checkpoints are given so that if our model can reach regularization (D = 0) and set all α to one and β to zero for refer-
those checkpoints, the model can achieve good performance. To ence word arcs and α = β = 0.5 for all competitors, we get
minimize the function G, we may first transform the problem to, P P i i
i∈ref t γt (j)xt
P μj = P P i , (12)
min i∈ref t γt (j)
,θ i i
which is the BW update equation. EBW is also a special case of
s.t. i ≥ Qi (θ) − Ci ∀i
GBW, since if we set α equals one and β equals zero for all refer-
i ≥ Ci − Qi (θ) ∀i , ence, and α equals zero and β equals one for all competitors, the
GBW update equation becomes EBW update equation,
where represents slack variables and i is an index to a word arc. P P i P P i
i i 0
This is equivalent to the original problem in equation 4 without con- i∈ref t γt (j)xt − i∈com t γt (j)xt + Dj μj
μj = P P i P P i . (13)
t γt (j) − t γt (j) + Dj
straints. We call this as the primal problem for the rest of this paper. i∈ref i∈com

3770

∂LD X X ∂Qj ∂Φk
One should note that this result implies the D-term used in EBW = Qi − Ci + (αj − βj ) , (18)
can be considered as a regularization using Mahalanobis distance ∂αi j
∂Φk ∂αi
k
between the mean vectors of the new and the ML model, and the P i i
meaning is well represented. and, ∂Φk t γt (k)xt
= (19)
If the optimization is performed on the covariance, the modifi- ∂αi Zk (α, β)
cation to the primal problem is P P i
X X where Zk (α, β) = i (αi −βi ) t γt (k)+Dk and it is considered
min i + Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj ) + log |Σj |)
j j to be constant and we can obtain this value from the past iteration.
Σ
i j When αi is updated, βi can be obtained using the constraint αi +
s.t. i ≥ Qi (Σ) − Ci ∀i βi = 1.
i ≥ Ci − Qi (Σ) ∀i , (14) 3.3. Convergence condition of EBW and GBW
where Σ0
j = 0
Aj Aj ; Mj≡ μ0 μ0
= Bj Bj . Assuming both A and
j j This technique we use for GBW is known as Lagrange relaxation [5],
B exist, we have this Lagrangian, Lc , since it converts a primal problem into a dual problem. In theory, the
X X dual problem is always a convex problem (maximizing a concave
Lc ( , Σ, α, β) = i − αi ( i − Qi (Σ) + Ci ) objective function here) [5]. Note that when strong duality does not
i i
X hold, which means the optimal value of the dual can only serve as a
− βi ( i − Ci + Qi (Σ)) strict lower bound to the primal objective, there is no guarantee that
i the solution obtained from the dual is primal optimal. We can only
X consider this technique as a relaxation method.
+ Dj (tr(Aj Σ−1 Aj ) + tr(Bj Σ−1 Bj )
j j
j
Consider when D → ∞ and this term dominates the objective
function, strong duality occurs and GBW is guaranteed to converge
+ log |Σj |) . (15) in this case. Although the solution is simply the backoff model, this
We then differentiate the Lc w.r.t. the covariance, behavior is the same as EBW. However, given a problem and a finite
D, if the solution of GBW is equivalent to BW or EBW, it can be
∂Lc X X i
= (αi − βi ) γt (j)(Σ−1 − Σ−1 Stj Σ−1 )
j j
i
j
shown GBW is guaranteed to converge for this specific problem.
∂Σj i t
One should also note that the D constant in GBW is related to the
checkpoints. If the checkpoints are set more aggressively, that is
+ Dj (Σ−1 − Σ−1 Σ0 Σ−1 − Σ−1 Mj Σ−1 ) ,
j j j j j
0
j (16)
very high likelihood for reference word arcs and very low likelihood
where Stj ≡ (xi − μj )(xi − μj ) . Then by setting it to zero, we
i for competing word arcs, GBW is very likely to reduce to EBW
t t
obtain the GBW update equation for covariance, (but it is possible to construct artificial cases that GBW does not
reduce to EBW). However, in such case, the of the primal problem
Σj = Ψj (α, β) becomes larger, and therefore, D has to be larger for regularization
P P i i i 0 0
i (αi −βi ) t γt (j)(xt −μj )(xt −μj ) +Dj (Σj +Mj ) to be effective. Hence, although we claim GBW must converge when
= P P i , (17)
i (αi −βi ) t γt (j)+Dj it reduces to EBW, this case is equivalent to having D → ∞.
which is also a generalization of BW and EBW. Instead of solving
4. EXPERIMENTAL SETUP
two independent optimization problems, one may use the parame-
ters obtained from the first problem as the solution for the second We evaluated the performance of GBW and EBW on a speaker inde-
problem to compute the covariances. This procedure assumes the pendent Farsi LVCSR system with 33K vocabulary. The Farsi sys-
solutions of the two problems are similar and we adopt this proce- tem was trained with more than 110 hours of audio data in force
dure in our experiments. One should also note that the formulation protection and medical screening domain. The audio data can be
of GBW can incorporate I-smoothing [2] easily as well. roughly divided into two categories: 1.5-way and 2-way data. 1.5-
GBW is the same as BW and EBW that it is based on the EM way means basic question and answering and the sentences tend to
algorithm. However, the M-step of GBW is replaced by solving a be simpler; 2-way data is conversational and it may have more com-
dual problem to retrieve the Lagrange multipliers, so we can use plicated or incomplete sentences. A development test set was se-
equation 11 and equation 17 to obtain the HMM parameters. The lected from the 2-way data set as we are interested in conversational
dual problem is formulated by plugging equations 10, 11 and 17 back data. This development set consists of around 45 minutes of 2-way
to the Lagrangian. Assuming we are optimizing the mean vectors, data. For the test set, we selected the Farsi offline evaluation set
we have used in DARPA TransTac 2007 evaluation, which consists of around
X
max LD (α, β) = (αi − βi )(Qi (Φ(α, β)) − Ci ) 2 hours of conversational data. We tuned the algorithms based on
α,β the development set and tested on the test set at the end.
i
s.t. ∀i αi + βi = 1 and αi , βi ≥ 0 . MMI objective was chosen for optimization. The checkpoints
were selected based on the model used in E-step, and they were set
This dual problem can be solved by gradient ascent. By taking to be 10% to 40% higher than the log likelihood of the reference
derivative w.r.t. the Lagrange multipliers, we obtain the gradients. word arcs, and 10% to 40% lower of the competing word arcs. In
We need an assumption that at each iteration, the parameters do not the M-step, we performed four iterations of gradient ascent to update
move too far away. If this assumption holds, we can assume the de- the dual variables. From the dual variables, we then reestimated the
nominators of equation 11 and 17 are unchanged. Otherwise, the Gaussian parameters. No regularization nor smoothing was used for
gradient equation would couple with all the multipliers in the pro- GBW in the first experiment.
gram which would become computationally intractable. Finally, we The result in figure 1 shows that GBW without regularization
have, and smoothing can improve the baseline ML system. It shows GBW

3771

it over trains the system very soon due to the aggressiveness when
regularization is not used.
Table 1 summarizes the WER performance of EBW and GBW
on the test set. Both EBW and GBW make significant improvement

algo obj func dev test
BW ML 50.7% 50.2%
EBW MMI 46.7% 46.5%
GBW MMI 46.0% 45.8%

Table 1. WER of BW, EBW, and GBW on dev set and TransTac
2007 Farsi offline evaluation set.
over the baseline ML model and GBW performs slightly better than
EBW.

5. CONCLUSION AND FUTURE WORK
Fig. 1. Performance of GBW without regularization on dev set. We presented generalized Baum-Welch algorithm for discriminative
training. We showed that the common BW and EBW algorithms
are special cases of GBW. Unlike EBW, GBW uses a checkpoint-
is reliable as it works even without regularization and smoothing. ing technique to address the unbound issue, and GBW works even
On the contrary, EBW does not work when there is no regularization without regularization. Preliminary experiments also showed that
or smoothing and it just corrupts the model. As the checkpoints GBW can improve BW and EBW algorithm. More experiments on
are set more aggressively, GBW gives more improvement at earlier the checkpoints, and the training procedure are needed in order to
iterations but degrades afterwards. understand the behavior of this algorithm.
Model initialization for GBW is important due to the EM frame- The formulation of GBW helps us to understand EBW better.
work, one option is to initialize the dual variables such that it con- We learn that the regularization and smoothing of EBW can be rep-
forms to the ML model, that is BW initialization. Another option resented as a distance based regularization to the primal objective.
is to initialize the dual variables with EBW after the first iteration Regularization and smoothing are not always necessary for GBW,
of EBW. Figure 2 shows the performance of GBW and EBW with but these methods improve the performance.
different settings. Although the figure only shows the first seven it-
erations, the experiment was performed with 16 iterations and no 6. ACKNOWLEDGMENTS
improvement after the first seven iterations is observed for all algo- This work is in part supported by the US DARPA under the TransTac
rithms. (Spoken Language Communication and Translation System for Tac-
tical Use) program. Any opinions, findings, and conclusions or rec-
ommendations expressed in this material are those of the authors and
do not necessarily reflect the views of DARPA.

7. REFERENCES

[1] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young,
“MMIE Training of Large Vocabulary Recognition Systems,”
Speech Communication, vol. 22, no. 4, pp. 303–314, 1997.
[2] D. Povey, Discriminative Training for Large Vocabulary Speech
Recognition, Ph.D. thesis, Cambridge University Engineering
Dept., 2003.
[3] Y. Normandin and S. D. Morgera, “ An Improved MMIE
Training Algorithm for Speaker-independent, Small Vocabulary,
Continuous Speech Recognition,” in Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Pro-
cessing, 1991.
[4] F. Sha and L. K. Saul, “Large Margin Hidden Markov Models
Fig. 2. Performance of BW, EBW and GBW on dev set.
for Automatic Speech Recognition,” Advances in Neural Infor-
mation Processing Systems, vol. 19, pp. 1249–1256, 2007.
When GBW is initialized as EBW, GBW outperforms EBW at
all iterations. GBW with BW initialization lags behind EBW at the [5] Stephen Boyd and Lieven Vandenberghe, Convex Optimization,
earlier stages of the training since GBW is close to ML at the be- Cambridge University Press, 2004.
ginning, but GBW can obtain the same performance of EBW at the
end. When BW initialization is used, in addition to the figure, GBW
without regularization and smoothing gives more improvement at
the early stages compared to the one with regularization. However,

3772

Baum2

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Baum2 (20)

Recently uploaded (20)

Baum2