SlideShare a Scribd company logo
Computer Science and Engineering, UCSD                                                October 12, 1999
Tail Inequalities                                                                     Author: Bellare


                                     Tail Inequalities



1 The problem
We have a collection X1 ; : : : ; Xn of random variables each ranging between 0 and 1. We let pi =
E Xi for i = 1; : : : ; n and we let X = X1 +    + Xn. We let  = E X . Linearity of expectation
tells us that  = p1 +    + pn . We x some parameter A 0 and are interested in the probability
that X ,   A, namely that X exceeds its expectation by some amount A.
A particular form in which we wish to study this probability is the following. We let A = x for
some x 0. Then Pr X ,   A = Pr X  1 + x . We are interested in how this behaves
as a function of x, with all other quantities being xed. Mostly we want good upper bounds.
This situation arises extremely often. Typically, something is known about the amount of indepen-
dence of the random variables X1 ; : : : ; Xn . The simplest case is that they are actually independent.
Another case common in computer science is that they satisfy some limited form of independence,
for example pairwise independence, or, more generally, t-wise independence where t  2 is some
integer. When t = n we have independence. Alternatively, they may satisfy some form of almost
independence. Tail inequalities deal with these situations.
In mathematics courses on introductory probability theory, these problems are typically treated
via the laws of large numbers and the central limit theorem. These provide qualitative under-
standing of how the probabilities in question behave as a function of n. Tail inequalities are the
quantitative analogue.
We will begin with some background and then go to the most common case, the one where the
random variables are fully independent. Then we address limited independence.

2 Basic inequalities
The most basic inequality is Markov's.
Proposition 1 Markov's Inequality For any non-negative random variable X and any real
number a 0 we have
                             Pr X  a  E X :            a
As an example let a = 2  E X . Then the above says Pr X  2  E X  1=2. Namely, if you
move out to twice the expectation, you can have only half the area under the curve to your right.
This is quite intuitive.

                                                   1
2                                                                                           Bellare


Proof of Proposition 1: This is a simple computation:
                                               X
                        a  Pr X  a = a           Pr X = x
                                                          x : xa
                                                      X
                                                              x  Pr X = x
                                                     x : xa
                                                     X
                                                         x  Pr X = x
                                                      x
                                             = EX :
Where in this proof did we use that X  0? Third line above.
Markov's inequality is rather weak. The curious thing is that nonetheless it is in the end the root
of the most powerful tail inequalities around. You just have to nd the right way to use it.
One step up from Markov's inequality is Chebyschev's inequality. To state it we rst recall that if
X is a random variable then its variance is Var X = E X , 2 = E X 2 , 2 where  = E X
                                                                          

is the expectation of X .
Proposition 2 Chebychev's inequality Let X be a random variable, and let A 0. Then
                         Pr jX , j  A  Var X :                   A2




Proof of Proposition 2: Let  = E X . Let Y be the random variable de ned by Y = X , =      2

X , 2  X +  . Then
    2           2
                        h   i                      h    i
              E Y = E X , 2  E X +  = E X , 2 +  = Var X :
                             2                   2              2        2    2


Since Y  0 we can apply Proposition 1 to it. We have
                                                      h     i
                             Pr jX , j  A = Pr Y  A                   2




                                                 EY           A 2


                                                      = Var 2X
                                                          A
as desired.
This will come in useful for tail inequalities on pairwise independent random variables.

3 Tail inequalities for independent random variables
We have independent random variables X1 ; : : : ; Xn ranging between 0 and 1. For simplicity let's
assume they are actually boolean, meaning assume only the values 0 and 1. This turns out to
be the worse case for the situations we deal with. In that case, pi def E Xi = Pr Xi = 1 for
                                                                     =
Tail Inequalities                                                                                 3


i = 1; : : : ; n. As usual set X = X +    + Xn and  = E X . We are interested in upper bounding
                                     1
Pr X ,  A where A 0 is some given real number.
The law of large numbers says that if we x A; p then
                                        h                      i
                                  lim Pr n n Xi , pi  A = 0 :
                                           1
                                             P
                                 n!1           i  =1

That is, the probability that X deviates from its expectation gets smaller and smaller as the number
of samples n grows. In computer science we want more precise information: our interest is in how
this probability tails of as a function of n.
Nomenclature in this area is not uniform, but the bounds we will now discuss sometimes go under
the name of Cherno -type bounds. A good reference is 1 .
The rst bound we specify is the simplest, yet good enough in many of the applications.

Proposition 3 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi
                        1
for i = 1; : : : ; n. Let X = X +    + Xn and let  = E X . Let A 0 be a real number. Then
                                1


                                     Pr X ,  A            e,A2 = n :
                                                                  2




We won't prove this because below we will prove something stronger. But let's discuss it. To get
an understanding of the bound we consider the case where p1 = p2 =    = pn .

Corollary 4 Let X ; : : : ; Xn be independent, 0=1 valued random variables all having the same
                    1
expectation p. Let X = X +    + Xn and let  = E X . Let x 0 be a real number. Then
                            1


                                    Pr X       1 + x    e,x2 p2n=     2


                                                          = e,x2 p= :2




Proof: Set A = x in Proposition 3 and use the fact that  = pn.

This shows us the punch-line: roughly, Pr X 1 + x decreases exponentially with n for xed
x; p. In other words, the probability that the sum of independent random variables deviates signif-
icantly from its mean expectation drops very quickly as the number of random variables grows.
For example if you toss many fair coins, the probability of getting signi cantly more than 50
heads is very small.
However you have to be careful with the above bound. The intuition that Pr X 1 + x
decreases exponentially with n is quite sensitive to the values of x; p, and there are many common
situations in which the bounds obtained from the above are not good enough. One way to see why
is to consider the second line in the bound of Corollary 4, which we got just by substituting  for
np in the rst line. It shows us that if p is small, the bound is worse even for a xed value of the
expectation  of the sum X . This is actually a weakness in the bound, not necessarily a re ection
of reality.
Here now is a stronger Cherno -type bound. This is pretty much tight, meaning as good as you
can get. It may at rst be hard to interpret, but we'll elucidate it later.
4                                                                                                      Bellare


Theorem 5 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi for
                   1
i = 1; : : : ; n. Let X = X +    + Xn and let  = E X . Let
                        1                                      1 be a real number. Then
                                       Pr X          e,g          


where we de ne the function g by g  = ln  + 1 , .
To visualize this it is again useful to set = 1 + x for x                 0 and see what happens to the bound
viewed as a function of x.
Corollary 6 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi for
                    1
i = 1; : : : ; n. Let X = X +    + Xn and let  = E X . Let 0 x  2 be a real number. Then
                        1

                                 Pr X 1 + x    e, x2 = :        3       10



Notice the improvement over Corollary 4: the factor of p in the exponent has vanished. That is
quite a change; the new bound is much better.
Be careful when you apply this bound to note that it only holds for x  2. If x is larger, our
intuition is that the probability in question should be even lower, yet the bound above does not
apply. If you want a bound that works for large x you will need to go back to the proofs of
Theorem 5 and Corollary 6 and try to extend them. It would be nice to do this and get a clean
bound for larger x, actually. We might explore these issues later, but right now I want to look more
at the proofs. Let's begin by seeing why Corollary 6 follows from Theorem 5.
Proof of Corollary 6: Theorem 5 tells us that
                              Pr X 1 + x    e,g x                  1+ 


where g is the function de ned in the statement of Theorem 5. So it su ces to show that
g1 + x  3x =10 for 0 x  2. We do this using Taylor series approximations. We have
             2


                      g1 + x = 1 + x ln1 + x + 1 , 1 + x
                               = 1 + x ln1 + x , x
                                                               i
                               = ,x + 1 + x  ,1i,  x
                                                X
                                                                          1
                                                              i
                                                            i1
                                                                  i                       i
                                                    ,1i,1  x +                         x
                                              X                           X
                                 = ,x +                       i                 ,1i  i , 1
                                              i1                         i2

                                 =
                                     X                 xi        x
                                         ,1i,1  i + ,1i  i , 1
                                                                                 i
                                     i2
                                                        i
                                           ,1i  ii x 1
                                     X
                                 =                     ,
                                     i2
                                                      i
                                 = x , ,1i,1  ii x 1
                                    2 X

                                   2 i3             ,
                                                                     !

                                  x , x ,x +x :
                                      2          3       4        5

                                   2   6 12 20
Tail Inequalities                                                                                    5


We now want to upper bound the expression in parentheses in the last line above. We consider
the function f x = 1=6 , x=12 + x2 =20. It attains its minimum at x = 5=6 so for 0 x  2 the
maximum value of f is attained at x = 2. Thus the above is
                                                   x , x  2  f 2
                                                       2
                                                               2
                                                    2
1 , 1
                                                  = x  2 5
                                                       2



                                                     x2
                                                  = 310 :
That concludes the proof.
Now we come to the interesting part, namely the proof of Theorem 5. It introduces the idea of
exponential generating functions. It is quite neat, illustrating many simple but powerful techniques.

Proof of Theorem 5: We introduce a parameter  0 whose value will be set later. Recall that
 = p +    + pn = E X . We use the monotonicity of the exponential function and then apply
      1
Markov's inequality to get
                                                                   h                            i
                                 Pr X               = Pr eX e                          
                                                          h   i
                                                        E eX
                                                       e  :                                     1
This is the exponential generating function trick. We do something that looks really trivial. We
note that the probability is unchanged if we exponentiate the terms involved, and then we use, of
all things, the weakest of the inequalities around, namely Markov's inequality. Yet as we will now
see, rather strong bounds emerge.
The next thing we do is bound the expectation in Equation 1. We start with the following
                                  h       i            h                               i
                                 E eX = E e X1 X2  Xn    +       +   +       

                                          h                          i
                                       = E eX1  eX2  : : :  eXn :
The independence of X1 ; : : : ; Xn this is the one and only place we use this implies that the above
equals
                                   h          i    h       i               h           i
                                 E eX1  E eX2  : : :  E eXn :
Now we compute these individual expectations: For any i = 1; : : : ; n we have
                             h        i
                           E eXi = 1  Pr Xi = 0 + e  Pr Xi = 1
                                  = 1 , pi  + e  pi
                                  = 1 + e , 1  pi :
So at this point we have
                h   i
              E eX = 1 + e , 1  p  1 + e , 1  p  : : :  1 + e , 1  pn :
                                                   1                           2
6                                                                                                 Bellare


Notice that so far we have done no bounding; we have equalities.
At this point, what can we do? We are looking at a complex bound, product of many terms. We
will start seeing terms involving products of the values p1 ; : : : ; pn , which is not something we know
much about. What we do know something about is the sum of p1 ; : : : ; pn , because this is exactly
, the expectation of X . We'd like to work this in. This is done by applying a very common and
useful little inequality, namely that 1 + y  ey for any real number y. Set yi = e , 1pi and we
get
                              h    i
                            E eX  1 + y   1 + y   : : :  1 + yn
                                                     1                2


                                   ey1  ey2  : : :  eyn
                                       =   ey1  yn
                                                +   +


                                       =   e e, p1 
                                                   1   +   +pn 
                                       =   e e,  :
                                                   1



Let's now put this back together with Equation 1. That gives us

                                  Pr X                e e, 
                                                   e 
                                                                         1




                                                    = e e , , 
                                                                         1


                                                    = e,f           



where
                                        f  =  , e , 1 :
Now we want to analyze the function f  and choose the value of  0 that makes f as large as
possible. Since all the above is true for any value of  0 we can plug in this special value and that
will be our bound. To analyze f  we use high-school level calculus. We compute the derivate:
f 0 = , e . The function f 0 is positive for  ln , zero at  = ln , and then negative for
 ln . This tells us that f is increasing for 0  ln  and decreasing for  ln . So the
maximum is at  = ln . Now we note that
                                    f ln  = ln   , + 1 :
This is exactly what we called g , so the proof is complete.
The technique of this proof is the one used in all proofs of Cherno -type bounds, with minor
variations. It is useful to know it so that you can derive your own bounds if necessary.

4 Tail inequalities for pairwise independent random variables
Pairwise independence means that we cannot infer anything extra about Xi given Xj where j 6= i,
even though we might be able to infer something or even everything about Xi if we were given Xj
and Xk where i; j; k are distinct.
Tail Inequalities                                                                                                                     7


De nition 7 We say that X ; : : : ; Xn are pairwise independent random variables if for every 1 
                              1
i j  n and every a; b 2 R we have
                 Pr Xi = a and Xj = b = Pr Xi = a  Pr Xj = b :
The tail inequality for such random variables makes use of the fact that the variance of a sum of
pairwise independent random variables behaves exactly like the variance of a sum of independent
random variables: it is the sum of the individual variances.
Lemma 8 Let X ; : : : ; Xn be pairwise independent random variables. Then
                    1

                     Var X +    + Xn = Var X +    + Var Xn :
                              1                                                 1




Proof of Lemma 8: Use the formula for the variance and the linearity of expectation to get
                           h                 i
    Var X +    + Xn = E X +    + Xn , E X +    + Xn
             1                           1
                                                                        2
                                                                                           1
                                                                                                            2



                       = E X +    + Xn X +    + Xn  , E X +    + E Xn 
                                         1                                      1                               1
                                                                                                                                  2

                           hP         i   X
                       = E i;j Xi Xj , E Xi  E Xj
                                                                      i;j
                                  X                            X
                          =             E XiXj ,                       E Xi  E Xj
                                  i;j                           i;j
                                  X      h           i       X                         X                    X
                          =             E Xi +   2
                                                                    E Xi Xj ,                      E Xi ,
                                                                                                      2
                                                                                                                    E Xi  E Xj
                                   i                         i6=j                              i            i6=j
                                  X
         h           i                         X
                          =             E Xi , E Xi +2                      2
                                                                                           E Xi Xj , E Xi  E Xj 
                                   i                                                i6=j
                                  X                           X
                          =             Var Xi +                      E Xi Xj , E Xi  E Xj  :
                                   i                          i6=j
The pairwise independence means that E Xi Xj = E Xi  E Xj whenever i 6= j . Thus the second
sum above is zero, and we are done.
We can now obtain the tail inequality by applying Chebyshev's inequality.
Lemma 9 Let X ; : : : ; Xn be pairwise independent random variables, let X = X +    + Xn, let
                    1                                                                                                    1
A 0 be a real number, and let  = E X . Then
                     Pr jX , j A  Var X +    + Var Xn :                    1

                                                                                     A 2




Proof of Lemma 9: Proposition 2 tells us that
                                   Pr jX , j A                              Var X :
                                                                                A              2


Now apply Lemma 8.

More Related Content

PDF
Analysis Solutions CIII
PDF
Mrbml004 : Introduction to Information Theory for Machine Learning
PDF
Lesson 17: Indeterminate forms and l'Hôpital's Rule (slides)
PDF
STAB52 Lecture Notes (Week 2)
PDF
D029019023
PDF
Lesson 18: Indeterminate Forms and L'Hôpital's Rule
PDF
Ian.petrow【transcendental number theory】.
PDF
The axiomatic power of Kolmogorov complexity
Analysis Solutions CIII
Mrbml004 : Introduction to Information Theory for Machine Learning
Lesson 17: Indeterminate forms and l'Hôpital's Rule (slides)
STAB52 Lecture Notes (Week 2)
D029019023
Lesson 18: Indeterminate Forms and L'Hôpital's Rule
Ian.petrow【transcendental number theory】.
The axiomatic power of Kolmogorov complexity

What's hot (19)

PDF
Lesson 6: Limits Involving Infinity (slides)
PDF
Lesson 13: Exponential and Logarithmic Functions (slides)
PPTX
The Euclidean Spaces (elementary topology and sequences)
PDF
Real and convex analysis
PPT
3 fol examples v2
DOCX
Polynomial functions modelllings
PDF
Lesson 6: Limits Involving ∞
PDF
Lesson 5: Continuity (slides)
PDF
PDF
Ssp notes
PDF
Convergence
PDF
Cs229 notes8
PPTX
Math project
PPTX
PDF
Lecture1
PPT
Tips manejo de 1er parcial fula metodos numericos 2010
PDF
PDF
Thesis defendence presentation
PDF
Probability Basics and Bayes' Theorem
Lesson 6: Limits Involving Infinity (slides)
Lesson 13: Exponential and Logarithmic Functions (slides)
The Euclidean Spaces (elementary topology and sequences)
Real and convex analysis
3 fol examples v2
Polynomial functions modelllings
Lesson 6: Limits Involving ∞
Lesson 5: Continuity (slides)
Ssp notes
Convergence
Cs229 notes8
Math project
Lecture1
Tips manejo de 1er parcial fula metodos numericos 2010
Thesis defendence presentation
Probability Basics and Bayes' Theorem
Ad

Viewers also liked (17)

PDF
Comparative Study That Aims Rdf Processing For The Java Platform
PPT
insomnia
PDF
Revista Do Invasor Janeiro 2010
PDF
Lazer Frenzy Brochure
PPS
Amistat Petit Princep
PPTX
Nature\'s Market Presentation
PPS
Cosa Vuoi Di Più Dalla Vita
PPT
actividad 3, prueba de la herramienta
PPTX
Classic Placement Slide
PDF
Getting Started
PPT
Day in the life of westside!
PDF
Listening practice-tests
PPTX
Que tiempos aquellos
PPTX
Presentacion power point
PPTX
Nutrución en cirugía
PPTX
Cooperative Work Programme
Comparative Study That Aims Rdf Processing For The Java Platform
insomnia
Revista Do Invasor Janeiro 2010
Lazer Frenzy Brochure
Amistat Petit Princep
Nature\'s Market Presentation
Cosa Vuoi Di Più Dalla Vita
actividad 3, prueba de la herramienta
Classic Placement Slide
Getting Started
Day in the life of westside!
Listening practice-tests
Que tiempos aquellos
Presentacion power point
Nutrución en cirugía
Cooperative Work Programme
Ad

Similar to Tail (20)

PDF
Intro probability 4
PDF
Basics of probability in statistical simulation and stochastic programming
PPTX
Probability and Random Variables
PPTX
Probability and statistics - Discrete Random Variables and Probability Distri...
PDF
mathes probabality mca syllabus for probability and stats
PDF
Course material mca
PDF
Probability Formula sheet
PPT
3 recursive bayesian estimation
PDF
Engr 371 final exam december 1997
PPT
Discrete probability
PDF
Tele3113 wk1wed
PDF
Statistics (1): estimation, Chapter 1: Models
PDF
Lecture 1,2 maths presentation slides.pdf
PDF
Random variables
PDF
Engr 371 final exam august 1999
PPTX
Random process.pptx
PDF
Engr 371 final exam april 2010
PPT
Expectation of Discrete Random Variable.ppt
PDF
Probability Recap
Intro probability 4
Basics of probability in statistical simulation and stochastic programming
Probability and Random Variables
Probability and statistics - Discrete Random Variables and Probability Distri...
mathes probabality mca syllabus for probability and stats
Course material mca
Probability Formula sheet
3 recursive bayesian estimation
Engr 371 final exam december 1997
Discrete probability
Tele3113 wk1wed
Statistics (1): estimation, Chapter 1: Models
Lecture 1,2 maths presentation slides.pdf
Random variables
Engr 371 final exam august 1999
Random process.pptx
Engr 371 final exam april 2010
Expectation of Discrete Random Variable.ppt
Probability Recap

Recently uploaded (20)

PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Tartificialntelligence_presentation.pptx
PPT
What is a Computer? Input Devices /output devices
PDF
STKI Israel Market Study 2025 version august
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Modernising the Digital Integration Hub
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Architecture types and enterprise applications.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
project resource management chapter-09.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Enhancing emotion recognition model for a student engagement use case through...
Web App vs Mobile App What Should You Build First.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Programs and apps: productivity, graphics, security and other tools
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Hindi spoken digit analysis for native and non-native speakers
OMC Textile Division Presentation 2021.pptx
DP Operators-handbook-extract for the Mautical Institute
NewMind AI Weekly Chronicles – August ’25 Week III
Assigned Numbers - 2025 - Bluetooth® Document
Tartificialntelligence_presentation.pptx
What is a Computer? Input Devices /output devices
STKI Israel Market Study 2025 version august
A novel scalable deep ensemble learning framework for big data classification...
Modernising the Digital Integration Hub
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Architecture types and enterprise applications.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
project resource management chapter-09.pdf

Tail

  • 1. Computer Science and Engineering, UCSD October 12, 1999 Tail Inequalities Author: Bellare Tail Inequalities 1 The problem We have a collection X1 ; : : : ; Xn of random variables each ranging between 0 and 1. We let pi = E Xi for i = 1; : : : ; n and we let X = X1 + + Xn. We let = E X . Linearity of expectation tells us that = p1 + + pn . We x some parameter A 0 and are interested in the probability that X , A, namely that X exceeds its expectation by some amount A. A particular form in which we wish to study this probability is the following. We let A = x for some x 0. Then Pr X , A = Pr X 1 + x . We are interested in how this behaves as a function of x, with all other quantities being xed. Mostly we want good upper bounds. This situation arises extremely often. Typically, something is known about the amount of indepen- dence of the random variables X1 ; : : : ; Xn . The simplest case is that they are actually independent. Another case common in computer science is that they satisfy some limited form of independence, for example pairwise independence, or, more generally, t-wise independence where t 2 is some integer. When t = n we have independence. Alternatively, they may satisfy some form of almost independence. Tail inequalities deal with these situations. In mathematics courses on introductory probability theory, these problems are typically treated via the laws of large numbers and the central limit theorem. These provide qualitative under- standing of how the probabilities in question behave as a function of n. Tail inequalities are the quantitative analogue. We will begin with some background and then go to the most common case, the one where the random variables are fully independent. Then we address limited independence. 2 Basic inequalities The most basic inequality is Markov's. Proposition 1 Markov's Inequality For any non-negative random variable X and any real number a 0 we have Pr X a E X : a As an example let a = 2 E X . Then the above says Pr X 2 E X 1=2. Namely, if you move out to twice the expectation, you can have only half the area under the curve to your right. This is quite intuitive. 1
  • 2. 2 Bellare Proof of Proposition 1: This is a simple computation: X a Pr X a = a Pr X = x x : xa X x Pr X = x x : xa X x Pr X = x x = EX : Where in this proof did we use that X 0? Third line above. Markov's inequality is rather weak. The curious thing is that nonetheless it is in the end the root of the most powerful tail inequalities around. You just have to nd the right way to use it. One step up from Markov's inequality is Chebyschev's inequality. To state it we rst recall that if X is a random variable then its variance is Var X = E X , 2 = E X 2 , 2 where = E X is the expectation of X . Proposition 2 Chebychev's inequality Let X be a random variable, and let A 0. Then Pr jX , j A Var X : A2 Proof of Proposition 2: Let = E X . Let Y be the random variable de ned by Y = X , = 2 X , 2 X + . Then 2 2 h i h i E Y = E X , 2 E X + = E X , 2 + = Var X : 2 2 2 2 2 Since Y 0 we can apply Proposition 1 to it. We have h i Pr jX , j A = Pr Y A 2 EY A 2 = Var 2X A as desired. This will come in useful for tail inequalities on pairwise independent random variables. 3 Tail inequalities for independent random variables We have independent random variables X1 ; : : : ; Xn ranging between 0 and 1. For simplicity let's assume they are actually boolean, meaning assume only the values 0 and 1. This turns out to be the worse case for the situations we deal with. In that case, pi def E Xi = Pr Xi = 1 for =
  • 3. Tail Inequalities 3 i = 1; : : : ; n. As usual set X = X + + Xn and = E X . We are interested in upper bounding 1 Pr X , A where A 0 is some given real number. The law of large numbers says that if we x A; p then h i lim Pr n n Xi , pi A = 0 : 1 P n!1 i =1 That is, the probability that X deviates from its expectation gets smaller and smaller as the number of samples n grows. In computer science we want more precise information: our interest is in how this probability tails of as a function of n. Nomenclature in this area is not uniform, but the bounds we will now discuss sometimes go under the name of Cherno -type bounds. A good reference is 1 . The rst bound we specify is the simplest, yet good enough in many of the applications. Proposition 3 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi 1 for i = 1; : : : ; n. Let X = X + + Xn and let = E X . Let A 0 be a real number. Then 1 Pr X , A e,A2 = n : 2 We won't prove this because below we will prove something stronger. But let's discuss it. To get an understanding of the bound we consider the case where p1 = p2 = = pn . Corollary 4 Let X ; : : : ; Xn be independent, 0=1 valued random variables all having the same 1 expectation p. Let X = X + + Xn and let = E X . Let x 0 be a real number. Then 1 Pr X 1 + x e,x2 p2n= 2 = e,x2 p= :2 Proof: Set A = x in Proposition 3 and use the fact that = pn. This shows us the punch-line: roughly, Pr X 1 + x decreases exponentially with n for xed x; p. In other words, the probability that the sum of independent random variables deviates signif- icantly from its mean expectation drops very quickly as the number of random variables grows. For example if you toss many fair coins, the probability of getting signi cantly more than 50 heads is very small. However you have to be careful with the above bound. The intuition that Pr X 1 + x decreases exponentially with n is quite sensitive to the values of x; p, and there are many common situations in which the bounds obtained from the above are not good enough. One way to see why is to consider the second line in the bound of Corollary 4, which we got just by substituting for np in the rst line. It shows us that if p is small, the bound is worse even for a xed value of the expectation of the sum X . This is actually a weakness in the bound, not necessarily a re ection of reality. Here now is a stronger Cherno -type bound. This is pretty much tight, meaning as good as you can get. It may at rst be hard to interpret, but we'll elucidate it later.
  • 4. 4 Bellare Theorem 5 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi for 1 i = 1; : : : ; n. Let X = X + + Xn and let = E X . Let 1 1 be a real number. Then Pr X e,g where we de ne the function g by g = ln + 1 , . To visualize this it is again useful to set = 1 + x for x 0 and see what happens to the bound viewed as a function of x. Corollary 6 Let X ; : : : ; Xn be independent, 0=1 valued random variables, and let pi = E Xi for 1 i = 1; : : : ; n. Let X = X + + Xn and let = E X . Let 0 x 2 be a real number. Then 1 Pr X 1 + x e, x2 = : 3 10 Notice the improvement over Corollary 4: the factor of p in the exponent has vanished. That is quite a change; the new bound is much better. Be careful when you apply this bound to note that it only holds for x 2. If x is larger, our intuition is that the probability in question should be even lower, yet the bound above does not apply. If you want a bound that works for large x you will need to go back to the proofs of Theorem 5 and Corollary 6 and try to extend them. It would be nice to do this and get a clean bound for larger x, actually. We might explore these issues later, but right now I want to look more at the proofs. Let's begin by seeing why Corollary 6 follows from Theorem 5. Proof of Corollary 6: Theorem 5 tells us that Pr X 1 + x e,g x 1+ where g is the function de ned in the statement of Theorem 5. So it su ces to show that g1 + x 3x =10 for 0 x 2. We do this using Taylor series approximations. We have 2 g1 + x = 1 + x ln1 + x + 1 , 1 + x = 1 + x ln1 + x , x i = ,x + 1 + x ,1i, x X 1 i i1 i i ,1i,1 x + x X X = ,x + i ,1i i , 1 i1 i2 = X xi x ,1i,1 i + ,1i i , 1 i i2 i ,1i ii x 1 X = , i2 i = x , ,1i,1 ii x 1 2 X 2 i3 , ! x , x ,x +x : 2 3 4 5 2 6 12 20
  • 5. Tail Inequalities 5 We now want to upper bound the expression in parentheses in the last line above. We consider the function f x = 1=6 , x=12 + x2 =20. It attains its minimum at x = 5=6 so for 0 x 2 the maximum value of f is attained at x = 2. Thus the above is x , x 2 f 2 2 2 2
  • 6. 1 , 1 = x 2 5 2 x2 = 310 : That concludes the proof. Now we come to the interesting part, namely the proof of Theorem 5. It introduces the idea of exponential generating functions. It is quite neat, illustrating many simple but powerful techniques. Proof of Theorem 5: We introduce a parameter 0 whose value will be set later. Recall that = p + + pn = E X . We use the monotonicity of the exponential function and then apply 1 Markov's inequality to get h i Pr X = Pr eX e h i E eX e : 1 This is the exponential generating function trick. We do something that looks really trivial. We note that the probability is unchanged if we exponentiate the terms involved, and then we use, of all things, the weakest of the inequalities around, namely Markov's inequality. Yet as we will now see, rather strong bounds emerge. The next thing we do is bound the expectation in Equation 1. We start with the following h i h i E eX = E e X1 X2 Xn + + + h i = E eX1 eX2 : : : eXn : The independence of X1 ; : : : ; Xn this is the one and only place we use this implies that the above equals h i h i h i E eX1 E eX2 : : : E eXn : Now we compute these individual expectations: For any i = 1; : : : ; n we have h i E eXi = 1 Pr Xi = 0 + e Pr Xi = 1 = 1 , pi + e pi = 1 + e , 1 pi : So at this point we have h i E eX = 1 + e , 1 p 1 + e , 1 p : : : 1 + e , 1 pn : 1 2
  • 7. 6 Bellare Notice that so far we have done no bounding; we have equalities. At this point, what can we do? We are looking at a complex bound, product of many terms. We will start seeing terms involving products of the values p1 ; : : : ; pn , which is not something we know much about. What we do know something about is the sum of p1 ; : : : ; pn , because this is exactly , the expectation of X . We'd like to work this in. This is done by applying a very common and useful little inequality, namely that 1 + y ey for any real number y. Set yi = e , 1pi and we get h i E eX 1 + y 1 + y : : : 1 + yn 1 2 ey1 ey2 : : : eyn = ey1 yn + + = e e, p1 1 + +pn = e e, : 1 Let's now put this back together with Equation 1. That gives us Pr X e e, e 1 = e e , , 1 = e,f where f = , e , 1 : Now we want to analyze the function f and choose the value of 0 that makes f as large as possible. Since all the above is true for any value of 0 we can plug in this special value and that will be our bound. To analyze f we use high-school level calculus. We compute the derivate: f 0 = , e . The function f 0 is positive for ln , zero at = ln , and then negative for ln . This tells us that f is increasing for 0 ln and decreasing for ln . So the maximum is at = ln . Now we note that f ln = ln , + 1 : This is exactly what we called g , so the proof is complete. The technique of this proof is the one used in all proofs of Cherno -type bounds, with minor variations. It is useful to know it so that you can derive your own bounds if necessary. 4 Tail inequalities for pairwise independent random variables Pairwise independence means that we cannot infer anything extra about Xi given Xj where j 6= i, even though we might be able to infer something or even everything about Xi if we were given Xj and Xk where i; j; k are distinct.
  • 8. Tail Inequalities 7 De nition 7 We say that X ; : : : ; Xn are pairwise independent random variables if for every 1 1 i j n and every a; b 2 R we have Pr Xi = a and Xj = b = Pr Xi = a Pr Xj = b : The tail inequality for such random variables makes use of the fact that the variance of a sum of pairwise independent random variables behaves exactly like the variance of a sum of independent random variables: it is the sum of the individual variances. Lemma 8 Let X ; : : : ; Xn be pairwise independent random variables. Then 1 Var X + + Xn = Var X + + Var Xn : 1 1 Proof of Lemma 8: Use the formula for the variance and the linearity of expectation to get h i Var X + + Xn = E X + + Xn , E X + + Xn 1 1 2 1 2 = E X + + Xn X + + Xn , E X + + E Xn 1 1 1 2 hP i X = E i;j Xi Xj , E Xi E Xj i;j X X = E XiXj , E Xi E Xj i;j i;j X h i X X X = E Xi + 2 E Xi Xj , E Xi , 2 E Xi E Xj i i6=j i i6=j X h i X = E Xi , E Xi +2 2 E Xi Xj , E Xi E Xj i i6=j X X = Var Xi + E Xi Xj , E Xi E Xj : i i6=j The pairwise independence means that E Xi Xj = E Xi E Xj whenever i 6= j . Thus the second sum above is zero, and we are done. We can now obtain the tail inequality by applying Chebyshev's inequality. Lemma 9 Let X ; : : : ; Xn be pairwise independent random variables, let X = X + + Xn, let 1 1 A 0 be a real number, and let = E X . Then Pr jX , j A Var X + + Var Xn : 1 A 2 Proof of Lemma 9: Proposition 2 tells us that Pr jX , j A Var X : A 2 Now apply Lemma 8.
  • 9. 8 Bellare References 1 R. Motwani and P. Raghavan, Randomized algorithms, Cambridge University Press, 1995.