mathstat.pdf

1. SELECTED TOPICS IN MATHEMATICAL STATISTICS LAJOS HORVÁTH Abstract. This is the outcome of the online Math 6070 class during the COVID-19 epidemic. 1. Some problems in nonparametric statistics First we consider some examples for various statistical problems. In all cases we should be able to get large sample approximations but getting the critical values might not be simple. Also, the large sample approximations might not work in case of our sample sizes. We assume in this section that Assumption 1.1. X1, X2, . . . , XN are independent and identically distributed random variables with distribution function F. First we consider a simple hypothesis question which has been already discussed. 1.1. Kolmogov–Smirnov and related statistics. We wish to test the null hypothesis that F(t) = F0(t) for all −∞ < t < ∞, where F0(t) is a given distribution function. We assume that F0 is continuous. There are several well known tests for this problem. The first two are due to Kolmogorov and Smirnov: TN,1 = N1/2 sup −∞<t<∞ |FN (t) − F0(t)| , TN,2 = N1/2 sup −∞<t<∞ (FN (t) − F0(t)) , where FN (t) = 1 N N X `=1 I{Xi ≤ t}, −∞ < t < ∞ denotes the empirical distribution function of our sample. The distributions of the statistics TN,1 and TN,2 do not depend on F0 for any sample size as long as (1.1) F0 is continuous. By the probability integral transformation U1 = F(X1), U2 = F(X2), . . . , UN = F(XN ) are independent identically distributed random variables, uniform on [0, 1]. Since TN,1 = N1/2 sup −∞<t<∞

6. 1 N N X `=1 I{F(Xi) ≤ F(t)} − F0(t)

11. = N1/2 sup 0<u<1

16. 1 N N X `=1 I{Ui ≤ u} − F0(F−1 (u))

21. , where F−1 is the generalized inverse of F. (If F is not strictly increasing then F−1 is not uniquely defined but it will satisfy F(F−1(u)) = u. ) We already discussed in the class of the weak convergence of the uniform empirical and quantile processes. Due to the probability integral transformation the following results are immediate consequences: (1.2) TN,1 D → sup 0≤u≤1 |B(u)| 1

22. 2 LAJOS HORVÁTH and (1.3) TN,2 D → sup 0≤u≤1 B(u), where {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. We already discussed the definition of the Brownian bridge when we studied the weak convergence of the process constructed from the uniform order statistics. The Brownian bridge is defined as B(t) = W(t) − tW(1), where W(t) is a Wiener process. Hence B(0) = B(1) = 0 (tied down). It is a Gaussian process, i.e. the finite dimensional distributions are multivariate normal. The parameters of the multivariate normal distribution can be computed from the facts that EB(t) = 0 and EB(t)B(s) = min(t, s) − ts. The Brownian bridge is continous with probability 1. More on the Brownian bridge is on page 153 of DasGupta(2008). Hence (1.2) and (1.3) provide large sample approximations for our test statistics under the null hypothesis. Even we know how good are the approximations in (1.2) and (1.3). There are constants c1 and c2 such that (1.4) sup −∞<x<∞

26. P {TN,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

30. ≤ c1 log N N1/2 and (1.5) sup −∞x∞

34. P {TN,2 ≤ x} − P sup 0≤u≤1 B(u) ≤ x

38. ≤ c2 log N N1/2 . These results follows immediately from the Komlós, Major and Tusnády approximation (cf. Das- Gupta, 2008, p. 162). There are explicit bounds for c1 and c2 but these are so large that they are useless in practice. It is even more interesting, from a theoretical point of view, that the results in (1.4) and (1.5) are very close to the best possible ones; N−1/2 are lower bounds. Since the limiting distribution functions in (1.4) and (1.5) are known explicitly, we could check how these results work for finite sample sizes. Chapter 9 in Shorack and Wellner (1986) contains formulae and bounds, including exact and asymptotic bounds for TN,1 and TN,2. We need to study the behavior of the test statistics under suitable alternatives. First we look at TN,1. We assume that (1.6) HA : there is t0 such that F(t0) 6= F0(t0). If (1.6) holds, then (1.7) TN,1 P → ∞. Since TN,1 = N1/2 sup −∞t∞ |FN (t) − F(t) + F(t) − F0(t)| , we get the lower bound N1/2 sup −∞t∞ |F(t) − F0(t)| − N1/2 sup −∞t∞ |FN (t) − F(t)| ≤ TN,1. The weak convergence of the empirical process yields N1/2 sup −∞t∞ |FN (t) − F(t)| = OP (1) and (1.6) gives N1/2 sup −∞t∞ |F(t) − F0(t)| → ∞, completing the proof of (1.9). However, we might not be able to reject the null hypothesis under the alternative of (1.6). The statistic TN,2 is consistent under the alternative (1.8) HA : there is t0 such that F(t0) F0(t0).

39. MATHEMATICAL STATISTICS 3 Similarly to the proof of (1.9), one can show that (1.9) TN,2 P → ∞. The other class of statistics are due to Cramér and von Mises. We provide two formulas. If you know how to integrate with respect to a function, please use those. If not, use the formula, where the density f0(t) = F0 0(t) appears. There are two possibilities for us: TN,3 = N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ (FN (t) − F0(t))2 f0(t)dt and TN,4 = N Z ∞ ∞ (FN (t) − F0(t))2 dt. Similarly to the first two statistics TN,1 and TN,2, the distribution of TN,3 also does not depend on F0, if (1.1) holds. Using the probability integral transformation again, we get that TN,3 also does not depend on F0. However, the distribution of TN,4 does depend on F0. This means that we need to use different Monte Carlo simulations for different F0’s. The weak convergence of the uniform empirical process already used in the justification of (1.2) and (1.3) can be used to show that (1.10) TN,3 D → Z 1 0 B2 (u)du and (1.11) TN,4 D → Z 1 0 B2 (F0(t))dt, where, as before, {B(u), 0 ≤ u ≤ 1} denotes a Brownian bridge. The rate of convergence in (1.10) and (1.11) is much better than in (1.4) and (1.5). Namely, there are c3 and c4 such that (1.12) sup −∞x∞

43. P {TN,3 ≤ x} − P Z 1 0 B2 (u)du ≤ x

47. ≤ c3 N and (1.13) sup −∞x∞

51. P {TN,4 ≤ x} − P Z ∞ −∞ B2 (F0(t))dt) ≤ x

55. ≤ c4 N . The upper bound in (1.12) was obtained by Götze (cf. Shorack and Wellner, 1986, p. 223) and his method can be used to prove (1.13). It is conjectured that these results are optimal, it is impossible to replace 1/N with a sequence which would converge to 0 faster. The theoretical results in (1.12) and (1.13) were observed empirically a long time ago. This is one of the reasons for the popularity of TN,3. There is an interesting connection between U statistics and the Cramér–von Mises statistics. It can be shown that the Cramér–von Mises statistics are essentially U statistics. This claim is supported by a famous expansion of the square integral of the Brownian bridge: (1.14) Z 1 0 B2 (t)dt = ∞ X k=1 1 k2π2 N2 , where Ni, i ≥ 1 are independent standard normal random variable. This is like the limit of the degenerate U statistics. The result in (1.14) is a consequence of the Karhunen–Loéve theorem. They showed that {B(t), 0 ≤ t ≤ 1} D = ( √ 2 ∞ X k=1 Nk 1 kπ sin(kπs) ) .

56. 4 LAJOS HORVÁTH This result looks obvious, in some sense, since B(t) is square integrable, so we should have expansion with respect a basis. Here the interesting part is that the Ni’s are iid standard normal random variables. If a different basis is used, this will not be correct. We use a special basis here, since sin(kπs) are the eigenfunction of the operator K(t, s)f(s)ds. There is am other interesting and useful formula for the Brownian bridge: the integral B(t) = Z t 0 1 − t 1 − s dW(s) 0 ≤ t ≤ 1, also defines a Brownian bridge. However, first we need to define integration with respect to a Wiener process. We have two roads: study Ito integration. The other possibility is much simpler. We just assume that integration by parts defines the formula, so Z t 0 1 − t 1 − s dW(s) = 1 1 − t W(t) − Z t 0 W(s)d 1 1 − s = 1 1 − t W(t) + Z t 0 1 (1 − s)2 W(s)ds. This integral representation of the Brownian bridge is often used in biostatistics. Next we discuss the consistency of the Cramér–von Mises tests. If (1.1) and (1.6), then (1.15) TN,3 P → ∞ and (1.16) TN,4 P → ∞. We write N Z ∞ ∞ (FN (t) − F0(t))2 dF0(t) = N Z ∞ ∞ ([FN (t) − F(t)] + [F(t) − F0(t)])2 dF0(t) = N Z ∞ ∞ (FN (t) − F(t))2 dF0(t) + 2N Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t) + N Z ∞ ∞ (F(t) − F0(t))2 dF0(t) and by the weak convergence of the Cramér–von Mises statistic Z ∞ ∞ (N1/2 [FN (t) − F(t)])2 dF0(t) = OP (1). Using the Cauchy–Schwartz inequality we obtain that N

60. Z ∞ ∞ [FN (t) − F(t)][F(t) − F0(t)]dF0(t)

64. ≤ N Z ∞ ∞ [FN (t) − F(t)]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = N1/2 Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) 1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 and Z ∞ ∞ [N1/2 (FN (t) − F(t))]2 dF0(t) = OP (1), N1/2 Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 1/2 = O(N1/2 ).

65. MATHEMATICAL STATISTICS 5 According to our condition Z ∞ ∞ [F(t) − F0(t)]2 dF0(t) 0 and therefore (1.15) is proven. Similar arguments give (1.16). One of the basic advise is that “do not compare apples and oranges”. One of the interpretation is that we should compare variables with the same or essentially the same variances. Since the observations are independent, under H0 we have that var (FN (t) − F0(t)) = 1 N F0(t)(1 − F0(t)), so the variance of the variables used in all statistics so far, depend on t. Darling and Erdős (1956) suggested the following statistic to test the null hypothesis against the alternative in (1.6): (1.17) TN,5 = sup −∞x∞ N1/2|FN (t) − F0(t)| (F0(t)(1 − F0(t)))1/2 . The statistic TN,5 is called self normalized. However, even under the null hypothesis (1.18) lim N→∞ P{TN,5 ≥ C} = 1, for all C , i.e. TN,5 is unbounded in probability. Here is a heuristic argument for (1.17). We observe that TN,5 does not depend on F0. If the weak convergence of the empirical process to a Brownian bridge holds, the distribution of TN,5 should be close to the distribution of sup0t1 |B(t)|/(t(1 − t))1/2. But according to the law of the iterated logarithm, lim sup t→0 |B(t)| t1/2 = ∞ a.s. and therefore P sup 0t1 |B(t)|/(t(1 − t))1/2 = ∞ = 1. We can also use the empirically self normalized (1.19) TN,6 = sup X1,N xXN,N N1/2|FN (t) − F0(t)| (FN (t)(1 − FN (t)))1/2 , where X1,N = min{X1, X2, . . . , XN } and XN,N = max{X1, X2, . . . , XN }. Using the result of Darling and Erdős (1956), it can be shown that under the null hypothesis lim N→∞ P (2 log log N)1/2 TN,5 ≤ x + 2 log log N + 1 2 log log log N − 1 2 log π (1.20) = exp(−2e−x ) for all x. The limit result in (1.20) also holds for TN,6. Here we have the interesting result that even under the null hypothesis TN,5 and TN,6 converge to ∞ in probability, since 1 (2 log log N)1/2 TN,5 P → 1. However, under the alternative (1.6), N−1/2 TN,5 P → c, where c is a positive constant and if t 6= t0, then c ≥ |F(t) − F0(t)| (F0(t)(1 − F0(t))1/2 . This means the under the alternative TN,5 will be much larger than under the null. This observation makes it possible to use bootstrap. The rate of convergence in (1.20) is slow. The limit is

66. 6 LAJOS HORVÁTH an extreme value and, in general, convergence to extreme values can be slow. Also, the norming sequences in (1.20) are chosen for their “simplicity”. They do not have any statistical meaning like the norming with the mean and the variance in the central limit theorem. There is an important observation in this discussion: the test statistic has a limit distribution under the null and converges in probability to ∞ under the alternative. In case of TN,5 and TN,6, we should say that they converge to ∞ much faster. Next we consider testing if our sample belongs to a specific family of distributions. 1.2. Parameter estimated processes. We assume that Assumption 1.1 holds. Now we wish to the null hypothesis F0(t, λ) = ( 0, if t 0 1 − e−t/λ, if t ≥ 0, (1.21) where λ is an unknown parameter. The true value of the parameter is λ0. It is natural to estimate λ from the sample by the maximum likelihood estimator λ̂N = X̄N = 1 N N X i=1 Xi. If the null hypothesis of (1.21) is true, then FN (t) should be close to F0(t, λ̂N ) for all −∞ t ∞, because FN (t) always estimates the true distribution function. Hence we study the difference between FN (t) and F0(t, λ̂N ). We start withe a Taylor expansion with respect the parameter F0(t, λ̂N ) − F0(t, λ0) = g1(t, λ0)(λ̂N − λ0) + 1 2 g2(t, λ∗ )(λ̂N − λ0)2 , where λ∗ is between λ̂N and λ0, We can assume that t 0 since both FN (t) and F(t, λ) are 0 for t ≤ 0. Let g1(t, λ) = ∂F0(t, λ) ∂λ and g2(t, λ) = ∂2F0(t, λ) ∂λ2 . We know from the law of large numbers that (1.22) λ̂N P → λ0, and from the central limit theorem that N1/2(λ̂N − λ0) is asymptotically normal and therefore (1.23) N1/2 |λ̂N − λ0| = OP (1). It is elementary that sup−∞t∞ |g2(t, λ)| is bounded, as a function of λ, in a neighbourhood of λ0. Hence (1.22) yields (1.24) sup 0t∞ |g2(t, λ∗ )| = OP (1). Putting together (1.23) and (1.24) we conclude (1.25) sup 0t∞ |g2(t, λ∗ )|(λ̂N − λ0)2 = OP 1 N . These arguments give the important observation that sup 0t∞

69. N1/2 [FN (t) − F0(t, λ̂N )] − N1/2 [FN (t) − F0(t, λ0) − g1(t, λ0)(λ̂N − λ0)]

72. (1.26) = OP (N−1/2 ).

73. MATHEMATICAL STATISTICS 7 The result is in (1.26) is very important and it can be proven in more generality. The process N1/2(FN (t) − F0(t, λ̂N )) is called parameter estimated empirical process and it is often used to check a null hypothesis when an unknown parameter appears under the null hypothesis. We know some facts already: (1.27) N1/2 (FN (t) − F0(t, λ0) D[0,∞] −→ B(F0(t, λ0)) and the asymptotic normality of N1/2(λ̂N −λ0). However, this is not enough! We need them jointly since both terms appear in (1.26). The key is a formula which you might have learnt in probability. We know that λ0 = EX1 = Z ∞ 0 tf0(t, λ0)dt = Z ∞ 0 (1 − F0(t, λ0))dt. Using integration by parts, Z ∞ 0 tf0(t, λ0)dt = − Z ∞ 0 t(1 − F0(t, λ0))0 dt = −t(1 − F0(t, λ0))

77. ∞ 0 + Z ∞ 0 (1 − F0(t, λ0))dt. Clearly, lim t→0 t(1 − F0(t, λ0)) = 0 and by the existence of the expected value lim t→∞ t(1 − F0(t, λ0)) = 0. Thus we get N1/2 (λ̂N − λ0) = Z ∞ 0 N1/2 (F0(t, λ0) − F̂N (t))dt = − Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt. Now everything is expressed in terms of the empirical process. The weak convergence of the empirical process in (3.21) yields (1.28) N1/2 (FN (t)−F0(t, λ0)−g1(t, λ0)(λ̂N −λ0)) D[0,∞] −→ B(F(t, λ0))+g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du. It looks obvious that (3.21) implies (1.29) Z ∞ 0 N1/2 (F̂N (t) − F0(t, λ0))dt D → Z ∞ 0 B(F0(u, λ0))du, but it requires a little work. For any C 0, the weak convergence of the empirical process to B(F(·)) implies that N1/2 FN (t) − F0(t, λ0) + g1(t, λ0) Z C 0 N1/2 (F̂N (u) − F0(u, λ0))du (1.30) D[0,∞] −→ B(F(t, λ0)) + g1(t, λ0) Z C 0 B(F0(u, λ0))du. Also, by the Cauchy–Schwartz inequality we have var Z ∞ C B(F0(u, λ0))du = E Z ∞ C B(F0(u, λ0))du 2 (1.31) = E Z ∞ C Z ∞ C B(F0(u, λ0))B(F0(v, λ0))dudv = Z ∞ C Z ∞ C E[B(F0(u, λ0))B(F0(v, λ0))]dudv ≤ Z ∞ C Z ∞ C (E[B(F0(u, λ0))]2 )1/2 (E[B(F0(v, λ0))]2 )1/2 dudv

78. 8 LAJOS HORVÁTH = Z ∞ C Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 [F0(v, λ0)(1 − F0(v, λ0))]1/2 dudv = Z ∞ C [F0(u, λ0)(1 − F0(u, λ0))]1/2 du 2 → 0, as C → ∞. The same arguments yield for any N that var Z ∞ C N1/2 (F̂N (u) − F0(u, λ0)) → 0, as C → ∞. (1.32) Now Chebishev’s inequality implies on account of (1.31) and (1.32) that

82. Z ∞ C B(F0(u, λ0))du

86. P → 0 and for all 0 lim C→∞ lim sup N→∞ P

90. Z ∞ C N1/2 (F̂N (u) − F0(u, λ0))

94. = 0. Now the proof of (1.28) is complete. In light of (1.29) we suggest the parameter estimated Kolmogorov–Smirnov statistics: TN,7 = N1/2 sup 0t∞

97. FN (t) − F0(t, λ̂N )

100. and TN,8 = N1/2 sup 0t∞ FN (t) − F0(t, λ̂N ) . Now the limit distributions of TN,7 and TN,8 can be derived easily from (1.29). Namely, (1.33) TN,8 D → sup 0t∞

104. B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du

108. and (1.34) TN,9 D → sup 0t∞ B(F0(t, λ0)) + g1(t, λ0) Z ∞ 0 B(F0(u, λ0))du . How to use (1.33) and (1.34) ? This is highly not obvious. Please note that the limit depends on the parametric form of F0, but this is not an issue since we know that F0 is exponential. But the dependence on λ0 is more serious since λ0 is unknown. However, a little work shows that TN,8 and TN,9 do not depend on λ0. Note that N1/2 sup 0t∞

111. FN (t) − F0(t, λ̂N )

114. = N1/2 sup 0t∞

117. FN (t) − (1 − e−t/λ̂N )

120. = N1/2 sup 0u∞

123. FN (uλ̂N ) − (1 − e−u )

126. . By definition, FN (uλ̂N ) = 1 N N X i=1 I{Xi/λ̂N ≤ u} and Xi λ̂N = Xi X̄N = Xi/λ0 PN j=1 Xj/λ0 . Hence TN,7 and therefore the limit distribution does not depend on λ0. Now Monte Carlo simulations could be used to get the distribution of the limit in (1.33), since we can assume that λ0 = 1 in

127. MATHEMATICAL STATISTICS 9 the limit. This argument also works for TN,8. Hence TN,7 and TN,8, and therefore their limits, are free of the unknown parameter. Our argument works for scale families. With some modifications, it can be done for location and location and scale families. Let assume that we are in a location family. In this case the underlying distribution is F(t, λ) = F0(t − λ). Hence sup −∞t∞

130. FN (t) − F0(t − λ̂N )

133. = sup −∞t∞

136. [FN (t) − F0(t − λ0)] + [λ0 − λ̂N ])

139. = sup −∞t∞

142. FN (u + λ0) − F0(u + [λ0 − λ̂N ])

145. and FN (u + λ0) = 1 N N X i=1 I{Xi ≤ t + λ0} = 1 N N X i=1 I{Xi − λ0 ≤ t} Since we are in a location family, the distribution of Xi − λ0 does not depend on λ0. We showed that in case of location families, if λ̂N is the maximum likelihood estimator, then the distribution of λ0 − λ̂N does not depend on λ0 (more is true, the value of λ0 − λ̂N does not depend on λ0). The same argument work for the location and scale families. As an example, let assume that F0 is a Gamma distribution with parameters λ (scale parameter) and κ (shape parameter). We assume that κ is known. Since we are in the scale family, the arguments used in the exponential case would work. So far we considered Kolmogorov–Smirnov type processes for parameter estimated processes. In case of scale families (this includes the exponential we discussed at the beginning of this section), N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dF0(t, λ̂N ) = N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 f0(t, λ̂N )dt do not depend on λ0. An other possibility for parameter free method is the parameter estimated Cramér–von Mises statistic N Z ∞ −∞ (FN (t) − F0(t, λ̂N ))2 dFN (t) ≈ N X i=1 i N − F0(Xi,N , λ̂N ) 2 , where X1,N ≤ X2,N ≤ . . . ≤ XN,N are the order statistics. To establish the consistency of TN,7 is easy.We assume that under the alternative HA : inf λ0 sup 0t∞ |F(t) − F0(t, λ)| 0 and in this case TN,7 P → ∞. We have that TN,8 P → ∞, if HA : inf λ0 sup 0t∞ (F(t) − F0(t, λ)) 0. The asymptotic behaviour of the parameter estimated Cramér–von Mises statistics can be discussed in the same way. The self normalized statistics also can be used to test if the underlying distribution in a parametric form. For example, in case of testing for exponentiality we can use sup 0t∞ N1/2|FN (t) − F0(t, λ̂N )| (F0(t, λ̂N )(1 − F0(t, λ̂N )))1/2

146. 10 LAJOS HORVÁTH and sup X1,N tXN,N N1/2|FN (t) − F0(t, λ̂N )| (FN (t)(1 − FN (t)))1/2 , where X1,N = min(X1, X2, . . . , XN ) and XN,N = max(X1, X2, . . . , XN ). We found the similar pattern for the test statistics as in Section 1.1. The test statistics convergence in distribution to a limit and they converge to ∞ in probability under the alternative. It turns out that the estimation of the parameter does not effect the limit distribution, i.e. (1.20) holds for the parameter estimated statistics as well. Scale family. The underlying density is in the form f(t, λ) = 1 λ f0(t/λ), where λ 0 is a parameter. We use the empirical distribution function of Yi = Xi X̄N . 1 ≤ i ≤ N X̄N = 1 N N X i=1 Xi The distribution of Yi does not depend on λ0, the true value of the parameter under the null hypothesis. Hence the limit of N1/2 (HN (x) − F0(x)) with HN (x) = 1 N N X i=1 I{Yi ≤ x}, does not depend on λ0 but it DOES on f0. We used the notation F0 0 = f0. The sample mean X̄N might not be the maximum likelihood estimator. What is the maximum likelihood estimator? The likelihood function is L(λ) = N Y i=1 1 λ f0(Xi/λ) and the log likelihood is `(λ) = −N log λ + N X i=1 log f0(Xi/λ). We compute the derivative `0 (λ) = − N λ + N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ2 and we need to solve the equation − N X i=1 1 f0(Xi/λ) f0 0(Xi/λ) − Xi λ = N. The equation depends only on Xi/λ. This shows that λ̂N /λ0 does not depend on λ0, where λ̂N is the maximum likelihood estimator. Hence the parameter estimated statistics do not depend on the unknown scale parameter under the null hypothesis. Next we consider a typical two sample problem.

147. MATHEMATICAL STATISTICS 11 1.3. Comparing two samples. In addition to Assumption 1.1 we require Assumption 1.2. Y1, Y2, . . . , YM are independent and identically distributed random variables with distribution function H. It is a very common problem to test H0 : F(t) = H(t) for all t. In addition to FN (t) we define the empirical distribution of the Y sample HM (t) = 1 M M X i=1 Yi. If H0 is true, the difference should be small. Due to independence, under the null hypothesis we have var(FN (t) − HM (t)) = 1 N + 1 M (F(t)(1 − F(t))) = N + M NM (F(t)(1 − F(t))), so our consideration will be based on the two sample version of the empirical process uN,M (t) = NM N + M 1/2 [(FN (t) − HM (t)) − (F(t) − H(t))] . Of course, under the null hypothesis F(t)−H(t) = 0 for all t in the definition of uN,M (t). The weak convergence of uN,M (t) is immediate consequence of the weak convergence of empirical processes: (1.35) uN,M (t) D[−∞,∞] −→ c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(1) (H(t)), where B(1) and B(2) are independent Brownian bridges, and lim N,M→∞ M N + M = c0. We observe that (1.36) {c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t)), −∞ t ∞} D = {B(F(t)), −∞ t ∞}. Since B(1)(F(t)), B(2)(F(t)) are jointly Gaussian, they linear combination will be Gaussian. Hence we need to compute the mean and the covariance of c 1/2 0 B(1)(F(t)) + (1 − c0)1/2B(2)(F(t)). Since EB(1)(F(t)) = 0 EB(2)(F(t)) = 0, we get that E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))] = 0. Using the independence of B(1) and B(2) and EB(1)(t)B(1)(s) = EB(2)(t)B(2)(s) = min(t, s) − ts, E[c 1/2 0 B(1) (F(t)) + (1 − c0)1/2 B(2) (F(t))][c 1/2 0 B(1) (F(s)) + (1 − c0)1/2 B(2) (F(s))] = E[c 1/2 0 B(1) (F(t))][c 1/2 0 B(1) (F(s))] + E[(1 − c0)1/2 B(2) (F(t))][(1 − c0)1/2 B(2) (F(s))] = c0[min(F(t), F(s)) − F(t)F(s)] + (1 − c0)[min(F(t), F(s)) − F(t)F(s) = min(F(t), F(s)) − F(t)F(s), which is exactly the covariance function of B(F(t)). We suggest the following statistics: TN,M,1 = sup −∞t∞ |uN,M (t)| and TN,M,2 = sup −∞t∞ uN,M (t).

148. 12 LAJOS HORVÁTH If H0 holds and F = H is continuous, then (1.37) TN,M,1 D → sup −∞t∞ |B(F(t))| = sup 0≤t≤1 |B(t)| and (1.38) TN,M,2 D → sup −∞t∞ B(F(t)) = sup 0≤t≤1 B(t) The result in (1.37) and (1.38) are immediate consequences of (1.35) and (1.36). If F is continuous, then the distributions of TN,M,1 and TN,M,2 do not depend on F under H0. This observation is an immediate consequence of the probability integral transformation. The statistics TN,M,1 and TN,M,2 are Kolmogorov–Smirnov type statistics. Similarly to the previous discussions we can define Cramér–von Mises type statistics as well. We can define similarly Cramér– von Mises statistics: Z ∞ −∞ u2 N,M dFN (t) ≈ M N + M N X i=1 i N − HM (Xi,N ) 2 , where X1,N ≤ X2,N ≤ . . . , XN,N are the order statistics of the first sample. Or similarly, Z ∞ −∞ u2 N,M dHM (t) ≈ NM N + M M X i=1 i M − FN (Yi,M ) 2 , where Y1,M ≤ Y2,M ≤ . . . , YM,M are the order statistics of the second sample. Using again the weak convergence of uN,M (t), one can prove that under H0 Z ∞ −∞ u2 N,M dFN (t) D → Z 1 0 B2 (u)du and Z ∞ −∞ u2 N,M dHM (t) D → Z 1 0 B2 (u)du. So far the observations were not only independent but also identically distributed even under the alternative hypothesis. The next problem is interesting since we want to test that the assumption of identically distributed data will be tested. The topic of Section 1.4 is very popular in the literature and it is called the change point problem or testing for the stability of the data. 1.4. Change point. We assume that Assumption 1.1 holds but we observe Z1, Z2, . . . , ZN defined by Zi = ( µ0 + Xi, if 1 ≤ k ≤ k∗ µA + Xi, if k ∗ +1 ≤ k ≤ N. (1.39) We call k∗ the time of change and it is unknown. Similarly, the means, µ0 6= µA before and after the change are also unknown. Of course these are the means of the Zi’s, if EXi = 0. Hence we need to modify Assumption 1.1: Assumption 1.3. X1, X2, . . . , XN are independent and identically distributed random variables with EXi = 0 and EX2 i = σ2.

149. MATHEMATICAL STATISTICS 13 The model assumes that the variance is constant. First we even assume that σ is known to find a suitable test statistic. We will discuss how to proceed if σ is unknown. We note that σ is a nuisance parameter so we have no interest in its value. Recently, data examples confirmed that σ might not be constant during the observation period. It might be time dependent so we wish to detect changes in the mean even if the variance is changing as well. We only want to detect a change point if the mean changes regardless what happens to the variance of the observations. Only few results are available now. We assume now that (1.40) σ2 is known. We wish to test the stability of the model, i.e. the mean remains constant during the observation period: H0 : k∗ N against the alternative HA : 1 k∗ N. The null hypothesis postulates the change occurs outside of the observation period so it does not matter for us. Under the alternative the means changes exactly once. Our model is called “at most one change” (AMOC). First we need to find a test statistic. Let assume that k∗ = k is known. In this case this a simple two sample problem. We cut the data into two parts at k and we compute the sample means for each segment with Z̄k = 1 k k X i=1 Zi and Ẑk = 1 N − k N X i=k+1 Zi. If H0 holds, than |Z̄k −Ẑk| is small, the difference between the two empirical means can be explained by the variability in the data. Using Assumption 1.3 we get var Z̄k − Ẑk = σ2 k + σ2 N − k = σ2 N k(N − k) , so we reject the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same if Qk = 1 σ k(N − k) N 1/2 |Z̄k − Ẑk| is large. The statistic Qk should be familiar since this is the two sample z–test if the observations are normal! To prove this claim assume that X1, X2, X3, . . . , XN are independent identically distributed normal random variables with EXi = 0, EX2 i = σ2, σ2 is known. We wish to test that the means of Z1, Z2, . . . , Zk and Zk+1, Zk+2, . . . , ZN are the same. Let µ1 be the mean of the first sample, µ2 be the mean of the second sample and µ be the mean under the null hypothesis. The maximum likelihood estimators are µ̂1 = 1 k k X i=1 Zi, µ̂2 = 1 N − k N X i=k+1 Zi and µ̂ = 1 N N X i=1 Zi. Hence the likelihood ratio is k Y i=1 1 √ 2π exp(−(Zi − µ̂1)2 /(2σ2 )) N Y i=k+1 1 √ 2π exp(−(Zi − µ̂2)2 /(2σ2 )) N Y i=1 1 √ 2π exp(−(Zi − µ̂)2 /(2σ2 ))

150. 14 LAJOS HORVÁTH = exp 1 2σ2 N X i=1 (Zi − µ̂)2 − k X i=1 (Zi − µ̂1)2 − N X i=k+1 (Zi − µ̂2)2 !! = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − Nµ̂2 = exp 1 2σ2 kµ̂2 1 + (N − k) ˆ µ2 2 − N((k/N)µ̂1 + ((N − k)/N)µ̂2)2 = exp 1 2σ2 k(N − k) N (µ̂1 − µ̂2)2 , proving our claim. Since k is unknown we use the rule: (1.41) reject H0, if max 1≤kN 1 σ |Qk| is large. A simple algebra shows that the rule in (1.41) might not work. It is easy to see that under H0 Qk = 1 σ N k(N − k) 1/2

155. k X i=1 Xi − k N N X i=1 Xi

160. . So by the law of the iterated logarithm for partial sums of independent and identically distributed random variables (Das Gupta, 2008, pp. 8) (1.42) max 1≤kN |Qk| P → ∞ as N → ∞. We observe that σ max 1≤kN |Qk| ≥ max 1≤kN/2 N k(N − k) 1/2

165. k X i=1 Xi − k N N X i=1 Xi

170. ≥ max 1≤kN/2 N k(N − k) 1/2

175. k X i=1 Xi

180. − max 1≤kN/2 N k(N − k) 1/2

185. k N N X i=1 Xi

190. ≥ 21/2 max 1≤kN/2 1 k 1/2

195. k X i=1 Xi

200. − N−1/2

205. N X i=1 Xi

210. . The central limit theorem yields N−1/2

215. N X i=1 Xi

220. = OP (1) and the law of the iterated logarithm implies lim sup N→∞ (log log(N/2))−1/2 max 1≤kN/2 1 k 1/2

225. k X i=1 Xi

230. 0 a.s., completing the proof of (1.42). The result in (1.42) was first observed empirically by economists. This caused a stir since the z–test is widely used (without checking the required assumptions) and using the normal table for the critical values of max1≤kN |Qk| caused over rejection, and it was getting worse as N was increasing. Andrews (1993) is the most popular contribution to the applicability of the z–test for the change point problem. He observed that the law of the iterated comes into action for large and small k. He suggested rejecting for large values of max bNαc≤k≤N−bNαc 1 σ |Qk|,

231. MATHEMATICAL STATISTICS 15 where b·c is the integer part and 0 α 1/2 is chosen by the practitioner. Since the change point problem is common in economics (“nothing last forever”), there has been a tremendous interest in the choice of α. The choice of 5% and 10% is recommended. Using the weak convergence of partial sums to a Wiener process can be used to prove that under the null hypothesis (1.43) max bNαc≤k≤N−bNαc 1 σ |Qk| D → sup α≤t≤1−α |B(t)| (t(1 − t))1/2 , where {B(u), 0 ≤ u ≤ 1} is a Brownian bridge. The functional L(f) = supα≤u≤1−α |f(u)| is a continuous functional on the Skorokhod space D[0, 1], so the weak convergence of partial sums gives (1.43). Looking at (1.43) it is clear why (2.20) holds if α = 0. The limit cannot be finite in this case according to the law of iterated logarithm for the Wiener process. Hence Andrews (1993) claimed that no limit result can be established for max1≤kN |Qk|. This claim is strongly believed in econometrics but it was not even true when Andrews (1993) published his famous paper. If we look at again (1.19), we face the same issue. The self–normalization (i.e. taking the maximum of random variables with constant variance) puts too much weight at the beginning and the end of the data. If Darling and Erdős (1956) can be used to get the limit in (1.19), it might work in the present case as well. We will return to this question later. The limit result of (1.43) suggests that we should remove the weight and work with (1.44) TN,11 = max 1≤k≤N 1 σ N−1/2

236. k X i=1 Zi − k N N X i=1 Zi

241. . Now we reached a famous and useful statistic. It is called CUSUM (CUmulative SUMS) in the literature. One of the interesting feature is that it does not depend on the unknown mean under the null hypothesis. The mean is a nuisance parameter and its value does not appear in CUSUM type statistics. We usually refer to the maximally selected z–statistic as the standardized CUSUM. The limit distribution of TN,11 under the null hypothesis is very simple: (1.45) TN,11 D → sup 0≤u≤1 |B(u)|. The result in (1.45) follows from the weak convergence of partial sums. Using (1.45), it is easy to detect changes in the data since the distribution of sup0≤u≤1 |B(u)| is known and tabulated. In TN,11 we can recognise a Kolmogorov–Smirnov type statistics. A possible Cramér–von Mises type statistic for the change point problem is Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du and under H0 Z 1 0  N−1/2   bNuc X i=1 Zi − bNuc N N X i=1 Zi     2 du D → Z 1 0 B2 (t)dt. The behaviour of TN,11 is very simple under the exactly one change point alternative. Namely, (1.46) if µ0 6= µA, then TN,11 P → ∞. We note that (1.46) holds in case of several changes in the mean. We can use TN,11 estimate the time of change k∗ from the data. The estimator is the point where the CUSUM achieves it largest value: k̂N = ( k : 1 ≤ k ≤ N :

246. k X i=1 Zi − k N N X i=1 Zi

251. = max 1≤`≤N

256. ` X i=1 Zi − ` N N X i=1 Zi

261. ) .

262. 16 LAJOS HORVÁTH If the change occurs in the middle of the data, i.e. k∗ N = bNθc, with some 0 θ 1, then (1.47) k∗ N N P → θ, i.e. we can consistently approximate θ. So we can do testing in the very unlikely case when σ is known. If we can estimate σ from the sample, we have more realistic procedures. The first candidate is the sample mean: σ̂2 N,1 = 1 N − 1 N X i=1 Zi − Z̄N 2 . We have already established that under the null hypothesis σ̂2 N,1 P → σ2 , so it is asymptotically consistent. But this is not the case under the alternative! We note that (1.48) Z̄N = k∗ N 1 k∗ k∗ X i=1 Zi + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi P → µ̄ = θµ0 + (1 − θ)µA. Elementary algebra gives σ̂2 N = 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 + 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 and 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 = k∗ N − 1 1 k∗ k∗ X i=1 (Zi − µ0)2 + k∗ N − 1 (µ0 − µ̄)2 + k∗ N − 1 (Z̄N − µ̄)2 + 2k∗(µ0 − µ̄) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗(µ̄ − X̄N ) N − 1 1 k∗ k∗ X i=1 (Zi − µ0) + 2k∗ N − 1 (µ0 − µ̄)(µ̄ − X̄N ). Using now the law of large numbers we obtain that 1 k∗ k∗ X i=1 (Zi − µ0)2 P → σ2 , k∗ X i=1 (Zi − µ0) P → 0 and by (1.48) X̄N − µ̄ P → 0. Thus we conclude (1.49) 1 N − 1 k∗ X i=1 Zi − µ0 + µ0 − µ̄ + µ̄ − Z̄N )2 P → θσ2 + θ(µ̄ − µ0)2 . Similar arguments give (1.50) 1 N − 1 N X i=k∗+1 Zi − µA + µA − µ̄ + µ̄ − Z̄N 2 P → (1 − θ)σ2 + (1 − θ)(µ̄ − µA)2 .

263. MATHEMATICAL STATISTICS 17 Putting together (1.49) and (1.50) we get that (1.51) σ̂2 N,1 P → σ2 + θ(µ0 − µ̄)2 + (1 − θ)(µ̄ − µA)2 , so we are overestimating σ2. This is the penalty that the possible change in the mean is not taken into account. Thus we could try σ̂2 N,2 = 1 N − 1   k̂N X i=1 Zi − Z̄k̂N 2 + N X i=k̂N +1 Zi − Ẑk̂N 2   . On account of k̂N ≈ k∗, it looks obvious that σ̂2 N,2 → σ2 in probability. But now the null hypothesis is the problem since there is no k∗ under the null hypothesis! Using the weak convergence of the CUSUM process, it can be shown that k̂N /N converges in distribution. It requires lengthy calculations to show that (1.52) σ̂2 N,2 P → σ2 under the null and also under the alternative. Now we can define two statistics which does not require the knowledge of σ2: (1.53) TN,12 = max 1≤k≤N 1 σ̂N,1 N−1/2

268. k X i=1 Zi − k N N X i=1 Zi

273. and (1.54) TN,13 = max 1≤k≤N 1 σ̂N,2 N−1/2

278. k X i=1 Zi − k N N X i=1 Zi

283. . We note that under the no change null hypothesis (1.55) TN,12 D → sup 0≤u≤1 |B(u)|, and (1.56) TN,13 D → sup 0≤u≤1 |B(u)|. Under the alternative TN,12 P → ∞ and TN,13 P → ∞. The suggested tests are very similar and it is not immediate to see difference between them. None of them are perfect: in TN,12 we are overestimating the variance, so we reduce the power while in TN,13 an additional estimation is used which might impact the behaviour in case of small and moderate sample sizes. This is a typical situation in statistics. We have a choice but which one is better is not obvious. 1.5. Total Time on Test. The total time on test (TTT) is a popular concept in engineering. For example, the best test for exponentiality is based on TTT. TTT is defined for positive variables, so this will be assumed in this part. One of the ingredients for TTT is the function z(x) = 1 1 − F(x) Z x 0 (1 − F(u))du. The estimator of z(x) is simple, we just replace F with FN resulting in ẑN (x) = 1 1 − FN (x) Z x 0 (1 − FN (u))du.

284. 18 LAJOS HORVÁTH The weak convergence of N1/2(FN (x)−F(x)) to B(F(x)) (B is a Brownian bridge yields if F(T) 1, then (1.57) N1/2 (ẑN (x) − z(x)) D[0,T] −→ Γ(x), where Γ(x) = B(F(x)) (1 − F(x))2 Z x 0 (1 − F(u))du − 1 1 − F(x) Z x 0 B(F(u))du. The proof of (1.57) can be derived from the weak convergence of the empirical process with the help of some algebra. By (1.57) we have that (1.58) TN,15 D → sup 0≤x≤T |Γ(x)|, where TN,15 = sup 0≤x≤T N1/2 |ẑN (x) − z(x)|. Getting the distribution of the limit in (1.58) is hopeless since it depends on the unknown F. We will demonstrate that bootstrap works. Interestingly, (1.58) mainly used to construct confidence bands for z(x), 0 ≤ x ≤ T.

285. MATHEMATICAL STATISTICS 19 2. Several versions of resampling In Section 1 we discussed several common hypothesis testing problems in statistics and possible approaches how to tackle them. The procedures based on a single sample had the following form: we defined a test statistic TN and established the following properties: we reject for large values of TN , (2.1) lim N→∞ P{TN ≤ x} = D(x) under the null hypothesis, where D denotes the limiting distribution function and (2.2) TN P → ∞ under the alternative. Based on the original sample X1, X2, . . . , XN we want to create an other sample, called bootstrap sample, X∗ 1 , X∗ 2 , . . . , X∗ L which should resemble the original observation. We consider the original sample X1, X2, . . . , XN as fixed values, i.e. we condition with respect to them. Due to this condi- tioning we use PX to denote P{ · |X}, where X = (X1, X2, . . . , XN ). From the bootstrap sample we compute our test statistic T (1) L , as TN was computed from the original sample. Please note that we have not said how the bootstrap sample was obtained. We repeat this procedure independently of each other R times resulting in the bootstrap statistics T (1) L , T (2) L , . . . , T (R) L . Next we compute their empirical distribution function (2.3) DN,L,R(x) = 1 R R X i=1 I{T (i) L ≤ x}, −∞ x ∞. If we can show that under the null hypothesis (2.4) lim min(N,L)→∞ sup −∞x∞

288. PX{T (1) L ≤ x} − D(x)

291. = 0 a.s., since in this case the law of large numbers implies that (2.5) sup −∞x∞ |DN,L,R(x) − D(x)| → 0 if min(N, L, R) → ∞. Equation (2.5) means that for almost all realization of the original sample X, DN,L(x) converges to D. So you must be extremely unlucky if the bootstrap is not working for you! We require from the bootstrap statistic that it is bounded in probability under the alternative: (2.6) |T (1) L | = OPX (1). The construction of the critical values will explain why (2.6) is crucial for the bootstrap to work. Let 0 α 1 and define the bootstrap critical value cN,L,R = cN,L(α) by DN,L,R(cN,L,R) = 1 − α. (There is a minor technical issue, since DN,L(x) is a jump function so it might not take the value 1 − α. In this case the smallest number for which DN,L(x) is the first time larger than 1 − α. This works, for example, D(x) is a continuous.) Using (2.5) we obtain that P{TN cN,L,R} → α, as min(N, L, R) → ∞ under the null hypothesis. The requirement in (2.6) implies that |cN,L,R(α)| = OP X(1), as min(N, L, R) → ∞ even under the alternative and therefore on account of (2.2) we get under the alternative that P{TN cN,L,R} → 1, as min(N, L, R) → ∞. This means that the rejection rate under the null hypothesis is asymptotically α and we reject the alternative with probability going to 1. The statistics in Section 1 can be bootstrapped. Bootstrap is simple, you need to run the same program several times and you will get the critical value very

292. 20 LAJOS HORVÁTH easily. This sounds nice but, of course, questions will arise. How to choose L? How to choose R? We discuss these questions later. The theory supporting our discussion works well if the limit is derived from Gaussian processes. But this might not be the case for Poisson processes and extreme values. Usually the bootstrap is better (it provides better critical values) and this is proven in several cases, like bootstrapping the mean. In Section 1 we tried to discuss problems where, in some cases, only bootstrap can provide critical values. The bootstrap can be used to construct confidence intervals and confidence bands as well. 2.1. Nonparametric bootstrap. This is probably the most popular method for resampling. However, the permutation method (selection without replacement) is older but it has more limited use. It is a modification of Fisher’s exact test. Our sample is X = (X1, X2, . . . , XN ). As in the introduction of this lecture, we assume that bX is given so we consider as constant, i.e. we condition with respect to X. We assume that (2.7) F is a continuous distribution function. If (2.7) holds, then there is no tie among the Xi’s with probability 1. Now we select from X with replacement, resulting in X∗ 1 , X∗ 2 , . . . , X∗ L. Due to the construction, (2.8) X∗ 1 , X∗ 2 , . . . , X∗ L are independent and identically distributed random variables. The computation of the common distribution function is very simple. Since there is no tie among the Xi’s, due to the random selection, PX{X∗ 1 = Xj} = 1 N , 1 ≤ j ≤ N, so the number of Xi’s which are less than x gives the conditional probability that X∗ 1 less or equal than x. This means that the common distribution function of the bootstrap sample is FN (x) = 1 N N X i=1 I{Xi ≤ x}, i.e. the empirical distribution of the original sample. It is important to note that FN (x) is a jump function even if (2.8) holds. This cause problems when the definition of TN , the statistic we want to bootstrap, assumes that (2.7) holds. For example, statistics based on densities will have this problem. However, as we discussed earlier, FN is an excellent estimate for F. For example, (2.9) sup −∞x∞ |FN (x) − F(x)| → 0 a.s. Even we know that the rate of convergence in (2.9) is N−1/2(log log)1/2 according to the law of the iterated logarithm for empirical processes. The computation of the mean and the variance of X∗ 1 is simple since it takes Xi, 1 ≤ i ≤ N, with probability 1/N, so EXX∗ 1 = 1 N N X i=1 Xi = X̄N i.e. the conditional expected value of X∗ 1 is the sample mean of the original sample. We use EX to denote the conditional expected value when we condition with respect to X. Similarly, varX(X∗ 1 ) = 1 N N X i=1 X2 i − X̄2 N .

293. MATHEMATICAL STATISTICS 21 Also, E[EXX∗ 1 ] = µ and E[varX(X∗ 1 )] = (N/(N − 1))σ2, where EX1 = µ and var(X1) = σ2. According to the central theorem (2.10) sup −∞x∞

298. P ( N−1/2 N X i=1 (Xi − µ)/σ ≤ x ) − Φ(x)

303. → 0 as N → ∞. The bootstrap version of this result is (2.11) sup −∞x∞

308. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − P ( N−1/2 N X i=1 (Xi − µ)/σ ≤ x )

313. → 0 a.s., as min(N, L) → ∞. The theoretical mean and variance in (2.10) are replaced with the conditional mean and variance of the bootstrapped observations. The proof is very simple if we also assume that |X1|3 ∞. According to the Berry–Esseen theorem, there is an absolute constant c such that (2.12) sup −∞x∞

318. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)

328. PX ( N−1/2 L X i=1 (X∗ i − X̄N )/σ̄N ≤ x ) − Φ(x)

333. → 0 a.s., as L → ∞. Now we get (2.11) from (2.10) and (2.13). The proof of (2.11) is typical to deal with theoretical issues of the bootstrap. It is shown that the test statistic and its bootstrap version have the same limit distribution. Das Gupta (2008) has a lengthy discussion of bootstrapping the mean. He provides theoretical evidence that the rate of convergence in (2.11) is better than in (2.10). This has been confirmed empirically in the literature. Please set up Monte Carlo simulations to provide numerical evidence that the rate of convergence is better in (2.11). Just provide some graphs. Bootstrapping the mean provides theoretical results but not too useful in real life applications due to the enormous amount of results on the sample mean. We illustrate on the Kolmogorov–Smirov statistic why the bootstrap works. We already obtained TN,1 of Problem 1. Now we obtain it bootstrap version from the sample X∗ 1 , X∗ 2 , . . . , X∗ L. Their empirical distribution function is F∗ L(x) = 1 L L X i=1 I{X∗ i ≤ x} and now we can define T∗ L,1 as T∗ L,1 = max −∞x∞ L1/2 |F∗ L(x) − FN (x)|.

334. 22 LAJOS HORVÁTH We provide some heuristic arguments proving (2.14) sup −∞x∞ |

338. PX{T∗ L,1 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

342. → 0 a.s., as min(N, L) → ∞, where B is a Brownian bridge. By the weak convergence of the empirical process L1/2 (F∗ L(x) − FN (x)) ≈ B(FN (x)) and by (2.9) and the almost sure continuity of B, B(FN (x)) ≈ B(F(x)). Since F is continuous, sup−∞x∞ |B(F(x))| = sup0≤u≤1 |B(u)|. Hence we have (2.14). Please note that (2.14) holds regardless we have H0 or the alternative. Hence we have (2.4) and (2.5). In case of the Kolmogorov–Smirnov statistic the limit distribution has a known form. Hence you could investigate the question if the bootstrap method provides a better approximation for the distribution of TN,1 than the limit distribution. There are several cases where the limit distribution of the test statistic depends on the underlying distribution of the data. In this case the bootstrap might be the only method to get critical values to test our null hypothesis. Now we return to the test for exponentiality. From the bootstrap sample we estimate λ by λ̂∗ L = 1 L L X i=1 X∗ i , so the bootstrapped parameter could be defined as L1/2(F∗ L(x) − F0(x, λ̂∗ L)). Following the proof in Section 1.2, one can show that sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| converges to the limit if TN,8. So it works under the null. Now the bad news. Under the alternative (2.15) sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| P → ∞. The proof of (2.15) is the same what we did in Section 1.2. Hence with method we will not be able to reject exponantiality even if it is false. The problem is that we should use the distribution function of the bootstrap sample, which is FN (x) which does not contain any place for the parameter λ. We need to do something else which will be done in the next section. Next we consider we consider the two sample problem. If we select from the X sample with replacement and separately from the Y’s with replacement, the procedure will not work. The distribution of the bootstrapped X sample will be around F, while the distribution of the Y sample will be close to H. This means that TN,M,1 and its bootstrapped version, T∗ N∗,M∗,1 behave exactly in the same way. Hence (2.16) PX,Y{T∗ N∗,M∗,1 K} → 1 a.s. for all K. Now we combine the two samples into 1, Z = (X1, X2, . . . , XN , Y1, Y2, . . . , YM )T . We select from Z with replacement, resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ L). Due to the random selection with replacement, conditionally on Z, these are independent and identically distributed with PZ{Z∗ 1 ≤ x} = 1 N + M N+M X i=1 I{Zi ≤ x} = N N + M 1 N N X i=1 I{Xi ≤ x} ! + M N + M 1 M M X i=1 I{Zi ≤ x} ! . Let N∗ = bLN/(N + M)c and M∗ = L − N∗, where b·c denotes the integer part. Now X = (Z∗ 1 , Z∗ 2 , . . . , ZN∗ ) and Y∗ = (Z∗ N∗+1, Z∗ N∗+2, . . . , Z∗ L∗ ). Regardless if the original samples satisfy

343. MATHEMATICAL STATISTICS 23 the null or the alternative hypothesis, X∗ and Y∗ have the same distribution (conditionally on Z, the empirical distribution function of Z). Hence under the null as well as under the alternative sup −∞x∞

347. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}

351. → 0 a.s., where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. 2.2. Parametric bootstrap. Now we discuss how to get critical values for TN,8. Let X∗ 1 , X∗ 2 , . . . , X∗ L be independent, identically distributed random variables with distribution function F0(x, λ̂N ), i.e. given X, these simulated random variables are independent and PX{X∗ i ≤ x} = F0(x, λ̂N ). We need to estimate the parameter, for the bootstrap this is denoted by λ̂∗ N . Clearly, we need to use λ̂∗ L = 1 L L X i=1 X∗ i . Now the bootstrap version of the parameter estimated empirical process is L1/2 (F∗ L(x) − F0(x, λ̂∗ L)). We note that using integration by parts, (2.17) λ̂N = Z ∞ 0 (1 − F0(u, λ̂N )du and (2.18) λ̂∗ L = Z ∞ 0 (1 − F∗ L(u))du. Using again the mean value theorem we get for almost all realization of bX we have uniformly in x L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N ) + F0(x, λ̂N ) − F0(x, λ̂∗ L)) = L1/2 (F∗ L(x) − F0(x, λ̂N )) + g1(x, λ̂N ) Z ∞ 0 L1/2 (F∗ L(u) − F0(u, λ̂N ))du + oX(1), where g1(u, λ) = ∂F0(u, λ) ∂λ . Conditionally on X, L1/2(F∗ L(x) − F0(x, λ̂N )) ≈ B(F0(x, λ̂N )). By the strong law of large numbers λ̂N → µ a.s., where µ = EX1, which is true under the null and the alternative. By the continuity of the Brownian bridge, B(F0(x, λ̂N )) ≈ B(F0(x, µ)). Thus we get for almost all realization of X that L1/2 (F∗ L(x) − F0(x, λ̂∗ L)) D[0,∞] −→ B(F0(x, µ)) + g1(x, µ) Z ∞ 0 B(F0(u, µ)). If H0 holds, then µ = λ0, so TN,7 and sup0≤x∞ L1/2|F∗ L(x) − F0(x, λ̂∗ L)| have the same limit distribution. Under the alternative sup 0≤x∞ L1/2 |F∗ L(x) − F0(x, λ̂∗ L)| = OPX (1). Hence this method provides a correct resampling for TN,7. The more general case can be handled in the same way.

352. 24 LAJOS HORVÁTH 2.3. Resampling without replacement (the permutation method). First we show that the permutations of the original sample can be used for the change point problem. Let Z∗ 1 , Z∗ 2 , . . . , Z∗ N be a random permutation of Z = (Z1, Z2, . . . , ZN ). We note that the permuted variables (selection without replacement) are not independent, but the dependence is week. It is easy to see that P{Z∗ i = Zj} = 1 N , 1 ≤ i, j ≤ N and P{Z∗ i = Zj, Z∗ k = Z`} = 1 N(N − 1) , 1 ≤ i, j, k, ` ≤ N, i 6= k, j 6= `. Hence |P{Z∗ i = Zj, Z∗ k = Z`} − P{Z∗ i = Zj}P{Z∗ k = Z`}| = 1 N2(N − 1) , which is much smaller than P{Z∗ i = Zj}. Also, PZ{Z∗ i ≤ x} = FN (x), 1 ≤ i ≤ N, and FN (x) = 1 N N X i=1 I{Zi ≤ x}. The permuted statistic is T∗ N,11 = max 1≤k≤N 1 σ̄N N−1/2

357. k X i=1 Z∗ i − k N N X i=1 Z∗ i

362. . Using the weak dependence between the permuted variables, one can show that (2.19) sup −∞x∞

366. PZ{T∗ N,11 ≤ x} − P sup 0≤u≤1 |B(u)| ≤ x

370. → 0 a.s., where B is a Brownian bridge. Under the alternative there is a change at k∗ so the distribution of the permuted random variables is a linear combination of two distributions, but the permuted variables will have the same marginal distribution. Namely, for all 1 ≤ j ≤ N, PZ{Z∗ j ≤ x} = FN (x) = 1 N k∗ X i=1 Zi + N X i=k∗+1 Zi ! = k∗ N 1 k∗ k∗ X i=1 Zi ! + N − k∗ N 1 N − k∗ N X i=k∗+1 Zi ! . Hence if there is no change in the means of the observations, then we still have that σ̄N → τ2 and τ2 might be different from σ2. However, this is still enough to claim that T∗ N,11 = OPZ (1), so the permutation method can be used to find critical values for the change point statistics based on partial sum processes. The permutation resampling was considered by Fisher (without calling it resampling). The idea is very simple. We permute Z resulting in Z∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N+M ). These are not independent since we selected without replacement but they have the same distribution, conditionally on Z, PZ{Z∗ i ≤ x} = 1 N + M N+M X `=1 I{Z` ≤ x}. Now X∗ = (Z∗ 1 , Z∗ 2 , . . . , Z∗ N ) and Y∗ = (Z∗ N+1, Z∗ N+2, . . . , Z∗ N+M ). Regardless if the original data satisfy the null hypothesis or the alternative, the marginal distributions, conditionally on Z, are the same. Hence under the null as well as under the alternative sup −∞x∞

374. PZ{T∗ N∗,M∗,1 ≤ x} − P{ sup 0≤u≤1 |B(u)| ≤ x}

378. → 0 a.s.

379. MATHEMATICAL STATISTICS 25 where T∗ N∗,M∗,1 is the bootstrap version of TN,M,1 and B is a Brownian bridge. Hence both (2.4) and (2.6) are satisfied. So far we only had to assume that min(N, L) → ∞. Of course, choosing a much larger L would result in lots of ties (we try to imitate continuous distributions which do not have ties). Since we apply limit theorems, L cannot be small. Usually, L = N is used. However, in case of extremes it might not work. 2.4. Bootstrapping the largest observation. Das Gupta (2008) contains an example when the bootstrap does not work when we sample with replacement. The original sample size is N and we generated a bootstrap sample X∗ 1 , X∗ 2 , . . . , X∗ N . Let XN,N = max(X1, X2, . . . , XN ) be the maximum in the original sample and X∗ N,N = max(X∗ 1 , X∗ 2 , . . . , X∗ N ). The bootstrap works in this case if XN,N = X∗ N,N , but the probability that X∗ N,N XN,N is (1 − 1/N)N → 1/e, N → ∞. This is clear, since during the selection, we cannot pick XN,N , so we need to choose from N − 1 possibilities. This example suggests if we increase the bootstrap sample size, we might hit XN,N . However, this is not the case. Let assume that X1, X2, . . . , XN be independent and identically distributed exponential(1) random variables, i.e. F(t) = ( 0, if t 0 1 − e−t , if t ≥ 0. Let XN,N be the largest order statistics and YN = XN,N − log N. As we did before P{YN ≤ t} = P{XN,N ≤ t + log N} = FN (t + log N) and FN (t + log N) =      0, if t − log N 1 − e−t N N , if t ≥ − log N. Thus we get that for all −∞ t ∞ lim N→∞ P{YN ≤ t} = H(t), where H(t) = e−e−t , −∞ t ∞. We can do the same for the bootstrap sample. Let Y ∗ L = max(X∗ 1 , X∗ 2 , . . . , X∗ L) − log L. As before, PX{Y ∗ L ≤ t} = FL N (t + log L), where, as before, FN (t) is the empirical distribution function of X = (X1, X2, . . . , XN ). According to the law of the iterated logarithm for the empirical process we have (2.20) lim sup N→∞ N 2 log log N 1/2 sup −∞t∞ |FN (t) − F(t)| = 1 a.s. Next we write FL N (t − log L) = (F(t + log L) + FN (t + log L) − F(t + log L))L . If t is fixed and N is so large that t + log L 0 F(t + log L) + FN (t + log L) − F(t + log L) = 1 − e−t L + FN (t + log L) − F(t + log L) = 1 − 1 L e−t (1 + L(FN (t + log L) − F(t + log L))) .

380. 26 LAJOS HORVÁTH If want use the formula 1 − xn n → e−x , if xn → x with xN,L = e−t (1 + L|FN (t + log L) − F(t + log L)|) we need that L|FN (t + log L) − F(t + log L)| → 0 a.s. In light of (2.20) this is satisfied if (2.21) L N log log N 1/2 → 0. According to (2.21), the bootstrap with replacement works if L is not large, essentially L must be less than N1/2. So more is not better in this case. Please note that if N = 100, than we should use a bootstrap sample size less than 10. Doing asymptotic theory with 10 observations is somewhat questionable. The rate of convergence to extreme values could be very slow so the bootstrap might not be better than using the limit results for the original sample. In any case, if (2.21) holds, the lim N→∞ PXP{Y ∗ L ≤ t} = H(t), a.s. So far we bootstrapped the observations directly. Now we consider the case when the observations are not identically distributed so selection with or without replacement will not work. All the bootstrap methods we discussed so far produced identically distributed random variables. 2.5. Residual bootstrap. We illustrate this method by linear models. We assume that (2.22) yi = x i β0 + i, 1 ≤ i ≤ N, where xi = (xi,1, xi,2, . . . , xi,d) ∈ Rd and β0 ∈ Rd. As usual, y1, y2, . . . , yN and x1, x2, . . . , xN are observed. We note that in statistics the xi’s are given numbers, while in econometrics, they modeled as random variables. The errors, 1, 2, . . . , N are unobservable random errors. We assume that 1, 2, . . . , N are independent and identically distributed random variables with (2.23) Ei = 0 and 0 E2 i = σ2 ∞. The parameters of interest are β0 and σ2. We estimate β0 using the least squares β̂N . The residuals defined by ˆ i = yt − x i β̂N , 1 ≤ i ≤ N. We collect some facts from linear models. First we write (2.22) in matrix form. Let YN = (y1, y2, . . . , yN ), EN = (1, 2, . . . , N ), β = (β1, β2, . . . , βd) and XN =         x1,1 x1,2 . . . x1,d x2,1 x2,2 . . . x2,d . . . . . . . . . . . . . . . . . . xN,1 xN,2 . . . xN,d         The matrix form of (2.22) is Y = XN β + EN . The least square estimator β̂ is defined by the minimization problem β̂N = argminβ kYN − XN βk2 ,

381. MATHEMATICAL STATISTICS 27 where k · k is the Euclidian norm (some of the squares of the elements of the matrix). We obtain the solution in explicit form: β̂N = X X −1 X Y, assuming that X N XN is nonsingular. Usually, the properties of β̂N are established assuming the normality of the errors. If normality of the errors is assumed, then one possibility is the parametric bootstrap. However, the bootstrap in this case is not very useful, since in this case the normality of β̂N is proven, so only the estimation of σ is needed for statistical inference. If the errors are not necessarily normal, then β̂N is still asymptotically normal. Namely, (2.24) N1/2 (β̂N − β0) D → Nd(0, σ2 A−1 ), where (2.25) lim N→∞ 1 N X X = A, β0 is the true value of the parameter and Nd denotes a d–dimensional normal random variable. It is easy to interpret the condition in (2.25): (2.26) lim N→∞ 1 N N X `=1 xi,`x`,j = ai,j, A = {ai,j, 1 ≤ i, j ≤ d}. If the xi’s are modeled as random variables, then (2.26) is just the law of large numbers. Hence (2.25) is a very natural assumption in linear models. If we go back the definition of the residuals, then we have ˆ i = i − x i (β̂N − β0). Let zi = (x i , ˆ i), Z = {zi, 1 ≤ i ≤ N}. We choose L-times, independently of each other with replacement from {zi, 1 ≤ i ≤ N}, resulting in {z∗ i , 1 ≤ i ≤ L}. Now we define (2.27) y∗ i = (x∗ i ) β̂N + ˆ ∗ i , 1 ≤ i ≤ L. Using the same notation as before but putting up ∗ we write (2.28) Y∗ = X∗ β̂N + E∗ and therefore β̂ ∗ N = (X∗ ) X∗ −1 (X∗ ) Y∗ . Using (2.27) we get that β̂ ∗ N = β̂N + (X∗ ) X∗ −1 (X∗ ) E∗ . Using again the law of large numbers lim L→∞ 1 L (X∗ ) X∗ = A a.s. We note that EZ (X∗ ) X∗ −1 (X∗ ) E∗ (X∗ ) X∗ −1 (X∗ ) E∗ = (X∗ ) X∗ −1 (X∗ ) E h E∗ (E∗ ) i X∗ (X∗ ) X∗ −1 ≈ L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 . Conditionally on Z, ∗’s are independent and identically distributed and therefore EZ h E∗ (E∗ ) i = EZ(∗ 1)2 IN×N

382. 28 LAJOS HORVÁTH where IN×N is the N × N identity matrix. Thus we have L2 A−1 (X∗ ) EZ h E∗ (E∗ ) i X∗ A−1 ≈ L2 EZ(∗ 1)2 A−1 (X∗ ) IN×N X∗ A−1 ≈ LEZ(∗ 1)2 A−1 . It is easy to see that EZ(∗ 1)2 → σ2 a.s. Thus we conjecture that for almost all realization of Z (2.29) L1/2 (β̂ ∗ N − β̂N ) D → Nd(0, σ2 A−1 ). The proofs of (2.24) and (2.29) are essentially the same. We showed that EZβ̂ ∗ N = β̂N EZ h β̂ ∗ N (β̂ ∗ N ) i = 1 L EZ(∗ 1)2 A−1 + oZ 1 L = σ2 L A−1 + oZ 1 L , hence the first (mean) and the second order (variance) properties of β̂ ∗ N and β̂N are practically the same. Of course, this is not the proof of the normality of the estimators but these are necessary results for normality. Since β̂N always converges to the true value of the parameter of the linear model, the limit theorem 2.29 can be used to justify this bootstrap method for hypothesis testing. We only need that min(N, L) → ∞. The resampling of the residuals is also a popular technique in time series analysis. For the sake if simplicity we consider an autoregressive AR(1) sequence. We assume that {i, −∞ t ∞} are independent and identically distributed random variables. The AR(1) sequence is the solution of the recursion (2.30) yi = ρyi−1 + i, −∞ i ∞. If |ρ| 1 and |0|δ ∞ with some δ 0, then (2.30) has a unique solution given by (2.31) yi = X `=0 ρ` i−`, −∞ i ∞. We can estimate ρ with ρ̂N , the least square estimator. It is established in time series that N1/2(ρ̂N − ρ) is asymptotically normal. Observing y1, y2, . . . , yN , we define the residuals as ˆ i = yi − ρ̂N yi−1, 2 ≤ i ≤ N. We select from {ˆ 2, ˆ 3, . . . , ˆ N } with replacement, creating {ˆ ∗ 1, ˆ ∗ 2, . . . , ˆ ∗ L}. If the statistical inference is about the i’s, we are done. If our interest is in y1, y2, . . . , yN , then we define the boostrap sample (2.32) y∗ i = ρy∗ i−1 + ˆ ∗ i , 2 ≤ i ≤ L, with some initial value y∗ 0. However, of i is small the solution of (2.32) is certainly not close to the solution of (2.30) given by the infinite sum in (2.31). Hence we do not use all y∗ i ’s only if i ≥ L0. This we get a bootstrap sample of size L − L0. L0 is the burn in period and the practical advise is that L0 = 25 or 50. Next we discuss how the bootstrap can help to construct confidence bands.

383. MATHEMATICAL STATISTICS 29 2.6. Confidence bands. We recall the TTT function z(x) from Section 1.5. We want to define to random functions, zN,1(x) and zN,2(x) such that lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α. If we choose zN,1(x) = z(x) − cN−1/2 and zN,2(x) = z(x) + cN−1/2 According to our theory, lim N→∞ P {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = P{ sup 0≤x≤a |Γ(x)| ≤ c}, so c = c(1 − α) is coming from the equation P sup 0≤x≤a |Γ(x)| ≤ c = 1 − α. However, the computation of the distribution distribution function P{sup0≤x≤a |Γ(x)| ≤ c} is hopeless since it depends on the unknown F. Let X∗ 1 , X∗ 2 , . . . , X∗ L be the bootstrap sample and define the bootstrap version of zN (x) by ẑ∗ L(x) = 1 1 − F∗ L(x) Z x 0 (1 − F∗ L(u))du, where, as before, F∗ L(u) = 1 L L X i=1 I{X∗ i ≤ u}. Using our previous arguments, one can show that PZ{ sup 0≤x≤a L1/2 |ẑ∗ L(x) − ẑN (x) ≤ c} → P sup 0≤x≤a |Γ(x)| ≤ c a.s., hence the bootstrap can be used to estimate c(1−α) from the sample. We obtain ĉN,L,R = ĉN,L,R(1− α) as our estimate, where N is the original sample size, L is the bootstrap sample size and R is the number of the repetations of the bootstrap procedure. We get that if ẑN,1(x) = z(x)−ĉN,L,RN−1/2 and ẑN,2(x) = z(x) + ĉN,L,RN−1/2, then lim min(N,L,R)→∞ PX {zN,1(x) ≤ z(x) ≤ zN,2(x) for all x ∈ [0, a]} = 1 − α a.s. The construction of confidence bands for a regression line is also a popular question. Let a(x) = β0 + β1x, b ≤ x ≤ d. We observe the line at N points giving the observations yi, xi, 1 ≤ i ≤ N yi = a(xi) + i = β0 + β1xi + i. We wish to define aN,1(x) and aN,2(x) from the sample that lim N→∞ P{aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = 1 − α. We try aN,1(x) = β̂0,N + β̂1,N − cN−1/2 and aN,1(x) = β̂0,N + β̂1,N + cN−1/2 , where β̂0,N and β̂1,N are the least squares estimators for the parameters. Then P {aN,1(x) ≤ a(x) ≤ aN,2(x) for all x ∈ [b, d]} = P ( sup b≤x≤d |N1/2 (β0 − β̂0,N ) + N1/2 (β1 − β̂1,N )x| ≤ c ) . We already know that N1/2 (β0 − β̂0,N ), N1/2 (β1 − β̂1,N ) D → N2,

384. 30 LAJOS HORVÁTH where N2 = (N1, N2) is a bivariate normal random variable. Hence Γ0 (x) = N1 + N2x, b ≤ x ≤ d is a Gaussian process and we need to find c = c(1 − α) such that (2.33) P ( sup b≤x≤d |Γ0(x)| ≤ c ) = 1 − α. We have at least to possibilities to get c in (2.33). The mean of N2 is 0 and it covariance matrix is known explicitly and it is easy to estimate from the sample. Hence we can easily estimate N2 and we could use Monte Carlo simulations. The other possibility is using the bootstrap as it was done for z(x). To reflect the variability of the data even in the bands, one might try using the limit distribution of sup b≤x≤d |N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x| (var(N1/2(β0 − β̂0,N ) + N1/2(β1 − β̂1,N )x))1/2 . There was restriction on the choice of L, the bootstrap sample size in case of extreme values. It turns out that there is no restriction on L if we bootstrap a statistic with normal limit (or limit derived from normal random variables and/or processes).

385. MATHEMATICAL STATISTICS 31 3. Density estimation So far we have discussed the estimation of the distribution function and the theory related to it. These are fundamental results and the distribution of several statistics can be derived from this theory. In some cases, looking at the data, we want to guess the distribution of the underlying observations. This is impossible from the distribution function and from its estimate to do this since all distributions look the same. However, the shape of densities is relatively unique and everybody could see the difference between the shapes of exponential and normal densities. So the estimation of densities could provide important tools for statistical analysis. However, they are rarely used in hypothesis testing, for example, and their rate of convergence to the limit can be extremely slow. A large part of the statistical literature shows efforts how to avoid the estimation of densities. The main problem is that the density, as a derivative, is a limit. In real life, we have trouble to estimate limits. We discuss several versions of density estimation. 3.1. Kernel density estimator. Let X1, X2, . . . , XN be independent and identically distributed random variables with distribution function F. The density function f is defined by f(t) = F0 (t). The kernel density estimate was introduced by Rosenblatt (1956) and Parzen (1962) and it is defined by ˆ fN (t) = 1 NhN N X i=1 K t − Xi hN , (3.1) where hN is the bandwidth (analogous to length of bins in a histogram) and K(·) is the kernel. One natural requirement is that ˆ fN (t) is a density function for each N. This requirement is satisfied if (3.2) K is a density function. A function is a density function if it is non negative and its integral on the line is 1. It is clear from (3.2) that ˆ fN (t) ≥ 0 and Z ∞ −∞ ˆ fN (t)dt = 1 NhN N X i=1 Z ∞ −∞ K t − Xi hN dt and by change of variable, u = (t − Xi)/hN , Z ∞ −∞ K t − Xi hN dt = hN Z ∞ −∞ K(u)du = hN . Hence Z ∞ −∞ ˆ fN (t)dt = 1. Assumption (3.2) is attractive but it will limit how small of the bias of ˆ fN (t) can be. We will have two assumptions on the window (bandwidth) hN : (3.3) hN → 0 and (3.4) NhN → ∞, as N → ∞. Assumptions (3.3) and (3.4) will require careful balancing act. According to (3.3), hN should be small but according to (3.4), hN cannot be too small. These conflicting requirements will lead us to the optimal choice of hN . We need (3.3) to get an asymptotically unbiased estimator,

386. 32 LAJOS HORVÁTH i.e. E ˆ fN (t) → f(t). The assumption in (3.4) will imply that var( ˆ fN (t)) → 0. First we consider the behaviour of ˆ fN (t) at a fixed point t. Let Λ be a neighbourhood of t. It is easy to see that E ˆ fN (t) = 1 h EK t − X1 h , since the observations are identically distributed. By definition, 1 h EK t − X1 h = 1 h Z ∞ −∞ K t − u h f(u)du = Z ∞ −∞ K(x)f(t − xh)dx. Next we show that (3.5) Z ∞ −∞ K(x)f(t − xh)dx → f(t). We assume (3.6) sup −∞u∞ K(u) ∞, (3.7) sup −∞x∞ f(x) ∞, i.e. K and f are bounded functions. Also, (3.8) f(u) is continuous if u ∈ Λ. Using (3.2), (3.3) and (3.6)–(3.8) we show that (3.5) holds. Let 0. We choose A so large that Z −A −∞ K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z −A −∞ K(x) ≤ and Z ∞ A K(x)f(t − xh)dx ≤ sup −∞u∞ f(u) Z ∞ A K(x) ≤ . Using the continuity assumed in (3.8) with (3.3), there is an integer N0 such that sup −A≤x≤A |f(t − xh) − f(t)| ≤ , if N ≥ N0. Hence for N ≥ N0 we have

390. Z A −A K(x)(f(t − xh) − f(t))dx

394. ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z A −A K(x) ≤ sup −A≤u≤A |f(t − uh) − f(t)| Z ∞ −∞ K(x) ≤ . Also, by the choice of A we get

398. Z A −A K(u)du − 1

402. f(t) ≤ , completing the proof of (3.5). Hence (3.9) E ˆ fN (t) → f(t),

403. MATHEMATICAL STATISTICS 33 i.e. the estimator is asymptotically unbiased. In the applications the rate of convergence will be important. We need to increase our assumptions on the smoothness of K and f: (3.10) Z ∞ −∞ x2 K(x)dx ∞, (3.11) sup −∞x∞ |f0 (x)| ∞ and sup −∞x∞ |f00 (x)|, (3.12) f00 (u) is continuous, if u ∈ Λ. Using a two term Taylor expansion we obtain that Z ∞ ∞ K(x)(f(t − xh) − f(t))dx = −h Z ∞ −∞ K(x)f0 (t)xhdx + 1 2 Z ∞ −∞ K(x)(xh)2 f00 (ξ(x))dx, where ξ(x) satisfies |ξ(x) − t| ≤ |x|h. Using now (3.10)–(3.12), repeating our previous arguments we can show that Z ∞ −∞ K(x)x2 f00 (ξ(x))dx → f00 (t) Z ∞ −∞ x2 K(x)dx. Thus we conclude E ˆ fN (t) = f(t) − hf0 (t) Z ∞ −∞ xK(x)dx + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Since we want to have small bias, from now on we assume that (3.13) Z ∞ −∞ xK(x)dx = 0. If K is symmetric around 0, assumption (3.13) holds. Under (3.13) E ˆ fN (t) = f(t) + h2 2 f00 (t) Z ∞ −∞ x2 K(x)dx + o(h2 ). Now we turn to the computation of the variance. Since the observations are independent and identically distributed we get that var( ˆ fN (t)) = 1 N2h2 N X i=1 var K t − Xi h = 1 Nh2 var K t − X1 h . and 1 h var K t − X1 h = 1 h E K t − X1 h 2 − 1 h EK t − X1 h 2 . We already showed that EK t − X1 h = O(h). (3.14) Repeating our previous calculations we conclude 1 h E K t − X1 h 2 = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du

404. 34 LAJOS HORVÁTH = f(t) Z ∞ −∞ K2 (u)du + o(1). Summarizing our calculations we have that var( ˆ fN (t)) = 1 Nh f(t) Z ∞ −∞ K2 (u)du + o(1) and therefore var( ˆ fN (t)) → 0 if and only if (3.4) holds. Since ˆ fN (t) is biased, the mean square error is used to evaluate its performance. By definition, MSE( ˆ fN (t)) = var( ˆ fN (t)) + (E ˆ fN (t) − f(t))2 = 1 Nh f(t) Z ∞ −∞ K2 (u)du + h4 4 (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 + o 1 Nh + o(h4 ), if (3.13) holds. Now it is easy to find h which gives the smallest value of MSE( ˆ fN (t)), at least asymptotically: hopt = c0N−1/5 , where c0 = (c1/c2)1/5 c1 = f(t) Z ∞ −∞ K2 (u)du and c2 = (f00 (t))2 Z ∞ −∞ u2 K(u)du 2 . The result on the optimal h looks nice but it is not too useful. It depends on t but in our definition of the kernel density estimator, the window depends only on the sample size. Also, since f is unknown, we cannot compute c0. However, we have the interesting observation that the optimal hN is proportional to N−1/5 so it will be crucial for any theory to cover this case. We wish to use an estimator which minimizes the mean square error, i.e. we choose h and K, where min K min h E( ˆ fN (t) − f(t))2 is reached. We already found hopt and plugging this value into the formula for the MSE, we need to maximize this expression with respect to K. This is hard but the value of the MSE will not change too much. Hence the crucial question is the choice of h. There are some kernels which are often used in practice: K(u) = 1 2c I{−c ≤ u ≤ c} (uniform density), K(u) = 1 (2π)1/2 e−u2/2 (normal density), K(u) = (1 − |u|)I{|u| ≤ 1}, (triangular or Bartlett), K(u) = 3 4 (1 − u2 )I{|u| ≤ 1} (Epanechnikov kernel), K(u) = 30 20 √ 5 (5 − u2 )I{− √ 5 ≤ u ≤ √ 5} (Epanechnikov kernel) and K(u) = 1 2π sin(u/2) u/2 2 , −∞ u ∞, K(0) = 1 2π (Fejér kernel). All kernels have finite support except the normal and the Fejér. The kernel densities based on the normal and the Fejér kernels are infinitely times differentiable. The others might not provide differentiable or only few times differentiable density estimates. The Epanechnikov kernel minimizes the mean square error. The Fejér kernel is coming from the theory of Fourier analysis. In practice,

405. MATHEMATICAL STATISTICS 35 there is little difference between estimators using different kernels. Next we consider the limit distribution of ˆ fN (t). We show that (Nh)1/2( ˆ fN (t) − f(t)) is asymptotically normal for each t. We decompose the difference between the empirical and the true density as ˆ fN (t) − f(t) = [ ˆ fN (t) − E ˆ fN (t)] + [E ˆ fN (t) − f(t)], the random error and the numerical bias. The bias term will not play any role in the limit if (Nh)1/2 h2 → 0, i.e. (3.15) hN1/5 → 0. This means that using the optimal window the asymptotic mean of (Nh)1/2( ˆ fN (t) − f(t)) will not be 0. This is natural since the optimal window will give the same order of the square of the bias and the variance of ˆ fN (t). Since we already know the exact behaviour of the bias term, we consider the normality of ˆ fN (t) − E ˆ fN (t) = 1 Nh N X i=1 K t − Xi h − EK t − Xi h . We introduce ηi,N = 1 h1/2 K t − Xi h − EK t − Xi h , which are independent and identically distributed random variables. Also, by (3.6) these are bounded random variables, but the bound depends on N. Regardless, we can use Liapounov’s condition (cf. DasGupta, 2008 p. 64) to establish normality. Now var(ηi,N ) = 1 h EK2 t − Xi h − EK t − Xi h 2 # . We showed that EK t − Xi h = O(h). Repeating our previous arguments we get that 1 h EK2 t − Xi h = 1 h Z ∞ −∞ K2 t − x h f(x)dx = Z ∞ −∞ K2 (u)f(t − uh)du = f(t) Z ∞ −∞ K2 (u)du + o(1). Now we compute Eη4 i,N (essentially we only need an upper bound). The only reason why we compute that 4th moment because in this case we can get the exact asymptotic and the method can be used in other cases as well. Namely, Eη4 i,N = 1 h2 K4 t − Xi h − 4EK3 t − Xi h EK t − Xi h + 6EK2 t − Xi h EK t − Xi h 2 − 4EK t − Xi h EK t − Xi h 3 + EK t − Xi h 4 .

406. 36 LAJOS HORVÁTH For ` = 1, 2, 3 and 4 we have EK` t − Xi h = Z ∞ ∞ K` t − x h fxdx = h Z ∞ −∞ K` (u)f(t − uh)du = hf(t) Z ∞ −∞ K` (u)du + o(h). Thus we get Eη4 i,N = 1 h f(t) Z ∞ −∞ K4 (u)du + o 1 h . According to the Liapounov condition, we need to show that N X i=1 Eη4 i,N !1/4 N X i=1 Eη2 i,N !1/2 → 0. (3.16) We showed that N X i=1 Eη2 i,N !1/2 ≈ N1/2 and N X i=1 Eη4 i,N !1/4 ≈ N1/4 h−1/4 , and therefore (3.16) holds if Nh → ∞. Thus we get the asymptotic normality of (Nh)1/2( ˆ fN (t) − E ˆ fN (t)): (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t)) D → N 0, f(t) Z ∞ −∞ K2 (u)du , (3.17) where N denotes a normal random variable. The normalization in the central limit theorem in (3.17) shows that the rate of convergence is always slower than, for example, the convergence of the empirical distribution function to the theoretical one. Also, we need to choose the kernel K and the window h. We also see that the optimal window h ≈ N−1/5 will give a central limit theorem for (Nh)1/2( ˆ fN (t) − f(t)) but the expected value of the limiting normal will not be 0. We are interested in f(t), as a function of t, so we would like to estimate at several points simulta- neously. First we consider the correlation between (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)). Using independence we get that E (Nh)1/2 ( ˆ fN (t) − E ˆ fN (t))(Nh)1/2 ( ˆ fN (s) − E ˆ fN (s)) = 1 Nh N X i=1 N X j=1 E K t − Xi h − EK t − Xi h K s − Xj h − EK s − Xj h = 1 Nh N X i=1 E K t − Xi h − EK t − Xi h K s − Xi h − EK s − Xi h = 1 h E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h .

407. MATHEMATICAL STATISTICS 37 Now E K t − X1 h − EK t − X1 h K s − X1 h − EK s − X1 h = E K t − X1 h K s − X1 h − EK t − X1 h EK s − X1 h = E K t − X1 h K s − X1 h + O(h2 ) on account of (3.14). Following our earlier calculations we get that E K t − X1 h K s − X1 h = Z ∞ −∞ K t − x h K s − x h f(x)dx = h Z ∞ −∞ K(u)K u + t − s h f(t − uh)du. Since K is integrable on the line, if t 6= s, then for all u (3.18) K u + t − s h → 0, since |t − s|/h → ∞. On account of (3.6) and (3.7) we have that 0 ≤ K(u)K u + t − s h f(t − uh) ≤≤ sup x K(x) sup x f(x) K(u), K is inegrable on R and by (3.18) K(u)K u + t − s h f(t − uh) → 0 for all u ∈ Rd , so by the Lebesgue dominated convergence theorem we have (3.19) Z ∞ −∞ K(u)K u + t − s h f(t − uh)du → 0, if t 6= s. Thus we proved that (Nh)1/2( ˆ fN (t) − E ˆ fN (t)) and (Nh)1/2( ˆ fN (s) − E ˆ fN (s)) are asymptotically uncorrelated if t 6= s. Now we try to establish the multivariate central limit theorem for density estimates. Let t1 t2 . . . tR and our conditions on the density are satisfied at these points. We show that (Nh)1/2 ( ˆ fN (t1) − E ˆ fN (t1)), (Nh)1/2 ( ˆ fN (t2) − E ˆ fN (t2)), (3.20) . . . , (Nh)1/2 ( ˆ fN (tR) − E ˆ fN (tR)) ! D → NR(0, Σ) where NR is an R dimensional normal random vector, Σ is a diagonal matrix and diag(Σ) = f(t1) Z ∞ −∞ K2 (u)du, f(t2) Z ∞ −∞ K2 (u)du, . . . , f(tR) Z ∞ −∞ K2 (u)du . There is a standard method how to prove the asymptotic normality of a random vector. This is the Cramér–Wold theorem (DasGupta, 2008, p. 9). According to this theorem we need to show that all linear combinations are asymptotically normal, i.e. for all λ1.λ2, . . . , λR R X `=1 λ`(Nh)1/2 ( ˆ fN (t`) − E ˆ fN (t`)) D → N 0, R X `=1 λ2 ` f(t`) Z ` −∞ K2 (u)du ! . (3.21)

408. 38 LAJOS HORVÁTH The proof of (3.21) is also based on Liapounov theorem. Using the just proven asymptotic uncor- relation, we obtain that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #2 = E 1 h1/2 ( R X `=1 λ` K t` − Xi h − EK t` − Xi h )#2 = 1 h E R X `=1 λ` K t` − Xi h − EK t` − Xi h #2 = R X `=1 λ2 ` f(t`) Z ∞ −∞ K2 (u)du + o(1). The Hölder inequality yields (x1 + x2 + . . . + xR)4 ≤ R4(x4 1 + x4 2 + . . . + x4 R) and therefore we get that E R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h #4 ≤ R4 R X `=1 λ4 ` E 1 h1/2 K t` − Xi h − EK t` − Xi h 4 ≤ R4 24 R X `=1 λ4 ` 1 h2 EK4 t` − Xi h + 1 h2 EK t` − Xi h 4 # = O 1 h . So if η̄i,N = R X `=1 λ` 1 h1/2 K t` − Xi h − EK t` − Xi h , then Eη̄i,N = 0 and using the formulae above we get the N X i=1 η̄4 i,N !1/4 N X i=1 η̄2 i,N !1/2 → 0. Hence the Liapounov condition is satisfied and therefore (3.21) holds. We obtain several results on the kernel density estimators for fixed t’s. The rate of convergence is (Nh)−1/2 is slower than the usual N−1/2. Also, the asymptotic independence (3.20) will cause problems when we look at the estimate on the interval [a, b]. We wish to obtain “global” results for ˆ fN (t). The popular choices are sup a≤t≤b | ˆ fN (t) − f(t)| and Z b a | ˆ fN (t) − f(t)|p dt 1/p ,

409. MATHEMATICAL STATISTICS 39 where p ≥ 1. What is the “natural” norm? It is argued that p = 1 is the “natural” norm, since the L1 norm of densities is always finite, it is always less than 2. All the other norms put restrictions on f. The visualization of ˆ fN (t) together with f is supported by the sup–norm. To obtain global results, the point wise assumptions on f must hold on [a − , b + ] with some 0. Under these conditions, sup a≤t≤b | ˆ fN (t) − f(t)| P → 0 i.e. the estimator is uniformly weakly consistent. The limit distribution of the sup norm and the L2 norm of the kernel density estimator was determined by Bickel and Rosenblatt (1973). They consider MN,1 = (Nh)1/2 sup a≤t≤b f−1/2 || ˆ fN (t) − f(t)| and MN,2 = Nh Z b a Z b a ( ˆ fN (t) − f(t))2 a(t)dt, where a(t) is a weight function. Bickel and Rosenblatt (1973) explicitly define numerical sequences r1,N and r2,N such that (3.22) P {r1,N (MN,1 − r2,N ) ≤ x} → exp(−2e−x ). We note that r1,N ≈ (log N)1/2 and r2,N ≈ (log N)1/2 and the limit is an extreme value distribution. These suggest that the rate of convergence in (3.22) is slow. This conjecture was checked empirically. They also showed that there are constants r3 and r4 such that (3.23) P 1 r3h1/2 (MN,2 − r4) → Φ(x), where Φ(x) denotes the standard normal distribution function. The rate of convergence in (3.23) is better than in (3.22), so usually (3.23) is used in hypothesis testing. Csörgő and Horváth (1988) extend the results in (3.23) to the functionals MN,3 = (Nh)p/2 Z b a | ˆ fN (t) − f(t)|p a(t)dt, where a(t) is a weight function for all p ≥ 1. Their result is mainly used when p = 1 since this will give the natural L1 norm for density estimates. Since the rate of convergence in (3.22) is rather low, we discuss how to use the bootstrap to get cN = cN (α) such that (3.24) lim N→∞ P{MN,1 ≤ cN } = 1 − α. The result in (3.24) can be used to construct confidence bands for the density on [a, b] and for hypothesis testing. We note that cN (α) does not have a limit as N → ∞, it is inceasing like (2 log(1/h))1/2. If we use boostrap with replacement, X∗ 1 , X∗ 2 , . . . , X∗ N are independent and identically distributed random variables but they are discrete, so conditionally on X they do not have a density function. Note that due to the difficult form of r1,N and r2,N , the bootstrap sample size is N, the original sample size. However, even there is no density, we can formally compute ˆ f∗ N (t) = 1 Nh N X i=1 K t − X∗ i h , which is a density for all N, it satisfies that ˆ f∗ N (t) ≥ 0 and Z ∞ −∞ ˆ f∗ N (t)dt = 1.

410. 40 LAJOS HORVÁTH This is really interesting since with a density we are estimating a non existing density if we condition on X. The bootstrap statistic is M∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˆ f∗ N (t) − ˆ fN (t)|. We cannot repeat our previous arguments since conditionally on X, ˆ fN (t) is not the density of the bootstrap observations. However, if c∗ N (α) is defined by P{M∗ N,1 ≤ c∗ N } = 1 − α, then (3.25) lim N→∞ P{MN,1 ≤ c∗ N (α)} = 1 − α. Hence we can use c∗ N (α) as an approximation for cN (α). It is more natural to use a boostrap sample with density function ˆ fN (t), conditionally on X. Since ˆ fN (t) is a density function, F̂N (x) = Z x −∞ ˆ fN (t)dt defines a distribution function. We note that F̂N (x) = 1 Nh N X i=1 Z x −∞ K t − Xi h dt = 1 N N X i=1 K x − Xi h , where K(u) = Z u −∞ K(t)dt, i.e. K(u) is the distribution function satisfying K0(u) = K(u). Hence F̂N (x) is a smooth estimator for the underlying distribution function F. Let Z1, Z2, . . . , ZN be independent identically random variables with distribution function F̂N (x), conditionally on X. Now we compute the kernel density estimator ˜ f∗ N (t) from Z1, Z2, . . . , ZN . The corresponding sup statistic is M̃∗ N,1 = (Nh)1/2 sup a≤t≤b ˆ f −1/2 N (t)| ˜ f∗ N (t) − ˆ fN (t)|. If c̃∗ N = c̃∗ N (1 − α) is defined by lim N→∞ P{MN,1 ≤ c̃∗ N (α)} = 1 − α, so we have an other resampling based estimator for cN (α). Our discussion introduced a smooth estimator for F and this estimator is used to define the smoothed bootstrap. Next we consider the effect of estimating a parameter in the fitted density function. We assume that the underlying density function is in the parametric form f(t, θ). The true value of the parameter is θ0, i.e. f0(t) = f(t, θ0). We estimate the parameter with θ̂N satisfying (3.26) N1/2 (θ̂N − θ0) = OP(1). We have seen that several estimators satisfy (3.26), for example, maximum likelihood, least squares, U–statistics and so on. If f(t, θ) has a bounded derivative in a neighbourhood of θ0, i.e. there is a constant C such that ∂f(t, θ) ∂θ ≤ C for all t and θ in a neighbourhood of θ0. Hence by (3.26) and the mean value theorem we get that sup a≤t≤b

413. f(t, θ̂N ) − f(t, θ0)

416. = OP (N−1/2 )

417. MATHEMATICAL STATISTICS 41 and therefore (3.27) (Nh)1/2 sup a≤t≤b

420. ˆ fN (t) − f(t, θ̂N )

423. = (Nh)1/2 sup a≤t≤b

426. ˆ fN (t) − f(t, θ0)

429. + oP (1). This means that estimating parameters does not effect the results on density estimation. This is different from the parameter estimated empirical process where the estimation of the parameter changes the asymptotics. 3.2. Cross validation. We have seen if we minimize MSE( ˆ fN (t)f (t))2 with respect to the window (smoothing parameter), then h depends on t. In order to find an “optimal” window, the other possible criteria is the minimization of the mean integrated square error MISE(h) = E Z b a ( ˆ fN (t) − f(t))2 dt. So we choose hopt which minimizes MISE(h), i.e. hopt = argminhMISE(h). Since E Z b a ( ˆ fN (t) − f(t))2 dt = Z b a E( ˆ fN (t) − E ˆ fN (t))2 dt + 2 Z b a E( ˆ fN (t) − E ˆ fN (t))(E ˆ fN (t) − f(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = Z b a var( ˆ fN(t))dt + Z b a (E ˆ fN (t) − f(t))2 dt = 1 Nh Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 + h4 4 Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 + o(h4 ) + o 1 Nh So, at least asymptotically. hopt = c∗N−1/5 , with c∗ = (Z b a f(t)dt Z ∞ −∞ K2 (u)du 2 )1/5 (Z b a (f00 (t))2 dt Z ∞ −∞ u2 K(u)du 2 )−1/5 , which depends on the unknown f. Cross validation provides a data based estimator for hopt. We write Z b 0 ( ˆ fN (t) − f(t))2 dt = Z b a ( ˆ fN (t))2 dt − 2 Z b a ˆ fN (t)f(t)dt + Z b a f2 (t)dt = J(h) + Z b a f2 (t)dt. Since Z b a f2 (t)dt does not depend on h, we need to minimize J(h). The estimator for J(h) is ¯ J(h) = Z b a ( ˆ fN (t))2 dt − 2 N N X i=1 Z b a ˆ fN (t) ˆ f (−i) N (t)dt,

430. 42 LAJOS HORVÁTH where ˆ f (−i) N (t) is the kernel density estimator without the ith observation. The sample based cross validation estimator is ĥ = argminh{ ¯ J(h)}. It can be shown that (3.28) ĥ hopt P → 1 and (3.29) MISE(ĥ) MISE(hopt) P → 1. (Note: proving ĥ − hopt P → 0 would not be too useful since both terms go to 0.) The result in (3.29) that using ĥ we get the asymptotically most efficient kernel density estimator. Also, we need to check that our results proven for the non random MISE(hopt) window (expansion of the bias, variance, asymptotic normality, asymptotic distribution of norms) will go through for the random ˆ fN . These have been established in the literature (cf. Silverman (1986)), so it is justified to use ˆ f. We discussed cross validation in the context of finding the optimal window. The same idea is also used in model validation on machine learning, for example. However, not always one element is removed to get the comparison estimates, but several. Also, the same idea is used in case of jackknife estimators. For computational purpose we approximate ˆ J(h) with J∗ (h) = 1 Nh2 N X i=1 N X j=1 K∗ Xi − Xj h + 2 Nh K(0), where K∗ (x) = K(2) (x) − 2K(x) and K(2) (x) = Z ∞ −∞ K(x − y)K(y)dy. The numerical work is still substantial and Fast Fourier Transform is suggested for the computation. The computation of the cross validation is not too simple so some suggestions are given which only supported by simulations. If f is a normal density h∗ = 1.06σN−1/5 is suggested where σ is the variance of the observations. Since σ is unknown, it is estimated by σ̂ = min(sample standard deviation, interquartile range/π). Hence h∗ = 1.06σ̂N−1/5 is computable from the sample. This rule of thumb is used for non–normal densities as well. Usually instead of 1.06, several other constants are tried and the “best” is used in the analysis. Choosing h requires practice!

431. MATHEMATICAL STATISTICS 43 3.3. Histogram. The histogram is wildly popular, since this was the first density estimator and it has been around since 1880’s. Also, even the simplest statistical software contains it. It is not better than the kernel density estimator and the estimate for smooth densities is a step function. The definition is very simple. We assume for the sake of simplicity that the support of f is [0, 1]. (Essentially, we have an interval which contains all the observations, or we use an interval such that the integral of the density on this interval is close to 1. Roughly speaking, we need a relatively large but not too large interval for the construction of the density.) We construct histogram with equal length bins. Let m be an integer and define the bins Bj = x : j − 1 m x ≤ j m , j = 1, 2, . . . , m. If Yj = N X i=1 I {Xi ∈ Bj} , and p̂j = Yj N , j = 1, 2, . . . , m then the histogram is defined as ˜ fN (t) = m X j=1 p̂j h I {t ∈ Bj} with h = 1 m . Hence the histogram is a step function, the value of the percentage of the observations in the bin. It is clear that the histogram is closely related to the kernel density estimator with the uniform kernel. (Writing h = 1/m is just an effort to connect the number of bins with the window.) It can be shown that MISE = Z 1 0 ( ˜ fN (t) − f(t))2 dt = h2 12 Z 1 0 (f00 (t))2 dt + 1 Nh + o(h2 ) + o 1 Nh , so in this case the optimal window is hopt = N−1/3     6 Z 1 0 (f00 (t))2 dt     1/3 . If we compare the optimal window of order N−1/3 to the optimal window of order N−1/5 for kernel densities, we see that the histogram converges to its limit much slower. Finding the optimal h (finding the number of bins m) can be found by cross validation as well. Let ˆ Jh = 2 h(N − 1) − N + 1 h(N − 1) m X j=1 p̂j. The cross validation gives the window ĥ = argminh ˆ J(h) so m = b1/ĥc. There are several versions of the histogram, including non equal bin sizes, data driven bins and so on. Silverman (1986) contains a readable account of histograms. Local log likelihood (local polynomial smoothing). The likelihood method can be extended to function spaces so the likelihood for f is N X i=1 log f(Xi) − N Z ∞ −∞ f(u)du − 1 .

432. 44 LAJOS HORVÁTH Maximizing the log likelihood function above does not give acceptable result. Nonparametric likelihood arguments give that the locally smoothed log likelihood should be used and the likelihood estimator for f is ˆ fN (t) = argmaxf L(t; f) where L(t; f) = N X i=1 K t − Xi h log Xi − N Z ∞ −∞ K t − u h f(u)du. Maximizing with respect to f is hard so we approximate log t with a polynomial in the form pt(a, u) = r X j=0 aj j! (t − u)j , a = (a0, a1, . . . , ar). So we need to minimize N X i=1 N X i=1 K t − Xi h r X j=0 aj j! (t − Xi)j − N Z ∞ −∞ K t − u h exp   r X j=0 aj j! (t − u)j   du with respect to a. The minimum is reached at â = (â0, â1, . . . , âr), Then the estimator is ˆ fN (t) = eâ0 . This method is also called to local polynomial smoothing. It is called local because of the expansion of log t around t. This method requires the choice of K, h, r. Due to the choices of more parameters, we can get good results. The best result would be for large r but in this case the error would increase. So h and r should be picked using the data. This is implemented in several statistical packages. You can find more on local polynomial smoothing in Fan and Gijbels (1996) 3.4. Estimation with series. If φ1, φ2, . . . are orthonormal functions on [0, 1], i.e. Z 1 0 φi(u)φj(u)du = ( 0, if i 6= j 1, if i = j then (3.30) f(t) = ∞ X `=1 c`φ`(t), c` = Z 1 0 f(u)φ`(u)du. The expansion in (3.30) requires some assumptions to make sense, at least Z 1 0 f2 (u)du ∞ is needed. This give you a meaningful estimate for a fixed t and also in L2, the space of square integrable functions on [0, 1]. Assuming appropriate further conditions, we have that the infinite sum converges in the sup–norm or in L1 (the space of integrable functions on [0, 1]). It is easy to unbiased estimator for c` ĉ` = 1 N N X i=1 φ`(Xi). Clearly, Eĉ` = 1 N N X i=1 Eφ`(Xi) = Eφ`(X1) = Z 1 0 φ`(u)f(u)du = c`.

433. MATHEMATICAL STATISTICS 45 But we cannot compute an infinite sum, we use only the first M, ˆ fN (t) = M X `=1 ĉ`φ`(t). Now we have the issue of choosing M. Since E ˆ fN (t) = M X `=1 c`φ`(t) so using a large M we can reduce the bias. However, it is easy to see that this will increase the variance. We have the same problem what we had in case of kernel density estimation and now 1/M behaves like h. Nor surprisingly, the optimal M is c∗N1/5 with some constant c∗, depending on f and the basis. Fourier series is a popular choice. If you had a course on Fourier series, you can see that we are getting back, as approximation, to the kernel density estimation with “non standard” (i.e. the kernel is not necessary a density) kernels, like Dirichlet and Fejér kernels. This also tells you why it is hard to get good results at the endpoints 0 and 1. This is essentially numerical analysis. The other popular choices are wavelets and Haar functions. These provide fast converging series (3.30). Also, in case of normal data, their orthogonality properties will give the independence of the estimators for c`. The series estimation of densities has not been popular in statistics but due to their good properties, wavelets and Haar functions are used more often, especially in machine learning. The transformation with wavelets is used also often in information theory. Numerical mathematics contains a large amount of results on wavelets but if some randomness is introduced, not much is known. 3.5. Multivariate density estimation. The estimation of univariate densities has a large literature and nearly everything we discussed can be easily done for multivariate densities. Let d be the dimension of your data. The multivariate density estimation used mainly when d = 2 or d = 3. Clearly, visualization is simple if d = 2. Using projections into 2 dimensional spaces, d = 3 is used but sparsely. Also, the “curse of dimension” appears in density estimation. Let assume that our observation is uniform in the d-dimensional box [0, 1]d. If we choose the inside of this box, i.e. [δ, 1 − δ]d, the probability of being in this box is (1 − 2δ)d which converges to 0, as d → ∞, for all 0 δ 1/2. This means that the data is “spreading out” out, in our case, they are around the edges. The normal case is even worse. The probability that a standard d–dimensional normal random vector will be in any compact set of Rd goes to 0, as d → ∞. So we need an extremely large sample size to see d–dimensional normal random vectors, essentially you need that the sample size is an exponential function of the dimension. Since the arrival of “big data” we might have that this assumption is satisfied but even in this case the first step is dimension reduction. The series based estimation goes through easily, you only need basis on a compact subset of Rd and you estimate the density as in case of univariate observations. For kernel density estimation you need a density function in Rd. Please note that except the d dimensional distribution, nearly all other multivariate densities were products of univariate densities. So we choose K(x) = K1(x1)K2(x2) · · · Kd(xd), x = (x1, x2, . . . , xd) as our kernel. The other popular choice is K(x) = K0(x Vx),

434. 46 LAJOS HORVÁTH where K0 is a univariate density function and V is a matrix. The observations are vectors X1, X2, . . . , XN , Xi = (Xi,1, Xi,2, . . . , Xi,d). So we can use ˆ fN (x) = 1 Nh1h2 · · · hd N X i=1 d Y `=1 Ki x` − Xi,` h` . The formula for ˆ fN (x) tells us the all our computations done for univariate densities can be done for multivariate densities. The calculations will be much more involved but the idea behind the calculations will be the same. Applications of multivariate density estimation is surveyed in Scott (2015). k–nearest neighbour. We have again a d–dimensional sample and for each x let Rk(x) be the radius of the smallest ball which contains exactly k observations. Let Vk(x) be the volume of the ball in Rd with radius Rk(x). The k–nearest neighbour density estimator is ˆ fN (x) = k N 1 Vk(x) . We note that Vk(x) = πd/2 Γ(d 2 + 1) Rd k(x), where Γ(·) is the Gamma function. We recognise that πd/2 Γ(d 2 + 1) is the volume of the unit ball in Rd . It is always assumed that k = k(N) → ∞ but slowly enough. Applications. (i) Mode. The mode is defined as the point where f(x) takes it largest value. i.e. θ = argmaxxf(x). Using any of the estimators for the density f, we get the empirical counterpart θ̂N = argmaxx ˆ fN (x). Using the properties of the density estimates, one can show that θ̂N → θ in probability and the asymptotic asymptotic normality of θ̂N . (ii) Entropy. If X is a random vector in Rd with density function f(x), the entropy is defined as HX = − Z Rd log f(x)f(x)dx. Hence ĤN = − Z Rd log ˆ fN (x) ˆ fN (x)dx is a plug in estimator. Observing that HX = −E log f(X), then H̃N = − 1 N N X i=1 log ˆ fN (Xi) is an other possible estimator. Both estimators are consistent and asymptotically normal but they converge to their limit slower than N−1/2.

435. MATHEMATICAL STATISTICS 47 4. Tests for exponentiality and normality As in the previous sections, Assumption 4.1. X1.X2, . . . , XN are independent and identically distributed random variables with distribution function F. We have already discussed the parameter estimated processes to check the null hypothesis (4.1) H0 : there is θ0 ∈ Rd such that F(t) = F0(t, θ0) for all t. It is assumed that F0(t, θ) is known, i.e. we want to test if F can be written in a given parametric form. The parameter estimated empirical processes are tailored for this problem. First we estimate the unknown parameter with θ̂N satisfying under H0 that (4.2) θ̂N − θ0 = 1 N N X `=1 g(X`) + oP 1 N , with some function g(·). Assumption (4.2) is called a linearization assumption, and it is usually satisfied when N1/2(θ̂N − θ0) is asymptotically normal. For example, the maximum likelihood in the regular case and the least squares estimators satisfy (4.2). The parameter estimated empirical process is N1/2(FN (t)−F0(t, θ̂N )), where FN (t) is the empirical distribution function. Generalizing the arguments we used in the exponential case, one can prove that (4.3) N1/2 (FN (t) − F0(t, θ̂N )) D[−∞,∞] −→ Γ(t), where Γ(·) is a Gaussian process. The proof of (4.3) provides and explicit form for Γ(t) but it is a difficult functional of the Brownian bridge and not only depends on F0 but it might depend on θ0 as well. However, in case of scale, location and location-scale families, using the maximum likelihood estimator to estimate θ, we can achieve that Γ does not depend on θ0. The weak convergence in (4.3) gives the parameter estimated Kolmogorov–Smirnov statistic TN,1 = sup −∞t∞ |N1/2 (FN (t) − F0(t, θ̂N ))| and the parameter estimated Cramér–von Mises staistic TN,2 = Z ∞ −∞ (N1/2 (FN (t) − F0(t, θ̂N )))2 dt, N,3 = Z ∞ −∞ (N1/2 (FN (t) − F0(t, θ̂N )))2 dFN (t) ≈ N X `=1 i N − F0(X`,N , θ̂N ) 2 , where X1,N ≤ X2,N ≤ . . . ≤ XN,N are the order statistics. The weak convergence (4.3) yields that (4.4) N,1 D → sup −∞t∞ |Γ(t)|, (4.5) N,2 D → Z ∞ −∞ Γ2 (t)dt, and (4.6) N,3 D → Z ∞ −∞ Γ2 (t)dF0(t, θ0). If F0 is the exponential or normal distribution functions, then TN,1, TN,2 and TN,3 are the Lilliefors tests. The convergence in distribution in (4.4)–(4.6) can be used via Monte Carlo simulations if Γ does not depend on θ0. However, the parametric boostrap as we discussed in Section ??? can be always used.

436. 48 LAJOS HORVÁTH The other popular method to test H0 is the χ2 test. We divide the support of F0 into M cells with the points t1, t2, . . . , tM−1. Please note that we assumed, without explicitly saying, that the support of F0(t, θ) does not depend on θ. Next we define the cells B1 = (−∞, t1], B2 = (t1, t2], . . . , BM−1 = (tM−2, tM−1], BM = (tM−1, ∞). The original observations are replaced with the count data Yi = N X `=1 I{X` ∈ Yi}, 1 ≤ i ≤ M. Then (Y1, Y )2, . . . , YM ) has multinomial distribution with parameters p1, p2, . . . , pM , N. If H0 holds, then p1 = p1(θ) = F0(t1, θ), p2 = p2(θ) = F0(t2, θ) − F0(t1, θ), . . . , pM−1 = pM−1(θ) = F0(tM−1, θ) − F0(tM−2, θ), pM = pM (θ) = 1 − F0(tM−1, θ) and the value of the parameter is θ0. Since it is unknown we need to estimate it but we need to use the count data Y1, Y2, . . . , YM . The likelihood function is Λ(θ) = N! Y1!Y2! · · · YM ! N Y `=1 Y pi(θ) i , and therefore tha maximum likelihood estimator is θ̃N = argmaxθΛ(θ). The maximum likelihood tests is equivalent with the χ2 test: we reject for large values of TN = N X i=1 (Yi − Npi(θ̃N ))2 Npi(θ̃N ) . If θ ∈ Rd and the coordinates are “independent” (this essentially mean that the centered and scaled θ̃N has a non degenerate d–dimensional normal limit). Under the null hypothesis TN D → χ2 (M − 1 − d). This general method can be applied to the exponential as well as to the normal distributions. The power of the χ2 test is low since it checks the original null hypothesis of (4.1). Now we start discussing tests which were developed directly to test for exponentiality. In the first family we only have a scale parameter F0(t, θ) = ( 0, if t 0 1 − e−t/θ , if t ≥ 0. (4.7) In the second formulation of the exponential distribution we have two parameters θ = (θ1, θ2) and F0(t, θ) = ( 0, if t θ1 1 − e−(t−θ1)/θ2 , if t ≥ θ1. (4.8)

437. MATHEMATICAL STATISTICS 49 4.1. Rényi representation. We assume that (4.7) holds. According to the Rényi representation, (4.9) S1 SN , S2 SN , . . . , SN−1 SN D = {U1,N−1, U2,N−1, . . . , UN−1,N−1} , where Si = i X `=1 X`, 1 ≤ i ≤ N and U1,N−1, U2,N−1, . . . , UN−1,N−1 are the order statistics from N−1 independent random variables, uniform on [0, 1]. The distribution of Si/SN does not depend on θ, so we can assume that θ = 1. Let rN (t) = N1/2 SbNtc SN − bNtc N , 0 ≤ t ≤ 1. We already showed that (4.10) N−1/2 SbNtc − bNtc D[0,1] −→ W(t), where W(·) is a Wiener. Using the (4.10), elementary arguments give that rN (t) D[0,1] −→ W(t) − tW(1). By definition, B(t) = W(t) − tW(1) is a Brownian bridge, so rN (t) D[0,1] −→ B(t). Now we get the convergence of the functionals of rn(t) and therefore (4.11) sup 0≤t≤1 |rN (t)| D → sup 0≤t≤1 |B(t)| and (4.12) Z 1 0 r2 N (t)dt ≈ N−1 X i=1 Si SN 2 D → Z 1 0 B2 (t)dt. The results in (4.11) and (4.12) are large sample approximations. But their connection in (4.9) to the uniform order statistics, we can use the exact distribution which were developed for uniform order statistics. Next we discuss how to test for (4.8). We estimate θ1 with θ̂1 = X1,N = min(X1, X2, . . . , XN ), with the maximum likelihood estimator for θ1. We change rN (t) into r̃N (t) = N1/2 SbNtc − bNtcX1,N SN − NX1,N − bNtc N , 0 ≤ t ≤ 1. The underlying model under the null hypothesis is a location and scale model, and therefore Sk − kX1,N SN − NX1,N = k X i=1 (Xi − X1,N ) N X i=1 (Xi − X1,N ) does not depend on θ = (θ1, θ2). Hence we can assume that θ1 = 0 and θ2 = 1. It is easy to see that P {NX1,N ≤ t} = 1 − P {NX1,N t} = 1 − (1 − P {X1 ≤ t/N})N = 1 − e−t

438. 50 LAJOS HORVÁTH for all t ≥ 0. Hence X1.N = OP 1 N . Now we get max 1≤k≤N

449. k X i=1 (Xi − X1,N ) N X i=1 (Xi − X1,N ) − k X i=1 Xi N X i=1 Xi

460. = OP 1 N . Hence removing X1,N from the observations, which removes the location parameter from the sample, we do not change the limit results in (4.11) and (4.12). These tests are not really powerful to test for exponentiality since we test only that the observations are in a scale (or a location and scale) family and when θ = 1, the both the mean and the variance is 1. 4.2. Gini’s test. Let TN,4 = 1 2X̄N N(N − 1) N X i,j=1 |Xi − Xj|, where X̄N = 1 N N X i=1 Xi. If H0 of (4.7) holds, then (4.13) (12(N − 1))1/2 (TN,4 − 1/2) D → N(0, 1), where N(0, 1) is a standard normal random variable. We can use TN,5 = 1 2(X̄N − X1,N )N(N − 1) N X i,j=1 |Xi − Xj| to test if (4.8) holds. Under the null hypothesis, TN,5 also satisfies (4.13). TN,4 and TN,5 are essentially U–statistics, so the corresponding theory can be used to derive (4.13). The test hinges on the observation that E|X1 − X2| = θ/2. 4.3. Mean residual life. The distribution of X1 satisfies (4.7) if and only if E(X1 − t|X1 t) is constant, which is equivalent with (4.14) E min(X1, t) = θP{X1 ≤ t}. The process vN (t) = N1/2 1 N N X i=1 min(Xi/X̄N , t) − 1 N N X i=1 I{Xi/X̄N ≤ t ! compares estimates for the left and right sides of (4.14). We reject (4.7), if sup0t∞ |vN (t)| or R ∞ 0 v2 N (t)dt is large. These are parameter free statistics so Monte Carlo simulations can provide critical values.

461. MATHEMATICAL STATISTICS 51 4.4. Laplace transform. The exponential distribution has a very simple moment generating function E exp(tX1) = θ t + θ , −θ t ∞. The process uN (t) = N1/2 1 N N X i=1 exp(tXi/X̄N ) − 1 t + 1 ! , −1 t ∞ checks, if the Laplace transform of Xi/X̄N is close to 1/(1 + t). We reject (4.7), if supa≤t≤b |uN (t)| or R b a u2 N (t)dt is large, where −1 a b ∞. These are parameter free statistics so Monte Carlo simulations can provide critical values. 4.5. Total Time on Test. Let X1,N ≤ X2,N ≤ . . . ≤ XN,N denote the order statistics. The total time is Ti = i X `=1 (N − ` + 1)(X`,N − X`−1,N ), X0,N = 0. If N machines are working, parallel with each other, Ti is the total time they worked until the th failure. The total time on transform is Ti/TN , 1 ≤ i ≤ N. Plotting Ti/TN , 1 ≤ i ≤ N against i/N, 1 ≤ i ≤ N and connecting the plotted points with straight lines we get the TTT plot. Under (4.7) the TTT plot should be close to the diagonal of the unit square. The TTT plot is an excellent visual tool for model identification and test fr ageing. We reject (4.7) if N1/2 max 1≤i≤N−1

465. Ti TN − i N

469. is large. The TTT method is one of the most often used method to test for (4.7). We can easily modify the TTT transform to test for (4.8), we just need to remove the first term in the definition of Ti, the other terms do nor depend on the location. The methods in Sections 4.3–4.4 are based on characterizations of the exponential distribution. Other characterizations of the exponential distribution, like the characteristic function (Fourier transform), can also be used to derive further tests. Test for normality is one of the central problems in statistics since several procedures assume that the underlying observations are normal. The number of procedures is enormous, we are just discussing two often used procedures and then a few others just to show how characterization of the normal distribution can be used to test for normality. Our null hypothesis is (4.15) H0 : there are µ and σ 0 such that F(t) = Φ t − µ σ , where Φ is the standard normal distribution function Φ(t) = 1 √ 2π Z t −∞ e−u2/2 du.

470. 52 LAJOS HORVÁTH 4.6. Shapiro–Wilk (Shapiro–Francia) test. The definition of the statistic requires the introduction of few further notations. Let Z1,N ≤ Z2,N ≤ . . . ≤ ZN,N the order statistics from N independent standard normal random variables and V the covariance matrix of Z1,N , Z2,N , . . . , ZN,N , i.e. V = {vi,j, 1 ≤ i, j ≤ N} with vi,j = cov(Zi,N , Zj,N ). Let m = (EZ1,N , EZ2,N , . . . , EZN,N ) be the vector of the expected values of the oder statistics of independent standard normal random variables. The computation of m and V is simple since we have formula for the joint distribution of Zi,N , Zj,N . The matrix V are several papers and they are tabulated for 2 ≤ N ≤ 20 and approximated for 20 ≤ N ≤ 50. Nowdays, if you are using a computer program, the values of V are part of the procedure. Let X = (X1, X2, . . . , XN ). We estimate σ with (4.16) σ̂N = N X `=1 c`X`,N , where (c1, c2, . . . , cN ) = mV−1 mT V−1m . We note that σ̂N is the minimum variance linear estimator for σ when µ is known. We also note that σ̂N is a linear combination of the order statistics X1,N ≤ X2,N ≤ . . . ≤ XN,N . Now the Shapiro–Wik (1965) statistic is TN,1 = σ̂2 N N X i=1 (Xi − X̄N )2 . Shapiro and Francia (1972) suggested replacing V with I, the N ×N identity matrix. They argued that if N is large, the Xi,N and EXj,N are close to being independent if |i − j| is large. Hence V can be approximated with a diagonal matrix. In this case TN,2 = N X i=1 biXi,N !2 N X i=1 (Xi − X̄N )2 , where (b1, b2, . . . , bN ) = m mm . In statistical software, usually TN,2 is used for N ≥ 50. By definition, the coordinates of m are EZi,N and by the probability integral transformation, EZi,N ≈ Φ−1 i N + 1 , which leads to TN,3 = N X i=1 Xi,N − X̄N SN − Φ−1 i N + 1 2 , with S2 N = 1 N − 1 N X i=1 (Xi − X̄N )2 . The statistic is rarely used in practice, but it is a good theoretical tool to understand TN,2 for large N since they are asymptotically independent. The meaning of TN,3 is clear. First we standardize the observation, and we work with (Xi − X̄N )/SN . Under H0 of (4.15) they should be close of being independent standard normal random variables and therefore (4.17) E Xi,N − X̄N ) SN ≈ Φ−1 i N + 1 , 1 ≤ i ≤ N.

471. MATHEMATICAL STATISTICS 53 The statistic TN,3 checks the validity of (4.17). We recall the quantile function QN (t) = F−1 N (t), the inverse of the empirical distribution function. The relation (4.18) TN,3 ≈ N Z N/(N+1) 1/(N+1) QN (t) − X̄N SN − Φ−1 (t) 2 dt, so the theory of quantile processes can be used to get the properties of TN,3. We note that N1/2 QN (t) − X̄N SN − Φ−1 (t) is a parameter estimated quantile process. So TN,3 is a Cramér–von Mises type statistic with estimated parameters. The details of our heuristic arguments are given in DeWet and Venter (1972). The theory of quantile processes yields (4.19) TN,3 − aN D → ∞ X i=1 1 i (N2 i − 1), where N1, N2, . . . are independent standard normal random variables and aN = Z N/(N+1) 1/(N+1) t(1 − t) φ2 (Φ−1(t)) dt, where φ(t) = 1 √ 2π e−t2/2 , the standard normal density function. Since aN → ∞, the result in (4.19) explains why the critical values of the Shapiro–Wilk (Shapiro–Francia) are increasing with the sample size. For those who like theory, the limit in (4.19) is coming from the Karhunen–Loéve expansion of square integrals, but now the expansion of Z 1 0 B2(t) − t(1 − t) φ2 (Φ−1(t)) dt is used. 4.7. Jarque–Bera test. If N is a standard normal random variable, then EN3 = 0 and EN4 = 3. The Jarque-Bera test checks if the standardized variables satisfy these two properties. Let AN,1 = 1 N N X i=1 Xi − X̄N SN 3 and AN,2 = 1 N N X i=1 Xi − X̄N SN 4 . These are estimates for the third and fourth moments of the standardized observations. If H0 of (4.15) holds, then N1/2[AN,1, AN,2 − 3] converge in distribution to independent normal random variables. The statistic is defined by TN, 4 = N 6 A2 N,1 + 1 4 (AN,2 − 3)2 . If the observations are normal, then TN,4 D → χ2 (2), where χ2(2) is a χ2 random variable with 2 degrees of freedom. The Jarque–Bera test is probably the most often used statistic to test for normality. It is easy to compute and it has power against, for example t–distribution alternatives. However, it is easy to find a distribution which cannot be distinguished from the normal distribution by the Jarque–Bera test since only checks the third and fourth moments. It is interesting to note that TN,4 can be used for a large class of time series. Probably this is the only test which is not effected by dependence for a large a large class random

472. 54 LAJOS HORVÁTH variables. 4.8. Anderson–Darling test. The Anderson–Darling test is just a weighted version of the parameter estimated Cramér–von Mises statistic. Let TN,6 = Z ∞ −∞ [N1/2(FN ((t − X̄N )/SN ) − Φ(t))]2 Φ(t)(1 − Φ(t)) dt. Similarly, under the null hypothesis, the limit does not depend on the unknown mean and variance. Using the weak convergence of the parameter estimated empirical processes, one can show that there is a Gaussian process Γ(t) such that TN,6 D → Z ∞ −∞ Γ2 (t)dt. The limit can be represented as an infinite weighted χ2 random variable, Z ∞ −∞ Γ2 (t)dt = ∞ X i=1 λiN2 i , where λ1 λ2 . . . and N1, N2, . . . are independent, standard normal random variables. The weights can be easily computed from the covariance of Γ(x). So the asymptotic critical values can be easily tabulated. This makes it different from the Shapiro–Wilk test, where the critical values depend on the sample size. The rate of converge convergence to the limit is faster for the Anderson–Darling statistic. It is observed empirically that the power of the Anderson–Darling statistics is close to the power of the Shapiro–Wilk statistic. 4.9. QQ and PP plot. These plots are used as visual tools to evaluate normality. The QQ plot is the more popular one. In case of the QQ plot we plot the plot the points ((Xi,N − X̄)/SN , Φ−1(i/(N + 1))), 1 ≤ i ≤ N. If the normality holds, then these points should be, approximately, on the diagonal of the unit square. Just looking at the a picture, we cannot assess critical values to our guess. We should need to see that in case of normality how far (Xi,N − X̄N )/SN can be from Φ−1(i/(N + 1)) for all 1 ≤ i ≤ N. Some calculations yield that var((Xi,N − X̄N )/SN − Φ−1(i/(N + 1)) is much larger for small and large i, than around i = N/2. So even under the null hypothesis the deviation from the straight line is not constant, it is higher around (0, 0) and (1, 1) of the diagonal. So it is not easy to see if the normality holds or not. The PP plot plots (FN ((x − X̄N )/SN ), Φ(x)), −∞ x ∞. So we are just looking at a visualization of the parameter estimated empirical process. As we have seen, the variance of FN ((x − X̄N )/SN ) − Φ(x) is smaller for small and large x’s, than in the middle, so it is opposite of the behaviour of the QQ plot. 4.10. Laplace transform. The moment generating function (Laplace transform) of a standard normal random variable is E exp(tX) = exp t2 2 , where X stands for a standard normal random variable. The moment generating function uniquely determines the normal distribution. Let MN (t) = 1 N N X i=1 exp(tXi) be the empirical moment generating function. The law of large numbers implies that for all a sup −a≤t≤a

474. MN (t) − exp((t − µ)2 /(2σ2 ))

476. ; → 0 a.s.

477. MATHEMATICAL STATISTICS 55 if the normality holds. The central limit theorem implies that for all t, zN (t) = N1/2 (MN ((t − X̄N )/(2§2 N )) − exp(t2 /2)) is asymptotically normal. Even more is true, for all a, the process zN (t) converges in C[−a, a], the space of continuous functions on [−a, a], to a Gaussian limit, so normality is rejected for large values of TN,7 = sup −a≤≤a |zN (t)| and TN,8 = Z a −a z2 N (t)dt. We need to use bootstrap to get critical values for TN,7 and TN,8, since both the depend on the true values of µ and σ2. 4.11. Fourier transform. . The characteristic function (or Fourier transform) of X, a standard normal random variable is simple, E exp(itX) = exp − t2 2 , where i2 = −1. Similarly to the empirical moment generating function, we can estimate the characteristic function from the data by CN (t) = 1 N N X i=1 exp(itXi). As in Section ??, cN (t) = N1/2{CN ((t − X̄N )/SN ) − exp(−t2/2)} converges in the space of continuous complex valued functions. We reject normality for large values of TN,9 = sup −a≤t≤ kcN (t)k and TN,10 = Z a −a kcN (t)k2 dt, with some a, where k·k denotes the norm of complex numbers. As in case of the momen generating based approach, resampling is needed to get critical values. In this section we demonstrated how characterizations of the exponential and the normal distribution can be used to construct tests. Since there are more characterizations, much more tests are in the literature. The power of the tests depend on the alternative but some tests are usually performs well for a large class of alternative distributions. The power investigation is usually based on simulations. We showed that the asymptotic properties of the density estimator are not effected by estimation of some parameters. However, using densities is not recommended, the rate of convergence is slow and the choice of the window (smoothing parameter) is not obvious. Changing the window might change the outcome of your test. There are several methods, like the Box–Cox transformation, to transform the observations into normal ones. However, these transformations are usually data driven, so we start with a (hidden) parameter estimation. We have seen that estimating parameters causes changes in the critical values, we should be careful how the transformation is done.

478. 56 LAJOS HORVÁTH 5. Multivariate methods Let X1, X2, . . . , XN be independent and identically distributed random variables in Rd with distribution function F(x) and density function f(x). The univariate method can be easily generalized to d dimension with the help of the multivariate empirical distribution FN (x) = 1 N N X i=1 I{Xi,1 ≤ x1, Xi,2 ≤ x2, . . . , Xi,d ≤ xd} = 1 N N X i=1 I{Xi,1 ≤ x1}I{Xi,2 ≤ x2} · · · I{Xi,d ≤ xd}, X i = (Xi,1, Xi,2, . . . , Xi,d) and x = (x1, x2, . . . , xd). There are several problems we discussed in case of univariate distributions, which can be asked in case of multivariate distributions: (i) testing if F(x) = F0(x), where F0(x) is a given distribution function (ii) estimation of the density f(x) (iii) testing for normality (if f exists, then usually normality is assumed and this assumption is sometimes checked. Note that a d dimensional normal distribution has d + d(d + 1)/2 parameters, so it will be hard to reject normality for large d.) However, there are few additional questions which are related to the multivariate distributions (iv) testing the independence of the coordinates of the observations (v) test if there is a matrix A such that F(x) = G(xAx), where K is a given univariate distribution function (elliptic distribution) (vi) statistical inference on conditional mean, like E(X1,1|X1,2 = x2, . . . , X1,d which is a d − 1 vari- ate function. By the properties of the conditional expected value, the density of (X1,2, . . . , X1,d) appears in the formula for the conditional expected value, so the estimation method is similar to multivariate density estimation (vii) the coordinates of the observations have the same distribution (viii) estimation of the copula. (The copula C(x) is defined on [0, 1]d and F(x) = C(F1(x1), F2(x2), . . . , Fd(xd)), where Fi, 1 ≤ i ≤ d are the marginal distribution functions, X = (x1, x2, . . . , xd).) Pearson’s χ2 test can be used if d is small. Let assume that the axes are divided in to K intervals. Then for the d dimensional data we have Kd cells. So if do not want to have empty cells, as usually recommended for the χ2 test, we need that N Kd. If, for example, N = 100 and K = 5,then we get that d ≤ 2.8, so it is only working for univariate and bivariate data. Thus we need very large sample sizes in case of high dimensional data. We discuss in Section 5.4 why and how the χ2 test can be used for high dimensional data when we want to test that the underlying distribution is a given distribution function. 5.1. Testing that the underlying distribution is F0. We wish to test that H0 : F(x) = F0(x) for all x ∈ Rd , where F0 is a given function. We can follow the univariate method and we can define the d– dimensional Kolmogorov–Smirnov statistic TN,1 = N1/2 sup x∈Rd |FN (x) − F0(x)| and the Camér–von Mises statistic TN,2 = N Z Rd (FN (x) − F0(x))2 f0(x)dx, where f0 is the density function under the null hypothesis. In contrast to the univariate case, the statistics TN,1 and TN,2 are not distribution free, and even there limit distribution will depend on

479. MATHEMATICAL STATISTICS 57 F0. However, nonparametric bootstrap can be used to get critical values for these statistics. 5.2. Testing for multivariate normality. If Y is d dimensional normal random variable with parameters µ and Σ, then N = Σ−1/2 (Y − µ) is a d dimensional normal random variable, i.e. EY = 0 and E(NNT = I, where I is the d × d identity matrix. The multivariate normal distribution is uniquely determined by its moment generating and characteristic functions. Let hcdot, ·i denote the inner product in Rd and ktk2 = ht, ti is the corresponding norm. The moment generating function (Laplace transform) of N is M(t) = exp 1 2 ktk2 and the characteristic function (Fourier transform) F(t) = exp − 1 2 ktk2 . The unknown parameters can be easily estimated from the sample with µ̂N = 1 N N X i=1 Xi and Σ̂N = 1 N − 1 N X i=1 (Xi − µ̂N )(Xi − µ̂N ) . Using the estimates µ̂N and Σ̂N we empirically transform the observations via Yi = Σ̂ −1/2 N (Xi − µ̂N ) into approximately standard normal random vectors. Hence we can use the statistics with some a 0 TN,3 = N1/2 sup ktk≤a

484. 1 N N X i=1 exp(ht, Yii) − M(t)

489. , TN,4 = N Z ktk≤a 1 N N X i=1 exp(ht, Yii) − M(t) !2 dt, and TN,5 = N Z ktk≤a 1 N N X i=1 exp(iht, Yii) − F(t) !2 dt, where i2 = −1. The parametric bootstrap is suggested to get the critical values for TN,3, TN,4 and TN,5. There are other equivalent forms for these statistics. For example, TN,6 = N1/2 sup ktk≤a

494. 1 N N X i=1 exp(ht, Xii) − exp t µ̂N + 1 2 (t − µ̂N ) Σ̂N (t − µ̂N )

499. can be used. In this case instead of transforming the observations, we estimate the parameters in the multivariate normal distribution, so we have a parameter estimated moment generating function. There are also analogues of the Jarque–Bera test for the multivariate case but due to the slow convergence to the limit they are not as popular as the univariate version.

500. 58 LAJOS HORVÁTH 5.3. Test for independence of the coordinates. There are several tests for independence when d = 2 based on nonparametric methods. We discuss a method based on the empirical distribution function. We have independence if and only if the joint distribution function is the product of the marginal distributions. So we reject for large values of TN,7 = N1/2 sup x∈Rd

505. FN (x) − d Y i=1 FN,i(xi)

510. , where FN,i(xi) = 1 N N X `=1 I{X`,i ≤ xi}, x = (x1, x2, . . . , xd). Nonparametric bootstrap can be used again to get critical values for TN,7. 5.4. χ2 test with large number of cells. Let X1, X2, . . . , XN be independent and identically distributed random variables in Rd. We also assume Rd covered by K = K(N) disjoint cells B1, B2, . . . , BK. We replace the observations with the count data Yj = N X i=1 I{Xi ∈ Bj}, 1 ≤ j ≤ K. Now (Y1, Y2, . . . , YK) is multinomial (p1, p2, . . . , pK, N), p1 + p2 + . . . + pK = 1. Under our null hypothesis H0 : pj = P{X1 ∈ Bj} = pj,0, 1 ≤ j ≤ K. Person’s χ2 statistic is defined as TN,8 = K X j=1 (Yj − Npj,0)2 Npj,0 . If K is fixed, then (5.1) TN,8 D → χ2 (K − 1), where χ2(K − 1) is a χ2 random variable with K − 1 degrees of freedom. Now we consider the case when K = K(N) → ∞. This is a typical if d, the dimension of the parameters is large. If pj(N) = P{X1 ∈ Bj} → 0 and Npj,N is going to ∞ but slowly enough, then (5.2) Yj,N ≈ Pj(Npj,N ), 1 ≤ j ≤ K, where Pj(Npj,N ), 1 ≤ j ≤ K are independent random variables, Pj(Npj,N ) is a Poisson random variable. Since Yj is a binomial random variable with parameters pj,N and N, so using the binomial formula, (5.2) can be proven. It is somewhat harder to prove the independence. Heuristically, we have only few (with respect to N) in the cells, so changing some of them, it will not effect the others. We also know that the Poisson can be approximated with normal distribution, so Yj − Npj,0 (Npj,N )1/2 ≈ Nj, 1 ≤ j ≤ K, where N1, N2, . . . , NK are independent standard normal random variables. Thus we get a χ2 approximation for TN,8: TN,8 ≈ χ2 (K). Since K = K(N) → ∞, we can approximate χ2(K) with the normal distribution. Thus we heuristically established that (5.3) TN,8 − K (2K)1/2 D → N,

511. MATHEMATICAL STATISTICS 59 where N is a standard normal random variable. Now we can compare (5.1) and (5.3). There is little difference between them, since we can approximate the χ2 distribution with normal, if the degree of freedom is large. There is one interesting observation: if K is a fixed number, we should not have empty cells. Due to the Poisson approximation in (5.2), we can have empty cells, even there number can go to ∞ as K = K(N) → ∞. You can interpret (5.3) that the Pearson test remains valid if K = K(N) → ∞ slowly enough. 5.5. Ridge, lasso and elastic nets estimates. We look at the regression yi = x i β + , 1 ≤ i ≤ N, where xi ∈ Rd, β ∈ Rd. As usual, we observe {yi, xi, 1 ≤ i ≤ N} and {i, 1 ≤ i ≤ N} are random errors satisfying (5.4) Ei = 0 and 0 σ2 = E2 i ∞, 1 ≤ i ≤ N. We have the standard assumptions in (5.4) for linear models (regressions). We discussed the classical least squares to estimate the unknown β. This method works well if d is small, certainly according to the theory of least squares, d cannot depend on N and to run statistical programs we need that d N. Recently, there has been interest when d N, i.e. model is over fitted. If there is linear dependence between {xi, 1 ≤ i ≤ N}, some of the coordinates of β must be eliminated. However, this is not so simple and it is hard to do the automatically. Ridge, lasso and elastic nets estimates are popular approaches to this problem. All of them use the least squares but an additional penalty term is added to control the number of the parameters. The classical leat squares minimizes M(β) = 1 N N X i=1 (yi − x i β)2 . Let β = (β1, β2, . . . , βd). The ridge estimator minimizes (5.5) M(β) + λ N X i=1 β2 i , so the penalty is a quadratic. We note that the vector minimizing M(β)+λ PN i=1 β2 i can be written explicitly. We need to choose a suitable λ to get “desirable”. Usually we want to choose λ that the estimates for some (or several) βi’s will be 0. Usually, λ is a function of N and d. This method is used regularization in numerical mathematics. We discuss the cross validation to choose a data based λ. The lasso is not too different from the ridge but the penalty term is the L1 norm of β, so the location of the minimum of M(β) + λ N X i=1 |βi|. The elastic nets estimate combines the ridge and the lasso penalty terms, so the estimator is the minimizer of M(β) + λ α N X i=1 |βi| + (1 − α) N X i=1 β2 i ! . In case of the elastic net, we need to choose λ and α.

512. 60 LAJOS HORVÁTH Cross validation is used often to get a suitable data driven estimate. Our goal is that λ should give the smallest “prediction” error. We discuss discuss the method for the ridge estimator. Let β̂(λ) be the minimizer of (5.5). We wish to choose λ which minimizes (5.6) E N X i=1 yi − x i β̂(λ) 2 . The K fold (or leave out N − K + 1 ) cross validation starts with putting the integers 1, 2, . . . , N into K disjoint subsets F1, F2, . . . , FK. Let Bk = F1∪ . . . ∪Fk−1∪Fk+1∪ . . . ∪FN , 1 ≤ k ≤ K, i.e. the elements of Fk are left out. Using (yi, xi), i ∈ Bk we estimate β based on (5.5) resulting in β̂ (−k) . Now using β̂ (−k) we validate if the linear model holds for (yi, xi), i ∈ Fk with 1 nk X i∈Fk yi − x i β̂ (−k) 2 , where nk is the number of the elements of Fk. Now the cross validation estimator for λ minimizing (5.6) is the minimizer of K X k=1 1 nk X i∈Fk yi − x i β̂ (−k) 2 . If K = N this is the leave out one cross validation. The ridge, lasso and elastic nets methods are not restricted to linear models. Let β = (β1, β2, . . . , βd) ∈ Rd be the parameter and the estimator for β is the minimizer of the random process UN (β) (likelihood method, least absolute deviation, weighted least squares and so on), which is computed from the sample for any value of β. Now we define the corresponding the ridge estimator as the minimizer of UN (β) + λ d X i=1 β2 i . We can define the lasso and elastic nets estimators analogously. We can also use cross validation to get the “best” value for λ from the sample. These methods sometimes called dimension reduction, since d parameters are replaced with much less since we force the estimator for the parameter several zeros. We note that these estimators are biased, this is clear for the ridge estimator. The idea is that we want to minimize the mean square error and allowing small bias we substantially reduce the variance and hence the mean square error. Also we get a simpler model. You can look at these estimators as smoothed estimates. To illustrate this point, we look at the “functional” regression yi = h(xi) + i, 0 ≤ xi ≤ 1, 1 ≤ i ≤ N, where the errors {i, 1 ≤ i ≤ N} have zero mean and finite variance. It is also assumed that the unknown h is in a given class of functions H. A popular method is the find h which minimizes (5.7) 1 N N X i=1 (yi − h(xi))2 + λ Z 1 0 (h00 (u))2 du. It is clear from this formulation that the second term is a penalty for the non smoothness of h. For example, if we connect the points (yi, xi), 1 ≤ i ≤ N linearly with a broken line, then the first

513. MATHEMATICAL STATISTICS 61 term in (5.7) is 0 but the second term will be ∞. Choosing H properly, the solution of (5.7) is the smoothing spline. References [1] Andrews, D.W.K.: Tests for parameter instability and structural change with unknown change point. Econo- metrica 61(1993), 821–856. [2] Bickel, P.J. and Rosenblatt, M.: On some global measures of the deviations of density function estimates. Annals of Statistics, 1(1973), 1071–1095. [3] Csörgő, M. and Horváth, L.: Central limit theorems for Lp–norms of density estimators. Probability Theory and Related Fields, 80(1988), 269—291. [4] Darling, D.A. and Erdős, P.: A limit theorem for the maximum of normalized sums of independent random variables. Duke Mathematical Journal 23(1956), 143–155. [5] DasGupta, A.: Asymptotic Theory of Statistics and Probability. Springer, 2008. [6] De Wet, T. and Venter, J.H.: Asymptotic distributions of certain test criteria of normality. South African Statistical Journal, 6(1972), 135–149. [7] Fan, J. and Gijbels, I.: Local Polynomial Modelling and Its Applications. First Edition, Chapman and Hall, London, (1996). [8] Jarque, C.M. and Bera, A.K.: Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters, 6(1980), 255—259. [9] Parzen, E.: On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(1962) 1065—1076. [10] Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(1956) 832—837. [11] Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, 2015. [12] Shapiro, S.S. and Francia, R.S.: An approximate analysis of variance test for normality. Journal of the American Statistical Association, 67(1972), 215—216. [13] Shapiro, S.S. and Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika, 52(1965) 591—611. [14] Shorack, G.R. and Wellner, J.A.: Empirical Processes with Applications to Statistics. Wiley, 1986. [15] Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability, London, Chapman and Hall, 1986.

mathstat.pdf

More Related Content

Similar to mathstat.pdf (20)

Recently uploaded (20)

mathstat.pdf