Some sampling techniques for big data analysis

Some sampling techniques for big data analysis
Jae Kwang Kim 1
Iowa State University & KAIST
May 31, 2017
1
Joint work with Zhonglei Wang

Example
Let’s look at an artiﬁcial ﬁnite population.
ID Size of farms yield(y)
1 4 1
2 6 3
3 6 5
4 20 15
Parameter of interest: Mean yield of the farms in the population
Assume that farm sizes are known.
Instead of observing N = 4 farms, we want to select a sample of size
n = 2.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53

Example - Continued
6 possible samples
case sample ID sample mean sampling error
1 1, 2 2 -4
2 1, 3 3 -3
3 1, 4 8 2
4 2, 3 4 -2
5 2, 4 9 3
6 3, 4 10 4
Each sample has sampling error.
Two ways to select one of the six possible samples.
Nonprobability sampling : (using size of farms or etc.) select a sample
subjectively.
Probability sampling : select a sample by a probability rule.

Example - Continued : Probability sampling 1
Simple Random Sampling : Assign the same selection probability to all
possible samples
case sample ID sample mean(¯y) sampling error selection probability
1 1, 2 2 -4 1/6
2 1, 3 3 -3 1/6
3 1, 4 8 2 1/6
4 2, 3 4 -2 1/6
5 2, 4 9 3 1/6
6 3, 4 10 4 1/6
In this case, the sample mean(¯y) has a discrete probability distribution.
¯y =



2 w. p. 1/6
3 w. p. 1/6
4 w. p. 1/6
8 w. p. 1/6
9 w. p. 1/6
10 w. p. 1/6

Expected value of sampling error
E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6
= 0
Thus, the estimator is unbiased.
Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67.

Probability Sampling
Deﬁnition : For each element in the population, the probability that
the element is included in the sample is known and greater than 0.
Advantages
1 Exclude subjectivity of selecting samples.
2 Remove sampling bias (or selection bias)
What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ)
(sampling) error of ˆθ = ˆθ − θ
= ˆθ − E ˆθ + E ˆθ − θ
= variation + bias
In nonprobability sampling, variation is 0 but there is a bias. In
probability sampling, there exist variation but bias is 0.

Probability Sampling
Main theory
1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large
sample size.
2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently
large sample size.
Additional advantages of probability sampling with large sample :
1 Improve the precision of an estimator
2 Can compute confidence intervals or test statistical hypotheses.
With the same sample size, we may have different precision.

Survey Sampling
A classical area of statistics concerning ....
1 Drawing a probability sample from a target population
2 Using the sample to make inference about the population
Why survey sampling?
1 Cost consideration: Data collection often involves cost. Smaller sample
means less money.
2 Representativeness: Probability sampling is the only scientifically
justified approach of data collection for population studies.
3 Computational efficiency: Sample takes less memory and storage, and
less computing time.

Survey Sampling in the era of Big Data
Challenges
1 Decreasing response rates: strict probability sampling is not possible
and representativeness may be weakened.
2 Cost model changes: data collection cost is not proportional to size.
3 Many competitors: survey sampling is no longer the only way of
collecting data. Abundant data sources.

Paradigm Shift in Survey Sampling
Missing data framework:
From sample to population ⇒ From observed to unobserved
Combine survey data with other existing data
Survey Data
Auxiliary Data
⇒ Synthetic Data
Sampling techniques are still useful in handling big data. (e.g.
Reservior sampling, balanced sampling, inverse sampling, calibration
weighting, statistical inference with informative sampling, etc.)

In this talk, we will introduce ...
1 Reservior sampling and its variants
2 Inverse sampling
3 Synthetic data imputation (or survey integration)

Topic 1: Reservoir Sampling
McLeod and Bellhouse (1983)’s idea:
1 Let S = {1, · · · , n} be the initial reservoir sample.
2 If k = n + 1, · · · , select it with probability n/k.
3 If k is selected, replace it with one random sample from S. If k is not
selected, then go to Step 2.
Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is
the removal step.

Justiﬁcation
Let Sk be the set of sample elements selected from Uk = {1, · · · , k}.
If m ≤ n,
P(m ∈ Sk) = 1 ×
n
n + 1
×
n + 1
n + 2
× · · · ×
k − 1
k
=
n
k
If m > n,
P(m ∈ Sk) =
n
m
×
m
m + 1
× · · · ×
k − 1
k
=
n
k

Remark
Selecting a SRS (simple random sample) of size n from arbitrary size
in one pass
No need to know the population size in advance
Very useful for big data sampling when the population is updated
continuously.

Improving reservoir sampling
Goal: We wish to reduce the variance of ˆθ, which is computed from
reservoir sample.

Method 1: Balanced reservoir sampling
We wish to impose balanced condition in the reservoir sample
1
n
i∈S
zi =
1
N
i∈U
zi .
Variable z can be called control variable.
To achieve the balanced condition, we can modify Step 3 (removal
step) as follows:
1 Let ¯zk be the population mean of z up to the k-th populaiton
Uk = {1, · · · , k}.
2 For each element in the current sample S, compute
D(i) = ¯zk − ¯z
(−i)
S , i ∈ S
where ¯z
(−i)
S is the sample mean excluding zi .
3 Instead of removing one element at random in Step 3, remove i∗
with
the smallest value of D(i).

A numerical experiment
Figure: Trace of sample means of size n = 100 from a population of size
N = 10, 000. The left side is the trace of the sample mean from the classical
reservoir sampling and the right side is the trace of the sample mean from the
proposed reservoir sampling.

Remark
The balanced reservoir sampling may provide an eﬃcient sample for
estimating population mean, but not the population distribution.
We may wish to provide an eﬃcient representative sample that
provides consistent estimates for various population parameters.

Method 2: Stratified reservoir sampling
Assume that the finite population consists of H subpopulations, called
strata. The within-stratum variance is small.
For scalar y with bounded support, we can predetermine the stratum
boundaries by creating H intervals with equal length.
We want to obtain a reservoir sample that achieves stratified random
sampling with proportional allocation.
The basic idea is to use Chao (1982)’s sampling to select an equal
probability sampling for each with probability proportional to
wi = Nh/nh, where Nh and nh are updated continuously.

Stratified reservoir sampling
Let Uk = {1, · · · , k} be the finite population up to element k.
Let π(k; i) be the first-order inclusion probability that unit i is
selected from Uk.
We want to have
k
i=1
π(k; i) = n
and
π(k; i) ∝ wk,i
where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where
Nk,h and nk,h are the population size and the sample size,
respectively, corresponding to stratum h.

Stratiﬁed reservoir sampling using Chao’s method
At time k + 1, we update the reservoir sample as follows:
1 Select unit k + 1 with probability π(k + 1; k + 1).
2 If unit k + 1 is not selected, retain the sample at time k.
3 If unit k + 1 is selected, then remove one unit from the current
sample with the removal probability
Rkj =



0 if j ∈ Ak
Tkj if j ∈ Bk
(1 − Tk)/(n − Lk) if j ∈ Ck,
where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1},
Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1},
Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk
Tkj ,
Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|.

Simulation study
Finite population {yi : i ∈ U} is generated by
yi ∼ N(10, 10)
independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000.
Sample size n = 1, 000.
Three reservoir methods are considered.
Classical reservoir sampling approach (Mcleod and Bellhouse, 1983).
Balanced reservoir sampling approach with yi being the control variable.
Stratiﬁed reservoir sampling approach: 51 strata partitioned by equally
spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and
99%-th sample quantile of {yi : i = 1, . . . , n}.

Simulation study (Cont’d)
We are interested in estimating the following ﬁnite population
parameters.
Mean mk = k−1 k
i=1 yi , where k = n + 1, . . . , N.
Probability Pk,j = k−1 k
i=1 I(yi < qj ), where j = 1, . . . , 5 and
q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th
quantiles of N(10, 10).
1,000 simulations are conducted.

Bias of estimating mk
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Stratified
k−n
Value
The black line is the point-wise mean of the simulation, and blue ones are the
point-wise 5%-th and 95%-th quantiles.

Bias of estimating Pk,1
0 1000 3000
0.000.020.040.06
Classical
k−n
Value
0 1000 3000
0.000.020.040.06
Balanced
k−n
Value
0 1000 3000
0.000.020.040.06
Stratified
k−n
Value

0 1000 3000
0.000.050.100.15
Classical
k−n
Value
0 1000 3000
0.000.050.100.15
Balanced
k−n
Value
0 1000 3000
0.000.050.100.15
Stratified
k−n
Value

0 1000 3000
−0.02−0.010.000.010.02
Classical
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Balanced
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Stratified
k−n
Value

0 1000 3000
−0.15−0.10−0.050.00
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Stratified
k−n
Value

0 1000 3000
−0.06−0.04−0.020.00
Classical
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Balanced
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Stratified
k−n
Value

Summary statistics for k = 5, 000
Table: Summary statistics for k = 5, 000. The unit for the values is 10−3
Mean q1 q2
Bias S.E. Bias S.E. Bias S.E.
Classical 0.13 88.36 0.10 5.84 0.26 12.84
Balanced 0.83 1.02 50.19 6.84 165.66 11.77
Stratiﬁed -22.47 17.89 -0.91 1.61 3.48 3.06
q3 q4 q5
Bias S.E. Bias S.E. Bias S.E.
Classical -0.18 14.05 -0.47 12.80 0.19 6.04
Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67
Stratiﬁed 3.56 3.00 1.74 2.59 1.89 1.48
S.E.: standard error of the 1,000 Monte Carlo estimates

Discussion
Balanced reservoir sampling provides the most efficient estimation for
the population mean, but it provides biased estimates for other
parameters.
Stratified reservoir sampling provides more efficient estimates than
the classical reservoir sampling for various parameters.
It can be shown that, under some conditions, the stratified reservoir
sampling satisfies
V (ˆθ) = O
1
nH
,
where H is the number of strata.

Topic 2: Inverse sampling
Goal: Obtain a simple random sample from the current sample
obtained from a complex sampling design.
Originally proposed by Hinkins et al (1997) and Rao et al (2003) in
the context of stratified cluster sampling.
Note that the inverse sampling can be treated as a special case of
two-phase sampling:
1 Phase One: Original sample obtained from complex sampling design.
2 Phase Two: select a sample from Phase One sample to obtain the final
sample so that they have equal first-order inclusion probabilities.

Inverse sampling for big data analysis
Two cases
1 Big data = population
2 Big data = population
In case one, the sampling problem is easy
In case two, we need to adjust for the bias associated with big data
itself.
If the data is not obtained from a probability sample, it may be
subject to selection bias.
We will discuss how to remove the selection bias of the big data using
inverse sampling idea.

Assume that the big data (D) is a subset of the ﬁnite population.
Let δ be the indicator function
δi =
1 if i ∈ D
0 otherwise
The observations in D can be viewed as a random sample from
f (y | δ = 1).
However, we are interested in obtaining sample from f (y).
How to compute E(Y ) = yf (y)dy using the observations from
f (y | δ = 1)?

Importance sampling idea
Goal: We are interested in estimating E(Y ) = yf (y)dy using
y1, · · · , yn from f (y | δ = 1).
Note that
E(Y ) = y
f (y)
f (y | δ = 1)
f (y | δ = 1)dy ∼=
1
n
n
i=1
wi yi
where
wi ∝
f (yi )
f (yi | δi = 1)
The weight wi is called importance weight. It is the relative amount
of representativeness of yi with respect to f (y) when yi is generated
from f (y | δ = 1).

Assume for now that f (y) and f (y | δ = 1) are of known form.
From each i ∈ D, compute
wi ∝
f (yi )
f (yi | δi = 1)
Use the Chao method or something else, select a sample of size n
with the selection probability proportional to wi .
What if f (y) and f (y | δ = 1) are unknown?

Inverse sampling for big data analysis (Cont’d)
Assume bivariate observation (xi , yi ) from f (x, y | δ = 1).
Suppose that µx = E (X) is available from an external source.
Let
P0 = {f (x, y); xf (x, y)d(x, y) = µx }
Using µx , we can ﬁnd f0(x, y) ∈ P0 that minimizes the
Kullback-Leibler distance
f0 (x, y) ln
f0 (x, y)
f (x, y | δ = 1)
d (x, y) . (1)

Inverse sampling for big data analysis (Cont’d)
The solution to (1) is
f0 (x, y) = f (x, y | δ = 1)
exp (λ (x − µx ))
E {exp (λ (X − µx ))}
(2)
where λ satisﬁes xf0 (x, y) d (x, y) = µx .
The transformation from f (x, y | δ = 1) to f0 (x, y) is called
exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y)
satisfy the calibration constraint.
The importance weight in this situation is
w∗
i ∝
f0 (xi , yi )
f (xi , yi | δi = 1)
∝ exp{ˆλ (xi − µx )} (3)
where ˆλ satisﬁes n
i=1 w∗
i xi = µx .
Closely related to the exponential tilting (ET) weight (Kim 2010).

Inverse reservoir sampling using exponential tilting
Reservoir inverse sampling
1 Phase one: compute w∗
i using ET method.
2 Phase two: Use Chao’s reservoir sampling to select an unequal
probability sample of size n with πi ∝ w∗
i .
Computation for ˆλ in w∗
i in (3):
1 Newton method: Solve
n
i=1 w∗
i (λ)xi = µx iteratively.
2 One-step method (Kim, 2010): Approximation using
ˆλ = (µx − ¯xk )/s2
k ,
where ¯xk and s2
k are the sample mean and variance of
{xi : i = 1, . . . , k}.

Simulation study
Finite population {(xi , yi ) : i ∈ U} is generated by
xi ∼ N(0, 1)
yi | xi ∼ N(xi , 1)
for i ∈ U, where U = {1, . . . , N} and N = 100, 000.
Indicator {δi : i ∈ U} is obtained by
logit{Pr(δi = 1 | xi )} = 0.25xi ,
where logit(x) = log{x/(1 − x)}.
The data stream is the ﬁrst 10,000 elements with δi = 1.
Sample size n = 500.
We are interested in estimating E(Y ) by the proposed reservoir
inverse sampling methods.

Bias of estimating E(Y )
0 4000
−0.100.000.050.100.150.200.25
R_whole
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
R_one_step
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
Trd.
k−4n
Value

Summary statistics for k = 10, 000
Table: Summary statistics for k = 10, 000. The units for the values is 10−2
Bias S.E.
Reservoir inverse sampling 0.07 6.52
Reservoir one step 0.20 6.41
Traditional reservoir 12.33∗ 5.94
S.E.: standard error of the 1,000 Monte Carlo
estimates
∗
: signiﬁcant bias

Topic 3: Survey Integration
Combine information from two independent surveys, A and B, from
the same target population
1 Non-nested two-phase sample: Observe x from survey A and observe
(x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and
Rao, 2012).
2 Two surveys with measurement errors: Observe (x, y1) from survey A
and observe (x, y2) from survey B (Kim, Berg, and Park, 2016).

Survey integration
Combining big data with survey data
1 A: survey data (representative sample, expensive), observe (x, z)
2 B: big data (self-selected sample, inexpensive), observe (x, y).
Rivers (2007) idea : Use x to implement the nearest neighbor
imputation for sample A. That is, create synthetic data imputation
for sample A.

Table: Data Structure
Data Representativeness X Z Y
A Yes o o
B No o o
Table: Rivers idea
Data Representativeness X Z Y
A Yes o o o
B No o o

Remark
Data B is subject to selection bias.
Rivers method is justiﬁed if
f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A)
In some literature, the above condition is called transportability.
The transportability can be achieved if the selection mechanism for
big data is non-informative (Pfeﬀermann 2011) or ignorable.
In this case, the synthetic data imputation can provide unbiased
estimation of the population parameters.

New approach for combining big data with survey data
1 In survey sample A, observe
δi =
1 if i ∈ B
0 otherwise
by asking a question on the membership for the big data.
2 Since we observe (xi , zi , δi ) for all i ∈ A, we can ﬁt a model for
πi = P(δi = 1 | xi , zi ) to obtain ˆπi .
3 Use wi = 1/ˆπi as the weight for analyzing big data B.

Remark
Note that the self-selection bias is essentially the same as the
nonresponse bias in the missing data literature.
Probability sample A is used to estimate the response probability πi .
Rivers method is based on the assumption that
P(δ = 1 | x, y, z) = P(δ = 1 | x).
The proposed method is based on weaker assumption for the response
mechanism:
P(δ = 1 | x, y, z) = P(δ = 1 | x, z).

Remark
Once the propensity score weights are computed, we can use it for
reservoir inverse sampling.
Variance estimation needs to be developed.

Small area estimation
Small area estimation can be viewed as a special case of survey
integration.
Big data can be used in small area estimation (Rao and Molina,
2015).
We can use big data as covariates in area-level model for small area
estimation (Marchetti et al, 2015).
For unit level model, statistical issues such as self-selection bias and
measurement errors may occur.

Conclusion
Three statistical tools for big data analysis
1 Reservoir sampling
2 Inverse sampling
3 Survey integration
Problems with big data: measurement error, self-selection bias
Survey data can be used to reduce the problems in big data.
Promising area of research

References
Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan.
Biometrika 69, 653–656.
Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154.
Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design
Algorithm. Surv. Methodol. 23, 11–21.
Kim, J. K. (2010). Calibration estimation using exponential tilting in sample
surveys. Surv. Methodol. 36, 145–155.
Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional
imputation. Surv. Methodol. 42, 19–40.
Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent
surveys: a model-assisted approach. Biometrika 99, 85–100.
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F.,
Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015).
Small area model-based estimators using big data sources. J. Oﬀ. Stat. 31,
263–281.

References (Cont’d)
Merkouris, T. (2004). Combining independent regression estimators from
multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139.
McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for
Drawing a Simple Random Sample. Appl. Statist. 32, 182–184.
Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why
is it a problem? How can we approach? Survey Methodology 37, 115-136.
Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons,
New Jersey.
Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey
Data Structures: Some Theory and Applications of Inverse Sampling. Surv.
Methodol. 29, 107–128.
River, D. (2007). Sampling for web surveys. Joint Statistical Meetings,
Proceedings of the Section on Survey Research Methods.

Some sampling techniques for big data analysis

More Related Content

What's hot (20)

Similar to Some sampling techniques for big data analysis (20)

Recently uploaded (20)

Some sampling techniques for big data analysis