SlideShare a Scribd company logo
Some sampling techniques for big data analysis
Jae Kwang Kim 1
Iowa State University & KAIST
May 31, 2017
1
Joint work with Zhonglei Wang
Example
Let’s look at an artificial finite population.
ID Size of farms yield(y)
1 4 1
2 6 3
3 6 5
4 20 15
Parameter of interest: Mean yield of the farms in the population
Assume that farm sizes are known.
Instead of observing N = 4 farms, we want to select a sample of size
n = 2.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53
Example - Continued
6 possible samples
case sample ID sample mean sampling error
1 1, 2 2 -4
2 1, 3 3 -3
3 1, 4 8 2
4 2, 3 4 -2
5 2, 4 9 3
6 3, 4 10 4
Each sample has sampling error.
Two ways to select one of the six possible samples.
Nonprobability sampling : (using size of farms or etc.) select a sample
subjectively.
Probability sampling : select a sample by a probability rule.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 3 / 53
Example - Continued : Probability sampling 1
Simple Random Sampling : Assign the same selection probability to all
possible samples
case sample ID sample mean(¯y) sampling error selection probability
1 1, 2 2 -4 1/6
2 1, 3 3 -3 1/6
3 1, 4 8 2 1/6
4 2, 3 4 -2 1/6
5 2, 4 9 3 1/6
6 3, 4 10 4 1/6
In this case, the sample mean(¯y) has a discrete probability distribution.
¯y =



2 w. p. 1/6
3 w. p. 1/6
4 w. p. 1/6
8 w. p. 1/6
9 w. p. 1/6
10 w. p. 1/6
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 4 / 53
Expected value of sampling error
E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6
= 0
Thus, the estimator is unbiased.
Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 5 / 53
Probability Sampling
Definition : For each element in the population, the probability that
the element is included in the sample is known and greater than 0.
Advantages
1 Exclude subjectivity of selecting samples.
2 Remove sampling bias (or selection bias)
What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ)
(sampling) error of ˆθ = ˆθ − θ
= ˆθ − E ˆθ + E ˆθ − θ
= variation + bias
In nonprobability sampling, variation is 0 but there is a bias. In
probability sampling, there exist variation but bias is 0.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 6 / 53
Probability Sampling
Main theory
1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large
sample size.
2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently
large sample size.
Additional advantages of probability sampling with large sample :
1 Improve the precision of an estimator
2 Can compute confidence intervals or test statistical hypotheses.
With the same sample size, we may have different precision.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 7 / 53
Survey Sampling
A classical area of statistics concerning ....
1 Drawing a probability sample from a target population
2 Using the sample to make inference about the population
Why survey sampling?
1 Cost consideration: Data collection often involves cost. Smaller sample
means less money.
2 Representativeness: Probability sampling is the only scientifically
justified approach of data collection for population studies.
3 Computational efficiency: Sample takes less memory and storage, and
less computing time.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 8 / 53
Survey Sampling in the era of Big Data
Challenges
1 Decreasing response rates: strict probability sampling is not possible
and representativeness may be weakened.
2 Cost model changes: data collection cost is not proportional to size.
3 Many competitors: survey sampling is no longer the only way of
collecting data. Abundant data sources.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 9 / 53
Paradigm Shift in Survey Sampling
Missing data framework:
From sample to population ⇒ From observed to unobserved
Combine survey data with other existing data
Survey Data
Auxiliary Data
⇒ Synthetic Data
Sampling techniques are still useful in handling big data. (e.g.
Reservior sampling, balanced sampling, inverse sampling, calibration
weighting, statistical inference with informative sampling, etc.)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 10 / 53
In this talk, we will introduce ...
1 Reservior sampling and its variants
2 Inverse sampling
3 Synthetic data imputation (or survey integration)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 11 / 53
Topic 1: Reservoir Sampling
McLeod and Bellhouse (1983)’s idea:
1 Let S = {1, · · · , n} be the initial reservoir sample.
2 If k = n + 1, · · · , select it with probability n/k.
3 If k is selected, replace it with one random sample from S. If k is not
selected, then go to Step 2.
Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is
the removal step.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 12 / 53
Justification
Let Sk be the set of sample elements selected from Uk = {1, · · · , k}.
If m ≤ n,
P(m ∈ Sk) = 1 ×
n
n + 1
×
n + 1
n + 2
× · · · ×
k − 1
k
=
n
k
If m > n,
P(m ∈ Sk) =
n
m
×
m
m + 1
× · · · ×
k − 1
k
=
n
k
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 13 / 53
Remark
Selecting a SRS (simple random sample) of size n from arbitrary size
in one pass
No need to know the population size in advance
Very useful for big data sampling when the population is updated
continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 14 / 53
Improving reservoir sampling
Goal: We wish to reduce the variance of ˆθ, which is computed from
reservoir sample.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 15 / 53
Method 1: Balanced reservoir sampling
We wish to impose balanced condition in the reservoir sample
1
n
i∈S
zi =
1
N
i∈U
zi .
Variable z can be called control variable.
To achieve the balanced condition, we can modify Step 3 (removal
step) as follows:
1 Let ¯zk be the population mean of z up to the k-th populaiton
Uk = {1, · · · , k}.
2 For each element in the current sample S, compute
D(i) = ¯zk − ¯z
(−i)
S , i ∈ S
where ¯z
(−i)
S is the sample mean excluding zi .
3 Instead of removing one element at random in Step 3, remove i∗
with
the smallest value of D(i).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 16 / 53
A numerical experiment
Figure: Trace of sample means of size n = 100 from a population of size
N = 10, 000. The left side is the trace of the sample mean from the classical
reservoir sampling and the right side is the trace of the sample mean from the
proposed reservoir sampling.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 17 / 53
Remark
The balanced reservoir sampling may provide an efficient sample for
estimating population mean, but not the population distribution.
We may wish to provide an efficient representative sample that
provides consistent estimates for various population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 18 / 53
Method 2: Stratified reservoir sampling
Assume that the finite population consists of H subpopulations, called
strata. The within-stratum variance is small.
For scalar y with bounded support, we can predetermine the stratum
boundaries by creating H intervals with equal length.
We want to obtain a reservoir sample that achieves stratified random
sampling with proportional allocation.
The basic idea is to use Chao (1982)’s sampling to select an equal
probability sampling for each with probability proportional to
wi = Nh/nh, where Nh and nh are updated continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 19 / 53
Stratified reservoir sampling
Let Uk = {1, · · · , k} be the finite population up to element k.
Let π(k; i) be the first-order inclusion probability that unit i is
selected from Uk.
We want to have
k
i=1
π(k; i) = n
and
π(k; i) ∝ wk,i
where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where
Nk,h and nk,h are the population size and the sample size,
respectively, corresponding to stratum h.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 20 / 53
Stratified reservoir sampling using Chao’s method
At time k + 1, we update the reservoir sample as follows:
1 Select unit k + 1 with probability π(k + 1; k + 1).
2 If unit k + 1 is not selected, retain the sample at time k.
3 If unit k + 1 is selected, then remove one unit from the current
sample with the removal probability
Rkj =



0 if j ∈ Ak
Tkj if j ∈ Bk
(1 − Tk)/(n − Lk) if j ∈ Ck,
where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1},
Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1},
Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk
Tkj ,
Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 21 / 53
Simulation study
Finite population {yi : i ∈ U} is generated by
yi ∼ N(10, 10)
independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000.
Sample size n = 1, 000.
Three reservoir methods are considered.
Classical reservoir sampling approach (Mcleod and Bellhouse, 1983).
Balanced reservoir sampling approach with yi being the control variable.
Stratified reservoir sampling approach: 51 strata partitioned by equally
spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and
99%-th sample quantile of {yi : i = 1, . . . , n}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 22 / 53
Simulation study (Cont’d)
We are interested in estimating the following finite population
parameters.
Mean mk = k−1 k
i=1 yi , where k = n + 1, . . . , N.
Probability Pk,j = k−1 k
i=1 I(yi < qj ), where j = 1, . . . , 5 and
q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th
quantiles of N(10, 10).
1,000 simulations are conducted.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 23 / 53
Bias of estimating mk
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Stratified
k−n
Value
The black line is the point-wise mean of the simulation, and blue ones are the
point-wise 5%-th and 95%-th quantiles.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 24 / 53
Bias of estimating Pk,1
0 1000 3000
0.000.020.040.06
Classical
k−n
Value
0 1000 3000
0.000.020.040.06
Balanced
k−n
Value
0 1000 3000
0.000.020.040.06
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 25 / 53
Bias of estimating Pk,2
0 1000 3000
0.000.050.100.15
Classical
k−n
Value
0 1000 3000
0.000.050.100.15
Balanced
k−n
Value
0 1000 3000
0.000.050.100.15
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 26 / 53
Bias of estimating Pk,3
0 1000 3000
−0.02−0.010.000.010.02
Classical
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Balanced
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 27 / 53
Bias of estimating Pk,4
0 1000 3000
−0.15−0.10−0.050.00
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 28 / 53
Bias of estimating Pk,5
0 1000 3000
−0.06−0.04−0.020.00
Classical
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Balanced
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 29 / 53
Summary statistics for k = 5, 000
Table: Summary statistics for k = 5, 000. The unit for the values is 10−3
Mean q1 q2
Bias S.E. Bias S.E. Bias S.E.
Classical 0.13 88.36 0.10 5.84 0.26 12.84
Balanced 0.83 1.02 50.19 6.84 165.66 11.77
Stratified -22.47 17.89 -0.91 1.61 3.48 3.06
q3 q4 q5
Bias S.E. Bias S.E. Bias S.E.
Classical -0.18 14.05 -0.47 12.80 0.19 6.04
Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67
Stratified 3.56 3.00 1.74 2.59 1.89 1.48
S.E.: standard error of the 1,000 Monte Carlo estimates
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 30 / 53
Discussion
Balanced reservoir sampling provides the most efficient estimation for
the population mean, but it provides biased estimates for other
parameters.
Stratified reservoir sampling provides more efficient estimates than
the classical reservoir sampling for various parameters.
It can be shown that, under some conditions, the stratified reservoir
sampling satisfies
V (ˆθ) = O
1
nH
,
where H is the number of strata.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 31 / 53
Topic 2: Inverse sampling
Goal: Obtain a simple random sample from the current sample
obtained from a complex sampling design.
Originally proposed by Hinkins et al (1997) and Rao et al (2003) in
the context of stratified cluster sampling.
Note that the inverse sampling can be treated as a special case of
two-phase sampling:
1 Phase One: Original sample obtained from complex sampling design.
2 Phase Two: select a sample from Phase One sample to obtain the final
sample so that they have equal first-order inclusion probabilities.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 32 / 53
Inverse sampling for big data analysis
Two cases
1 Big data = population
2 Big data = population
In case one, the sampling problem is easy
In case two, we need to adjust for the bias associated with big data
itself.
If the data is not obtained from a probability sample, it may be
subject to selection bias.
We will discuss how to remove the selection bias of the big data using
inverse sampling idea.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 33 / 53
Inverse sampling for big data analysis
Assume that the big data (D) is a subset of the finite population.
Let δ be the indicator function
δi =
1 if i ∈ D
0 otherwise
The observations in D can be viewed as a random sample from
f (y | δ = 1).
However, we are interested in obtaining sample from f (y).
How to compute E(Y ) = yf (y)dy using the observations from
f (y | δ = 1)?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 34 / 53
Importance sampling idea
Goal: We are interested in estimating E(Y ) = yf (y)dy using
y1, · · · , yn from f (y | δ = 1).
Note that
E(Y ) = y
f (y)
f (y | δ = 1)
f (y | δ = 1)dy ∼=
1
n
n
i=1
wi yi
where
wi ∝
f (yi )
f (yi | δi = 1)
The weight wi is called importance weight. It is the relative amount
of representativeness of yi with respect to f (y) when yi is generated
from f (y | δ = 1).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 35 / 53
Inverse sampling for big data analysis
Assume for now that f (y) and f (y | δ = 1) are of known form.
From each i ∈ D, compute
wi ∝
f (yi )
f (yi | δi = 1)
Use the Chao method or something else, select a sample of size n
with the selection probability proportional to wi .
What if f (y) and f (y | δ = 1) are unknown?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 36 / 53
Inverse sampling for big data analysis (Cont’d)
Assume bivariate observation (xi , yi ) from f (x, y | δ = 1).
Suppose that µx = E (X) is available from an external source.
Let
P0 = {f (x, y); xf (x, y)d(x, y) = µx }
Using µx , we can find f0(x, y) ∈ P0 that minimizes the
Kullback-Leibler distance
f0 (x, y) ln
f0 (x, y)
f (x, y | δ = 1)
d (x, y) . (1)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 37 / 53
Inverse sampling for big data analysis (Cont’d)
The solution to (1) is
f0 (x, y) = f (x, y | δ = 1)
exp (λ (x − µx ))
E {exp (λ (X − µx ))}
(2)
where λ satisfies xf0 (x, y) d (x, y) = µx .
The transformation from f (x, y | δ = 1) to f0 (x, y) is called
exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y)
satisfy the calibration constraint.
The importance weight in this situation is
w∗
i ∝
f0 (xi , yi )
f (xi , yi | δi = 1)
∝ exp{ˆλ (xi − µx )} (3)
where ˆλ satisfies n
i=1 w∗
i xi = µx .
Closely related to the exponential tilting (ET) weight (Kim 2010).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 38 / 53
Inverse reservoir sampling using exponential tilting
Reservoir inverse sampling
1 Phase one: compute w∗
i using ET method.
2 Phase two: Use Chao’s reservoir sampling to select an unequal
probability sample of size n with πi ∝ w∗
i .
Computation for ˆλ in w∗
i in (3):
1 Newton method: Solve
n
i=1 w∗
i (λ)xi = µx iteratively.
2 One-step method (Kim, 2010): Approximation using
ˆλ = (µx − ¯xk )/s2
k ,
where ¯xk and s2
k are the sample mean and variance of
{xi : i = 1, . . . , k}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 39 / 53
Simulation study
Finite population {(xi , yi ) : i ∈ U} is generated by
xi ∼ N(0, 1)
yi | xi ∼ N(xi , 1)
for i ∈ U, where U = {1, . . . , N} and N = 100, 000.
Indicator {δi : i ∈ U} is obtained by
logit{Pr(δi = 1 | xi )} = 0.25xi ,
where logit(x) = log{x/(1 − x)}.
The data stream is the first 10,000 elements with δi = 1.
Sample size n = 500.
We are interested in estimating E(Y ) by the proposed reservoir
inverse sampling methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 40 / 53
Bias of estimating E(Y )
0 4000
−0.100.000.050.100.150.200.25
R_whole
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
R_one_step
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
Trd.
k−4n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 41 / 53
Summary statistics for k = 10, 000
Table: Summary statistics for k = 10, 000. The units for the values is 10−2
Bias S.E.
Reservoir inverse sampling 0.07 6.52
Reservoir one step 0.20 6.41
Traditional reservoir 12.33∗ 5.94
S.E.: standard error of the 1,000 Monte Carlo
estimates
∗
: significant bias
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 42 / 53
Topic 3: Survey Integration
Combine information from two independent surveys, A and B, from
the same target population
1 Non-nested two-phase sample: Observe x from survey A and observe
(x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and
Rao, 2012).
2 Two surveys with measurement errors: Observe (x, y1) from survey A
and observe (x, y2) from survey B (Kim, Berg, and Park, 2016).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 43 / 53
Survey integration
Combining big data with survey data
1 A: survey data (representative sample, expensive), observe (x, z)
2 B: big data (self-selected sample, inexpensive), observe (x, y).
Rivers (2007) idea : Use x to implement the nearest neighbor
imputation for sample A. That is, create synthetic data imputation
for sample A.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 44 / 53
Table: Data Structure
Data Representativeness X Z Y
A Yes o o
B No o o
Table: Rivers idea
Data Representativeness X Z Y
A Yes o o o
B No o o
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 45 / 53
Remark
Data B is subject to selection bias.
Rivers method is justified if
f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A)
In some literature, the above condition is called transportability.
The transportability can be achieved if the selection mechanism for
big data is non-informative (Pfeffermann 2011) or ignorable.
In this case, the synthetic data imputation can provide unbiased
estimation of the population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 46 / 53
New approach for combining big data with survey data
1 In survey sample A, observe
δi =
1 if i ∈ B
0 otherwise
by asking a question on the membership for the big data.
2 Since we observe (xi , zi , δi ) for all i ∈ A, we can fit a model for
πi = P(δi = 1 | xi , zi ) to obtain ˆπi .
3 Use wi = 1/ˆπi as the weight for analyzing big data B.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 47 / 53
Remark
Note that the self-selection bias is essentially the same as the
nonresponse bias in the missing data literature.
Probability sample A is used to estimate the response probability πi .
Rivers method is based on the assumption that
P(δ = 1 | x, y, z) = P(δ = 1 | x).
The proposed method is based on weaker assumption for the response
mechanism:
P(δ = 1 | x, y, z) = P(δ = 1 | x, z).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 48 / 53
Remark
Once the propensity score weights are computed, we can use it for
reservoir inverse sampling.
Variance estimation needs to be developed.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 49 / 53
Small area estimation
Small area estimation can be viewed as a special case of survey
integration.
Big data can be used in small area estimation (Rao and Molina,
2015).
We can use big data as covariates in area-level model for small area
estimation (Marchetti et al, 2015).
For unit level model, statistical issues such as self-selection bias and
measurement errors may occur.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 50 / 53
Conclusion
Three statistical tools for big data analysis
1 Reservoir sampling
2 Inverse sampling
3 Survey integration
Problems with big data: measurement error, self-selection bias
Survey data can be used to reduce the problems in big data.
Promising area of research
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 51 / 53
References
Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan.
Biometrika 69, 653–656.
Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154.
Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design
Algorithm. Surv. Methodol. 23, 11–21.
Kim, J. K. (2010). Calibration estimation using exponential tilting in sample
surveys. Surv. Methodol. 36, 145–155.
Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional
imputation. Surv. Methodol. 42, 19–40.
Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent
surveys: a model-assisted approach. Biometrika 99, 85–100.
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F.,
Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015).
Small area model-based estimators using big data sources. J. Off. Stat. 31,
263–281.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 52 / 53
References (Cont’d)
Merkouris, T. (2004). Combining independent regression estimators from
multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139.
McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for
Drawing a Simple Random Sample. Appl. Statist. 32, 182–184.
Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why
is it a problem? How can we approach? Survey Methodology 37, 115-136.
Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons,
New Jersey.
Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey
Data Structures: Some Theory and Applications of Inverse Sampling. Surv.
Methodol. 29, 107–128.
River, D. (2007). Sampling for web surveys. Joint Statistical Meetings,
Proceedings of the Section on Survey Research Methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 53 / 53

More Related Content

PDF
Fi review5
PDF
PDF
Predictive mean-matching2
PDF
better together? statistical learning in models made of modules
PDF
Uncoupled Regression from Pairwise Comparison Data
PDF
Convergence of ABC methods
PDF
Seattle.Slides.7
PDF
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING
Fi review5
Predictive mean-matching2
better together? statistical learning in models made of modules
Uncoupled Regression from Pairwise Comparison Data
Convergence of ABC methods
Seattle.Slides.7
3rd NIPS Workshop on PROBABILISTIC PROGRAMMING

What's hot (20)

PDF
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
PDF
ABC workshop: 17w5025
PDF
Approximate Bayesian model choice via random forests
PDF
Bayesian inference on mixtures
PDF
When Classifier Selection meets Information Theory: A Unifying View
PDF
Hypothesis testings on individualized treatment rules
PDF
ABC short course: introduction chapters
PDF
Slides csm
PDF
ABC short course: model choice chapter
PDF
Generative models : VAE and GAN
PDF
ABC short course: final chapters
PDF
Comparison of the optimal design
PDF
4thchannel conference poster_freedom_gumedze
PDF
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
PDF
ABC short course: survey chapter
PDF
Statistics symposium talk, Harvard University
PDF
ABC-Gibbs
PDF
An overview of Bayesian testing
PDF
asymptotics of ABC
PDF
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
ABC workshop: 17w5025
Approximate Bayesian model choice via random forests
Bayesian inference on mixtures
When Classifier Selection meets Information Theory: A Unifying View
Hypothesis testings on individualized treatment rules
ABC short course: introduction chapters
Slides csm
ABC short course: model choice chapter
Generative models : VAE and GAN
ABC short course: final chapters
Comparison of the optimal design
4thchannel conference poster_freedom_gumedze
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
ABC short course: survey chapter
Statistics symposium talk, Harvard University
ABC-Gibbs
An overview of Bayesian testing
asymptotics of ABC
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Ad

Similar to Some sampling techniques for big data analysis (20)

PDF
Computational Pool-Testing with Retesting Strategy
PDF
SAMPLING-PROCEDURE.pdf
PDF
Introduction to data mining and machine learning
PDF
sampling_distributions in inferential statistics.pdf
PDF
Bayesian selection of best subsets in high-dimensional regression
PDF
13_Unsupervised Learning.pdf
PPTX
Chapter three and - Economics-introduction
PPT
Sampling distribution
PPTX
Theory of sampling
PPTX
L4 theory of sampling
PPTX
Chapter 3 sampling and sampling distribution
PDF
Information-theoretic clustering with applications
PPTX
CH3L3 -Sampling Distribution of Means.pptx
PPT
BRM Unit 2 Sampling.ppt
PDF
GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction ...
PPT
9618821.ppt
PDF
9618821.pdf
PPTX
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
PPTX
Sampling Techniques for Research-PPT.pptx
DOCX
Statistics for management assignment
Computational Pool-Testing with Retesting Strategy
SAMPLING-PROCEDURE.pdf
Introduction to data mining and machine learning
sampling_distributions in inferential statistics.pdf
Bayesian selection of best subsets in high-dimensional regression
13_Unsupervised Learning.pdf
Chapter three and - Economics-introduction
Sampling distribution
Theory of sampling
L4 theory of sampling
Chapter 3 sampling and sampling distribution
Information-theoretic clustering with applications
CH3L3 -Sampling Distribution of Means.pptx
BRM Unit 2 Sampling.ppt
GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction ...
9618821.ppt
9618821.pdf
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
Sampling Techniques for Research-PPT.pptx
Statistics for management assignment
Ad

Recently uploaded (20)

PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
The scientific heritage No 166 (166) (2025)
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
BIOMOLECULES PPT........................
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
2. Earth - The Living Planet earth and life
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PPTX
Microbiology with diagram medical studies .pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
INTRODUCTION TO EVS | Concept of sustainability
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Biophysics 2.pdffffffffffffffffffffffffff
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
The scientific heritage No 166 (166) (2025)
POSITIONING IN OPERATION THEATRE ROOM.ppt
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
BIOMOLECULES PPT........................
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
2. Earth - The Living Planet earth and life
Viruses (History, structure and composition, classification, Bacteriophage Re...
Microbiology with diagram medical studies .pptx
7. General Toxicologyfor clinical phrmacy.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
protein biochemistry.ppt for university classes
INTRODUCTION TO EVS | Concept of sustainability

Some sampling techniques for big data analysis

  • 1. Some sampling techniques for big data analysis Jae Kwang Kim 1 Iowa State University & KAIST May 31, 2017 1 Joint work with Zhonglei Wang
  • 2. Example Let’s look at an artificial finite population. ID Size of farms yield(y) 1 4 1 2 6 3 3 6 5 4 20 15 Parameter of interest: Mean yield of the farms in the population Assume that farm sizes are known. Instead of observing N = 4 farms, we want to select a sample of size n = 2. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53
  • 3. Example - Continued 6 possible samples case sample ID sample mean sampling error 1 1, 2 2 -4 2 1, 3 3 -3 3 1, 4 8 2 4 2, 3 4 -2 5 2, 4 9 3 6 3, 4 10 4 Each sample has sampling error. Two ways to select one of the six possible samples. Nonprobability sampling : (using size of farms or etc.) select a sample subjectively. Probability sampling : select a sample by a probability rule. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 3 / 53
  • 4. Example - Continued : Probability sampling 1 Simple Random Sampling : Assign the same selection probability to all possible samples case sample ID sample mean(¯y) sampling error selection probability 1 1, 2 2 -4 1/6 2 1, 3 3 -3 1/6 3 1, 4 8 2 1/6 4 2, 3 4 -2 1/6 5 2, 4 9 3 1/6 6 3, 4 10 4 1/6 In this case, the sample mean(¯y) has a discrete probability distribution. ¯y =    2 w. p. 1/6 3 w. p. 1/6 4 w. p. 1/6 8 w. p. 1/6 9 w. p. 1/6 10 w. p. 1/6 Kim (ISU) Sampling techniques for big data analysis May 31, 2017 4 / 53
  • 5. Expected value of sampling error E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6 = 0 Thus, the estimator is unbiased. Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 5 / 53
  • 6. Probability Sampling Definition : For each element in the population, the probability that the element is included in the sample is known and greater than 0. Advantages 1 Exclude subjectivity of selecting samples. 2 Remove sampling bias (or selection bias) What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ) (sampling) error of ˆθ = ˆθ − θ = ˆθ − E ˆθ + E ˆθ − θ = variation + bias In nonprobability sampling, variation is 0 but there is a bias. In probability sampling, there exist variation but bias is 0. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 6 / 53
  • 7. Probability Sampling Main theory 1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large sample size. 2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently large sample size. Additional advantages of probability sampling with large sample : 1 Improve the precision of an estimator 2 Can compute confidence intervals or test statistical hypotheses. With the same sample size, we may have different precision. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 7 / 53
  • 8. Survey Sampling A classical area of statistics concerning .... 1 Drawing a probability sample from a target population 2 Using the sample to make inference about the population Why survey sampling? 1 Cost consideration: Data collection often involves cost. Smaller sample means less money. 2 Representativeness: Probability sampling is the only scientifically justified approach of data collection for population studies. 3 Computational efficiency: Sample takes less memory and storage, and less computing time. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 8 / 53
  • 9. Survey Sampling in the era of Big Data Challenges 1 Decreasing response rates: strict probability sampling is not possible and representativeness may be weakened. 2 Cost model changes: data collection cost is not proportional to size. 3 Many competitors: survey sampling is no longer the only way of collecting data. Abundant data sources. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 9 / 53
  • 10. Paradigm Shift in Survey Sampling Missing data framework: From sample to population ⇒ From observed to unobserved Combine survey data with other existing data Survey Data Auxiliary Data ⇒ Synthetic Data Sampling techniques are still useful in handling big data. (e.g. Reservior sampling, balanced sampling, inverse sampling, calibration weighting, statistical inference with informative sampling, etc.) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 10 / 53
  • 11. In this talk, we will introduce ... 1 Reservior sampling and its variants 2 Inverse sampling 3 Synthetic data imputation (or survey integration) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 11 / 53
  • 12. Topic 1: Reservoir Sampling McLeod and Bellhouse (1983)’s idea: 1 Let S = {1, · · · , n} be the initial reservoir sample. 2 If k = n + 1, · · · , select it with probability n/k. 3 If k is selected, replace it with one random sample from S. If k is not selected, then go to Step 2. Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is the removal step. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 12 / 53
  • 13. Justification Let Sk be the set of sample elements selected from Uk = {1, · · · , k}. If m ≤ n, P(m ∈ Sk) = 1 × n n + 1 × n + 1 n + 2 × · · · × k − 1 k = n k If m > n, P(m ∈ Sk) = n m × m m + 1 × · · · × k − 1 k = n k Kim (ISU) Sampling techniques for big data analysis May 31, 2017 13 / 53
  • 14. Remark Selecting a SRS (simple random sample) of size n from arbitrary size in one pass No need to know the population size in advance Very useful for big data sampling when the population is updated continuously. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 14 / 53
  • 15. Improving reservoir sampling Goal: We wish to reduce the variance of ˆθ, which is computed from reservoir sample. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 15 / 53
  • 16. Method 1: Balanced reservoir sampling We wish to impose balanced condition in the reservoir sample 1 n i∈S zi = 1 N i∈U zi . Variable z can be called control variable. To achieve the balanced condition, we can modify Step 3 (removal step) as follows: 1 Let ¯zk be the population mean of z up to the k-th populaiton Uk = {1, · · · , k}. 2 For each element in the current sample S, compute D(i) = ¯zk − ¯z (−i) S , i ∈ S where ¯z (−i) S is the sample mean excluding zi . 3 Instead of removing one element at random in Step 3, remove i∗ with the smallest value of D(i). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 16 / 53
  • 17. A numerical experiment Figure: Trace of sample means of size n = 100 from a population of size N = 10, 000. The left side is the trace of the sample mean from the classical reservoir sampling and the right side is the trace of the sample mean from the proposed reservoir sampling. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 17 / 53
  • 18. Remark The balanced reservoir sampling may provide an efficient sample for estimating population mean, but not the population distribution. We may wish to provide an efficient representative sample that provides consistent estimates for various population parameters. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 18 / 53
  • 19. Method 2: Stratified reservoir sampling Assume that the finite population consists of H subpopulations, called strata. The within-stratum variance is small. For scalar y with bounded support, we can predetermine the stratum boundaries by creating H intervals with equal length. We want to obtain a reservoir sample that achieves stratified random sampling with proportional allocation. The basic idea is to use Chao (1982)’s sampling to select an equal probability sampling for each with probability proportional to wi = Nh/nh, where Nh and nh are updated continuously. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 19 / 53
  • 20. Stratified reservoir sampling Let Uk = {1, · · · , k} be the finite population up to element k. Let π(k; i) be the first-order inclusion probability that unit i is selected from Uk. We want to have k i=1 π(k; i) = n and π(k; i) ∝ wk,i where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where Nk,h and nk,h are the population size and the sample size, respectively, corresponding to stratum h. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 20 / 53
  • 21. Stratified reservoir sampling using Chao’s method At time k + 1, we update the reservoir sample as follows: 1 Select unit k + 1 with probability π(k + 1; k + 1). 2 If unit k + 1 is not selected, retain the sample at time k. 3 If unit k + 1 is selected, then remove one unit from the current sample with the removal probability Rkj =    0 if j ∈ Ak Tkj if j ∈ Bk (1 − Tk)/(n − Lk) if j ∈ Ck, where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1}, Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1}, Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk Tkj , Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 21 / 53
  • 22. Simulation study Finite population {yi : i ∈ U} is generated by yi ∼ N(10, 10) independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000. Sample size n = 1, 000. Three reservoir methods are considered. Classical reservoir sampling approach (Mcleod and Bellhouse, 1983). Balanced reservoir sampling approach with yi being the control variable. Stratified reservoir sampling approach: 51 strata partitioned by equally spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and 99%-th sample quantile of {yi : i = 1, . . . , n}. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 22 / 53
  • 23. Simulation study (Cont’d) We are interested in estimating the following finite population parameters. Mean mk = k−1 k i=1 yi , where k = n + 1, . . . , N. Probability Pk,j = k−1 k i=1 I(yi < qj ), where j = 1, . . . , 5 and q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th quantiles of N(10, 10). 1,000 simulations are conducted. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 23 / 53
  • 24. Bias of estimating mk 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Classical k−n Value 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Balanced k−n Value 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Stratified k−n Value The black line is the point-wise mean of the simulation, and blue ones are the point-wise 5%-th and 95%-th quantiles. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 24 / 53
  • 25. Bias of estimating Pk,1 0 1000 3000 0.000.020.040.06 Classical k−n Value 0 1000 3000 0.000.020.040.06 Balanced k−n Value 0 1000 3000 0.000.020.040.06 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 25 / 53
  • 26. Bias of estimating Pk,2 0 1000 3000 0.000.050.100.15 Classical k−n Value 0 1000 3000 0.000.050.100.15 Balanced k−n Value 0 1000 3000 0.000.050.100.15 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 26 / 53
  • 27. Bias of estimating Pk,3 0 1000 3000 −0.02−0.010.000.010.02 Classical k−n Value 0 1000 3000 −0.02−0.010.000.010.02 Balanced k−n Value 0 1000 3000 −0.02−0.010.000.010.02 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 27 / 53
  • 28. Bias of estimating Pk,4 0 1000 3000 −0.15−0.10−0.050.00 Classical k−n Value 0 1000 3000 −0.15−0.10−0.050.00 Balanced k−n Value 0 1000 3000 −0.15−0.10−0.050.00 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 28 / 53
  • 29. Bias of estimating Pk,5 0 1000 3000 −0.06−0.04−0.020.00 Classical k−n Value 0 1000 3000 −0.06−0.04−0.020.00 Balanced k−n Value 0 1000 3000 −0.06−0.04−0.020.00 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 29 / 53
  • 30. Summary statistics for k = 5, 000 Table: Summary statistics for k = 5, 000. The unit for the values is 10−3 Mean q1 q2 Bias S.E. Bias S.E. Bias S.E. Classical 0.13 88.36 0.10 5.84 0.26 12.84 Balanced 0.83 1.02 50.19 6.84 165.66 11.77 Stratified -22.47 17.89 -0.91 1.61 3.48 3.06 q3 q4 q5 Bias S.E. Bias S.E. Bias S.E. Classical -0.18 14.05 -0.47 12.80 0.19 6.04 Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67 Stratified 3.56 3.00 1.74 2.59 1.89 1.48 S.E.: standard error of the 1,000 Monte Carlo estimates Kim (ISU) Sampling techniques for big data analysis May 31, 2017 30 / 53
  • 31. Discussion Balanced reservoir sampling provides the most efficient estimation for the population mean, but it provides biased estimates for other parameters. Stratified reservoir sampling provides more efficient estimates than the classical reservoir sampling for various parameters. It can be shown that, under some conditions, the stratified reservoir sampling satisfies V (ˆθ) = O 1 nH , where H is the number of strata. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 31 / 53
  • 32. Topic 2: Inverse sampling Goal: Obtain a simple random sample from the current sample obtained from a complex sampling design. Originally proposed by Hinkins et al (1997) and Rao et al (2003) in the context of stratified cluster sampling. Note that the inverse sampling can be treated as a special case of two-phase sampling: 1 Phase One: Original sample obtained from complex sampling design. 2 Phase Two: select a sample from Phase One sample to obtain the final sample so that they have equal first-order inclusion probabilities. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 32 / 53
  • 33. Inverse sampling for big data analysis Two cases 1 Big data = population 2 Big data = population In case one, the sampling problem is easy In case two, we need to adjust for the bias associated with big data itself. If the data is not obtained from a probability sample, it may be subject to selection bias. We will discuss how to remove the selection bias of the big data using inverse sampling idea. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 33 / 53
  • 34. Inverse sampling for big data analysis Assume that the big data (D) is a subset of the finite population. Let δ be the indicator function δi = 1 if i ∈ D 0 otherwise The observations in D can be viewed as a random sample from f (y | δ = 1). However, we are interested in obtaining sample from f (y). How to compute E(Y ) = yf (y)dy using the observations from f (y | δ = 1)? Kim (ISU) Sampling techniques for big data analysis May 31, 2017 34 / 53
  • 35. Importance sampling idea Goal: We are interested in estimating E(Y ) = yf (y)dy using y1, · · · , yn from f (y | δ = 1). Note that E(Y ) = y f (y) f (y | δ = 1) f (y | δ = 1)dy ∼= 1 n n i=1 wi yi where wi ∝ f (yi ) f (yi | δi = 1) The weight wi is called importance weight. It is the relative amount of representativeness of yi with respect to f (y) when yi is generated from f (y | δ = 1). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 35 / 53
  • 36. Inverse sampling for big data analysis Assume for now that f (y) and f (y | δ = 1) are of known form. From each i ∈ D, compute wi ∝ f (yi ) f (yi | δi = 1) Use the Chao method or something else, select a sample of size n with the selection probability proportional to wi . What if f (y) and f (y | δ = 1) are unknown? Kim (ISU) Sampling techniques for big data analysis May 31, 2017 36 / 53
  • 37. Inverse sampling for big data analysis (Cont’d) Assume bivariate observation (xi , yi ) from f (x, y | δ = 1). Suppose that µx = E (X) is available from an external source. Let P0 = {f (x, y); xf (x, y)d(x, y) = µx } Using µx , we can find f0(x, y) ∈ P0 that minimizes the Kullback-Leibler distance f0 (x, y) ln f0 (x, y) f (x, y | δ = 1) d (x, y) . (1) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 37 / 53
  • 38. Inverse sampling for big data analysis (Cont’d) The solution to (1) is f0 (x, y) = f (x, y | δ = 1) exp (λ (x − µx )) E {exp (λ (X − µx ))} (2) where λ satisfies xf0 (x, y) d (x, y) = µx . The transformation from f (x, y | δ = 1) to f0 (x, y) is called exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y) satisfy the calibration constraint. The importance weight in this situation is w∗ i ∝ f0 (xi , yi ) f (xi , yi | δi = 1) ∝ exp{ˆλ (xi − µx )} (3) where ˆλ satisfies n i=1 w∗ i xi = µx . Closely related to the exponential tilting (ET) weight (Kim 2010). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 38 / 53
  • 39. Inverse reservoir sampling using exponential tilting Reservoir inverse sampling 1 Phase one: compute w∗ i using ET method. 2 Phase two: Use Chao’s reservoir sampling to select an unequal probability sample of size n with πi ∝ w∗ i . Computation for ˆλ in w∗ i in (3): 1 Newton method: Solve n i=1 w∗ i (λ)xi = µx iteratively. 2 One-step method (Kim, 2010): Approximation using ˆλ = (µx − ¯xk )/s2 k , where ¯xk and s2 k are the sample mean and variance of {xi : i = 1, . . . , k}. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 39 / 53
  • 40. Simulation study Finite population {(xi , yi ) : i ∈ U} is generated by xi ∼ N(0, 1) yi | xi ∼ N(xi , 1) for i ∈ U, where U = {1, . . . , N} and N = 100, 000. Indicator {δi : i ∈ U} is obtained by logit{Pr(δi = 1 | xi )} = 0.25xi , where logit(x) = log{x/(1 − x)}. The data stream is the first 10,000 elements with δi = 1. Sample size n = 500. We are interested in estimating E(Y ) by the proposed reservoir inverse sampling methods. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 40 / 53
  • 41. Bias of estimating E(Y ) 0 4000 −0.100.000.050.100.150.200.25 R_whole k−4n Value 0 4000 −0.100.000.050.100.150.200.25 R_one_step k−4n Value 0 4000 −0.100.000.050.100.150.200.25 Trd. k−4n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 41 / 53
  • 42. Summary statistics for k = 10, 000 Table: Summary statistics for k = 10, 000. The units for the values is 10−2 Bias S.E. Reservoir inverse sampling 0.07 6.52 Reservoir one step 0.20 6.41 Traditional reservoir 12.33∗ 5.94 S.E.: standard error of the 1,000 Monte Carlo estimates ∗ : significant bias Kim (ISU) Sampling techniques for big data analysis May 31, 2017 42 / 53
  • 43. Topic 3: Survey Integration Combine information from two independent surveys, A and B, from the same target population 1 Non-nested two-phase sample: Observe x from survey A and observe (x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and Rao, 2012). 2 Two surveys with measurement errors: Observe (x, y1) from survey A and observe (x, y2) from survey B (Kim, Berg, and Park, 2016). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 43 / 53
  • 44. Survey integration Combining big data with survey data 1 A: survey data (representative sample, expensive), observe (x, z) 2 B: big data (self-selected sample, inexpensive), observe (x, y). Rivers (2007) idea : Use x to implement the nearest neighbor imputation for sample A. That is, create synthetic data imputation for sample A. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 44 / 53
  • 45. Table: Data Structure Data Representativeness X Z Y A Yes o o B No o o Table: Rivers idea Data Representativeness X Z Y A Yes o o o B No o o Kim (ISU) Sampling techniques for big data analysis May 31, 2017 45 / 53
  • 46. Remark Data B is subject to selection bias. Rivers method is justified if f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A) In some literature, the above condition is called transportability. The transportability can be achieved if the selection mechanism for big data is non-informative (Pfeffermann 2011) or ignorable. In this case, the synthetic data imputation can provide unbiased estimation of the population parameters. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 46 / 53
  • 47. New approach for combining big data with survey data 1 In survey sample A, observe δi = 1 if i ∈ B 0 otherwise by asking a question on the membership for the big data. 2 Since we observe (xi , zi , δi ) for all i ∈ A, we can fit a model for πi = P(δi = 1 | xi , zi ) to obtain ˆπi . 3 Use wi = 1/ˆπi as the weight for analyzing big data B. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 47 / 53
  • 48. Remark Note that the self-selection bias is essentially the same as the nonresponse bias in the missing data literature. Probability sample A is used to estimate the response probability πi . Rivers method is based on the assumption that P(δ = 1 | x, y, z) = P(δ = 1 | x). The proposed method is based on weaker assumption for the response mechanism: P(δ = 1 | x, y, z) = P(δ = 1 | x, z). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 48 / 53
  • 49. Remark Once the propensity score weights are computed, we can use it for reservoir inverse sampling. Variance estimation needs to be developed. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 49 / 53
  • 50. Small area estimation Small area estimation can be viewed as a special case of survey integration. Big data can be used in small area estimation (Rao and Molina, 2015). We can use big data as covariates in area-level model for small area estimation (Marchetti et al, 2015). For unit level model, statistical issues such as self-selection bias and measurement errors may occur. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 50 / 53
  • 51. Conclusion Three statistical tools for big data analysis 1 Reservoir sampling 2 Inverse sampling 3 Survey integration Problems with big data: measurement error, self-selection bias Survey data can be used to reduce the problems in big data. Promising area of research Kim (ISU) Sampling techniques for big data analysis May 31, 2017 51 / 53
  • 52. References Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan. Biometrika 69, 653–656. Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154. Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design Algorithm. Surv. Methodol. 23, 11–21. Kim, J. K. (2010). Calibration estimation using exponential tilting in sample surveys. Surv. Methodol. 36, 145–155. Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional imputation. Surv. Methodol. 42, 19–40. Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika 99, 85–100. Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015). Small area model-based estimators using big data sources. J. Off. Stat. 31, 263–281. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 52 / 53
  • 53. References (Cont’d) Merkouris, T. (2004). Combining independent regression estimators from multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139. McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for Drawing a Simple Random Sample. Appl. Statist. 32, 182–184. Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why is it a problem? How can we approach? Survey Methodology 37, 115-136. Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons, New Jersey. Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey Data Structures: Some Theory and Applications of Inverse Sampling. Surv. Methodol. 29, 107–128. River, D. (2007). Sampling for web surveys. Joint Statistical Meetings, Proceedings of the Section on Survey Research Methods. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 53 / 53