SlideShare a Scribd company logo
Modeling Big Count Data
An IRLS framework for COM-Poisson regression and GAM
Suneel Chatla
Galit Shmueli
November 12, 2016
Institute of Service Science
National Tsing Hua University, Taiwan (R.O.C)
Table of contents
1. Speed Dating Experiment- Count data models
2. Motivation
3. An IRLS framework
4. Simulation Study-Comparison of IRLS with MLE
5. A CMP Generalized Additive Model
6. Results & Conclusions
1
Speed Dating Experiment- Count
data models
Speed dating experiment
Fisman et al. (2006) conducted a speed dating experiment to
evaluate the gender differences in mate selection 1
.
Total sessions 14
Decision 1 or 0
Attractiveness 1-10
Intelligence 1-10
Ambition 1-10
...
...
Control variables
1https://www.kaggle.com/annavictoria/speed-dating-experiment
2
Outcome/Count variables
Matches : When both persons decide Yes
Tot.Yes : Total number of Yes for each subject in a particular session
3
Summary Statistics
Statistic N Mean St. Dev. Min Max
matches 531 2.524 2.304 0 14
Tot.Yes 531 6.433 4.361 0 21
Tot.partner 531 15.311 4.967 5 22
age 531 26.303 3.735 18 55
perc.samerace 531 0.391 0.242 0.000 0.833
avg.intcor 531 0.190 0.167 −0.298 0.569
attr 531 6.195 1.122 1.818 10.000
sinc 531 7.205 1.108 2.773 10.000
intel 531 7.381 0.988 3.409 10.000
func 531 6.438 1.103 2.682 10.000
amb 531 6.812 1.133 3.091 10.000
shar 531 5.511 1.333 1.409 10.000
like 531 6.157 1.072 1.682 10.000
prob 531 5.234 1.525 0.778 10.000
mean.agep 531 26.314 1.674 20.444 31.667
attr_o 531 6.200 1.186 2.333 8.688
sinc_o 531 7.224 0.690 4.167 9.000
intel_o 531 7.410 0.614 4.875 9.150
fun_o 531 6.438 1.015 2.625 8.615
amb_o 531 6.827 0.756 4.600 8.842
shar_o 531 5.498 0.942 1.375 7.700
like_o 531 6.161 0.873 2.333 8.300
prob_o 531 5.256 0.736 3.200 7.200
Tot.part.Yes 531 6.420 4.128 0 20 4
Tools:
• Poisson Regression
• Negative Binomial Regression
• Conway-Maxwell Poisson (CMP) Regression
5
The CMP distribution
From Shmueli et al. (2005),
Y ∼ CMP(λ, ν)
implies
P(Y = y) =
λy
(y!)νZ(λ, ν)
, y = 0, 1, 2, . . .
Z(λ, ν) =
∞∑
s=0
λs
(s!)ν
for λ > 0, ν ≥ 0.
The CMP distribution includes three well-known distributions as
special cases:
• Poisson (ν = 1),
• Geometric (ν = 0, λ < 1),
• Bernoulli (ν → ∞ with probability λ
1+λ ).
6
CMP distribution for different (λ, ν) combinations
λ=2,ν=0.5
Density
0 5 10 15
0.000.050.100.15
λ=2,ν=0.75
0 2 4 6 8 10 12
0.000.100.20
λ=2,ν=1
0 2 4 6 8
0.00.20.4
λ=2,ν=3
0 1 2 3 4
0.01.02.0
λ=8,ν=0.5
Density
40 60 80 100
0.0000.0150.030
λ=8,ν=0.75
5 10 15 20 25 30 35
0.000.040.08
λ=8,ν=1
0 5 10 15 20
0.000.060.12
λ=8,ν=3
0 1 2 3 4 5
0.00.20.40.60.8
λ=15,ν=0.5
Density
150 200 250 300
0.0000.010
λ=15,ν=0.75
20 30 40 50 60
0.000.020.04
λ=15,ν=1
5 10 15 20 25 30
0.000.040.08
λ=15,ν=3
0 1 2 3 4 5 6
0.00.40.8
7
CMP Regression
CMP regression models can be formulated as follows:
log(λ) = Xβ (1)
log(ν) = Zγ (2)
Maximizing the log-likelihood w.r.t the parameters β and γ will yield
the following normal equations Sellers and Shmueli (2010):
U =
∂logL
∂β
= XT
(y − E(y)) (3)
V =
∂logL
∂γ
= νZT
(−log(y!) + E(log(y!))) (4)
8
Motivation
Exploration of Speed Dating data
q
q
q
q
q q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q q
q q
q
q
qq q
q
q
q
q q
q
q q
q
q
q q q
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q qqq
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q q
q
q
q
4 5 6 7 8 9
−2−10123
Sincerity (Others)
Tot.Yes(log)
q
q
q
q
q q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
qq
qq
q
q
qq q
q
q
q
qq
q
qq
q
q
q qq
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
5 6 7 8 9
−2−10123
Intelligence (Others)
Tot.Yes(log)
q
q
q
q
q q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
qq
q q
q
q
qqq
q
q
q
qq
q
q q
q
q
qq q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
4 6 8 10
−2−10123
Sincerity
Tot.Yes(log)
q
q
q
q
qq
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
qq
q q
q
q
qqq
q
q
q
qq
q
q q
q
q
qq q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq q
q
q
q q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q
q
4 6 8 10
−2−10123
Fun seeking
Tot.Yes(log)
9
More flexibility?
Generalized Additive Models
• Smoothing Splines
• Penalized Splines
Both implementations are dependent upon the Iterative Reweighted
Least Squares (IRLS) estimation framework.
At present, there is no IRLS framework available for CMP !!
10
An IRLS framework
Update for each iteration
I
[
β
γ
](m)
= I
[
β
γ
](m−1)
+
[
U
V
]
which implies the following equations
XT
ΣyXβ(m)
− XT
Σy,log(y!)νZγ(m)
= XT
ΣyXβ(m−1)
−
XT
Σy,log(y!)νZγ(m−1)
+ XT
(y − E(y))
and
− νZT
Σy,log(y!)Xβ(m)
+ ν2
ZT
Σlog(y!)Zγ(m)
= −νZT
Σy,log(y!)Xβ(m−1)
+
ν2
ZT
Σlog(y!)Zγ(m−1)
+
νZT
(−log(y!) + E(log(y!)))
11
For the fixed values of both β and γ the equations
XT
ΣyXβ(m)
= XT
ΣyXβ(m−1)
+ XT
(y − E(y)) (5)
ν2
ZT
Σlog(y!)Zγ(m)
= ν2
ZT
Σlog(y!)Zγ(m−1)
+ νZT
(−log(y!) + E(log(y!))).
(6)
12
Algorithm
https://arxiv.org/abs/
1610.08244
13
Practical issues
Initial Values
• For λ = (y + 0.1)ν
• For ν = 0.2
Calculation of Cumulants
• Bounding error 10−8
or 10−10
• Asymptotic expressions
Stopping Criterion
• Based on −2
∑
l(yi; ˆλi, ˆνi)
Step size
• Step halving
14
Simulation Study-Comparison of
IRLS with MLE
Study design
We compare our IRLS algorithm with the existing implementation
which is based on maximizing the likelihood function (through optim
in R).
(a) Set sample size n = 100
(b) Generate x1 ∼ U(0, 1) and x2 ∼ N(0, 1)
(c) Calculate x3 = 0.2x1 + U(0, 0.3) and x4 = 0.3x2 + N(0, 0.1) (to
create correlated variables)
(d) Generate
y ∼ CMP(log(λ) = 0.05 + 0.5x1 − 0.5x2 + 0.25x3 − 0.25x4, ν)
where ν = {0.5, 2, 5}
15
Results
q
q
q
q
IR MLE IR MLE IR MLE
−0.50.00.51.01.5
x1
q q
q
q
q
q
q
q
IR MLE IR MLE IR MLE
−2.0−1.5−1.0−0.50.00.5
x2
q
q
q
IR MLE IR MLE IR MLE
−4−20246
x3
q
q
q
q
q
q
q
q
qq
IR MLE IR MLE IR MLE
−4−2024
x4
q
q
q
IR MLE IR MLE IR MLE
−2−101234
log(ν)
ν=0.5
ν=2
ν=5
16
A CMP Generalized Additive
Model
Additive Model
log(λ) = α +
p
∑
j=1
fj(Xj)
log(ν) = Zγ
where fj (j = 1, 2, . . . , p) are the smooth functions for the p variables.
17
Backfitting
Based on Hastie and Tibshirani (1990); Wood (2006), the algorithm as
follows
1. Initialize: fj = f
(0)
j , j = 1, . . . , p
2. Cycle: j = 1, . . . , p, 1, . . . , p, . . .
fj = Sj
(
y −
∑
k̸=j
fk|xj
)
3. Continue (2) until the individual functions don’t change.
One more nested loop inside the
IRLS framework !
18
Results & Conclusions
Comparison of Regression models on Tot.Yes
Poisson Negative Binomial CMP
(Intercept) 0.49 0.59 0.14
(0.43) (0.55) (0.33)
GenderMale 0.05 0.05 0.03
(0.04) (0.06) (0.03)
age −0.01 −0.01 −0.004
(0.01) (0.01) (0.004)
Tot.partner 0.07∗∗∗ 0.07∗∗∗ 0.04∗∗∗
(0.00) (0.01) (0.003)
avg.intcor −0.04 −0.04 −0.02
(0.11) (0.15) (0.09)
attr 0.19∗∗∗ 0.18∗∗∗ 0.11∗∗∗
(0.03) (0.04) (0.02)
sinc −0.06 −0.05 −0.04
(0.03) (0.04) (0.02)
intel 0.05 0.06 0.03
(0.04) (0.05) (0.03)
func 0.03 0.04 0.02
(0.04) (0.05) (0.03)
amb −0.12∗∗∗ −0.13∗∗ −0.07∗∗
(0.03) (0.04) (0.02)
shar 0.10∗∗∗ 0.10∗∗∗ 0.06∗∗∗
(0.02) (0.03) (0.02)
mean.agep −0.01 −0.01 −0.007
(0.01) (0.02) (0.009)
attr_o −0.10∗∗∗ −0.10∗∗∗ −0.06∗∗∗
(0.02) (0.03) (0.02)
sinc_o 0.02 0.02 0.01
(0.04) (0.05) (0.03)
intel_o 0.08 0.08 0.05
(0.05) (0.07) (0.04)
fun_o −0.01 −0.01 −0.003
(0.03) (0.04) (0.02)
amb_o −0.00 −0.01 0.0005
(0.04) (0.05) (0.03)
shar_o 0.02 0.03 0.01
(0.03) (0.04) (0.02)
ν 0.53∗∗∗
AIC 2844.92 2777.24 2751.7
BIC 3011.64 2948.23 2922.66
Log Likelihood -1383.46 -1348.62 -1335.33
Deviance 970.04 637.25
Num. obs. 531 531 531
∗∗∗
p < 0.001, ∗∗
p < 0.01, ∗
p < 0.05
19
Comparison of Additive Models on Tot.Yes
Dependent variable:
Tot.Yes
CMP(Chi.Sq) Poisson(Chi.Sq)
s(sinc) 7.16 11.53∗∗
s(func) 7.51 11.40∗∗
s(sinc_o) 13.96∗∗
29.30∗∗∗
s(intel_o) 14.06∗∗
13.26∗∗∗
ν 0.56
AIC 2737.03 2804.77
Note: ∗
p<0.1; ∗∗
p<0.05; ∗∗∗
p<0.01
It’s more about the behavior of opposite person that guide us to
select her/him.
20
Summary
• The IRLS framework is far more efficient than the existing
likelihood based method and provides more flexibility.
• Since CMP is computationally heavier than the other GLMs we
could parallelize some matrix computations inorder to increase
the speed.
• The IRLS framework allows CMP to have other modeling
extensions such as LASSO etc.
Full paper available from https://arxiv.org/abs/1610.08244
and the source code is available from
https://github.com/SuneelChatla/cmp
21
Suggestions and
1. 1.1
Questions?
21
References
Fisman, R., Iyengar, S. S., Kamenica, E., and Simonson, I. (2006).
Gender differences in mate selection: Evidence from a speed
dating experiment. The Quarterly Journal of Economics, pages
673–697.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models,
volume 43. CRC Press.
Sellers, K. F. and Shmueli, G. (2010). A flexible regression model for
count data. Annals of Applied Statistics, 4(2):943–961.
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P.
(2005). A useful distribution for fitting discrete data: revival of the
conway–maxwell–poisson distribution. Journal of the Royal
Statistical Society: Series C (Applied Statistics), 54(1):127–142.
Wood, S. (2006). Generalized additive models: an introduction with R.
CRC press.

More Related Content

PDF
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
PDF
Response Surface in Tensor Train format for Uncertainty Quantification
PDF
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
PDF
Propagation of Uncertainties in Density Driven Groundwater Flow
PDF
Talk Alexander Litvinenko on SIAM GS Conference in Houston
PDF
Lecture 5: Stochastic Hydrology
PDF
Lecture 4: Stochastic Hydrology (Site Characterization)
PDF
Lecture 6: Stochastic Hydrology (Estimation Problem-Kriging-, Conditional Sim...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Response Surface in Tensor Train format for Uncertainty Quantification
Mx/G(a,b)/1 With Modified Vacation, Variant Arrival Rate With Restricted Admi...
Propagation of Uncertainties in Density Driven Groundwater Flow
Talk Alexander Litvinenko on SIAM GS Conference in Houston
Lecture 5: Stochastic Hydrology
Lecture 4: Stochastic Hydrology (Site Characterization)
Lecture 6: Stochastic Hydrology (Estimation Problem-Kriging-, Conditional Sim...

What's hot (20)

PPTX
Sliced Wasserstein距離と生成モデル
PDF
Data Smashing
PDF
Lecture 3: Stochastic Hydrology
PPTX
Teaching Population Genetics with R
PDF
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
PDF
Average Sensitivity of Graph Algorithms
PPTX
Newton Forward Difference Interpolation Method
PDF
Thesis defense
PDF
Epidemic processes on switching networks
PDF
Perfect method for Frames
PPT
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
PDF
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
PDF
Multi dof modal analysis free
PDF
An Introduction into Anomaly Detection Using CUSUM
PDF
Hiroaki Shiokawa
PDF
Hiroyuki Sato
PDF
D02402033039
PDF
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
PDF
ON APPROACH TO DECREASE DIMENSIONS OF FIELD-EFFECT TRANSISTORS FRAMEWORK ELEM...
PDF
Using R Tool for Probability and Statistics
Sliced Wasserstein距離と生成モデル
Data Smashing
Lecture 3: Stochastic Hydrology
Teaching Population Genetics with R
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Average Sensitivity of Graph Algorithms
Newton Forward Difference Interpolation Method
Thesis defense
Epidemic processes on switching networks
Perfect method for Frames
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Multi dof modal analysis free
An Introduction into Anomaly Detection Using CUSUM
Hiroaki Shiokawa
Hiroyuki Sato
D02402033039
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
ON APPROACH TO DECREASE DIMENSIONS OF FIELD-EFFECT TRANSISTORS FRAMEWORK ELEM...
Using R Tool for Probability and Statistics
Ad

Viewers also liked (12)

PPTX
ШМО суспільно гуманітарного циклу Новоданилівської ЗОШ
PDF
50W LED Corn Light E40 Big
PDF
A Grateful Attitude Affirmations
PDF
Amplificadores operacionales
PDF
Diari Més 29 de Desembre 2015
PDF
2 1-teor a--diagn_stico-2016
PPT
SORLA Motivation and Engagement
PPT
Dificultades en la implementación de los nuevos anticoagulantes y cómo supera...
PPTX
IB Diploma English B course overview
PPT
(2015-09-15) ANTICOAGULACIÓN (PPT)
PPT
Atención inicial al Síndrome coronario agudo (SCA)
PPTX
Internet
ШМО суспільно гуманітарного циклу Новоданилівської ЗОШ
50W LED Corn Light E40 Big
A Grateful Attitude Affirmations
Amplificadores operacionales
Diari Més 29 de Desembre 2015
2 1-teor a--diagn_stico-2016
SORLA Motivation and Engagement
Dificultades en la implementación de los nuevos anticoagulantes y cómo supera...
IB Diploma English B course overview
(2015-09-15) ANTICOAGULACIÓN (PPT)
Atención inicial al Síndrome coronario agudo (SCA)
Internet
Ad

Similar to Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM (20)

PDF
Chapitre03_Solutions.pdf
PDF
Process Mining - Chapter 3 - Data Mining
PDF
Process mining chapter_03_data_mining
PDF
An Introduction to Statistical Inference and Its Applications.pdf
PDF
Lecturenotesstatistics
PPTX
Math Exam Help
PDF
Approximate Bayesian Computation with Quasi-Likelihoods
PDF
Probability and Statistics Cookbook
PDF
toaz.info-instructor-solution-manual-probability-and-statistics-for-engineers...
DOCX
Check your solutions from the practice. Please be sure you f.docx
PDF
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
DOCX
PDF
Asymptotic properties of bayes factor in one way repeated measurements model
PDF
Asymptotic properties of bayes factor in one way repeated measurements model
PDF
Encyclopedia Of Statistical Sciences Second Edition Volume 9 Oakess Test Of C...
PPT
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
PDF
Making Sense of Data Big and Small
PDF
Manual Solution Probability and Statistic Hayter 4th Edition
ODT
Qnt 351 final exam new 2016
ODT
Qnt 351 final exam new 2016
Chapitre03_Solutions.pdf
Process Mining - Chapter 3 - Data Mining
Process mining chapter_03_data_mining
An Introduction to Statistical Inference and Its Applications.pdf
Lecturenotesstatistics
Math Exam Help
Approximate Bayesian Computation with Quasi-Likelihoods
Probability and Statistics Cookbook
toaz.info-instructor-solution-manual-probability-and-statistics-for-engineers...
Check your solutions from the practice. Please be sure you f.docx
An Introduction to Generalized Linear Models 3rd Edition Annette J. Dobson
Asymptotic properties of bayes factor in one way repeated measurements model
Asymptotic properties of bayes factor in one way repeated measurements model
Encyclopedia Of Statistical Sciences Second Edition Volume 9 Oakess Test Of C...
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
Making Sense of Data Big and Small
Manual Solution Probability and Statistic Hayter 4th Edition
Qnt 351 final exam new 2016
Qnt 351 final exam new 2016

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PDF
Introduction to Data Science and Data Analysis
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
Introduction to Data Science and Data Analysis
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
climate analysis of Dhaka ,Banglades.pptx
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
oil_refinery_comprehensive_20250804084928 (1).pptx
.pdf is not working space design for the following data for the following dat...
ISS -ESG Data flows What is ESG and HowHow
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
[EN] Industrial Machine Downtime Prediction
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM

  • 1. Modeling Big Count Data An IRLS framework for COM-Poisson regression and GAM Suneel Chatla Galit Shmueli November 12, 2016 Institute of Service Science National Tsing Hua University, Taiwan (R.O.C)
  • 2. Table of contents 1. Speed Dating Experiment- Count data models 2. Motivation 3. An IRLS framework 4. Simulation Study-Comparison of IRLS with MLE 5. A CMP Generalized Additive Model 6. Results & Conclusions 1
  • 3. Speed Dating Experiment- Count data models
  • 4. Speed dating experiment Fisman et al. (2006) conducted a speed dating experiment to evaluate the gender differences in mate selection 1 . Total sessions 14 Decision 1 or 0 Attractiveness 1-10 Intelligence 1-10 Ambition 1-10 ... ... Control variables 1https://www.kaggle.com/annavictoria/speed-dating-experiment 2
  • 5. Outcome/Count variables Matches : When both persons decide Yes Tot.Yes : Total number of Yes for each subject in a particular session 3
  • 6. Summary Statistics Statistic N Mean St. Dev. Min Max matches 531 2.524 2.304 0 14 Tot.Yes 531 6.433 4.361 0 21 Tot.partner 531 15.311 4.967 5 22 age 531 26.303 3.735 18 55 perc.samerace 531 0.391 0.242 0.000 0.833 avg.intcor 531 0.190 0.167 −0.298 0.569 attr 531 6.195 1.122 1.818 10.000 sinc 531 7.205 1.108 2.773 10.000 intel 531 7.381 0.988 3.409 10.000 func 531 6.438 1.103 2.682 10.000 amb 531 6.812 1.133 3.091 10.000 shar 531 5.511 1.333 1.409 10.000 like 531 6.157 1.072 1.682 10.000 prob 531 5.234 1.525 0.778 10.000 mean.agep 531 26.314 1.674 20.444 31.667 attr_o 531 6.200 1.186 2.333 8.688 sinc_o 531 7.224 0.690 4.167 9.000 intel_o 531 7.410 0.614 4.875 9.150 fun_o 531 6.438 1.015 2.625 8.615 amb_o 531 6.827 0.756 4.600 8.842 shar_o 531 5.498 0.942 1.375 7.700 like_o 531 6.161 0.873 2.333 8.300 prob_o 531 5.256 0.736 3.200 7.200 Tot.part.Yes 531 6.420 4.128 0 20 4
  • 7. Tools: • Poisson Regression • Negative Binomial Regression • Conway-Maxwell Poisson (CMP) Regression 5
  • 8. The CMP distribution From Shmueli et al. (2005), Y ∼ CMP(λ, ν) implies P(Y = y) = λy (y!)νZ(λ, ν) , y = 0, 1, 2, . . . Z(λ, ν) = ∞∑ s=0 λs (s!)ν for λ > 0, ν ≥ 0. The CMP distribution includes three well-known distributions as special cases: • Poisson (ν = 1), • Geometric (ν = 0, λ < 1), • Bernoulli (ν → ∞ with probability λ 1+λ ). 6
  • 9. CMP distribution for different (λ, ν) combinations λ=2,ν=0.5 Density 0 5 10 15 0.000.050.100.15 λ=2,ν=0.75 0 2 4 6 8 10 12 0.000.100.20 λ=2,ν=1 0 2 4 6 8 0.00.20.4 λ=2,ν=3 0 1 2 3 4 0.01.02.0 λ=8,ν=0.5 Density 40 60 80 100 0.0000.0150.030 λ=8,ν=0.75 5 10 15 20 25 30 35 0.000.040.08 λ=8,ν=1 0 5 10 15 20 0.000.060.12 λ=8,ν=3 0 1 2 3 4 5 0.00.20.40.60.8 λ=15,ν=0.5 Density 150 200 250 300 0.0000.010 λ=15,ν=0.75 20 30 40 50 60 0.000.020.04 λ=15,ν=1 5 10 15 20 25 30 0.000.040.08 λ=15,ν=3 0 1 2 3 4 5 6 0.00.40.8 7
  • 10. CMP Regression CMP regression models can be formulated as follows: log(λ) = Xβ (1) log(ν) = Zγ (2) Maximizing the log-likelihood w.r.t the parameters β and γ will yield the following normal equations Sellers and Shmueli (2010): U = ∂logL ∂β = XT (y − E(y)) (3) V = ∂logL ∂γ = νZT (−log(y!) + E(log(y!))) (4) 8
  • 12. Exploration of Speed Dating data q q q q q q q qq q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q 4 5 6 7 8 9 −2−10123 Sincerity (Others) Tot.Yes(log) q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq qq q q qq q q q q qq q qq q q q qq q q q q q q qq q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q qq q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 5 6 7 8 9 −2−10123 Intelligence (Others) Tot.Yes(log) q q q q q q q q q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q qq q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 4 6 8 10 −2−10123 Sincerity Tot.Yes(log) q q q q qq q q q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 4 6 8 10 −2−10123 Fun seeking Tot.Yes(log) 9
  • 13. More flexibility? Generalized Additive Models • Smoothing Splines • Penalized Splines Both implementations are dependent upon the Iterative Reweighted Least Squares (IRLS) estimation framework. At present, there is no IRLS framework available for CMP !! 10
  • 15. Update for each iteration I [ β γ ](m) = I [ β γ ](m−1) + [ U V ] which implies the following equations XT ΣyXβ(m) − XT Σy,log(y!)νZγ(m) = XT ΣyXβ(m−1) − XT Σy,log(y!)νZγ(m−1) + XT (y − E(y)) and − νZT Σy,log(y!)Xβ(m) + ν2 ZT Σlog(y!)Zγ(m) = −νZT Σy,log(y!)Xβ(m−1) + ν2 ZT Σlog(y!)Zγ(m−1) + νZT (−log(y!) + E(log(y!))) 11
  • 16. For the fixed values of both β and γ the equations XT ΣyXβ(m) = XT ΣyXβ(m−1) + XT (y − E(y)) (5) ν2 ZT Σlog(y!)Zγ(m) = ν2 ZT Σlog(y!)Zγ(m−1) + νZT (−log(y!) + E(log(y!))). (6) 12
  • 18. Practical issues Initial Values • For λ = (y + 0.1)ν • For ν = 0.2 Calculation of Cumulants • Bounding error 10−8 or 10−10 • Asymptotic expressions Stopping Criterion • Based on −2 ∑ l(yi; ˆλi, ˆνi) Step size • Step halving 14
  • 20. Study design We compare our IRLS algorithm with the existing implementation which is based on maximizing the likelihood function (through optim in R). (a) Set sample size n = 100 (b) Generate x1 ∼ U(0, 1) and x2 ∼ N(0, 1) (c) Calculate x3 = 0.2x1 + U(0, 0.3) and x4 = 0.3x2 + N(0, 0.1) (to create correlated variables) (d) Generate y ∼ CMP(log(λ) = 0.05 + 0.5x1 − 0.5x2 + 0.25x3 − 0.25x4, ν) where ν = {0.5, 2, 5} 15
  • 21. Results q q q q IR MLE IR MLE IR MLE −0.50.00.51.01.5 x1 q q q q q q q q IR MLE IR MLE IR MLE −2.0−1.5−1.0−0.50.00.5 x2 q q q IR MLE IR MLE IR MLE −4−20246 x3 q q q q q q q q qq IR MLE IR MLE IR MLE −4−2024 x4 q q q IR MLE IR MLE IR MLE −2−101234 log(ν) ν=0.5 ν=2 ν=5 16
  • 22. A CMP Generalized Additive Model
  • 23. Additive Model log(λ) = α + p ∑ j=1 fj(Xj) log(ν) = Zγ where fj (j = 1, 2, . . . , p) are the smooth functions for the p variables. 17
  • 24. Backfitting Based on Hastie and Tibshirani (1990); Wood (2006), the algorithm as follows 1. Initialize: fj = f (0) j , j = 1, . . . , p 2. Cycle: j = 1, . . . , p, 1, . . . , p, . . . fj = Sj ( y − ∑ k̸=j fk|xj ) 3. Continue (2) until the individual functions don’t change. One more nested loop inside the IRLS framework ! 18
  • 26. Comparison of Regression models on Tot.Yes Poisson Negative Binomial CMP (Intercept) 0.49 0.59 0.14 (0.43) (0.55) (0.33) GenderMale 0.05 0.05 0.03 (0.04) (0.06) (0.03) age −0.01 −0.01 −0.004 (0.01) (0.01) (0.004) Tot.partner 0.07∗∗∗ 0.07∗∗∗ 0.04∗∗∗ (0.00) (0.01) (0.003) avg.intcor −0.04 −0.04 −0.02 (0.11) (0.15) (0.09) attr 0.19∗∗∗ 0.18∗∗∗ 0.11∗∗∗ (0.03) (0.04) (0.02) sinc −0.06 −0.05 −0.04 (0.03) (0.04) (0.02) intel 0.05 0.06 0.03 (0.04) (0.05) (0.03) func 0.03 0.04 0.02 (0.04) (0.05) (0.03) amb −0.12∗∗∗ −0.13∗∗ −0.07∗∗ (0.03) (0.04) (0.02) shar 0.10∗∗∗ 0.10∗∗∗ 0.06∗∗∗ (0.02) (0.03) (0.02) mean.agep −0.01 −0.01 −0.007 (0.01) (0.02) (0.009) attr_o −0.10∗∗∗ −0.10∗∗∗ −0.06∗∗∗ (0.02) (0.03) (0.02) sinc_o 0.02 0.02 0.01 (0.04) (0.05) (0.03) intel_o 0.08 0.08 0.05 (0.05) (0.07) (0.04) fun_o −0.01 −0.01 −0.003 (0.03) (0.04) (0.02) amb_o −0.00 −0.01 0.0005 (0.04) (0.05) (0.03) shar_o 0.02 0.03 0.01 (0.03) (0.04) (0.02) ν 0.53∗∗∗ AIC 2844.92 2777.24 2751.7 BIC 3011.64 2948.23 2922.66 Log Likelihood -1383.46 -1348.62 -1335.33 Deviance 970.04 637.25 Num. obs. 531 531 531 ∗∗∗ p < 0.001, ∗∗ p < 0.01, ∗ p < 0.05 19
  • 27. Comparison of Additive Models on Tot.Yes Dependent variable: Tot.Yes CMP(Chi.Sq) Poisson(Chi.Sq) s(sinc) 7.16 11.53∗∗ s(func) 7.51 11.40∗∗ s(sinc_o) 13.96∗∗ 29.30∗∗∗ s(intel_o) 14.06∗∗ 13.26∗∗∗ ν 0.56 AIC 2737.03 2804.77 Note: ∗ p<0.1; ∗∗ p<0.05; ∗∗∗ p<0.01 It’s more about the behavior of opposite person that guide us to select her/him. 20
  • 28. Summary • The IRLS framework is far more efficient than the existing likelihood based method and provides more flexibility. • Since CMP is computationally heavier than the other GLMs we could parallelize some matrix computations inorder to increase the speed. • The IRLS framework allows CMP to have other modeling extensions such as LASSO etc. Full paper available from https://arxiv.org/abs/1610.08244 and the source code is available from https://github.com/SuneelChatla/cmp 21
  • 30. References Fisman, R., Iyengar, S. S., Kamenica, E., and Simonson, I. (2006). Gender differences in mate selection: Evidence from a speed dating experiment. The Quarterly Journal of Economics, pages 673–697. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models, volume 43. CRC Press. Sellers, K. F. and Shmueli, G. (2010). A flexible regression model for count data. Annals of Applied Statistics, 4(2):943–961. Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the conway–maxwell–poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(1):127–142.
  • 31. Wood, S. (2006). Generalized additive models: an introduction with R. CRC press.