SlideShare a Scribd company logo
Simulation Study on Hurdle Model Performance on Zero Inflated
Count Data
Adrian Daniel D. Camacho
The use of hurdle models have been prevalent in many practices especially
the popularity of excessive zeroes in their dataset. This simulation study
showed that hurdle models are adept to omission of even significant
predictors. However the performance of hurdle models decreases when
there multicollinearity by almost 50% larger error rate in addition to the fact
that parameter estimates becomes biased. The simulation analysis was
employed onto 100 to 1,000,000 cases.
Keywords: hurdle model, zero-inflation, binary logistic regression,
truncated poisson, truncated negative binomial
1. Introduction
In practice, we encounter different data generating processes that in turn give excess zeroes.
The instance of data with excessive zeroes can be either structured or merely caused by
sampling variation. Some examples of such scenario are counting a specific strain of virus
developed in a certain environment, number of website visits for small retail sites, or even
to point of counting instances that a consigned product is bought from a coffee shop.
Poisson and Binomial distributions can accommodate such scenarios of zero occurrences
but only to such extent. They are modified to model the data that can adjust for structural
zeroes. There are also approaches to truncate the zeroes however this poses large loss of
information. This is where have to strike a balance of including or excluding them from
the data modeling. Usually we consider hurdle or zero-inflated models when we encounter
count data with outsized percentage of true zeroes. These models are developed to cope
with excessive zeroes, nevertheless they have different characteristics. Zero inflated
models (Poisson or Negative Binomial) assumes two possible origin of the zeroes, either
from the data structure or from sampling variation. On the other hand, hurdle models all
zeroes are structured. This being said, these models can behave differently and produce
different results when compared. One should take note how the data generating process is
designed and how the outcome is observed.
This paper aims to demonstrate the performance of hurdle models on zero-inflated count
data. The simulation study will focus on the different scenarios of sampling size,
multicollinearity, dropping important parameters, and other model adequacy factors.
The analysis will be focusing only on two-part hurdle models only, namely logistic-poisson
and logistic-negative binomial models.
2. Related Literature
The use of hurdle models have been prevalent in many practices. The wide range of
applications of these models that tackle the excessive zero problem had produced a number
of literature. Some of the following papers are associated with the use of such models, for
example Cameron and Trivedi (1986), Ridout et al. (1998) or Min and Agresti (2002).
The application have spanned from econometrics (Winkelmann, 2004 and Mullahy, 1986),
to epidemiology (Bohning et al., 1999), ecology (Welsh et al., 1996), manufacturing
defects (Lambert, 1992), medical care (Deb and Trivedi, 1997), banking (Moffatt, 2003)
and insurance (Boucher et al., 2006).
There is much difference between the zero-inflated models and hurdle models though
similarities in structure. For instance, a study (Hu, Pavlicova, and Nunes 2011) on the
differences between these distributions and models and to explore how to compare
different count data models using data from a multi-site clinical trial of behavioral
interventions to reduce episodes of HIV risk behaviour. Their findings conclude that zero-
inflated models fit better than the corresponding hurdle models. Their scenarios included
some participants scoring zero unprotected sexual occasions because they had no sexual
partners, others had sexual partners but scored zero because they did not engage in high-
risk sex. Moreover the example above indicates the need to consider two sources of zero
observations, “sampling zeros” that are part of the underlying sampling distribution
(Poisson, or negative binomial) and “structural zeros” that cannot score anything other than
zero. Hence, designing clinical trials are crucial to the choice of model in the event that fit
statistics do not identify a clear best fit.
Hurdle models are useful nevertheless if assumptions and experimental designs are aligned
to the data generating process. Many analysts still advocate this model and continue to
improve its performance. Robust version of the Hurdle Model (Cantoni and Zedini, 2009)
are created to address the frequency of gross errors and the complexity intrinsic to some
considered phenomena which may render this classical model unreliable and too limiting.
This method was also used to predict self-harm repetition example (Bethell and et. al.
2010). The first step tests factors associated with any repetition (repeaters versus non-
repeaters) and the second part tests factors associated with the number of presentations
(among repeaters). Hurdle models are shown to be more informative than traditional binary
analyses, and also adequately fit these data relative to some other count models.
Dynamic hurdle model for zero-inflated count data (Baetschmann and Winkelmann, 2015)
is discussing the encounter of many empirical count data applications to provide a new
explanation of extra zeroes that are related to the underlying stochastic process that
generates events. In their study, it was assumed that a process has two rates, a lower rate
until the first event, and a higher rate thereafter. Using this concept they are able to apply
this new approach to the socio-economic determinants of the individual number of doctor
visits in Germany.
3. Methodological Sketch
Hurdle Model (2-Step)
A hurdle model is a modified count model in which there are two main processes, one
generating the zeros and one generating the positive values. The concept underlying the
hurdle model is that a binomial probability model governs the binary outcome of whether
a count variable has a zero or a positive (non-zero) value. This implies that zeroes are
generated from a structured process. If the value is positive, the "hurdle is crossed," and
the conditional distribution of the positive values is governed by a zero-truncated count
model. (Agresti, 1996)
For this study, we will make use of two independent steps where as we consider a binary
logistic regression in the first step and both truncated Poisson and truncated Negative
Binomial regression in the second step. This follows a necessary re-structuring of the count
data (Y) to form a binary response data (Z).
Suppose we have independent counts ܻ௜ for i = 1, … , n and two set of covariates ܺ௜ ∈	ℛ௣
and ܷ௜ ∈	ℛ௣
that may or may not be (partially) equal. The first step of the hurdle model
defined as a logistic model that would predict the probability of a non-zero count. The
probability of the associated model of the first step is defined by:
ܲሺܻ	ℎ‫ܺ|0	ݏ݈݁݀ݎݑ‬ሻ =	
ୣ୶୮	ሺ௑೔
ᇲ
ఉሻ
ଵା	ୣ୶୮	ሺ௑೔
ᇲఉሻ
or consequently,
ܲሺܻ = 	0|ܺሻ =	
ଵ
ଵା	ୣ୶୮	ሺ௑೔
ᇲఉሻ
If Z is predicted to likely as a zero count then the predicted count would be zero as well.
Otherwise, the second step of the model would take place by either considering truncated
Y’s distribution as either truncated Poisson or truncated Negative Binomial. The estimating
model is given by:
ܻ෠ = ܷ௜
ᇱ
ߙ
Count
Y = 0
(Z = 0)
Zero is generated from covariates
Y > 0
(Z = 1)
Zero-truncated count model function of
covariates (may not be the same set)
The log-likelihood of the two-part Poisson-Logistic (zero altered poisson) model is written
as:
‫ܮ‬ሺߚ, ߙ|ܻሻ =	 ෍ ݈‫݃݋‬ ቆ
1
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔ୀ଴
+	 ෍ ݈‫݃݋‬ ቆ
expሺܺ௜
ᇱ
ߚሻ
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔வ଴
+	 ෍ ሾ݈‫݃݋‬ሺܻ௜ሺܷ௜
ᇱ
ߙሻ − exp	ሺܷ௜
ᇱ
ߙሻ − logሺ1 − expሺ− expሺܷ௜
ᇱ
ߙሻሻሻ
௒೔வ଴
− log	ሺܻ௜!ሻሿ =
‫ܮ‬ሺߚ|ܻሻ + 	‫ܮ‬ሺߙ|ܻሻ
Similarly, the log-likelihood of the two-part Negative Binomial-Logistic (zero altered
negative binomial) model can be written as:
‫ܮ‬ሺߚ, ߙ|ܻሻ =	 ෍ ݈‫݃݋‬ ቆ
1
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔ୀ଴
+	 ෍ ݈‫݃݋‬ ቆ
expሺܺ௜
ᇱ
ߚሻ
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔வ଴
+	 ෍ ቎෍ ݈ ቆ݆ +
expሺܷ௜
ᇱ
ߙሻ
ߠ
ቇ
௒೔ିଵ
௝வ଴
− ݈ܻ݊௜ − ቆܻ௜ +
expሺܷ௜
ᇱ
ߙሻ
ߠ
ቇ lnሺ1 + ߠሻ
௒೔வ଴
+ ܻ௜݈݊ߙ቏ =
‫ܮ‬ሺߚ|ܻሻ + 	‫ܮ‬ሺߙ|ܻሻ
3.1 Data and Model
The model postulated for this study was:
Y = exp(.10+.30X1+.70X2+.20X3-.40X4-.50X5)
This model was chosen to represent an average Poisson models with five independent
variables. The data for the independent variables were simulated from a Normal distribution
with mean = 0 and stdev = 1. The dependent variable Y was simulated from Poisson
distribution with the expected mean is estimated above. SAS was used to generate the
simulated data and analysis. A fixed seed value was set for random number generation.
Furthermore, the data was induced to have zero-inflation by having three variables
corresponding to the excessive zeroes data generation (see appendix A for SAS syntax).
These three variables follows a logistic distribution. The method was to create a certain cut-
off, which is uniformly distributed. The rule then was to force Y to zero if the probability
generated is less than the simulated cut-off score.
In order to simulate the performance of the hurdle model, the data was partitioned to a
training dataset (80%) and a testing dataset (20%). This will be used to benchmark the model
adequacy of the trained model onto the test dataset at different sample sizes and scenarios.
The sample data generated for this particular study are: 100, 1000, 10000, 100000, and
1000000.
3.2 Misspecification
The data was forced to accommodate multicollinearity. The independent variable ܺଶ was
redefined in the simulation of data with the formula ܺଶ = 1.5 ∗ ܺହ + 2 ∗
‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻ using SAS. This would return a strong multicollinearity factor
between the two variables.
Another source of misspecification voluntarily introduced to the dataset was omission of
an important variable. Again, ܺଶ was used to exemplify the scenario of removing this
significant predictor in the model.
The goal of these exercises is to assess the robustness of the hurdle model under such
circumstances. The effect of misspecification is gauged by checking the consequences on
the parameter estimates and some measures of model.
3.3 Estimation Procedure
After the data simulation was done, the data is partitioned to training data and test data
(80% and 20% of the total number of sample respectively). The model developed using the
training dataset will be the point of reference later on for the test data.
Following the concept of the hurdle model, the first step is to predict the probability of the
data using the covariates (5 independent variables) of being a zero count or not, using
PROC LOGISTIC in SAS. After predicting the probability of each case, we benchmark
and captured the cut-off score to be used as the rule of the “hurdle” using the maximum
predicted probability of the true zeroes in order to maximize specificity.
The next step was to truncate the data of all the zeroes. PROC COUNTREG is used on the
truncated data to produce Poisson and Negative Binomial (p = 1 and p = 2) predicted
counts.
Once the model development was completed, the test data are scored accordingly using the
2-step hurdle model. Additionally, the misspecification scenarios are run separately over
to check for the impact of such events.
4. Results
4.1. Regular Run
The proportion of zeroes amongst the simulated data is approximately 66% or two-thirds
of the data. The true mean of Y is relatively low around 0.96 for all simulations. As the
number of simulation increases, the maximum value of Y increases as well indicating
skewness to the right as expected.
Table 1: Descriptive Statistics of the Dependent Variable Count Y
Sample Statistics Training Data (80%) Test Data (20%) Overall
100 Average 0.95 0.74 0.78
StdDev 1.79 1.13 1.28
Min 0.00 0.00 0.00
Max 6.00 4.00 6.00
1,000 Average 0.95 0.95 0.95
StdDev 1.97 2.30 2.23
Min 0.00 0.00 0.00
Max 14.00 33.00 33.00
10,000 Average 0.99 0.95 0.95
StdDev 2.25 2.16 2.18
Min 0.00 0.00 0.00
Max 46.00 33.00 46.00
100,000 Average 0.97 0.97 0.97
StdDev 2.33 2.23 2.25
Min 0.00 0.00 0.00
Max 52.00 58.00 58.00
1,000,000 Average 0.96 0.97 0.97
StdDev 2.24 2.25 2.25
Min 0.00 0.00 0.00
Max 83.00 150.00 150.00
The parameter estimates generated during the first step of the hurdle model are all
significant using the following covariates. Table 3 presents that even by using the
covariates used to generate the data, there would still be misclassification but only to a
minimal rate.
Table 2: Results of Logistic Regression (1st
Step of Hurdle) under Training Data
Parameter Estimates
Sample Intercept x1 x2 x3 x4 x5
100 -0.5367 0.0794 0.2723 -0.1319 -0.0740 -0.3868
1,000 -0.8109 0.2597 0.4251 0.0450 -0.3009 -0.5692
10,000 -0.7688 0.1961 0.4479 0.1531 -0.2337 -0.3387
100,000 -0.7439 0.1824 0.4657 0.1421 -0.2555 -0.3331
1,000,000 -0.7372 0.1932 0.4543 0.1308 -0.2558 -0.3214
Table 3: Logistic Regression Classification Table
Sample Value of Z
Passed the Hurdle
Training Data (80%) Test Data (20%)
No Yes No Yes
100
0 14 49
> 1 6 31
1,000
0 134 1 (0.50%) 541
> 1 65 259
10,000
0 1,299 5,353 1 (0.01%)
> 1 701 2,646
100,000
0 13,306 53,111 1 (0.001%)
> 1 6,694 26,888
1,000,000
0 132,748 529,686
> 1 67,252 270,314
The second step of the model is supposedly predicting the counts (Y), however the
parameter estimates are significantly different than the original coefficients used to
simulate the data. At the 10,000 sample dataset (see Table 4), the parameter estimates
appear to be the closest to the true coefficients. The zero truncated data is not responded
well by the usual count data models.
Nevertheless, the hurdle model is able to capture the expected value of Y. Looking at
table 5, bias (%) decreases as the sample size increases. The test data also appears to be
fit well within the true mean range.
Table 4: Parameter Estimates Comparison (2nd
Step of the Hurdle Model)
Model Sample
Intercept
(Bo = 0.1)
x1 (B1 =
0.3)
x2 (B2 =
0.7)
x3 (B3 =
0.2)
x4 (B4 = -
0.4)
x5 (B5 = -
0.5)
Alpha
Poisson
100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269
1,000 0.4648 0.2423 0.6171 0.1882 -0.2351 -0.4228
10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865
NB1
100 0.3747 0.0944 0.3817 0.1568 -0.1824 -0.1370 0.0000
1,000 0.4648 0.2423 0.6170 0.1882 -0.2351 -0.4228 0.0000
10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703 0.0000
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000
NB2
100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269 0.0000
1,000 0.4750 0.2399 0.6081 0.1854 -0.2280 -0.4164 0.0130
10,000 0.5281 0.2435 0.5255 0.1406 -0.2675 -0.3287 0.0000
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000
Table 5: 2-Step Hurdle Model Predicted Count Accuracy
Sample Values
Training
Data
(80%)
Bias (%)
Test
Data
(20%)
Bias (%) Overall Bias (%)
100
True Mean Count 0.9500 0.7375 0.7800
Average Poisson
predicted count
0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2%
Average NB1 predicted
count
0.5008 -47.3% 0.6702 -9.1% 0.6363 -18.4%
Average NB2 predicted
count
0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2%
1,000
True Mean Count 0.9500 0.9475 0.9480
Average Poisson
predicted count
1.1074 16.6% 0.9475 0.0% 0.9795 3.3%
Average NB1 predicted
count
1.1074 16.6% 0.9475 0.0% 0.9795 3.3%
Average NB2 predicted
count
1.1004 15.8% 0.9445 -0.3% 0.9756 2.9%
10,000
True Mean Count 0.9920 0.9453 0.9546
Average Poisson
predicted count
1.0151 2.3% 0.9476 0.2% 0.9611 0.7%
Average NB1 predicted
count
1.0151 2.3% 0.9476 0.2% 0.9611 0.7%
Average NB2 predicted
count
0.9908 -0.1% 0.9209 -2.6% 0.9349 -2.1%
100,000
True Mean Count 0.9711 0.9654 0.9666
Average Poisson
predicted count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
Average NB1 predicted
count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
Average NB2 predicted
count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
1,000,000
True Mean Count 0.9627 0.9693 0.9680
Average Poisson
predicted count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
Average NB1 predicted
count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
Average NB2 predicted
count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
The mean absolute percentage error (MAPE) and mean absolute deviation (MAD) are
measures to gauge prediction accuracy. Upon checking, error rate is approximately below
20%. This indicates that there is a large discrepancy when we look at individual-level
comparison. As expected, the error/deviation is decreasing when we have a larger sample
size.
Table 6: Model Fit Measures (Mean Absolute Percentage Error and Mean Absolute Deviation)
Sample Measure Model Training Data (80%) Test Data (20%) Grand Total
100 MAPE Poisson 13.88% 18.19% 17.33%
NB1 13.31% 16.25% 15.67%
NB2 13.88% 18.19% 17.33%
MAD Poisson 0.4789 0.2782 0.3183
NB1 0.5012 0.2770 0.3218
NB2 0.4789 0.2782 0.3183
1,000 MAPE Poisson 19.91% 18.18% 18.53%
NB1 19.91% 18.18% 18.53%
NB2 19.82% 18.19% 18.52%
MAD Poisson 0.4947 0.3921 0.4127
NB1 0.4947 0.3921 0.4127
NB2 0.4902 0.3935 0.4128
10,000 MAPE Poisson 19.52% 17.24% 17.70%
NB1 19.52% 17.24% 17.70%
NB2 19.26% 16.96% 17.42%
MAD Poisson 0.3996 0.3663 0.3730
NB1 0.3996 0.3663 0.3730
NB2 0.4048 0.3708 0.3776
100,000 MAPE Poisson 17.45% 17.35% 17.37%
NB1 17.45% 17.35% 17.37%
NB2 17.45% 17.35% 17.37%
MAD Poisson 0.3651 0.3647 0.3647
NB1 0.3651 0.3647 0.3647
NB2 0.3651 0.3647 0.3647
1,000,000 MAPE Poisson 17.52% 17.54% 17.54%
NB1 17.52% 17.54% 17.54%
NB2 17.52% 17.54% 17.54%
MAD Poisson 0.3662 0.3675 0.3672
NB1 0.3662 0.3675 0.3672
NB2 0.3662 0.3675 0.3672
4.2. With Misspecification
The parameter estimates are severely affected and unstable once introduced with an ill-
condition independent variable. The sample size is not a factor at all to the damage of the
multicollinear covariates ܺଶ and ܺହ. The mean of Y however appears unaffected at large
sample sizes.
Table 7: Induced with multicollinearity at ࢄ૛
Model Sample
Intercept
(Bo = 0.1)
x1 (B1 =
0.3)
x2*
x3 (B3 =
0.2)
x4 (B4 = -
0.4)
x5 (B5 = -
0.5)
Alpha
Poisson
100 1.7391 -0.0582 -0.1621 0.2433 -0.3541 0.6122
1,000 1.8104 0.3427 0.0520 0.1469 -0.2965 -0.3562
10,000 2.1982 0.1715 -0.0115 0.0854 -0.1898 -0.1822
100,000 2.0910 0.1796 0.0180 0.1707 -0.3249 -0.5244
1,000,000 2.1282 0.2235 0.0041 0.1415 -0.3048 -0.3969
NB1
100 1.7615 -0.1236 -0.1663 0.1337 -0.2593 0.4055 3.4204
1,000 1.9619 0.1622 -0.0058 0.0974 -0.1362 -0.1439 8.3611
10,000 2.2730 0.0585 -0.0005 0.0551 -0.1051 -0.0977 15.3638
100,000 2.3584 0.0712 0.0002 0.0651 -0.1004 -0.1262 18.7049
1,000,000 2.3411 0.0736 0.0010 0.0496 -0.0972 -0.1245 18.0943
NB2
100 1.7149 -0.0868 -0.1686 0.1718 -0.3822 0.4846 0.5480
1,000 1.8367 0.2874 0.0341 0.0652 -0.2898 -0.3414 0.9997
10,000 2.1916 0.1687 -0.0195 0.0573 -0.2123 -0.1921 1.4860
100,000 2.1394 0.2060 0.0107 0.1475 -0.2782 -0.4195 1.5112
1,000,000 2.1438 0.2172 0.0031 0.1371 -0.2855 -0.3679 1.5149
* - Induced multicollinearity (ܺଶ = 1.5 ∗ ܺହ + 2 ∗ ‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻሻ
As compared to the gravity effect to the multicollinearity, omission of important variable
is still tolerable at this point. The estimate for the mean of Y also appears to be accurate
still.
Table 8: Omission of most important variable ࢄ૛
Model Sample
Intercept (Bo
= 0.1)
x1 (B1 = 0.3) x3 (B3 = 0.2) x4 (B4 = -0.4) x5 (B5 = -0.5) Alpha
Poisson
100 0.6251 -0.0707 0.0539 -0.0710 -0.0431
1,000 0.7677 0.1326 0.1496 -0.1580 -0.4463
10,000 0.7993 0.1785 0.1323 -0.2701 -0.3537
100,000 0.7979 0.2007 0.1393 -0.2737 -0.3466
1,000,000 0.7947 0.2070 0.1394 -0.2755 -0.3438
NB1
100 0.4249 -0.0635 0.0623 -0.0714 -0.0629 0.0000
1,000 0.8461 0.1178 0.1352 -0.1193 -0.3524 0.6700
10,000 0.8377 0.1626 0.1159 -0.2405 -0.3129 0.4623
100,000 0.8399 0.1781 0.1221 -0.2431 -0.3049 0.4686
1,000,000 0.8376 0.1826 0.1228 -0.2424 -0.3027 0.4704
NB2
100 0.6055 -0.0666 0.0621 -0.0690 -0.0589 0.0000
1,000 0.7884 0.1319 0.1481 -0.1431 -0.4148 0.2307
10,000 0.8113 0.1687 0.1268 -0.2523 -0.3410 0.1792
100,000 0.8109 0.1916 0.1315 -0.2602 -0.3296 0.1797
1,000,000 0.8067 0.1978 0.1330 -0.2627 -0.3283 0.1799
Similar to the properties of Poisson and Negative Binomial, the truncated model are
obviously not performing under multicollinearity. Prediction deviations balloons up to
nine times larger or the error rate increases by almost 50% more.
Table 9: Model Fit Comparison under different scenarios
Sample Measure Model regular run with MC problem omission of X2
100
MAPE
Poisson 17.33% 62.03% 21.16%
NB1 15.67% 61.52% 17.55%
NB2 17.33% 57.89% 20.73%
MAD
Poisson 0.3183 1.4536 0.4004
NB1 0.3218 1.5021 0.4016
NB2 0.3183 1.4254 0.4004
1,000
MAPE
Poisson 18.53% 81.53% 26.77%
NB1 18.53% 84.53% 27.17%
NB2 18.52% 81.67% 26.65%
MAD
Poisson 0.4127 2.3506 0.5865
NB1 0.4127 2.3245 0.5804
NB2 0.4128 2.3290 0.5822
10,000
MAPE
Poisson 17.70% 107.78% 25.53%
NB1 17.70% 111.52% 25.90%
NB2 17.42% 107.87% 25.52%
MAD
Poisson 0.3730 3.6091 0.5351
NB1 0.3730 3.6383 0.5367
NB2 0.3776 3.6171 0.5344
100,000
MAPE
Poisson 17.37% 116.45% 25.49%
NB1 17.37% 129.40% 25.90%
NB2 17.37% 115.83% 25.47%
MAD
Poisson 0.3647 3.9364 0.5367
NB1 0.3647 3.9887 0.5389
NB2 0.3647 3.8788 0.5360
1,000,000
MAPE
Poisson 17.54% 115.64% 25.54%
NB1 17.54% 126.53% 25.96%
NB2 17.54% 115.38% 25.52%
MAD
Poisson 0.3672 3.8658 0.5368
NB1 0.3672 3.9341 0.5385
NB2 0.3672 3.8476 0.5359
5. Conclusion
The logistic regression model at the first step manages to classify and mitigate the effects of
misspecification throughout the simulation analysis (keeping an error rate of < 1%). The mean
of the true count is predicted accurate enough despite the presence of multicollinearity (within
2%). Upon observing the AIC criteria, the poisson regression and NB1 almost maintains the
same value. For sake of simplicity, the simulated data still favor the poisson regression as the
2nd
step of the hurdle model.
The hurdle model performance behaves similar to the performance of poisson and negative
binomial models as it is greatly affected by ill-conditioned covariates. Omission of an
important variable does not have some worrisome effect as compared to multicollinearity.
Using the hurdle model on live data can gauge the reliability of the trained model assuming
that the test data comes from the same data generating process. The bias in the average count
for the test data only went up as high as 9.1% at a small sample size. Increasing the number of
cases studied decreases the discrepancy.
The analysis produced are limited to a poisson data with low mean. This does not include the
covariates that led to the zero-inflation and is possible to further improve the predictions and
estimates.
6. References
Cameron, A. and Trivedi, P. (1986). Econometric models based on count data, comparisons
and applications of some estimators. Journal of Applied Econometrics, 1, 29–53.
Ridout, M., Dem´etrio, C. G. B., and Hinde, J. (1998). Models for count data with many
zeros. In Proceedings of the 19th International Biometrics Conference, Cape Town, pp.
179–190.
Min, Y. and Agresti, A. (2002). Modeling nonnegative data with clumping at zero: A
survey. Journal of the Iranian Statistical Society, 1,(1-2), 7–33.
Winkelmann, R. (2004). Health care reform and the number of doctor visits—an econometric
analysis. Journal of Applied Econometrics, 19, 455–472.
Mullahy, J. (1986). Specification and testing of some modified count data
models. Journal of Econometrics, 33, 341–365.
Bohning, D., Dietz, E., Schlattmann, P., Mendonca, L., and Kirchner, U. (1999). The
zero-inflated poisson model and the decayed, missing and filled teeth index in dental
epidemiology. Journal of the Royal Statistical Society. Series A (Statistics in Society), 162,
195–209.
Welsh, A., Cunningham, R., Donnelly, C., and Lindenmayer, D. (1996). Modelling the
abundance of rare species : Statistical models for counts with extra zeros. Ecological
Modelling, 88, 297–308.
Lambert, D. (1992). Zero-inflated poisson regression with an application to defects in
manufacturing. Technometrics, 34, 1–14.
Deb, P. and Trivedi, P. (1997). Demand for medical care by the elderly: A finite mixture
approach. Journal of Applied Econometrics, 12, 313–336.
Moffatt, P. (2003). Hurdle models of loan default. In a Conference at the Credit Research
Center, University of Edinburgh, UK.
Boucher, J.-P., Denuit, M., and Guillen, M. (2006). Modelisation of claim count with
hurdle distribution for panel data. In Proceedings of the International Conference on
Mathematical and Statistical Modeling in Honor of Enrique Castillo.
Mei-Chen Hu, Martina Pavlicova, and Edward V. Nunes (2011). Zero-inflated and Hurdle
Models of Count Data with Extra Zeros: Examples from an HIV-Risk Reduction Intervention
Trial
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3238139/
Eva Cantoni and Asma Zedini (2009). A Robust Version of the Hurdle Model.
http://www.unige.ch/ses/metri/cahiers/2009_07.pdf
Jennifer Bethell, Anne E. Rhodes, Susan J. Bondy, W. Y. Wendy Lou, Astrid Guttmann
(2010). Repeat self-harm: application of hurdle models. The British Journal of Psychiatry.
http://bjp.rcpsych.org/content/196/3/243
Gregori Baetschmann and Rainer Winkelmann (2015). A Dynamic Hurdle Model for Zero-
Inflated Count Data.
https://www.econ.uzh.ch/dam/jcr:ffffffff-a477-8018-ffff-
ffffabad53fc/Dynamic_Hurdle.pdf
AGRESTI, A., 1996, An Introduction to Categorical Data Analysis. New York: John
Wiley & Sons, Inc.
BARRIOS, E., 2015, Lectures on Overdispersion
APPENDIX A: SAS code syntax to generate simulated data
data basedata;
call streaminit(123);
array vars x1-x5;
array zero_vars z1-z3;
array parms{5} (.3 .7 .2 -.4 -.5);
array zero_parms{3} (-.3 .1 .2);
intercept=.1;
z_intercept=-.1;
do i=1 to &sample;
/*parameter initialization for non-zero covariates*/
sum_xb=0;
sum_gz=0;
do j=1 to 5;
vars[j]=rand('NORMAL',0,1);
sum_xb=sum_xb+parms[j]*vars[j];
end;
mu=exp(intercept+sum_xb);
y_p=rand('POISSON', mu);
/*induce zeroes by some z1-z3 variables*/
do j=1 to 3;
zero_vars[j]=rand('NORMAL',0,1);
sum_gz = sum_gz+zero_parms[j]*zero_vars[j];
end;
z_gamma = z_intercept+sum_gz;
pzero = cdf('LOGISTIC',z_gamma);
cut=rand('UNIFORM');
if cut<pzero then y_p=0;
output;
end;
keep y_p x1-x5 z1-z3;
run;

More Related Content

PDF
ORDINARY LEAST SQUARES REGRESSION OF ORDERED CATEGORICAL DATA- IN
PDF
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
PDF
Prob and statistics models for outlier detection
PDF
Application of Semiparametric Non-Linear Model on Panel Data with Very Small ...
PPT
Mixed models
PDF
Statistical analysis of Multiple and Logistic Regression
PDF
ESTIMATING R 2 SHRINKAGE IN REGRESSION
PDF
Prediction model of algal blooms using logistic regression and confusion matrix
ORDINARY LEAST SQUARES REGRESSION OF ORDERED CATEGORICAL DATA- IN
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Prob and statistics models for outlier detection
Application of Semiparametric Non-Linear Model on Panel Data with Very Small ...
Mixed models
Statistical analysis of Multiple and Logistic Regression
ESTIMATING R 2 SHRINKAGE IN REGRESSION
Prediction model of algal blooms using logistic regression and confusion matrix

What's hot (18)

PDF
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
PDF
A Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
PDF
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
PPTX
Multicolinearity
PDF
quantitative-risk-analysis
PDF
Modeling XCS in class imbalances: Population size and parameter settings
PPT
Logistic Regression in Case-Control Study
PDF
Analyzing probabilistic models in hierarchical BOA on traps and spin glasses
PDF
Bayesian analysis of shape parameter of Lomax distribution using different lo...
PPTX
Mba2216 week 11 data analysis part 02
PPTX
Whats New in SigmaXL Version 8
PDF
2014 IIAG Imputation Assessments
PDF
Six sigma statistics
PDF
Modeling selection pressure in XCS for proportionate and tournament selection
PDF
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
PPT
BioSHaRE: Analysis of mixed effects models using federated data analysis appr...
PDF
McCarthy_TermPaperSpring
PPT
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
A Hybrid Immunological Search for theWeighted Feedback Vertex Set Problem
Estimating ambiguity preferences and perceptions in multiple prior models: Ev...
Multicolinearity
quantitative-risk-analysis
Modeling XCS in class imbalances: Population size and parameter settings
Logistic Regression in Case-Control Study
Analyzing probabilistic models in hierarchical BOA on traps and spin glasses
Bayesian analysis of shape parameter of Lomax distribution using different lo...
Mba2216 week 11 data analysis part 02
Whats New in SigmaXL Version 8
2014 IIAG Imputation Assessments
Six sigma statistics
Modeling selection pressure in XCS for proportionate and tournament selection
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA
BioSHaRE: Analysis of mixed effects models using federated data analysis appr...
McCarthy_TermPaperSpring
11/04 Regular Meeting: Monority Report in Fraud Detection Classification of S...
Ad

Similar to Simulation Study of Hurdle Model Performance on Zero Inflated Count Data (20)

PPT
Count Data Models in SAS
PDF
Stability criterion of periodic oscillations in a (16)
PDF
05_AJMS_316_21.pdf
PDF
Crash investigationa of the anaylsysis only
PDF
Ijmet 10 01_192
DOCX
Estimation in statistics
PPT
Behavior-Based Predictive Models
PPTX
statistical inference.pptx
PPTX
Statistical inference: Estimation
PPTX
Double Hurdle Models
PPT
101_sampling__population_Sept_2020.ppt
PPT
Data science types_of_poisson_regression
PPT
Data science course
DOCX
5 Nonsampling ErrorNonsampling error is a catch-all term f.docx
PPTX
inferencial statistics
PPT
week6a.ppt
PDF
Instuctor s Solutions Manual for Statistics for Business and Economics 12th E...
PPT
Statistics
PPT
Inferential statistics-estimation
PDF
Semiparametric Theory And Missing Data 1st Edition Anastasios Tsiatis
Count Data Models in SAS
Stability criterion of periodic oscillations in a (16)
05_AJMS_316_21.pdf
Crash investigationa of the anaylsysis only
Ijmet 10 01_192
Estimation in statistics
Behavior-Based Predictive Models
statistical inference.pptx
Statistical inference: Estimation
Double Hurdle Models
101_sampling__population_Sept_2020.ppt
Data science types_of_poisson_regression
Data science course
5 Nonsampling ErrorNonsampling error is a catch-all term f.docx
inferencial statistics
week6a.ppt
Instuctor s Solutions Manual for Statistics for Business and Economics 12th E...
Statistics
Inferential statistics-estimation
Semiparametric Theory And Missing Data 1st Edition Anastasios Tsiatis
Ad

Simulation Study of Hurdle Model Performance on Zero Inflated Count Data

  • 1. Simulation Study on Hurdle Model Performance on Zero Inflated Count Data Adrian Daniel D. Camacho The use of hurdle models have been prevalent in many practices especially the popularity of excessive zeroes in their dataset. This simulation study showed that hurdle models are adept to omission of even significant predictors. However the performance of hurdle models decreases when there multicollinearity by almost 50% larger error rate in addition to the fact that parameter estimates becomes biased. The simulation analysis was employed onto 100 to 1,000,000 cases. Keywords: hurdle model, zero-inflation, binary logistic regression, truncated poisson, truncated negative binomial 1. Introduction In practice, we encounter different data generating processes that in turn give excess zeroes. The instance of data with excessive zeroes can be either structured or merely caused by sampling variation. Some examples of such scenario are counting a specific strain of virus developed in a certain environment, number of website visits for small retail sites, or even to point of counting instances that a consigned product is bought from a coffee shop. Poisson and Binomial distributions can accommodate such scenarios of zero occurrences but only to such extent. They are modified to model the data that can adjust for structural zeroes. There are also approaches to truncate the zeroes however this poses large loss of information. This is where have to strike a balance of including or excluding them from the data modeling. Usually we consider hurdle or zero-inflated models when we encounter count data with outsized percentage of true zeroes. These models are developed to cope with excessive zeroes, nevertheless they have different characteristics. Zero inflated models (Poisson or Negative Binomial) assumes two possible origin of the zeroes, either from the data structure or from sampling variation. On the other hand, hurdle models all zeroes are structured. This being said, these models can behave differently and produce different results when compared. One should take note how the data generating process is designed and how the outcome is observed. This paper aims to demonstrate the performance of hurdle models on zero-inflated count data. The simulation study will focus on the different scenarios of sampling size, multicollinearity, dropping important parameters, and other model adequacy factors. The analysis will be focusing only on two-part hurdle models only, namely logistic-poisson and logistic-negative binomial models.
  • 2. 2. Related Literature The use of hurdle models have been prevalent in many practices. The wide range of applications of these models that tackle the excessive zero problem had produced a number of literature. Some of the following papers are associated with the use of such models, for example Cameron and Trivedi (1986), Ridout et al. (1998) or Min and Agresti (2002). The application have spanned from econometrics (Winkelmann, 2004 and Mullahy, 1986), to epidemiology (Bohning et al., 1999), ecology (Welsh et al., 1996), manufacturing defects (Lambert, 1992), medical care (Deb and Trivedi, 1997), banking (Moffatt, 2003) and insurance (Boucher et al., 2006). There is much difference between the zero-inflated models and hurdle models though similarities in structure. For instance, a study (Hu, Pavlicova, and Nunes 2011) on the differences between these distributions and models and to explore how to compare different count data models using data from a multi-site clinical trial of behavioral interventions to reduce episodes of HIV risk behaviour. Their findings conclude that zero- inflated models fit better than the corresponding hurdle models. Their scenarios included some participants scoring zero unprotected sexual occasions because they had no sexual partners, others had sexual partners but scored zero because they did not engage in high- risk sex. Moreover the example above indicates the need to consider two sources of zero observations, “sampling zeros” that are part of the underlying sampling distribution (Poisson, or negative binomial) and “structural zeros” that cannot score anything other than zero. Hence, designing clinical trials are crucial to the choice of model in the event that fit statistics do not identify a clear best fit. Hurdle models are useful nevertheless if assumptions and experimental designs are aligned to the data generating process. Many analysts still advocate this model and continue to improve its performance. Robust version of the Hurdle Model (Cantoni and Zedini, 2009) are created to address the frequency of gross errors and the complexity intrinsic to some considered phenomena which may render this classical model unreliable and too limiting. This method was also used to predict self-harm repetition example (Bethell and et. al. 2010). The first step tests factors associated with any repetition (repeaters versus non- repeaters) and the second part tests factors associated with the number of presentations (among repeaters). Hurdle models are shown to be more informative than traditional binary analyses, and also adequately fit these data relative to some other count models. Dynamic hurdle model for zero-inflated count data (Baetschmann and Winkelmann, 2015) is discussing the encounter of many empirical count data applications to provide a new explanation of extra zeroes that are related to the underlying stochastic process that generates events. In their study, it was assumed that a process has two rates, a lower rate until the first event, and a higher rate thereafter. Using this concept they are able to apply this new approach to the socio-economic determinants of the individual number of doctor visits in Germany.
  • 3. 3. Methodological Sketch Hurdle Model (2-Step) A hurdle model is a modified count model in which there are two main processes, one generating the zeros and one generating the positive values. The concept underlying the hurdle model is that a binomial probability model governs the binary outcome of whether a count variable has a zero or a positive (non-zero) value. This implies that zeroes are generated from a structured process. If the value is positive, the "hurdle is crossed," and the conditional distribution of the positive values is governed by a zero-truncated count model. (Agresti, 1996) For this study, we will make use of two independent steps where as we consider a binary logistic regression in the first step and both truncated Poisson and truncated Negative Binomial regression in the second step. This follows a necessary re-structuring of the count data (Y) to form a binary response data (Z). Suppose we have independent counts ܻ௜ for i = 1, … , n and two set of covariates ܺ௜ ∈ ℛ௣ and ܷ௜ ∈ ℛ௣ that may or may not be (partially) equal. The first step of the hurdle model defined as a logistic model that would predict the probability of a non-zero count. The probability of the associated model of the first step is defined by: ܲሺܻ ℎ‫ܺ|0 ݏ݈݁݀ݎݑ‬ሻ = ୣ୶୮ ሺ௑೔ ᇲ ఉሻ ଵା ୣ୶୮ ሺ௑೔ ᇲఉሻ or consequently, ܲሺܻ = 0|ܺሻ = ଵ ଵା ୣ୶୮ ሺ௑೔ ᇲఉሻ If Z is predicted to likely as a zero count then the predicted count would be zero as well. Otherwise, the second step of the model would take place by either considering truncated Y’s distribution as either truncated Poisson or truncated Negative Binomial. The estimating model is given by: ܻ෠ = ܷ௜ ᇱ ߙ Count Y = 0 (Z = 0) Zero is generated from covariates Y > 0 (Z = 1) Zero-truncated count model function of covariates (may not be the same set)
  • 4. The log-likelihood of the two-part Poisson-Logistic (zero altered poisson) model is written as: ‫ܮ‬ሺߚ, ߙ|ܻሻ = ෍ ݈‫݃݋‬ ቆ 1 1 + expሺܺ௜ ᇱ ߚሻ ቇ ௒೔ୀ଴ + ෍ ݈‫݃݋‬ ቆ expሺܺ௜ ᇱ ߚሻ 1 + expሺܺ௜ ᇱ ߚሻ ቇ ௒೔வ଴ + ෍ ሾ݈‫݃݋‬ሺܻ௜ሺܷ௜ ᇱ ߙሻ − exp ሺܷ௜ ᇱ ߙሻ − logሺ1 − expሺ− expሺܷ௜ ᇱ ߙሻሻሻ ௒೔வ଴ − log ሺܻ௜!ሻሿ = ‫ܮ‬ሺߚ|ܻሻ + ‫ܮ‬ሺߙ|ܻሻ Similarly, the log-likelihood of the two-part Negative Binomial-Logistic (zero altered negative binomial) model can be written as: ‫ܮ‬ሺߚ, ߙ|ܻሻ = ෍ ݈‫݃݋‬ ቆ 1 1 + expሺܺ௜ ᇱ ߚሻ ቇ ௒೔ୀ଴ + ෍ ݈‫݃݋‬ ቆ expሺܺ௜ ᇱ ߚሻ 1 + expሺܺ௜ ᇱ ߚሻ ቇ ௒೔வ଴ + ෍ ቎෍ ݈ ቆ݆ + expሺܷ௜ ᇱ ߙሻ ߠ ቇ ௒೔ିଵ ௝வ଴ − ݈ܻ݊௜ − ቆܻ௜ + expሺܷ௜ ᇱ ߙሻ ߠ ቇ lnሺ1 + ߠሻ ௒೔வ଴ + ܻ௜݈݊ߙ቏ = ‫ܮ‬ሺߚ|ܻሻ + ‫ܮ‬ሺߙ|ܻሻ 3.1 Data and Model The model postulated for this study was: Y = exp(.10+.30X1+.70X2+.20X3-.40X4-.50X5) This model was chosen to represent an average Poisson models with five independent variables. The data for the independent variables were simulated from a Normal distribution with mean = 0 and stdev = 1. The dependent variable Y was simulated from Poisson distribution with the expected mean is estimated above. SAS was used to generate the simulated data and analysis. A fixed seed value was set for random number generation. Furthermore, the data was induced to have zero-inflation by having three variables corresponding to the excessive zeroes data generation (see appendix A for SAS syntax). These three variables follows a logistic distribution. The method was to create a certain cut- off, which is uniformly distributed. The rule then was to force Y to zero if the probability generated is less than the simulated cut-off score. In order to simulate the performance of the hurdle model, the data was partitioned to a training dataset (80%) and a testing dataset (20%). This will be used to benchmark the model adequacy of the trained model onto the test dataset at different sample sizes and scenarios.
  • 5. The sample data generated for this particular study are: 100, 1000, 10000, 100000, and 1000000. 3.2 Misspecification The data was forced to accommodate multicollinearity. The independent variable ܺଶ was redefined in the simulation of data with the formula ܺଶ = 1.5 ∗ ܺହ + 2 ∗ ‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻ using SAS. This would return a strong multicollinearity factor between the two variables. Another source of misspecification voluntarily introduced to the dataset was omission of an important variable. Again, ܺଶ was used to exemplify the scenario of removing this significant predictor in the model. The goal of these exercises is to assess the robustness of the hurdle model under such circumstances. The effect of misspecification is gauged by checking the consequences on the parameter estimates and some measures of model. 3.3 Estimation Procedure After the data simulation was done, the data is partitioned to training data and test data (80% and 20% of the total number of sample respectively). The model developed using the training dataset will be the point of reference later on for the test data. Following the concept of the hurdle model, the first step is to predict the probability of the data using the covariates (5 independent variables) of being a zero count or not, using PROC LOGISTIC in SAS. After predicting the probability of each case, we benchmark and captured the cut-off score to be used as the rule of the “hurdle” using the maximum predicted probability of the true zeroes in order to maximize specificity. The next step was to truncate the data of all the zeroes. PROC COUNTREG is used on the truncated data to produce Poisson and Negative Binomial (p = 1 and p = 2) predicted counts. Once the model development was completed, the test data are scored accordingly using the 2-step hurdle model. Additionally, the misspecification scenarios are run separately over to check for the impact of such events.
  • 6. 4. Results 4.1. Regular Run The proportion of zeroes amongst the simulated data is approximately 66% or two-thirds of the data. The true mean of Y is relatively low around 0.96 for all simulations. As the number of simulation increases, the maximum value of Y increases as well indicating skewness to the right as expected. Table 1: Descriptive Statistics of the Dependent Variable Count Y Sample Statistics Training Data (80%) Test Data (20%) Overall 100 Average 0.95 0.74 0.78 StdDev 1.79 1.13 1.28 Min 0.00 0.00 0.00 Max 6.00 4.00 6.00 1,000 Average 0.95 0.95 0.95 StdDev 1.97 2.30 2.23 Min 0.00 0.00 0.00 Max 14.00 33.00 33.00 10,000 Average 0.99 0.95 0.95 StdDev 2.25 2.16 2.18 Min 0.00 0.00 0.00 Max 46.00 33.00 46.00 100,000 Average 0.97 0.97 0.97 StdDev 2.33 2.23 2.25 Min 0.00 0.00 0.00 Max 52.00 58.00 58.00 1,000,000 Average 0.96 0.97 0.97 StdDev 2.24 2.25 2.25 Min 0.00 0.00 0.00 Max 83.00 150.00 150.00 The parameter estimates generated during the first step of the hurdle model are all significant using the following covariates. Table 3 presents that even by using the covariates used to generate the data, there would still be misclassification but only to a minimal rate. Table 2: Results of Logistic Regression (1st Step of Hurdle) under Training Data Parameter Estimates Sample Intercept x1 x2 x3 x4 x5 100 -0.5367 0.0794 0.2723 -0.1319 -0.0740 -0.3868 1,000 -0.8109 0.2597 0.4251 0.0450 -0.3009 -0.5692 10,000 -0.7688 0.1961 0.4479 0.1531 -0.2337 -0.3387 100,000 -0.7439 0.1824 0.4657 0.1421 -0.2555 -0.3331 1,000,000 -0.7372 0.1932 0.4543 0.1308 -0.2558 -0.3214
  • 7. Table 3: Logistic Regression Classification Table Sample Value of Z Passed the Hurdle Training Data (80%) Test Data (20%) No Yes No Yes 100 0 14 49 > 1 6 31 1,000 0 134 1 (0.50%) 541 > 1 65 259 10,000 0 1,299 5,353 1 (0.01%) > 1 701 2,646 100,000 0 13,306 53,111 1 (0.001%) > 1 6,694 26,888 1,000,000 0 132,748 529,686 > 1 67,252 270,314 The second step of the model is supposedly predicting the counts (Y), however the parameter estimates are significantly different than the original coefficients used to simulate the data. At the 10,000 sample dataset (see Table 4), the parameter estimates appear to be the closest to the true coefficients. The zero truncated data is not responded well by the usual count data models. Nevertheless, the hurdle model is able to capture the expected value of Y. Looking at table 5, bias (%) decreases as the sample size increases. The test data also appears to be fit well within the true mean range. Table 4: Parameter Estimates Comparison (2nd Step of the Hurdle Model) Model Sample Intercept (Bo = 0.1) x1 (B1 = 0.3) x2 (B2 = 0.7) x3 (B3 = 0.2) x4 (B4 = - 0.4) x5 (B5 = - 0.5) Alpha Poisson 100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269 1,000 0.4648 0.2423 0.6171 0.1882 -0.2351 -0.4228 10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703 100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 NB1 100 0.3747 0.0944 0.3817 0.1568 -0.1824 -0.1370 0.0000 1,000 0.4648 0.2423 0.6170 0.1882 -0.2351 -0.4228 0.0000 10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703 0.0000 100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000 1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000 NB2 100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269 0.0000 1,000 0.4750 0.2399 0.6081 0.1854 -0.2280 -0.4164 0.0130 10,000 0.5281 0.2435 0.5255 0.1406 -0.2675 -0.3287 0.0000 100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000 1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000 Table 5: 2-Step Hurdle Model Predicted Count Accuracy Sample Values Training Data (80%) Bias (%) Test Data (20%) Bias (%) Overall Bias (%) 100 True Mean Count 0.9500 0.7375 0.7800 Average Poisson predicted count 0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2%
  • 8. Average NB1 predicted count 0.5008 -47.3% 0.6702 -9.1% 0.6363 -18.4% Average NB2 predicted count 0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2% 1,000 True Mean Count 0.9500 0.9475 0.9480 Average Poisson predicted count 1.1074 16.6% 0.9475 0.0% 0.9795 3.3% Average NB1 predicted count 1.1074 16.6% 0.9475 0.0% 0.9795 3.3% Average NB2 predicted count 1.1004 15.8% 0.9445 -0.3% 0.9756 2.9% 10,000 True Mean Count 0.9920 0.9453 0.9546 Average Poisson predicted count 1.0151 2.3% 0.9476 0.2% 0.9611 0.7% Average NB1 predicted count 1.0151 2.3% 0.9476 0.2% 0.9611 0.7% Average NB2 predicted count 0.9908 -0.1% 0.9209 -2.6% 0.9349 -2.1% 100,000 True Mean Count 0.9711 0.9654 0.9666 Average Poisson predicted count 0.9689 -0.2% 0.9660 0.1% 0.9666 0.0% Average NB1 predicted count 0.9689 -0.2% 0.9660 0.1% 0.9666 0.0% Average NB2 predicted count 0.9689 -0.2% 0.9660 0.1% 0.9666 0.0% 1,000,000 True Mean Count 0.9627 0.9693 0.9680 Average Poisson predicted count 0.9619 -0.1% 0.9693 0.0% 0.9678 0.0% Average NB1 predicted count 0.9619 -0.1% 0.9693 0.0% 0.9678 0.0% Average NB2 predicted count 0.9619 -0.1% 0.9693 0.0% 0.9678 0.0% The mean absolute percentage error (MAPE) and mean absolute deviation (MAD) are measures to gauge prediction accuracy. Upon checking, error rate is approximately below 20%. This indicates that there is a large discrepancy when we look at individual-level comparison. As expected, the error/deviation is decreasing when we have a larger sample size. Table 6: Model Fit Measures (Mean Absolute Percentage Error and Mean Absolute Deviation) Sample Measure Model Training Data (80%) Test Data (20%) Grand Total 100 MAPE Poisson 13.88% 18.19% 17.33% NB1 13.31% 16.25% 15.67% NB2 13.88% 18.19% 17.33% MAD Poisson 0.4789 0.2782 0.3183 NB1 0.5012 0.2770 0.3218 NB2 0.4789 0.2782 0.3183 1,000 MAPE Poisson 19.91% 18.18% 18.53% NB1 19.91% 18.18% 18.53% NB2 19.82% 18.19% 18.52%
  • 9. MAD Poisson 0.4947 0.3921 0.4127 NB1 0.4947 0.3921 0.4127 NB2 0.4902 0.3935 0.4128 10,000 MAPE Poisson 19.52% 17.24% 17.70% NB1 19.52% 17.24% 17.70% NB2 19.26% 16.96% 17.42% MAD Poisson 0.3996 0.3663 0.3730 NB1 0.3996 0.3663 0.3730 NB2 0.4048 0.3708 0.3776 100,000 MAPE Poisson 17.45% 17.35% 17.37% NB1 17.45% 17.35% 17.37% NB2 17.45% 17.35% 17.37% MAD Poisson 0.3651 0.3647 0.3647 NB1 0.3651 0.3647 0.3647 NB2 0.3651 0.3647 0.3647 1,000,000 MAPE Poisson 17.52% 17.54% 17.54% NB1 17.52% 17.54% 17.54% NB2 17.52% 17.54% 17.54% MAD Poisson 0.3662 0.3675 0.3672 NB1 0.3662 0.3675 0.3672 NB2 0.3662 0.3675 0.3672 4.2. With Misspecification The parameter estimates are severely affected and unstable once introduced with an ill- condition independent variable. The sample size is not a factor at all to the damage of the multicollinear covariates ܺଶ and ܺହ. The mean of Y however appears unaffected at large sample sizes. Table 7: Induced with multicollinearity at ࢄ૛ Model Sample Intercept (Bo = 0.1) x1 (B1 = 0.3) x2* x3 (B3 = 0.2) x4 (B4 = - 0.4) x5 (B5 = - 0.5) Alpha Poisson 100 1.7391 -0.0582 -0.1621 0.2433 -0.3541 0.6122 1,000 1.8104 0.3427 0.0520 0.1469 -0.2965 -0.3562 10,000 2.1982 0.1715 -0.0115 0.0854 -0.1898 -0.1822 100,000 2.0910 0.1796 0.0180 0.1707 -0.3249 -0.5244 1,000,000 2.1282 0.2235 0.0041 0.1415 -0.3048 -0.3969 NB1 100 1.7615 -0.1236 -0.1663 0.1337 -0.2593 0.4055 3.4204 1,000 1.9619 0.1622 -0.0058 0.0974 -0.1362 -0.1439 8.3611 10,000 2.2730 0.0585 -0.0005 0.0551 -0.1051 -0.0977 15.3638 100,000 2.3584 0.0712 0.0002 0.0651 -0.1004 -0.1262 18.7049 1,000,000 2.3411 0.0736 0.0010 0.0496 -0.0972 -0.1245 18.0943 NB2 100 1.7149 -0.0868 -0.1686 0.1718 -0.3822 0.4846 0.5480 1,000 1.8367 0.2874 0.0341 0.0652 -0.2898 -0.3414 0.9997 10,000 2.1916 0.1687 -0.0195 0.0573 -0.2123 -0.1921 1.4860 100,000 2.1394 0.2060 0.0107 0.1475 -0.2782 -0.4195 1.5112 1,000,000 2.1438 0.2172 0.0031 0.1371 -0.2855 -0.3679 1.5149 * - Induced multicollinearity (ܺଶ = 1.5 ∗ ܺହ + 2 ∗ ‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻሻ
  • 10. As compared to the gravity effect to the multicollinearity, omission of important variable is still tolerable at this point. The estimate for the mean of Y also appears to be accurate still. Table 8: Omission of most important variable ࢄ૛ Model Sample Intercept (Bo = 0.1) x1 (B1 = 0.3) x3 (B3 = 0.2) x4 (B4 = -0.4) x5 (B5 = -0.5) Alpha Poisson 100 0.6251 -0.0707 0.0539 -0.0710 -0.0431 1,000 0.7677 0.1326 0.1496 -0.1580 -0.4463 10,000 0.7993 0.1785 0.1323 -0.2701 -0.3537 100,000 0.7979 0.2007 0.1393 -0.2737 -0.3466 1,000,000 0.7947 0.2070 0.1394 -0.2755 -0.3438 NB1 100 0.4249 -0.0635 0.0623 -0.0714 -0.0629 0.0000 1,000 0.8461 0.1178 0.1352 -0.1193 -0.3524 0.6700 10,000 0.8377 0.1626 0.1159 -0.2405 -0.3129 0.4623 100,000 0.8399 0.1781 0.1221 -0.2431 -0.3049 0.4686 1,000,000 0.8376 0.1826 0.1228 -0.2424 -0.3027 0.4704 NB2 100 0.6055 -0.0666 0.0621 -0.0690 -0.0589 0.0000 1,000 0.7884 0.1319 0.1481 -0.1431 -0.4148 0.2307 10,000 0.8113 0.1687 0.1268 -0.2523 -0.3410 0.1792 100,000 0.8109 0.1916 0.1315 -0.2602 -0.3296 0.1797 1,000,000 0.8067 0.1978 0.1330 -0.2627 -0.3283 0.1799 Similar to the properties of Poisson and Negative Binomial, the truncated model are obviously not performing under multicollinearity. Prediction deviations balloons up to nine times larger or the error rate increases by almost 50% more. Table 9: Model Fit Comparison under different scenarios Sample Measure Model regular run with MC problem omission of X2 100 MAPE Poisson 17.33% 62.03% 21.16% NB1 15.67% 61.52% 17.55% NB2 17.33% 57.89% 20.73% MAD Poisson 0.3183 1.4536 0.4004 NB1 0.3218 1.5021 0.4016 NB2 0.3183 1.4254 0.4004 1,000 MAPE Poisson 18.53% 81.53% 26.77% NB1 18.53% 84.53% 27.17% NB2 18.52% 81.67% 26.65% MAD Poisson 0.4127 2.3506 0.5865 NB1 0.4127 2.3245 0.5804 NB2 0.4128 2.3290 0.5822 10,000 MAPE Poisson 17.70% 107.78% 25.53% NB1 17.70% 111.52% 25.90% NB2 17.42% 107.87% 25.52% MAD Poisson 0.3730 3.6091 0.5351 NB1 0.3730 3.6383 0.5367 NB2 0.3776 3.6171 0.5344
  • 11. 100,000 MAPE Poisson 17.37% 116.45% 25.49% NB1 17.37% 129.40% 25.90% NB2 17.37% 115.83% 25.47% MAD Poisson 0.3647 3.9364 0.5367 NB1 0.3647 3.9887 0.5389 NB2 0.3647 3.8788 0.5360 1,000,000 MAPE Poisson 17.54% 115.64% 25.54% NB1 17.54% 126.53% 25.96% NB2 17.54% 115.38% 25.52% MAD Poisson 0.3672 3.8658 0.5368 NB1 0.3672 3.9341 0.5385 NB2 0.3672 3.8476 0.5359 5. Conclusion The logistic regression model at the first step manages to classify and mitigate the effects of misspecification throughout the simulation analysis (keeping an error rate of < 1%). The mean of the true count is predicted accurate enough despite the presence of multicollinearity (within 2%). Upon observing the AIC criteria, the poisson regression and NB1 almost maintains the same value. For sake of simplicity, the simulated data still favor the poisson regression as the 2nd step of the hurdle model. The hurdle model performance behaves similar to the performance of poisson and negative binomial models as it is greatly affected by ill-conditioned covariates. Omission of an important variable does not have some worrisome effect as compared to multicollinearity. Using the hurdle model on live data can gauge the reliability of the trained model assuming that the test data comes from the same data generating process. The bias in the average count for the test data only went up as high as 9.1% at a small sample size. Increasing the number of cases studied decreases the discrepancy. The analysis produced are limited to a poisson data with low mean. This does not include the covariates that led to the zero-inflation and is possible to further improve the predictions and estimates.
  • 12. 6. References Cameron, A. and Trivedi, P. (1986). Econometric models based on count data, comparisons and applications of some estimators. Journal of Applied Econometrics, 1, 29–53. Ridout, M., Dem´etrio, C. G. B., and Hinde, J. (1998). Models for count data with many zeros. In Proceedings of the 19th International Biometrics Conference, Cape Town, pp. 179–190. Min, Y. and Agresti, A. (2002). Modeling nonnegative data with clumping at zero: A survey. Journal of the Iranian Statistical Society, 1,(1-2), 7–33. Winkelmann, R. (2004). Health care reform and the number of doctor visits—an econometric analysis. Journal of Applied Econometrics, 19, 455–472. Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33, 341–365. Bohning, D., Dietz, E., Schlattmann, P., Mendonca, L., and Kirchner, U. (1999). The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society. Series A (Statistics in Society), 162, 195–209. Welsh, A., Cunningham, R., Donnelly, C., and Lindenmayer, D. (1996). Modelling the abundance of rare species : Statistical models for counts with extra zeros. Ecological Modelling, 88, 297–308. Lambert, D. (1992). Zero-inflated poisson regression with an application to defects in manufacturing. Technometrics, 34, 1–14. Deb, P. and Trivedi, P. (1997). Demand for medical care by the elderly: A finite mixture approach. Journal of Applied Econometrics, 12, 313–336. Moffatt, P. (2003). Hurdle models of loan default. In a Conference at the Credit Research Center, University of Edinburgh, UK. Boucher, J.-P., Denuit, M., and Guillen, M. (2006). Modelisation of claim count with hurdle distribution for panel data. In Proceedings of the International Conference on Mathematical and Statistical Modeling in Honor of Enrique Castillo. Mei-Chen Hu, Martina Pavlicova, and Edward V. Nunes (2011). Zero-inflated and Hurdle Models of Count Data with Extra Zeros: Examples from an HIV-Risk Reduction Intervention Trial https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3238139/
  • 13. Eva Cantoni and Asma Zedini (2009). A Robust Version of the Hurdle Model. http://www.unige.ch/ses/metri/cahiers/2009_07.pdf Jennifer Bethell, Anne E. Rhodes, Susan J. Bondy, W. Y. Wendy Lou, Astrid Guttmann (2010). Repeat self-harm: application of hurdle models. The British Journal of Psychiatry. http://bjp.rcpsych.org/content/196/3/243 Gregori Baetschmann and Rainer Winkelmann (2015). A Dynamic Hurdle Model for Zero- Inflated Count Data. https://www.econ.uzh.ch/dam/jcr:ffffffff-a477-8018-ffff- ffffabad53fc/Dynamic_Hurdle.pdf AGRESTI, A., 1996, An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, Inc. BARRIOS, E., 2015, Lectures on Overdispersion
  • 14. APPENDIX A: SAS code syntax to generate simulated data data basedata; call streaminit(123); array vars x1-x5; array zero_vars z1-z3; array parms{5} (.3 .7 .2 -.4 -.5); array zero_parms{3} (-.3 .1 .2); intercept=.1; z_intercept=-.1; do i=1 to &sample; /*parameter initialization for non-zero covariates*/ sum_xb=0; sum_gz=0; do j=1 to 5; vars[j]=rand('NORMAL',0,1); sum_xb=sum_xb+parms[j]*vars[j]; end; mu=exp(intercept+sum_xb); y_p=rand('POISSON', mu); /*induce zeroes by some z1-z3 variables*/ do j=1 to 3; zero_vars[j]=rand('NORMAL',0,1); sum_gz = sum_gz+zero_parms[j]*zero_vars[j]; end; z_gamma = z_intercept+sum_gz; pzero = cdf('LOGISTIC',z_gamma); cut=rand('UNIFORM'); if cut<pzero then y_p=0; output; end; keep y_p x1-x5 z1-z3; run;