Simulation Study of Hurdle Model Performance on Zero Inflated Count Data

Simulation Study on Hurdle Model Performance on Zero Inflated
Count Data
Adrian Daniel D. Camacho
The use of hurdle models have been prevalent in many practices especially
the popularity of excessive zeroes in their dataset. This simulation study
showed that hurdle models are adept to omission of even significant
predictors. However the performance of hurdle models decreases when
there multicollinearity by almost 50% larger error rate in addition to the fact
that parameter estimates becomes biased. The simulation analysis was
employed onto 100 to 1,000,000 cases.
Keywords: hurdle model, zero-inflation, binary logistic regression,
truncated poisson, truncated negative binomial
1. Introduction
In practice, we encounter different data generating processes that in turn give excess zeroes.
The instance of data with excessive zeroes can be either structured or merely caused by
sampling variation. Some examples of such scenario are counting a specific strain of virus
developed in a certain environment, number of website visits for small retail sites, or even
to point of counting instances that a consigned product is bought from a coffee shop.
Poisson and Binomial distributions can accommodate such scenarios of zero occurrences
but only to such extent. They are modified to model the data that can adjust for structural
zeroes. There are also approaches to truncate the zeroes however this poses large loss of
information. This is where have to strike a balance of including or excluding them from
the data modeling. Usually we consider hurdle or zero-inflated models when we encounter
count data with outsized percentage of true zeroes. These models are developed to cope
with excessive zeroes, nevertheless they have different characteristics. Zero inflated
models (Poisson or Negative Binomial) assumes two possible origin of the zeroes, either
from the data structure or from sampling variation. On the other hand, hurdle models all
zeroes are structured. This being said, these models can behave differently and produce
different results when compared. One should take note how the data generating process is
designed and how the outcome is observed.
This paper aims to demonstrate the performance of hurdle models on zero-inflated count
data. The simulation study will focus on the different scenarios of sampling size,
multicollinearity, dropping important parameters, and other model adequacy factors.
The analysis will be focusing only on two-part hurdle models only, namely logistic-poisson
and logistic-negative binomial models.

2. Related Literature
The use of hurdle models have been prevalent in many practices. The wide range of
applications of these models that tackle the excessive zero problem had produced a number
of literature. Some of the following papers are associated with the use of such models, for
example Cameron and Trivedi (1986), Ridout et al. (1998) or Min and Agresti (2002).
The application have spanned from econometrics (Winkelmann, 2004 and Mullahy, 1986),
to epidemiology (Bohning et al., 1999), ecology (Welsh et al., 1996), manufacturing
defects (Lambert, 1992), medical care (Deb and Trivedi, 1997), banking (Moffatt, 2003)
and insurance (Boucher et al., 2006).
There is much difference between the zero-inflated models and hurdle models though
similarities in structure. For instance, a study (Hu, Pavlicova, and Nunes 2011) on the
differences between these distributions and models and to explore how to compare
different count data models using data from a multi-site clinical trial of behavioral
interventions to reduce episodes of HIV risk behaviour. Their findings conclude that zero-
inflated models fit better than the corresponding hurdle models. Their scenarios included
some participants scoring zero unprotected sexual occasions because they had no sexual
partners, others had sexual partners but scored zero because they did not engage in high-
risk sex. Moreover the example above indicates the need to consider two sources of zero
observations, “sampling zeros” that are part of the underlying sampling distribution
(Poisson, or negative binomial) and “structural zeros” that cannot score anything other than
zero. Hence, designing clinical trials are crucial to the choice of model in the event that fit
statistics do not identify a clear best fit.
Hurdle models are useful nevertheless if assumptions and experimental designs are aligned
to the data generating process. Many analysts still advocate this model and continue to
improve its performance. Robust version of the Hurdle Model (Cantoni and Zedini, 2009)
are created to address the frequency of gross errors and the complexity intrinsic to some
considered phenomena which may render this classical model unreliable and too limiting.
This method was also used to predict self-harm repetition example (Bethell and et. al.
2010). The first step tests factors associated with any repetition (repeaters versus non-
repeaters) and the second part tests factors associated with the number of presentations
(among repeaters). Hurdle models are shown to be more informative than traditional binary
analyses, and also adequately fit these data relative to some other count models.
Dynamic hurdle model for zero-inflated count data (Baetschmann and Winkelmann, 2015)
is discussing the encounter of many empirical count data applications to provide a new
explanation of extra zeroes that are related to the underlying stochastic process that
generates events. In their study, it was assumed that a process has two rates, a lower rate
until the first event, and a higher rate thereafter. Using this concept they are able to apply
this new approach to the socio-economic determinants of the individual number of doctor
visits in Germany.

3. Methodological Sketch
Hurdle Model (2-Step)
A hurdle model is a modified count model in which there are two main processes, one
generating the zeros and one generating the positive values. The concept underlying the
hurdle model is that a binomial probability model governs the binary outcome of whether
a count variable has a zero or a positive (non-zero) value. This implies that zeroes are
generated from a structured process. If the value is positive, the "hurdle is crossed," and
the conditional distribution of the positive values is governed by a zero-truncated count
model. (Agresti, 1996)
For this study, we will make use of two independent steps where as we consider a binary
logistic regression in the first step and both truncated Poisson and truncated Negative
Binomial regression in the second step. This follows a necessary re-structuring of the count
data (Y) to form a binary response data (Z).
Suppose we have independent counts ܻ௜ for i = 1, … , n and two set of covariates ܺ௜ ∈ ℛ௣
and ܷ௜ ∈ ℛ௣
that may or may not be (partially) equal. The first step of the hurdle model
defined as a logistic model that would predict the probability of a non-zero count. The
probability of the associated model of the first step is defined by:
ܲሺܻ ℎ‫ܺ|0 ݏ݈݁݀ݎݑ‬ሻ =
ୣ୶୮ ሺ௑೔
ᇲ
ఉሻ
ଵା ୣ୶୮ ሺ௑೔
ᇲఉሻ
or consequently,
ܲሺܻ = 0|ܺሻ =
ଵ
ଵା ୣ୶୮ ሺ௑೔
ᇲఉሻ
If Z is predicted to likely as a zero count then the predicted count would be zero as well.
Otherwise, the second step of the model would take place by either considering truncated
Y’s distribution as either truncated Poisson or truncated Negative Binomial. The estimating
model is given by:
ܻ෠ = ܷ௜
ᇱ
ߙ
Count
Y = 0
(Z = 0)
Zero is generated from covariates
Y > 0
(Z = 1)
Zero-truncated count model function of
covariates (may not be the same set)

The log-likelihood of the two-part Poisson-Logistic (zero altered poisson) model is written
as:
‫ܮ‬ሺߚ, ߙ|ܻሻ = ෍ ݈‫݃݋‬ ቆ
1
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔ୀ଴
+ ෍ ݈‫݃݋‬ ቆ
expሺܺ௜
ᇱ
ߚሻ
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔வ଴
+ ෍ ሾ݈‫݃݋‬ሺܻ௜ሺܷ௜
ᇱ
ߙሻ − exp ሺܷ௜
ᇱ
ߙሻ − logሺ1 − expሺ− expሺܷ௜
ᇱ
ߙሻሻሻ
௒೔வ଴
− log ሺܻ௜!ሻሿ =
‫ܮ‬ሺߚ|ܻሻ + ‫ܮ‬ሺߙ|ܻሻ
Similarly, the log-likelihood of the two-part Negative Binomial-Logistic (zero altered
negative binomial) model can be written as:
‫ܮ‬ሺߚ, ߙ|ܻሻ = ෍ ݈‫݃݋‬ ቆ
1
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔ୀ଴
+ ෍ ݈‫݃݋‬ ቆ
expሺܺ௜
ᇱ
ߚሻ
1 + expሺܺ௜
ᇱ
ߚሻ
ቇ
௒೔வ଴
+ ෍ ቎෍ ݈ ቆ݆ +
expሺܷ௜
ᇱ
ߙሻ
ߠ
ቇ
௒೔ିଵ
௝வ଴
− ݈ܻ݊௜ − ቆܻ௜ +
expሺܷ௜
ᇱ
ߙሻ
ߠ
ቇ lnሺ1 + ߠሻ
௒೔வ଴
+ ܻ௜݈݊ߙ቏ =
‫ܮ‬ሺߚ|ܻሻ + ‫ܮ‬ሺߙ|ܻሻ
3.1 Data and Model
The model postulated for this study was:
Y = exp(.10+.30X1+.70X2+.20X3-.40X4-.50X5)
This model was chosen to represent an average Poisson models with five independent
variables. The data for the independent variables were simulated from a Normal distribution
with mean = 0 and stdev = 1. The dependent variable Y was simulated from Poisson
distribution with the expected mean is estimated above. SAS was used to generate the
simulated data and analysis. A fixed seed value was set for random number generation.
Furthermore, the data was induced to have zero-inflation by having three variables
corresponding to the excessive zeroes data generation (see appendix A for SAS syntax).
These three variables follows a logistic distribution. The method was to create a certain cut-
off, which is uniformly distributed. The rule then was to force Y to zero if the probability
generated is less than the simulated cut-off score.
In order to simulate the performance of the hurdle model, the data was partitioned to a
training dataset (80%) and a testing dataset (20%). This will be used to benchmark the model
adequacy of the trained model onto the test dataset at different sample sizes and scenarios.

The sample data generated for this particular study are: 100, 1000, 10000, 100000, and
1000000.
3.2 Misspecification
The data was forced to accommodate multicollinearity. The independent variable ܺଶ was
redefined in the simulation of data with the formula ܺଶ = 1.5 ∗ ܺହ + 2 ∗
‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻ using SAS. This would return a strong multicollinearity factor
between the two variables.
Another source of misspecification voluntarily introduced to the dataset was omission of
an important variable. Again, ܺଶ was used to exemplify the scenario of removing this
significant predictor in the model.
The goal of these exercises is to assess the robustness of the hurdle model under such
circumstances. The effect of misspecification is gauged by checking the consequences on
the parameter estimates and some measures of model.
3.3 Estimation Procedure
After the data simulation was done, the data is partitioned to training data and test data
(80% and 20% of the total number of sample respectively). The model developed using the
training dataset will be the point of reference later on for the test data.
Following the concept of the hurdle model, the first step is to predict the probability of the
data using the covariates (5 independent variables) of being a zero count or not, using
PROC LOGISTIC in SAS. After predicting the probability of each case, we benchmark
and captured the cut-off score to be used as the rule of the “hurdle” using the maximum
predicted probability of the true zeroes in order to maximize specificity.
The next step was to truncate the data of all the zeroes. PROC COUNTREG is used on the
truncated data to produce Poisson and Negative Binomial (p = 1 and p = 2) predicted
counts.
Once the model development was completed, the test data are scored accordingly using the
2-step hurdle model. Additionally, the misspecification scenarios are run separately over
to check for the impact of such events.

4. Results
4.1. Regular Run
The proportion of zeroes amongst the simulated data is approximately 66% or two-thirds
of the data. The true mean of Y is relatively low around 0.96 for all simulations. As the
number of simulation increases, the maximum value of Y increases as well indicating
skewness to the right as expected.
Table 1: Descriptive Statistics of the Dependent Variable Count Y
Sample Statistics Training Data (80%) Test Data (20%) Overall
100 Average 0.95 0.74 0.78
StdDev 1.79 1.13 1.28
Min 0.00 0.00 0.00
Max 6.00 4.00 6.00
1,000 Average 0.95 0.95 0.95
StdDev 1.97 2.30 2.23
Min 0.00 0.00 0.00
Max 14.00 33.00 33.00
10,000 Average 0.99 0.95 0.95
StdDev 2.25 2.16 2.18
Min 0.00 0.00 0.00
Max 46.00 33.00 46.00
100,000 Average 0.97 0.97 0.97
StdDev 2.33 2.23 2.25
Min 0.00 0.00 0.00
Max 52.00 58.00 58.00
1,000,000 Average 0.96 0.97 0.97
StdDev 2.24 2.25 2.25
Min 0.00 0.00 0.00
Max 83.00 150.00 150.00
The parameter estimates generated during the first step of the hurdle model are all
significant using the following covariates. Table 3 presents that even by using the
covariates used to generate the data, there would still be misclassification but only to a
minimal rate.
Table 2: Results of Logistic Regression (1st
Step of Hurdle) under Training Data
Parameter Estimates
Sample Intercept x1 x2 x3 x4 x5
100 -0.5367 0.0794 0.2723 -0.1319 -0.0740 -0.3868
1,000 -0.8109 0.2597 0.4251 0.0450 -0.3009 -0.5692
10,000 -0.7688 0.1961 0.4479 0.1531 -0.2337 -0.3387
100,000 -0.7439 0.1824 0.4657 0.1421 -0.2555 -0.3331
1,000,000 -0.7372 0.1932 0.4543 0.1308 -0.2558 -0.3214

Table 3: Logistic Regression Classification Table
Sample Value of Z
Passed the Hurdle
Training Data (80%) Test Data (20%)
No Yes No Yes
100
0 14 49
> 1 6 31
1,000
0 134 1 (0.50%) 541
> 1 65 259
10,000
0 1,299 5,353 1 (0.01%)
> 1 701 2,646
100,000
0 13,306 53,111 1 (0.001%)
> 1 6,694 26,888
1,000,000
0 132,748 529,686
> 1 67,252 270,314
The second step of the model is supposedly predicting the counts (Y), however the
parameter estimates are significantly different than the original coefficients used to
simulate the data. At the 10,000 sample dataset (see Table 4), the parameter estimates
appear to be the closest to the true coefficients. The zero truncated data is not responded
well by the usual count data models.
Nevertheless, the hurdle model is able to capture the expected value of Y. Looking at
table 5, bias (%) decreases as the sample size increases. The test data also appears to be
fit well within the true mean range.
Table 4: Parameter Estimates Comparison (2nd
Step of the Hurdle Model)
Model Sample
Intercept
(Bo = 0.1)
x1 (B1 =
0.3)
x2 (B2 =
0.7)
x3 (B3 =
0.2)
x4 (B4 = -
0.4)
x5 (B5 = -
0.5)
Alpha
Poisson
100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269
1,000 0.4648 0.2423 0.6171 0.1882 -0.2351 -0.4228
10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865
NB1
100 0.3747 0.0944 0.3817 0.1568 -0.1824 -0.1370 0.0000
1,000 0.4648 0.2423 0.6170 0.1882 -0.2351 -0.4228 0.0000
10,000 0.5195 0.2223 0.5362 0.1492 -0.3031 -0.3703 0.0000
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000
NB2
100 0.4790 0.0886 0.3696 0.1549 -0.1805 -0.1269 0.0000
1,000 0.4750 0.2399 0.6081 0.1854 -0.2280 -0.4164 0.0130
10,000 0.5281 0.2435 0.5255 0.1406 -0.2675 -0.3287 0.0000
100,000 0.5035 0.2297 0.5404 0.1549 -0.3118 -0.3874 0.0000
1,000,000 0.5023 0.2325 0.5386 0.1569 -0.3099 -0.3865 0.0000
Table 5: 2-Step Hurdle Model Predicted Count Accuracy
Sample Values
Training
Data
(80%)
Bias (%)
Test
Data
(20%)
Bias (%) Overall Bias (%)
100
True Mean Count 0.9500 0.7375 0.7800
Average Poisson
predicted count
0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2%

Average NB1 predicted
count
0.5008 -47.3% 0.6702 -9.1% 0.6363 -18.4%
count
0.5533 -41.8% 0.7375 0.0% 0.7007 -10.2%
1,000
True Mean Count 0.9500 0.9475 0.9480
Average Poisson
predicted count
1.1074 16.6% 0.9475 0.0% 0.9795 3.3%
count
1.1074 16.6% 0.9475 0.0% 0.9795 3.3%
count
1.1004 15.8% 0.9445 -0.3% 0.9756 2.9%
10,000
True Mean Count 0.9920 0.9453 0.9546
Average Poisson
predicted count
1.0151 2.3% 0.9476 0.2% 0.9611 0.7%
count
1.0151 2.3% 0.9476 0.2% 0.9611 0.7%
count
0.9908 -0.1% 0.9209 -2.6% 0.9349 -2.1%
100,000
True Mean Count 0.9711 0.9654 0.9666
Average Poisson
predicted count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
count
0.9689 -0.2% 0.9660 0.1% 0.9666 0.0%
1,000,000
True Mean Count 0.9627 0.9693 0.9680
Average Poisson
predicted count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
count
0.9619 -0.1% 0.9693 0.0% 0.9678 0.0%
The mean absolute percentage error (MAPE) and mean absolute deviation (MAD) are
measures to gauge prediction accuracy. Upon checking, error rate is approximately below
20%. This indicates that there is a large discrepancy when we look at individual-level
comparison. As expected, the error/deviation is decreasing when we have a larger sample
size.
Table 6: Model Fit Measures (Mean Absolute Percentage Error and Mean Absolute Deviation)
Sample Measure Model Training Data (80%) Test Data (20%) Grand Total
100 MAPE Poisson 13.88% 18.19% 17.33%
NB1 13.31% 16.25% 15.67%
NB2 13.88% 18.19% 17.33%
MAD Poisson 0.4789 0.2782 0.3183
NB1 0.5012 0.2770 0.3218
NB2 0.4789 0.2782 0.3183
1,000 MAPE Poisson 19.91% 18.18% 18.53%
NB1 19.91% 18.18% 18.53%
NB2 19.82% 18.19% 18.52%

MAD Poisson 0.4947 0.3921 0.4127
NB1 0.4947 0.3921 0.4127
NB2 0.4902 0.3935 0.4128
10,000 MAPE Poisson 19.52% 17.24% 17.70%
NB1 19.52% 17.24% 17.70%
NB2 19.26% 16.96% 17.42%
MAD Poisson 0.3996 0.3663 0.3730
NB1 0.3996 0.3663 0.3730
NB2 0.4048 0.3708 0.3776
100,000 MAPE Poisson 17.45% 17.35% 17.37%
NB1 17.45% 17.35% 17.37%
NB2 17.45% 17.35% 17.37%
MAD Poisson 0.3651 0.3647 0.3647
NB1 0.3651 0.3647 0.3647
NB2 0.3651 0.3647 0.3647
1,000,000 MAPE Poisson 17.52% 17.54% 17.54%
NB1 17.52% 17.54% 17.54%
NB2 17.52% 17.54% 17.54%
MAD Poisson 0.3662 0.3675 0.3672
NB1 0.3662 0.3675 0.3672
NB2 0.3662 0.3675 0.3672
4.2. With Misspecification
The parameter estimates are severely affected and unstable once introduced with an ill-
condition independent variable. The sample size is not a factor at all to the damage of the
multicollinear covariates ܺଶ and ܺହ. The mean of Y however appears unaffected at large
sample sizes.
Table 7: Induced with multicollinearity at ࢄ૛
Model Sample
Intercept
(Bo = 0.1)
x1 (B1 =
0.3)
x2*
x3 (B3 =
0.2)
x4 (B4 = -
0.4)
x5 (B5 = -
0.5)
Alpha
Poisson
100 1.7391 -0.0582 -0.1621 0.2433 -0.3541 0.6122
1,000 1.8104 0.3427 0.0520 0.1469 -0.2965 -0.3562
10,000 2.1982 0.1715 -0.0115 0.0854 -0.1898 -0.1822
100,000 2.0910 0.1796 0.0180 0.1707 -0.3249 -0.5244
1,000,000 2.1282 0.2235 0.0041 0.1415 -0.3048 -0.3969
NB1
100 1.7615 -0.1236 -0.1663 0.1337 -0.2593 0.4055 3.4204
1,000 1.9619 0.1622 -0.0058 0.0974 -0.1362 -0.1439 8.3611
10,000 2.2730 0.0585 -0.0005 0.0551 -0.1051 -0.0977 15.3638
100,000 2.3584 0.0712 0.0002 0.0651 -0.1004 -0.1262 18.7049
1,000,000 2.3411 0.0736 0.0010 0.0496 -0.0972 -0.1245 18.0943
NB2
100 1.7149 -0.0868 -0.1686 0.1718 -0.3822 0.4846 0.5480
1,000 1.8367 0.2874 0.0341 0.0652 -0.2898 -0.3414 0.9997
10,000 2.1916 0.1687 -0.0195 0.0573 -0.2123 -0.1921 1.4860
100,000 2.1394 0.2060 0.0107 0.1475 -0.2782 -0.4195 1.5112
1,000,000 2.1438 0.2172 0.0031 0.1371 -0.2855 -0.3679 1.5149
* - Induced multicollinearity (ܺଶ = 1.5 ∗ ܺହ + 2 ∗ ‫݀݊ܽݎ‬ሺNORMAL, 1,0ሻሻ

As compared to the gravity effect to the multicollinearity, omission of important variable
is still tolerable at this point. The estimate for the mean of Y also appears to be accurate
still.
Table 8: Omission of most important variable ࢄ૛
Model Sample
Intercept (Bo
= 0.1)
x1 (B1 = 0.3) x3 (B3 = 0.2) x4 (B4 = -0.4) x5 (B5 = -0.5) Alpha
Poisson
100 0.6251 -0.0707 0.0539 -0.0710 -0.0431
1,000 0.7677 0.1326 0.1496 -0.1580 -0.4463
10,000 0.7993 0.1785 0.1323 -0.2701 -0.3537
100,000 0.7979 0.2007 0.1393 -0.2737 -0.3466
1,000,000 0.7947 0.2070 0.1394 -0.2755 -0.3438
NB1
100 0.4249 -0.0635 0.0623 -0.0714 -0.0629 0.0000
1,000 0.8461 0.1178 0.1352 -0.1193 -0.3524 0.6700
10,000 0.8377 0.1626 0.1159 -0.2405 -0.3129 0.4623
100,000 0.8399 0.1781 0.1221 -0.2431 -0.3049 0.4686
1,000,000 0.8376 0.1826 0.1228 -0.2424 -0.3027 0.4704
NB2
100 0.6055 -0.0666 0.0621 -0.0690 -0.0589 0.0000
1,000 0.7884 0.1319 0.1481 -0.1431 -0.4148 0.2307
10,000 0.8113 0.1687 0.1268 -0.2523 -0.3410 0.1792
100,000 0.8109 0.1916 0.1315 -0.2602 -0.3296 0.1797
1,000,000 0.8067 0.1978 0.1330 -0.2627 -0.3283 0.1799
Similar to the properties of Poisson and Negative Binomial, the truncated model are
obviously not performing under multicollinearity. Prediction deviations balloons up to
nine times larger or the error rate increases by almost 50% more.
Table 9: Model Fit Comparison under different scenarios
Sample Measure Model regular run with MC problem omission of X2
100
MAPE
Poisson 17.33% 62.03% 21.16%
NB1 15.67% 61.52% 17.55%
NB2 17.33% 57.89% 20.73%
MAD
Poisson 0.3183 1.4536 0.4004
NB1 0.3218 1.5021 0.4016
NB2 0.3183 1.4254 0.4004
1,000
MAPE
Poisson 18.53% 81.53% 26.77%
NB1 18.53% 84.53% 27.17%
NB2 18.52% 81.67% 26.65%
MAD
Poisson 0.4127 2.3506 0.5865
NB1 0.4127 2.3245 0.5804
NB2 0.4128 2.3290 0.5822
10,000
MAPE
Poisson 17.70% 107.78% 25.53%
NB1 17.70% 111.52% 25.90%
NB2 17.42% 107.87% 25.52%
MAD
Poisson 0.3730 3.6091 0.5351
NB1 0.3730 3.6383 0.5367
NB2 0.3776 3.6171 0.5344

100,000
MAPE
Poisson 17.37% 116.45% 25.49%
NB1 17.37% 129.40% 25.90%
NB2 17.37% 115.83% 25.47%
MAD
Poisson 0.3647 3.9364 0.5367
NB1 0.3647 3.9887 0.5389
NB2 0.3647 3.8788 0.5360
1,000,000
MAPE
Poisson 17.54% 115.64% 25.54%
NB1 17.54% 126.53% 25.96%
NB2 17.54% 115.38% 25.52%
MAD
Poisson 0.3672 3.8658 0.5368
NB1 0.3672 3.9341 0.5385
NB2 0.3672 3.8476 0.5359
5. Conclusion
The logistic regression model at the first step manages to classify and mitigate the effects of
misspecification throughout the simulation analysis (keeping an error rate of < 1%). The mean
of the true count is predicted accurate enough despite the presence of multicollinearity (within
2%). Upon observing the AIC criteria, the poisson regression and NB1 almost maintains the
same value. For sake of simplicity, the simulated data still favor the poisson regression as the
2nd
step of the hurdle model.
The hurdle model performance behaves similar to the performance of poisson and negative
binomial models as it is greatly affected by ill-conditioned covariates. Omission of an
important variable does not have some worrisome effect as compared to multicollinearity.
Using the hurdle model on live data can gauge the reliability of the trained model assuming
that the test data comes from the same data generating process. The bias in the average count
for the test data only went up as high as 9.1% at a small sample size. Increasing the number of
cases studied decreases the discrepancy.
The analysis produced are limited to a poisson data with low mean. This does not include the
covariates that led to the zero-inflation and is possible to further improve the predictions and
estimates.

6. References
Cameron, A. and Trivedi, P. (1986). Econometric models based on count data, comparisons
and applications of some estimators. Journal of Applied Econometrics, 1, 29–53.
Ridout, M., Dem´etrio, C. G. B., and Hinde, J. (1998). Models for count data with many
zeros. In Proceedings of the 19th International Biometrics Conference, Cape Town, pp.
179–190.
Min, Y. and Agresti, A. (2002). Modeling nonnegative data with clumping at zero: A
survey. Journal of the Iranian Statistical Society, 1,(1-2), 7–33.
Winkelmann, R. (2004). Health care reform and the number of doctor visits—an econometric
analysis. Journal of Applied Econometrics, 19, 455–472.
Mullahy, J. (1986). Specification and testing of some modified count data
models. Journal of Econometrics, 33, 341–365.
Bohning, D., Dietz, E., Schlattmann, P., Mendonca, L., and Kirchner, U. (1999). The
zero-inflated poisson model and the decayed, missing and filled teeth index in dental
epidemiology. Journal of the Royal Statistical Society. Series A (Statistics in Society), 162,
195–209.
Welsh, A., Cunningham, R., Donnelly, C., and Lindenmayer, D. (1996). Modelling the
abundance of rare species : Statistical models for counts with extra zeros. Ecological
Modelling, 88, 297–308.
Lambert, D. (1992). Zero-inflated poisson regression with an application to defects in
manufacturing. Technometrics, 34, 1–14.
Deb, P. and Trivedi, P. (1997). Demand for medical care by the elderly: A finite mixture
approach. Journal of Applied Econometrics, 12, 313–336.
Moffatt, P. (2003). Hurdle models of loan default. In a Conference at the Credit Research
Center, University of Edinburgh, UK.
Boucher, J.-P., Denuit, M., and Guillen, M. (2006). Modelisation of claim count with
hurdle distribution for panel data. In Proceedings of the International Conference on
Mathematical and Statistical Modeling in Honor of Enrique Castillo.
Mei-Chen Hu, Martina Pavlicova, and Edward V. Nunes (2011). Zero-inflated and Hurdle
Models of Count Data with Extra Zeros: Examples from an HIV-Risk Reduction Intervention
Trial
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3238139/

Eva Cantoni and Asma Zedini (2009). A Robust Version of the Hurdle Model.
http://www.unige.ch/ses/metri/cahiers/2009_07.pdf
Jennifer Bethell, Anne E. Rhodes, Susan J. Bondy, W. Y. Wendy Lou, Astrid Guttmann
(2010). Repeat self-harm: application of hurdle models. The British Journal of Psychiatry.
http://bjp.rcpsych.org/content/196/3/243
Gregori Baetschmann and Rainer Winkelmann (2015). A Dynamic Hurdle Model for Zero-
Inflated Count Data.
https://www.econ.uzh.ch/dam/jcr:ffffffff-a477-8018-ffff-
ffffabad53fc/Dynamic_Hurdle.pdf
AGRESTI, A., 1996, An Introduction to Categorical Data Analysis. New York: John
Wiley & Sons, Inc.
BARRIOS, E., 2015, Lectures on Overdispersion

APPENDIX A: SAS code syntax to generate simulated data
data basedata;
call streaminit(123);
array vars x1-x5;
array zero_vars z1-z3;
array parms{5} (.3 .7 .2 -.4 -.5);
array zero_parms{3} (-.3 .1 .2);
intercept=.1;
z_intercept=-.1;
do i=1 to &sample;
/*parameter initialization for non-zero covariates*/
sum_xb=0;
sum_gz=0;
do j=1 to 5;
vars[j]=rand('NORMAL',0,1);
sum_xb=sum_xb+parms[j]*vars[j];
end;
mu=exp(intercept+sum_xb);
y_p=rand('POISSON', mu);
/*induce zeroes by some z1-z3 variables*/
do j=1 to 3;
zero_vars[j]=rand('NORMAL',0,1);
sum_gz = sum_gz+zero_parms[j]*zero_vars[j];
end;
z_gamma = z_intercept+sum_gz;
pzero = cdf('LOGISTIC',z_gamma);
cut=rand('UNIFORM');
if cut<pzero then y_p=0;
output;
end;
keep y_p x1-x5 z1-z3;
run;

Simulation Study of Hurdle Model Performance on Zero Inflated Count Data

More Related Content

What's hot (18)

Similar to Simulation Study of Hurdle Model Performance on Zero Inflated Count Data (20)

Simulation Study of Hurdle Model Performance on Zero Inflated Count Data