Correcting for Non-Sum to 1 Estimated Probabilities in Applications of Discrete Probability Models to Count Data

In this paper, we examine some often ignored or assumed problems relating with fitting probability models to count data either exhibiting over, equi, or under dispersion. Of particular concern are last category truncated data, where most often, expected values in this last category are collapsed together so that the sum of the expected values sum to the sample size in the data. That is, so that k ∑ i=0 m̂i = n, the sample size. We shall for illustrative purposes in this paper, consider the following distributions: the negative binomial (NB), the Inverse trinomial (IT), the hyper-Poisson (HP), the Quasinegative binomial (QNBD), the extended com-Poisson distribution (ECOMP) as well as the negative binomial-exponential distribution (NBGE).Though, we have restricted our discussion to these six distributions, other distributions may also be employed but the patterns are always the same, that is, the sum of the estimated probabilities does not equal 1.00 and consequently, the sum of the expected values is always less or equal (Poisson case only) the sample size in the observed data. We propose a common procedure to rectify this problem for both right truncated or non-truncated frequency count data exhibiting either excess zeros or regular frequency data.


Introduction
Count data are often modeled with the Poisson distribution-it being the underlying probability model for count data.However, its use had been restricted because of the absence of dispersion parameter in its function since both mean and variance are equal, thus leading to equi-dispersion: the ratio of the variance to the mean, which in this case equals 1.00.For data exhibiting long tails or over or under dispersion, alternative models such as the negative binomial (NB), the generalized Poisson, Famoye et al. (2004), the double Poisson and several other distributions, such as, the NB-generalized exponential, Aryuyuen & Bodhisuwan (2013) have been proposed.Other distributions that have received considerable attention in the literature are the Poisson Inverse Gaussian, the NB-Lindley Zamani & Ismail (2010); Lord et al. (2011); Geedipally et. (2012) and many others.
For frequency count data, all these distributions have something in common when applied to real life data, namely, the sum of estimated probabilities under these models, does not often add to 1 as would be expected.Consequently, the sum of the expected values under these models therefore do not necessarily add up to the sample size n.Most authors just combine the expected values in the last category with the left over, so that k ∑ i=0 mi = n.We shall illustrate this in the next sections.
For this study, we plan to implement a procedure that corrects this anomaly when these distributions are applied to frequency data.Consequently, we will be employing an array of distributions such as the two-parameter type distributions, the negative binomial (NB), the hyper-Poisson and the Com-Poisson (CMP) distributions.We will also employ the three parameter type distributions (the Quasi-negative binomial, the Inverse Tri-nomial and the generalized negative binomialexponential distribution), as well as the four parameter type distribution (the extended Com-Poisson distribution).These distributions cover a broad spectrum of possible distributions usually employed for count data.These six to seven models will be applied to two example frequency data.The second data set has excess zeros and we would apply our procedure in the light of these excess zeros to the zero-inflated versions of the six models.We present these distributions in the following sections.

Probability Models Considered in this Study
We begin our discussion in this paper with brief introductions to some of these distributions.

The Negative Binomial-NB
The negative-binomial model has the probability distribution: Where is the the normalizing term and ν is the dispersion parameter such that if ν > 1 we have under dispersion, and when ν < 1, we have overdispersion.The distribution reduces to the Poisson distribution when ν = 1.The means and variance of Y are respectively given as: and,

The Hyper-Poisson Distribution
The hyper-Poisson (HP) distribution first proposed by Bardwell & Crow (1964) and Crow & Bardwell (1965) is a twoparameter discrete distribution with probability density function (pdf) is the confluent hyper-geometric series in which (β) 0 = 1.Expressions for the mean and variance of the HP distribution are provided in Kumar & Nair (2014) as: where: Alternatively, Lawal (2017) has used the expressions below to obtain the mean and variance of the HPP distribution.His results agree with using expressions in ( 9).

The Quasi-Negative Binomial-QNBD
The quasi-negative binomial distribution Janardan (1975); Sen & Jain (1996) and very recently by Li et al. (2011) has the probability mass function Hassan & Bilal (2008) of the form: This is equivalent to QNBD proposed in Li et al. (2011) which has the alternative probability mass function: The properties of both alternative distributions have been outlined in the papers referred to above.In this paper, we will employ the QNBD defined in (10).

The Inverse-Trinomial Distribution-IT
The inverse trinomial distribution Shimizu & Yanagimoto (1991) which is derived from the Lagrangian expression has the probability mass function of the form : ) t (12) y=0,1,. . .; λ > 0, p ≥ r and p + q + r = 1.It is so named because its cumulant generating function is the inverse of that for the trinomial distribution.The IT model was employed for overdispersed medical count data by Phang & Ong (2013).
It is a special case of the negative binomial distribution (NB) where ) t .
The mean and variance of the IT are, for p > r given respectively as:
The means and variances of the NB-GE distribution in (14) are: where , and .

The Extended COM-Poisson (ECOMP) Distribution
The pmf of a random variable Y having the extended COM-Poisson distribution with parameters (ν, α, , β) is given in Chakraborty & Imoto (2016) by: where The distribution is defined in the parameter space To re-emphasize our problem, one common feature of the distributions described above, and indeed for most distributions employed for count regression models is that they all defined to have infinite range.Consequently, for a real life data that takes values Y = 0, . . ., k, it is most common to observe that the expected probabilities under any of the above models are not necessarily summing to 1.00 within the range 0 ≤ Y ≤ k as expected for a probability mass function, and consequently, the expected values will also not sum to n, the sample size.To overcome this, the practice has often been to add this shor fall expected values to the last category expected value, that is, category k in our case.
While this practice is most common, there exists however situations, where such a practice may not lead to the right decision being taken.The case in point is if the last category, k has been truncated.That is, we have observations designated in k+ categories.We give example below, which is adapted from Hassan & Bilal (2008) and relates to the number of absenteeism among shift workers in steel industry as reported in Arbous & Sichel (1954).The category 25+ actually stands for counts in categories (25-48) and are combined in that category with a frequency count of 16.Thus our last category is k = 25 in this data.The distribution of Y and the corresponding frequencies are displayed in columns 1 and 2 respectively in Table 1.Also presented in Table 1, are the expected values, ML estimated parameters, both true Wald's Goodness-of-Fit test statistic, X 2 T , its corresponding empirical value X 2 E and the grouped X 2 values with their corresponding degrees of freedom for five of the models, namely, (P, NB, CMP, HPP, and QNBD).We shall discuss the estimation procedures employed for these models but first, we observe the following from our example data in Table 1.For the NB and QNBD for instance, these sums are respectively, 239.134 and 241.214 at k = 25.Thus if we add these shortfalls to the last category for instance, we will get the following for {NB, CMP, HPP, QNBD}={10.3457,9.2868, 8.903, 8.380}.The corresponding contributions to X 2 G in this case are {3.0903,4.8528, 5.6574, 6.9289}.Here, (b) The parameter estimates under each model are provided under the panel 'ML estimates'.Clearly, each F(y) < 1.00 in all cases.
(c) The mean and variance of the observed data are respectively, µ = 8.8831 and σ 2 = 51.7717,giving a dispersion parameter DP=5.8281.Thus the data is highly over-dispersed.
(d) In Table 2 are presented the following: 1. F( 25) < 1 for all models 2. F(y * ) gives the value of Y required to attain an F(y) to be 1.Thus, k has to be 25, 98, 78, 58 and 51 respectively for these estimated probabilities to sum to 1 and of course in this case, the expected values also sum to n = 248.When we realize that the range of k here is 0 ≤ k ≤ 25 we can see that we need to make the necessary adjustments to realize our expectations.
3. The estimates µ and σ 2 are the theoretical means and variances under these models for the data.These are used to compute Wald's X 2 T , where 4. The estimated ȳ and σ2 are the estimated means and variances at k = 25, representing reality from the data and are thus classified as empirical estimates on which the Wald's test statistic X 2 E are computed, where Here p is the number of parameters being estimated.
5. For the grouped data, we employed Pearson's However, we ascertain that the expected values satisfy the Lawal (1980) rule for the χ 2 approximation to be valid.The model is based on k − p degrees of freedom.
6.We see that why the theoretical means and variances (apart from the Poisson, which has a much lower variance) of the models are much closer to the observed mean and variance of the data, however, when the models are implemented, the empirical means and variances of all the distributions do not match up with the observed values.The estimated variances grossly underestimates the true variance of the data.These we plan to correct with the new procedure described at a later section of this paper.

Estimation
The log-likelihood of a single observation i from P, NB, HPP, COMP, IT, ECP, QNBD, NBGE distributions are given in expressions (17a) to (17h) respectively: Maximum-likelihood estimations of the above models are carried out with PROC NLMIXED in SAS, which minimizes the function −LL(y, Θ) over the parameter space Θ numerically.The integral approximations in PROC NLMIXED is the Adaptive Gaussian Quadrature Pinheiro & Bates (1995) and the Newton-Rapson Conjugate Gradient optimization algorithm in PROC NLMIXED (NEWRAP).To obtain a quicker convergence, the Conjugate Gradient or quasi-Newton optimization algorithms of Powell (1977) and Beal (1972) are initially employed in our computations (the latter is the default in PROC NLMIXED).Convergence is often a major problem here and the choice of starting values is very crucial.
For each of the cases considered here, the above two initial optimizing algorithms were applied in turn to ascertain accuracy and consistency.Although the results differ very slightly, on the whole, they all agree very well.Thus, we may note here that each of these give slightly different parameter estimates.They all give values that are very close.

New Procedure
We can overcome the above subjective approach by instead fit a model with the log-likelihood of a single observation i of the form: where and k is the last category of the data.In our example in Table 1, this would be k = 25.To accomplish this, first we compute: for each of the distributions NB, CP, HPP, ECP, QNBD, IT and NBGE, The expression in ( 18) for the HPP model for example would be: where LL represents the log-likelihood LL3 in (17c) for the Hyper-Poisson model.Thus, for the Hyper-Poisson model, the above in (20) becomes: Here, Once the parameters λ and β are successfully estimated, the estimated probabilities and expected values are computed viz: Here again, The results of implementing this approach (sometimes referred to as right truncated) is presented in Table 3 for the seven models.Clearly now, the estimated probabilities sum to 1.00 in the range 0 ≤ Y ≤ 25 and consequently, the expected values also sum to n. Chakraborty & Imoto (2016) have used a similar procedure.We see here from Table 3 that the sums of expected values all sum to 248, the sample size as expected.Further, the empirical means and variances are much closer to the true values of 8.8831 and 51.7717 than in the earlier models.Consequently, the empirical Wald's GOF X 2 E are much smaller than those observed in Table 1.Also, the grouped Pearson's X 2 G fit much better.The ECP gives the lowest values but it is based on 22 d.f., while the Inverse tri-nomial (IT) gives a X 2 G of 25.7397 on 23 d.f. and more parsimonious.

Data Set II
As a second example, the data set in As in the previous example, under the models considered for this data, again, the sums of the expected frequencies in each model do not add to n = 4406.Both the estimated variances of the Com-Poisson and Hyper-Poisson models grossly underestimate the variance of the observed data.However, the NG, IT, QNB and the NBGE models give estimated variances that are closer to the observed variance of 0.5571 in the data, than the HPP, and CP models.The estimated means of all the models are very close to the observed mean of 0.2960.It is noted here that, both the QNB and the NBGE fit the data best.The dispersion parameter is 1.89 for this data, which by the size of the data gives significant indication of over-dispersion in the data.This is not necessarily unexpected because of the excess zeros in the data.We examine the effect of this in the next section.
When the procedure outlines earlier is employed on these data, the corresponding results for the right truncated models are displayed in Table 5. Results from Table 5 again indicate that the right-truncated models behave much better than those in Table 4.The estimated probabilities sum to 1.00, as well as the estimated frequencies summing to 4402.For both data considered here, the QNB, IT, the NB-GE, and ECP all give means and variances that are very close to the means and variances of the observed data.Also for both data sets, the HPP and Com-Poisson give estimated variances that grossly underestimate the true variances and both therefore are not very good for modeling these data sets.The ECP, QNB and NBGE all fit the data very well, but the QNB model is the most parsimonious for this data set.It also gives the lowest empirical Wald's test statistic of 4391.6755 on 4402 d.f.

Zero-Inflated Models
We present in this section the effect of applying the procedure employed in the last section to data exhibiting excess zeros (like the data in Table 4) where 80.4% of the data are zeros.Lawal (2017) has fitted the zero-inflated negative binomialgeneralized exponential distribution (ZINBGE) to the accident data in Table 4.We present here the results of fitting the ZINB, the ZIIT, the ZIQNB and the ZINBGE to the data in Table 4.To accomplish these, we recall that a zero-inflated (ZI) model is a two-part process manifested by the structural zeros part and the process that generates random counts and can be written in the form: where ϕ is the extra proportion of zeros and Y is the count random variable with specified parameters.ϕ is modeled here in the logit form.Thus, the probability mass function for the ZINB, ZIIT, ZIQNB and ZINBGE models are given respectively in expressions ( 24) to (27).

Pr(Y
From the above, it is not too difficult to formulate the corresponding log-likelihoods.The results of implementing these models are given in the first panel in Table 6.For the data in this example, its mean and variance are respectively, 0.2960 and 0.5571, with a dispersion parameter DP=1.88 which indicates over dispersion.This is not surprising because of the excess zeros in the data.We see again here for the first part of the results that none of the zero-inflated models have expected values summing to n = 4406.Clearly, for this data set, both ZINB and ZIIT are clearly not improving our fits since the estimates of ϕ in both cases are zero.We might as well do with NB and IT models.The ZIQNB and the ZINBGE both fit the data better with grouped X 2 being respectively, 11.2699 and 10.8060 on 5 d.f.The equivalent Wald's test statistic X 2 W are 4340.5217and 4680.9653,indicating that the ZIQNB now fits better, each is of course on 4402 degrees of freedom.Because the sum of the mi < n = 4406 in all the four models, it is often the case that the last categories usually carries the difference.Thus for instance, most analysis in the literature would assign an expected value of 0.592 + (4406 − 4405.216)= 0.592 + 0.784 = 1.376 under the ZINB model.We can do the same for the other three models to ensure that the sums all add to n = 4406 in this case.
We also observe that the empirical means and variances for the four models fall short of the true values for the data, hence the models did not fit very well.We have employed here the Lawal (1980) rule of the χ 2 approximation to X 2 .These are all satisfied in our data.As discussed in the previous section, the results in the second half of the table designated right truncated are the same models with truncation at Y = 7.The results are much better for these models based on the groups' X 2 .For instance, for the ZIQNB, this is 3.2806 on 5 d.f.In particular, both ZIQNB and ZINBGE fit very well.The reasons for this can be traced to their empirical means and variances which are very close to that observed for (a) Apart from the Poisson model, all the other models have 25 ∑ i=0 mi < 248, the sample size n = 248.Corresponding fits of IT and ECOMP gives 25 ∑ y=0 mi of 237.707 and 239.3451 respectively.In all the cases, 25 ∑ y=0 mi < n, where n = 248for this data set.The cumulative estimated probabilities under each model is given by the F(y).

Table 1 .
Distribution of absenteeism among Shift Workers

Table 2 .
Statistics Computed from implementation of the designated five Models

Table 3 .
Results from the Implementation of the Procedure

Table 4 .
Table 4, is taken from Aryyuen & Bodhisuwan (2013) and gives the number of hospital stays by United States residents aged 66 and above.The data was originally presented in Flynn (2009) but recently re-analyzed in Aryyuen & Bodhisuwan (2013) with zero inflated models.Here, Y i is the number of hospital stays and n i is the frequency in each category.The sample size here is n = 4406.Parameter Estimates and Expected values under the various Models

Table 5 .
Results from the Implementation of the Procedure

Table 6 .
MLEs, expected values and empirical means and variances for the four models