On Some Mixture Models for Over-dispersed Binary Data

In this paper, we consider several binomial mixture models for fitting over-dispersed binary data. The models range from the binomial itself, to the beta-binomial (BB), the Kumaraswamy distributions I and II (KPI \& KPII) as well as the McDonald generalized beta-binomial mixed model (McGBB). The models are applied to five data sets that have received attention in various literature. Because of convergence issues, several optimization methods ranging from the Newton-Raphson to the quasi-Newton optimization algorithms were employed with SAS PROC NLMIXED using the Adaptive Gaussian Quadrature as the integral approximation method within PROC NLMIXED. Our results differ from those presented in Li, Huang and Zhao (2011) for the example data sets in that paper but agree with those presented in Manoj, Wijekoon and Yapa (2013). We also applied these models to the case where we have a $k$ vector of covariates $(x_1, x_2, \ldots, x_k)^{'}$. Our results here suggest that the McGBB performs better than the other models in the GLM framework. All computations in this paper employed PROC NLMIXED in SAS. We present in the appendix a sample of the SAS program employed for implementing the McGBB model for one of the examples.


Introduction
The binomial outcome data are widely encountered in many real world applications.The Binomial distribution often fails to model the binomial outcomes since the variance of the observed binomial outcome data exceeds the nominal Binomial distribution variance, a phenomenon known as over-dispersion.One way of handling overdispersion is modeling the success probability of the Binomial distribution using a continuous distribution defined on the standard unit interval.The resultant general class of univariate discrete distributions is known as the class of Binomial mixture distributions.The Beta-Binomial (BB) distribution is a prominent member of this class of distributions.The Kumaraswamy-Binomial (KB) distribution (Kumuraswamy, 1980) is another well utilized member of this class.In this paper we focus the emphasis on the McDonald's Generalized Beta distribution (Manjor et al., 2015) of the first kind as the mixing distribution and would employ the newly introduced the McDonald Generalized Beta-Binomial distribution(McGBB) Manoj al. (2013).Some theoretical properties of McGBB are already discussed in Manoj et al. (2013) would not again be discussed here.The parameters of the McGBB distribution are estimated via maximum likelihood estimation technique.Real world datasets are modeled by using the new McGBB mixture distribution, and it is shown that this model gives better fit than its nested models, namely, the beta-binomial and the Kumaraswamy type II models.We also compared the distributions for the generalized linear model when we have covariates.The teratology data in Moore & Tsiatis (1991) are employed.Overdispersion in binomial regression could arise as a result of several reasons or combinations of reasons.In most cases, this could be due to the binomial model grossly underestimating the response variance or it could also be caused by positive correlation between the dependent responses.Hilbe (2011) has presented an elegant overview of this for the Poisson model.When over-dispersion occurs, then standard errors of parameter estimates are often underestimated and this often leads to wrong conclusions with regards to the significance of the predictor variables involved.The McGBB and other mixing models consider here therefore model the parameter π of the binomial with continuous distributions defined in the interval (0,1).
In this paper, we would consider the following binomial mixture distributions in addition to the binomial (used here as a reference model only).
• The beta-binomial

• The Kumaraswamy binomial I & II binomial models
• The Macdonald's generalized beta-binomial distribution

The Binomial Model
The random variable Y = ∑ y i , where y i ∼ Bernoulli(π) has for a fixed n the binomial distribution: The mean and variance of the Binomial are given respectively as:

Mixture Models for the Binomial
We consider in this section the following mixture distributions for the binomial: • The beta-binomial • The Kumaraswamy binomial I & II binomial models • The Macdonald's generalized beta-binomial distribution We develop each of these models in turn in what follows: The mixture model is obtained by evaluating the well-known integral: for y = 0, 1, . . ., n and Θ is the parameter space of the mixing distribution.

The Kumaraswamy Distribution
The Kumaraswamy I (Kumaraswamy, 1980) written as KBI or KW(I) model is a mixture of Binomial and p having the Kumaraswamy distribution, that is, is also a mixture of Y|π ∼ Bin(n, π), and π ∼ KB(α, β) where KB(α, β) has the distribution: Thus the KW(I) and KW(II) are resulting unconditional probability distributions given in ( 6) and ( 8) respectively as: The mean and variance of KW(I) are respectively: The KW(II) on the other hand, has the form: where α, β > 0 and y = 0, 1, 2, . . ., n.
Its mean and variance are given respectively as:

The McDonald Generalized Beta-Binomial Model (McGBB)
The McGBB distribution with parameters n, α, β, γ is a mixture of binomial Bin(n, π) and the McDonald's generalized beta distribution (McDonald, 1984(McDonald, , 1995) ) of the first kind GB1(α, β, γ) where the latter distribution is defined as: Clearly, the GB1 reduces to the beta distribution when γ = 1 and to the Kumaraswamy distribution when α = 1.
It can be easily demonstrated that the McGBB in (12): • reduces to the Beta-Binomial with parameters n, α, β if γ = 1.
The mean and variance of the McGBB are given by: where , and ρ = ) 2

Parameter Estimation
For a single observation, the log-likelihood for the binomial, the beta-binomial, KPI, KPII, and the MCGBB are displayed in expressions (14a) to (14e) respectively.
Maximum-likelihood estimations of the above models are carried out with PROC NLMIXED in SAS, which minimizes the function −LL(y, Θ) over the parameter space Θ numerically.The integral approximations in PROC NLMIXED is the Adaptive Gaussian Quadrature (Pinheiro & Bates, 1995) and several optimization algorithms: namely:the quasi-Newton algorithm (QUANEW), the Nelder-Mead Simplex method(NMSIMP), the Newton-Raphson method with line search (NEWRAP) and the Conjugate Gradient method (CONGRA) of Powell (1977) & Beale (1972).Convergence is often a major problem here and the choice of starting values is very crucial.For each of the cases considered here, the above four optimizing algorithms were applied in turn to ascertain accuracy and consistency.Although the results differ very slightly, on the whole, they all agree very well.Thus we may note here that each of these give slightly different parameter estimates.They all give values that are very close.
Since the estimation of the Kumaraswamy-binomial often leads to non-convergence, we would employ the log-likelihood of the McGBB since it reduces to the KB for α = 1.Thus its likelihood will now be of the form:

Application to Data Examples
We have applied the above models to five data sets that have previously being analyzed by various authors.The five data sets are described in Examples I to V.

Data Example I
This example is taken from Nelder & Mead (1965) and relates to the number of candidates having an "alpha", i.e. at least 15 scores out of a total 20 points from each of nine questions employed in assessing the final class of candidates in an examination.Here, Y ∼ Bin(9, π).There were a total of 209 candidates for the examination and Table 1 gives the distribution of these scores for the 209 candidates as well as the expected values and goodness-of fit-test under the five models.For the binomial case for instance, the expected probabilities are  Lawal's (1980) rule, which states that expected values can be as small as r/d 3/2 (where r is the number of expected values less than 3, and d is the degrees of freedom under such model) without violating the χ 2 assumption, we see that the minimum expected values required for the BIN model for example will be 5 8 3/2 = 0.22 since only five expected values are less than 3 and the d.f.=(10 − 2) = 8.Hence cells 6,7,8, and 9, that is, 6+ will be collapsed, now giving a new X 2 = 80.000 on 5 d.f.This presented in the bottom panel of Table 1.Corresponding expected minimum expectations for the BB, KPI, KPII and McGGB are 0.1620, 0.1326, 0.1620 and 0.2041 respectively, resulting in categories 8 and 9 (i.e.8+) being collapsed.The expected values, X 2 and the appropriate degree of freedom are presented in the bottom panel of the data.

Data Example II-Terrorism in the US
The data in this example relate to the incidents of terrorism in the United States and was analyzed in Xiaohu Li et al. (2011).The data was originally presented in Jenkins & Johnson (1975) who compiled a chronology of incidents in international terrorism between 01/1968 and 04/1974.The number of incidents and frequency of occurrences are presented in columns 1 and 2 respectively in Table 2.Here Y is the number of incidents recorded.For this data, we would need to collapse cells 3,4,5 for the binomial and cells 4 and 5, for the other four models.Clearly, the KPI seems to give the most parsimonious model for this data, but clearly the other models also fit the data well.The McGBB even though has the smallest X 2 but also has the smallest degrees of freedom because of the three parameters estimated under the model.In spite of several optimization methods mentioned above in SAS PROC NLMIXED, we could not reproduce the results of expected values presented in Li et al. (2011).All the methods lead to similar results presented above.It is our opinion that there are inherent errors in the computational aspects of the results in Li et al..The -2 log likelihood (-2LL) obtained in our analyses are much lower than those presented in Li et al..The generated expected probabilities in our models all sum to 1 and the expected values are computed on this basis.Our results therefore are much different from those of Li et al. and would conclude that Li et al. perhaps need to revisit their computations and hopefully would observe that the new results will agree with those we presented in Tables 2 and 3 in this paper.

Data Example III-Terrorism Data in Argentina
This example data is again originally from Jenkins & Johnson (1975) and relate to the number of incidents of terrorism in Argentina between 1968 and 1974.
For this data, cells are collapsed for both the binomial and KPI models.The other three models have minimum expected values that satisfy the Lawal's rule.Here, the Beta-binomial, KPII and the McGBB all fit the data well but the betabinomial seems to be the most parsimonious for the data.Again here, the expected values obtained are different from those presented in Li et al. (2011).The reasons presented earlier for Table 2 results also apply to these discrepancies in our results.We believe that our computations are very accurate.

Data Example IV: Alcohol Drinking Data in Week I
The alcohol drinking data refers to the numbers of alcohol consumption days in two reference weeks that are separately reported by a randomly selected sample of 399 respondents in the Netherlands in 1983 (Alanko & Lemmens, 1996).These are displayed in Tables 4 and 5 for week I and II respectively.The choice of these data sets is motivated by the fact that they have also been analyzed in Rodriguez-Avi et al. (2007) and Li et al. (2011).Thus we can compare our results with theirs.
For these two data sets, the number of days Y, an individual consumes alcohol out of n = 7 days in a particular week is considered distributed binomial, however, the success probability π of consuming alcohol on a particularly day can not be assumed constant because of person-to-person variation and thus a mixture binomial distribution could be applied to these data, with π being modeled by either a beta distribution or a Kumaraswamy distribution in the interval [0,1].This is the approach adopted in Manoj et al. (2013) and Li et al. (2011).

Data Example V: Alcohol Drinking Data in Week II
The description of this data set is as earlier presented in 3.4 and in Table 5 are presented the expected values as well as the goodness-of fit tests under the five models.4 and 5.We note here that the KPI does not fit the week 1 data.The poor fit of the binomial model for instance can be attributed to the expected means and variances under the binomial compared to the observed values of these parameters.For While the observed means and variance of the data are µ = 3.8195 and σ 2 = 6.2538, we see that the binomial model grossly underestimates the variance of the data, while the McGBB estimates the variance well.We thus see that the McGBB and the other mixture models model the variance of the observed much better and hence fits the data much better than the binomial.

Extension to GLM
In this section we extend the applications of the models proposed above to the case where we have covariates.Suppose we have a vector of covariates, say, (x 1 , x 2 , . . ., x t ) ′ .We can incorporate this into our models-we present an example data for illustration:

Example: Teratology Data
Teratology is the study of abnormalities of physiological development.The data below gives the results of studies on the effects of dietary regiments or chemical agents on fetal developments in rats.(Moore & Tsiatis, 1991).Female rats on iron-deficient diets were assigned to four groups.Group 1 (placebo injection), group 2 (injections on days 7 and 10), group 3 (days 0 and 7), and group 4 (Injections weekly).58 rats were made pregnant, sacrificed after three weeks, and the total number of dead fetuses was counted in each litter.Due to non measured covariates and genetic variability the probability of death may vary from litter to litter within a particular treatment group.There are 31, 12, 5 and 10 rats respectively in groups 1, 2,3 and 4, and presented below are y/n for the data, where for instance, 1/10 means that y = 1 and n = 10.If we let π i j denote the probability of death for fetus j in litter i, then, • For the binomial, the set up is: where • For the beta-binomial, Kumaraswamy II (fitted as nested within the McGBB) and McGBB models , the set up is as follows in (17a) to (17c) respectively: α = exp(a0 + a 2 x 2 + a 3 x 3 + a 4 x 4 ), γ = 1, β > 0 (17a) β = exp(a0 + a 2 x 2 + a 3 x 3 + a 4 x 4 ), α = 1, γ > 0 (17b) α = exp(a0 + a 2 x 2 + a 3 x 3 + a 4 x 4 ), β, γ > 0 (17c) where a 0 , a 2 , a 3 , a 4 are to be estimated from the models in addition to the the other parameters of the models.
The use of exp(x ′ a) is some times preferable to the use of the logit -1/[1 + exp(−x ′ a)] (see Martinez, 2015).For a general set of covariates, the above, for instance, α becomes α = a 0 + a 1 x 1 + a 2 x 2 + . . ., +a t x t .

Computing the Pearson's GOF X 2
We generate the Pearson's goodness-of fit test statistic X 2 with the the basic GOF defined as: where n = 58 the total number of observations in our data.Expressions for E(Y) and Var(Y) have been presented for each of the models considered in this study.While those of the binomial are presented in expressions (2), those for the McGBB and its nested models BB and KPII can be generated from the expressions in (13) appropriately.

Conclusions
In all the data examples considered in this study, the McGBB performs best.We employ the finite series representation of the McDonald's Generalized Beta-Binomial distribution as presented in Manoj et al. (2013).The McGBB has considerable improvement over the BB, KPI and KPII models.However, convergence problem for the McGBB makes it daunting for easy use but with the array of optimizations algorithms available in SAS PROC NLMIXED and R studio, the distribution is not more difficult to fit than any of the other models.
We also observe here that the results presented in Liet al. (2011) are suspect and our results are confirmed by the use of several SAS procedures as mentioned in the body of this paper.Thus if convergence can be achieved, the three parameter McDonald's generalized beta-binomial model offers a much better alternative to some of the well known mixture models for over-dispersed binary data, at least for the example cases considered in this paper.

Table 1 .
9−y and consequently, the expected values are my = 209 × P(Y = y), y = 0, 1, 2, . .., 9.The computed probabilities and expected values, together with the cumulative estimated probabilities and expected values under the McGBB model for this data are presented in Appendix II.We note here that the estimated probabilities sum to 1.0 Expected values under the five Models and corresponding GOF test , we have employed the binomial model as a baseline model for comparative purposes.The expected values in column 3 inTable 1 are those under the binomial model designated as 'BIN'.Under the natural binomial model for this data, X 2 = 455.6136on 208 d.f with estimated dispersion parameter φ = 2.1904 >> 1 indicating strong over-dispersion in the data.The model produces very many small expected frequencies undermining the approximate distribution of Pearson's X 2 as a χ 2 distribution.Consequently, if we adopt the

Table 2 .
Goodness of fit tests for Terrorism Incidents in the US

Table 3 .
Goodness of fit tests on Argentina Terrorism Data

Table 4 .
Goodness of fit tests for for the alcohol Drinking Data in Week I Under the Five Models

Table 5 .
Goodness of fit tests for for the alcohol Drinking Data in Week II Under the Five Models Tables4 and 5agree with those presented in Manjor et al. (2013), For both data, the McGBB model is most preferable to the other models.Of course the binomial model does not fit the data at all as expected.At the 5% nominal level, the beta-binomial, KPI, KPII and the McGBB fit the data however, the McGBB is the most parsimonious with p-values of 0.7071 and 0.3974 respectively for data in Tables