Estimating Disease Risk of Diabetes Cases in the Presence of Underreporting

In real life situations, the values of the response variable, which is the count data is mostly under-reported. In this work, we develop a model to cater for under-reporting in the case of count data. In particular, we allow under-reporting to vary spatially by regions through a probability captured by a binomial distribution. Count data mostly comes with a common property, which is the variance is greater than mean. When this happens, the recommended distribution is Negative Binomial (NB) instead of the usual Poisson distribution. The spatial variations of the disease were divided into correlated and uncorrelated parts. When a Negative Binomial was used, both the correlated and uncorrelated parts were all found to share a significant relationship with the relative risk for each region with more of contribution coming from the uncorrelated part. The model obtained was applied to diabetes data in Ghana. Disease maps for the diseases were also developed for Ghana. These maps are critical and informative to policy makers when coming up with preventive mechanisms in the face of scarce resources.


Introduction
One of the greatest challenges hindering the progress of Africa is non infectious diseases; which put strain on the tax payer and also depletes our human resources at an alarming rate.A good example of such diseases is diabetes.A fact sheet published in November 2016 by the World Health Organization (WHO), has it that, the number of people with diabetes has increased from 108 million in 1980 to 422 million in 2014.It is also estimated that, diabetes prevalence in adults has risen from 4.7 percent in 1980 to 8.5 percent.Diabetes is a chronic disease that comes about as a result of the inability for the pancreas to produce enough insulin or when the body is unable to effectively utilise the insulin produced by the pancreas.It is a leading cause of death in developing countries of which Ghana is included.In 2014, the International Diabetes Federation (IDF) estimated the number of people living with the disease to be 450,000 raising huge concerns among tax payer as to the remedies being put in place to check the menace.
The estimations above have necessitated the need to apply new and modern methods in solving this menace.Statisticians and mathematicians have responded to this by employing regression models in connecting count data to some variables proven to have a significant effect on the disease.As humans operate in space, it is impossible to separate environmental factors to these diseases.This was confirmed by John Snow, when he connected a certain borehole to cholera deaths in 1854 in London.This pioneered the use of geographically-coded data in modelling diseases employing a method referred to as spatial modelling and it has vast applications in the estimating the relative risk of small areas of a geographical location.With this information at hand, one is able to determine the relative risk of exposure thereby serving as a direction to policy makers.
Many papers have been published in this field.Some of which are (Ugarte, Ibez, & Militino, 2006).They touched on the different techniques one can use when faced with modelling risk in mortality data.They also presented a list of smoothing methods based on Poisson inference that estimate mortality rates and ratios better.In their work, the over-dispersion (which comes about as a result of spatial autocorrelation, unstructured heterogeneity or a combination of the two) was identified and accounted for by incorporating random effects into the models.Also, (Waller and Carlin and Xia & Gelfand, 1997) did a remarkable work on disease mapping where some regional mortality and morbidity cases were mapped.In their work, they were able to identify the fact that, Bayes and empirical Bayes methods help to reduce or eliminate the instability of estimates in low-population areas while maintaining their geographical resolution.Based on this knowledge, they extended their work by incorporating temporal effects and spatio-temporal interactions into their model and fitting their data using Markov chain Monte Carlo (McMC).(Gamado, Streftaris, & Zachary, 2014) also worked on modelling under-reporting in epidemics by considering the s-tochastic Markov SIR epidemic in which various reporting processes are incorporated.They were able to show that, excluding under-reporting when present, breeds a case of under estimation of the infectious rate.
In this work, the objective is to develop a model that correctly accounts for cases of under-reporting when given count data.A relative risk map of a given geographical area is plotted with reference to their spatial effects.This is achieved by employing a better method of estimation, i.e.Bayesian method of parameter estimation.
We achieve the above objective by taking some clues from (Waller and Carlin and Xia, & Gelfand, 1997) and assume a different dimension from what (Gamado et. al, 2014) worked on.Under-reporting takes a different turn.In the sense that, under-reporting varies spatially through a probability from region to region and captured by a binomial distribution.In our case, covariates will be excluded with the assumption that, count data varies spatially only.
In section 2, we build the model to cater for under-reported in count data by employing a Negative Binomial distribution instead of Poisson distribution as suggested by (Pararai, Famoye, & Lee, 2010).In Section 3, we derive the parameters using Bayesian method and continue with discussions of results in section 4.This will then be ended by the last but not the least section, dubbed, conclusion and recommendation.

Model
Given an independently and identically distributed (i.i.d) trials, a Negative Binomial Distribution can be used to model the number of success before a failure is achieved.In order to correctly model the phenomenon above, we continue to discuss the Negative Binomial (NB) distribution, Y ∼ NB (π, r).A discrete random variable, Y, can be said to follow a Negative Binomial function if the pdf, f (.), can be written as; The mean and variance of the distribution above can be written as r(1−π) π and r(1−π) π 2 respectively.Also, r, an integer-valued parameter, is the number of times (r = N + y) that we need to repeat a Bernoulli experiment with success probability, π until N successes are achieved.This makes the dispersion index (DI) as 1 π where π is the probability of reporting a case or success probability.Also, λ represents the mean.When this happens, E(Y) =λ=Eπ.Here, E denotes expected value of the unit under discussion.
Given that π = ( r r + λ ) and 1 − π = ( λ λ + r ) , the above pdf i.e.Equation 1 transforms into; ) y , 0 ≤ y ≥ r and r > 0. (2) When modelling count data, the assumption of independence for π can be relaxed and made to depend on some covariates.
In that case π can be written as; The probability distribution of the spatial effects are; In Equations ( 3), π i is the success probability of reporting an event and in our case, it is made to depend solely on structured (u i ) and unstructured spatial effects(v i ) with probability distributions suggested by (Besag, York, Jeremy & Molli, 1991); Ngesa, Achia, & Mwambi, 2014).In Equation ( 4), N stands for Normal distribution whiles in Equation ( 5), WN represents White Noise and d 1i is the number of neighbouring units.
Count data, mostly come with elements of over-dispersion; this happens when data are collected under non-uniform circumstances.This also happens when the population under study is heterogeneous.In this paper, this is obvious because of the incorporation of under-reporting.For the data given, the variance is more than seven thousand times the mean, a clear case of over-dispersion.
Also, the marginal expectation of Y i can be computed as; The Variance of Y i can be written as; we conclude that under-reporting just like unobserved heterogeneity leads to over-dispersion.
Over-dispersion means that there was a higher variation in the data than predicted.
To account for over-dispersion and under-reporting, we come up with a joint distribution between the binomial and negative binomial distribution.This can be written as; where π u is the probability of under-reporting which varies spatially through a binomial probability and α is the inverse of r.This can be written as; with their probability distribution functions as; In the equation above, u 2i and v 2i are the structured and unstructured spatial effects in the under-reporting probability.The average number of observed cases for a period of one year is µ = π u λ.This can also be written as; In this case, our dependent variables are assumed to vary spatially and not on some covariates.In that case, Equation (15) runs into; With the above in mind, the likelihood function of Equation ( 10) can be written as; Substituting Equation ( 16) into Equation ( 17) gives; )) +y ln ) .
A careful look at Equation ( 10) shows that, it transforms to Poisson and Geometric when the diversion parameter, α = 0 and 1 respectively.Having successfully identified the contributing parameters, we present the candidate models in order of complexity below;

Parameter Estimation (Bayesian Approach)
Bayesian method is preferred over the usual frequentist method.This is due to its advantage of suppressing the effects of confounding variables.In this method, a prior distribution p (θ) is first identified and then likelihood p (y|θ) is then computed from the data through the popular Maximum Likelihood Estimation.After which the Baye's theorem is invoked to compute the posterior distribution, p (θ|y) .Mathematically, p (θ|y) ∝ p (θ) p (y|θ) with the constant of proportionality being a marginal distribution written as ∫ p (θ) p (y|θ) dθ.
In this write-up, the main variables are the structured (u), the unstructured (v) and their unknown variances, Applying Bayesian method, The variables are estimated as; The above equation will be used in conjunction with the following prior distributions, Here, α 0 in Equation ( 20) is assumed to also contain the intercept parameter because they have the same domain of existence, which is the whole number line.It is at the backdrop of this that a Normal distribution of mean, 0 and unknown variance is chosen.
This makes the posterior to be computed as; We establish the posterior marginal distribution of each of the parameters.With this in mind and ⋆ standing for the conditioning arguments u 1i , u 2i , v 1i v 2i , τ, we begin with the posterior marginal distribution of u 1i as; where ψ 1i represents the covariates, Xβ for region 1.From Equation ( 23), we conclude that the posterior distribution of u i takes on the form of the marginal distribution of u i specified in Equation ( 12).
With ⋆ standing for the conditioning arguments u 1i , u 2i , v 1i v 2i , τ, the posterior distribution of the correlated part ,u 2i , of the under-reported probability can be computed as; We conclude that the posterior distribution of u 2i takes on the form of the marginal distribution of u 2i .
with that of the uncorrelated part, v 2i ,of the under-reported probability can be computed as; We conclude that the posterior distribution of v 2i takes on the form of the marginal distribution of v 2i .
The conditional posterior distribution of τ can be computed as;

Data and Results
The aim of this study is to estimate the relative risk of diabetes cases with elements of under-reporting.Data describing diabetes cases was retrieved from the Ghana Health Service, an independent institution charged with the collection and collation of data in all aspect of health importance, at the district level.Data on diabetes was collected and summed by the district.The yearly recordings were used in this work.The estimated population for each district, for the study period was obtained from the Ghana Statistical Service.Data on this morbidity is available for all districts of Ghana.The period of consideration is only for 2014 as the agency did not have all data for later years.There were no missing data of any kind.
Model estimation was carried out using a Bayesian approach with every parameter being assigned prior distributions.To be precise, a non informative Normal prior was assigned to the offset parameter, α 0 whiles the variance parameters are assigned inverse gamma distributions.The paper was carried out under the assumption that covariates are not available.
WinBugs version 1.4 was used in the implementation (Spiegelhalter , Thomas , Best, & Lunn, 2003) phase.A double chain of Markov chain Monte Carlo (McMC) iterations of 70,000 were ran with initial of 10,000 left out as the burn-in period and then every tenth sample value considered for arriving at the convergence of the estimates of the remaining 6,000 samples.The decision on convergence was arrived at based on the behaviour of our trace plots and auto-correlation/Time series plots of the McMC output (Gelman et al., 2014).According to (Gelman et al., 2014), when the trace plot for a double chain appears to be crossing each other, then that is an indication for convergence.The posterior means of each model was used in the assessment of their efficiency and then a Deviance Information Criterion (DIC) was generated.The models were compared using the (DIC) as proposed by (Spiegelhalter et al., 2003).The best fitting model is the model with the smallest value of DIC and in this case, Model 4 was used in the analysis.
The data we obtained is such that the variance is far greater than the mean.This could be as a result of over-dispersion.Also, larger variance than mean could also be attributed to the presence of under-reporting as was shown in Equations (6 and 8) .In this work, under-reporting has been catered for by assuming that it varies spatially in all units.Under-reporting varies through a probability captured by a binomial distribution and solely dependent on spatial properties.The overall count was then modelled using a Negative Binomial instead of Poisson.This choice was as a result of the occurrence of elements of over-dispersion.This further translates that, Negative Binomial (NB) is adopted for the estimation of the relative risks.
This work was based on the assumption that, both the count data and under-reporting vary spatially for all regions.This spatial properties can be divided into two parts which are correlated, u 1 and uncorrelated parts, v 1 .In the case count data, the mean values of the correlated part, u 1i fall in the range of (−2.194, 2.602) with most of the 95% credible interval been all positive.This signifies positive relationship with the relative risk.A similar thing happens in the case of the uncorrelated spatial part for the count data, v 1i with the mean values falling in the ,margin of (−1.247, 0.2919) with all the 95% credible interval being positive, which is an indication of positive relationship with the relative risk.
The values and the the credible intervals of the generated relative risks are probabilistic in nature, i.e. all below one with the values ranging from (0.001 to 0.3) in the case of the Negative Binomial and that of the Poisson falling in the range of (0.1 to 50) .With reference to the variables, u 1i , v 1i , v 2i and u 2i , there was a positive relationship between the parameters and the relative risk for all units.With the case of the under-reporting probabilities, (π u ), the correlated part, u 2i , falls in the range of (−0.103, 0.13) with all the 95% credible interval being all positive.The uncorrelated part, v 2i also falls in the range of (−1.245, 1.066) with all the (95%) being positive.These results projects the importance of incorporating spatial elements when modelling count data.

Conclusion and Recommendation
The addition of spatial variables to the model produces very good fit for the data.This we did by introducing structured and unstructured spatial elements in both the count data and an under-reporting probability.The spatially structured random effects was captured by the usual Conditional Auto-regressive (CAR) model proposed by (Besag et al., 1991;Ngesa et al., 2014).From the results obtained from Figure (1), it could be seen that most of the high risk areas were identified to be in the southern regions of Ghana.This outcome can be attributed to the eating and behaviour pattern of the inhabitants there.(Darkwa, 2011) pin-pointed regions from the southern part of Ghana as regions with high risk of contracting diabetes; with a risk higher that the world average.Interestingly, inhabitants in the south have their stable foods prepared from maize, rice, cassava and yam.These crops are proven to fuel diabetes in humans (Darkwa, 2011).
The low risk areas on the other hand are known with eating meals prepared from crops like millet, sorghum and guinea corn.These are foods known with low carbohydrates.Also, people in the north are known to trek long miles to their farms serving as exercise for them.Exercising is a good therapy for diabetic patients (Darkwa, 2011;Danquah et al., 2012).
In this work, validation was not based on covariates as the main idea was to identify and correct spatial effects in underreporting cases in each district although it can be factored in future works.For the sake of future works and recommendation, we propose an extension in the effect of looking at multivariate domain where multiple diseases are known to exist in each geographical setting.

Figure 3 .Figure 7 .
Figure 3.Time series plot of u 1 Figure 4. Trace plot of u 1

Table 1 .
Table (1), Model (2) is a better estimate of the count data better than Model (3) because it has a lower DIC.It can also be seen from the table that, the uncorrelated, v 2i , part of the probability of under-reporting (π u ) , contributes more to the model than the correlated parts, u 2i .This is evident in the DIC between the two models.The fourth model is however the best model with the lowest DIC.It can also be said that, most of the geographical units under study fall below the (0.05) mark, Figure (1), signifying low risk, with most of these geographical units located in the Southern part of Ghana.The probability of under-reporting, Figure(2) also has most of the geographical units occurring with values greater than 0.1.Comparison of Count Models in Ghana Figure 1.Relative Risk of diabetes cases in Ghana Figure 2. Probability of Underreporting