The Bayes Factor for the Misclassified Categorical Data

This article addresses the issue of misclassification in a single categorical variable, that is, how to test whether the collected categorical data are misclassified. To tackle this issue, a pair of null and alternative hypotheses is proposed. A mixed Bayesian approach is taken to test these hypotheses. Specifically, a bias-adjusted cell proportion estimator is presented that accounts for the bias caused by classification errors in the observed categorical data. The chi-square test is then adjusted accordingly. To test the null hypothesis that the data are not misclassified under a specified multinomial distribution against the alternative hypothesis they are misclassified, the Bayes factor is calculated for the observed data and a comparison is made with the classical p-value.


Introduction
The problem of misclassification is a major issue in observational epidemiologic studies.Not long after Bross (1954) pointed out that the non-differential misclassification would bias the corrected odds ratio toward the null hypothesis, Diamond and Lilienfeld (1962a-b) has extended the result to various types of epidemiologic studies.A 2 × 2 case-control studies with a single exposure variable being misclassified has been widely studied (Fleiss et al 2003, Chapter 17;Gustafson 2004, Chapter 5;Kleinbaum et al 1982, Chapter 12;Rothman et al 2008, Chapter 19).Yet, almost no authors pay attention to investigate the effect of misclassification in the analysis of a single categorical variable except Mote and Anderson (1965).Mote and Anderson primarily takes a deductive approach to account for the bias caused by the classification errors.Yet, the shortcoming with a deductive approach is that it does not take the sampling errors into consideration.As a result, the issue on how to deal with the misclassification in the analysis of categorical data still remains unsolved.This article addresses another important issue, that is, whether the observed categorical data are misclassified.Instead of using a deductive method, an inductive approach is employed to account for the misclassification bias embedded in the collected data.First, the inverse way is taken by equating the expected value of the estimated sample cell proportion with its population parameter conditional on that the misclassification probabilities are given.Then the bias-adjusted estimator is presented for the population cell proportion parameter by inverting the misclassification matrix.Second, the appropriate misclassification probabilities are calculated depending on if the misclassification is possibly made either from one category to all other categories (scenario I) or merely to its neighboring categories (scenario II).Third, in order to test the null hypothesis that the data are not misclassified under a specified multinomial distribution, a mixed Bayesian approach is used to calculate the Bayes factor and compare it with the traditional p-value.

Methodology & Background
Given that X is a categorical variable with K (≥ 3) categories and the data are collected through a simple random sampling of size N, where 1).The crude estimator, j p ˆ, for the population cell proportion p j in the j th category is then given by Assume that j p ˆis distributed as a multinomial distribution with the population size N and the cell proportion of the j th Suppose that the observed data are misclassified.Let w jk (j  k) be the misclassification probability of an observation belonging to the j th category being incorrectly classified into the k th category and w jj the correct classification probability that an observation belonging to the j th category being correctly classified into the j th category.Then, it is easily shown that the expected value of p ˆis Eq. 2 shows that the crude estimator k p ˆ is no longer unbiased for the population parameter p k , provided that I W  ,where I is the K × K identity matrix.A set of misclassification probabilities {w jk } is said to be feasible if the misclassification matrix W in Eq. 2 is invertible (or nonsingular) for 0 < w jk < 1.
Assume that W is invertible.Then bias-adjusted cell proportion (BACP) estimators ( where . Note that by using Eqs. 2 and 3 it's easily shown: The misclassification matrix W has two possible forms depending on how the categorical variable X is misclassified.
There are two possible scenarios that are given as follows: Scenario I: The misclassification occurs after classifying one category incorrectly into all other categories.Also, because misclassification can occur equally likely from any one of the j th correct category to the k th (observed) wrong category, we thus have, for fixed j , k ≠ j, and Scenario II: The misclassification occurs after classifying one category incorrectly only into its neighboring categories.Therefore, we have, for fixed j w jk = 0 for |k -j| > 1, and When K = 3, the associated misclassification matrix with its determinant and its inverse matrix for scenarios I and II are hereby obtained respectively.An explicit form of the misclassification matrix W I and its inverse V I for scenario I are given respectively by and By using Eqs.6b and 7, the feasibility and admissibility constraints for the misclassification probability and BACP estimator are given respectively as follows: ) and (8b) For scenario II, an explicit form of the misclassification matrix W II and its inverse V II are given respectively by and The BACP estimator for scenario II is thus given by By using Eqs.9b and 10, the feasibility and admissibility constraints for the misclassification probability and BACP are given respectively as follows: (11b) To test whether the data in table 1 are misclassified, we need to test the following (sharp) null hypothesis that the data has no misclassification under p = p 0 versus the alternative hypothesis that the data are misclassified (Berger and Selleke 1987) H 0 : p = p 0 , ω = 0 versus H 1 : p ≠ p 0 , ω > 0, ( where , {w jk } are the entries of the misclassification matrix W given by Eq. 2. To test Eq.12 the bias-adjusted chi-square test (BACST) is given by where , v jk denotes the entry of the j th row and the k th column of the inverse matrix V of the misclassification matrix W in Eq. 2 and For large samples, Eq. 13 is distributed under H 0 asymptotically as the central chi-square distribution with K -1 degrees of freedom (df).Yet Eq. 13 is distributed asymptotically under H 1 as the noncentral chi-square distribution with K -1 degrees of freedom and the non-centrality parameter given by (Lancaster 1969) When w jk = 0 for all j and k, Eq. 13 reduces to Reject the null hypothesis H 0 if 0 ˆC K   , where K  ˆ is given by Eq. 15 and C 0 is the critical value of the central chi-square distribution with K -1 df at the significance level α As is well known from the Bayesian viewpoint, the p-value is not an adequate measure for the evidence to support the null hypothesis (Goodman 1999a-b).Hence the Bayes factor is calculated as a comparison with the p-value.To formulate the hypothesis-testing problem in a Bayesian setting we begin with the data ) ,..., , ( and assume that its probability distribution follows in a family of distributions which are parameterized by is the K-dimensional simplex.To test the hypotheses of 0 , : (Eq.12), it is assumed that there exist a prior probability density function (PDF) ) ( 0 where g is a prior PDF on p ϵ Σ which assigns mass π 0 to {p = p 0 } and 1 -π 0 to {p ≠ p 0 }.Define , the Bayes factor is given by (Kass and Raftery 1995) where g m is given by 17b is the PDF of the noncentral chi-square distribution with K -1 degrees of freedom and the non-centrality parameter given by Eq. 14.
of Eq. 17b is calculated for Scenario I with the assumption of where c is the upper bound on the admissible BACP for scenario I and obtain where an approximation to the noncentral chi-square distribution is provided by using the central chi-square distribution (Cox and Reid 1987).The lower bound for the Bayes factor after using a symmetric Dirichlet's prior for g(p) are obtained under scenario I and II: The details for obtaining the value of ) max( i  , i = I or II, are given in the appendix.

Example
The data in The issue of concern here is whether the data are misclassified separately for males and females.Because we do not have any prior belief on the values of p 0 in Eq. 12, they are thereby determined empirically from the observed data.As a result, the values of p 0 are chosen differently for males and females.For females the values of p 0 in the null hypothesis are chosen to be that of equiprobability,  ˆ = 0.47 (p-value = 0.79) for males and females.Therefore, the null hypothesis H 0 is not rejected at the significance level of 0.05 for both males and females.Yet, we would like to test the above hypotheses from the Bayesian perspective by calculating the Bayes factor as a comparison with the p-value.
For both males and females under scenarios I or II, Eq.A10 in the appendix has three negative and one positive real, and a pair of conjugate complex roots.Due to the constraint that τ > 0, only the positive root is a stationary point for Eq.A9.Eq.A9 for males has only under scenario II a unique positive local maximum (Figure 1), while Eq.A9 has a unique positive local maximum at its stationary point for females only under scenario I (Figure 2). given by Eq.A9 is for CF model 12 under scenario I for females

Discussion
Some interesting observations are worthy to be mentioned below: 1.So far, this author is not aware of any guideline available in the literature on deciding how large the lower bound for the Bayes factor should be so that we're confident the evidence provided by the data surely supporting H 1 rather than H 0 .Yet, since the lower bounds for the Bayes factor from the cancer data for both genders were not large enough, a tentative conclusion was that the cancer data in table 2 seemed unlikely to be misclassified.Although H 0 was not rejected for both gender in table 2 either according to their p-values (table 3, column 6), the p-value is, strictly speaking, not an appropriate measure for assessing the evidence provided by the data due to its inherent fallacy (Goodman 1999a-b).2. From the analysis of the Bombay cancer data, the existence of Bayes factor seems to depend not only on the scenario (I or II) (the misclassification pattern), but also the multinomial distribution of p 0 (table 3).To clarify this issue, another data set related to the degree of severity for the clinical condition of myocardial infarction patients was studied (Snow 1965), where the distribution of p 0 for the treated and control groups are respectively specified as (0.4, 0.4, 0.2) and (0.3, 0.4, 0.3).It was found that the Bayes factor existed for the treated group under scenario I, but not under scenario II, whereas for the control group it exists under both scenarios.It seems that a crucial condition for the existence of Bayes factor is whether the BACST value (Eq.13) is positive.As far as the existence of the Bayes factor is concerned, I'd like to make a conjecture which is given as follows: "For any data set under either scenario I or II the lower bound of of Eq. 13 is positive for K ≥ 3."

Conclusion
This paper addresses an issue: "how to test whether the collected categorical data are misclassified."A mixed Bayesian approach is used to test the null hypothesis that the collected data are not misclassified under a specified multinomial distribution for the studied categorical variable.The Bayes factor is employed as the main instrument to assess the evidence provided by the data.The lung cancer from all hospitals in the city of Bombay, Australia was used as an example for illustration.Based on the result of the Bayes factor in this study, the p-value was shown again not an appropriate measure to assess the evidence provided by the data.

Appendix A
With an assumption of .
By substituting Eq.A1 into Eq.13, we have where By Eq. 14, we have of Eq. 18 with a choice of ) ( 0  h which equals to the pdf of uniform distribution over [0, c 1 ] is reduced to By substituting Eqs.A2 and A3 into the above equation and integrating with respect to θ, we have after algebraic simplification where With an assumption of By using Eq.A6, we have By substituting Eq.A7 into Eq.13 and integrating where If the prior distribution function for g(p) is taken to be a symmetric Dirichlet's distribution with the flattening constant (or hyper-parameter) τ (τ > 0) (Good 1975), then Eq.A5 is reduced to  and set it equal to zero, we have after simplification a set of misclassification error probabilities {w jk } is said to be admissible if the corresponding BACP estimators { k p  } are admissible.
for scenario I are given by Since p and ω are a priori independent under H 1 , we have

F
and w jk > 0, while that of p 0 in the null hypothesis for males are set up as follows: > 0. Because the misclassification probabilities of {w jk }, j, k = 1, 2, 3 are zero under the null hypothesis, the BACST values of Eq. 15 are then given respectively by M


To avoid the use of hyper-prior distribution on τ(Good and Crook 1974), the non-Bayesian approach is used to find the stationary point max(.)an elementary technique in calculus to calculate the first derivative of

Table 1 .
Observed data for the categorical variable X table 2 are taken from table C.1 inWoodward's book, pp.756-760 (Woodward 2005).It represents the lung cancer data collected by the Bombay Cancer Registry from all cancer patients registered in the 168 government and private hospitals and nursing homes in Bombay, Australia, and from death records maintained by the Bombay Municipal Corporation.The survival times of each subject with lung cancer from time of first diagnosis to death (or censoring) were recorded over the period 1 st January 1989 to 31 st December 1991.Here we are only concerned with type of tumor of 682 subjects grouped by gender.

Table 3 .
A comparison of the lower bound for Bayes factor (Eq. 19) with the p-value for admissible CF models