Measures on Proportional Reduction in Error by Arithmetic , Geometric and Harmonic Means for Multiway Contingency Tables

Abstract For multi-way contingency tables with nominal categories, this paper proposes three kinds of proportional reduction in error measures, which describe the relative decrease in the probability of making an error in predicting the value of one variable when the values of the other variables are known, as opposed to when they are not known. The measures have forms of arithmetic, geometric and harmonic means. An example is shown.


Introduction
Consider an R × C contingency table with both nominal categories of the explanatory variable X and the response variable Y. Let p i j denote the probability that an observation will fall in the ith category of X and in the jth category of Y (i = 1, . . ., R; j = 1, . . ., C). Goodman and Kruskal (1954) proposed two kinds of measures, i.e., (1) the measure which describes the proportional reduction in variation (PRV) in predicting the Y category obtained when the X category is known, as opposed to when the X category is not known, and (2) the measure which describes the proportional reduction in error (PRE) in predicting it.Although the details are omitted, some PRV measures are considered by, e.g., Theil (1970), Tomizawa, Seo and Ebi (1997), Tomizawa and Ebi (1998), Tomizawa and Yukawa (2003), and Yamamoto, Miyamoto and Tomizawa (2010).
The present paper considers the PRE measures.Goodman and Kruskal (1954) proposed the PRE measure as )) where also see Bishop, Fienberg and Holland (1975, p. 388), and Everitt (1992, p. 58).This measure describes the relative decrease in the probability of making an error in predicting the value of Y when the value of X is known, as opposed to when it is not known.The measure λ B has the properties that (i) 0 ≤ λ B ≤ 1, (ii) λ B = 0 if and only if the information about the explanatory variable X does not reduce the probability of making an error in predicting the category of the variable Y, and (iii) λ B = 1 if and only if no error is made, given knowledge of the explanatory variable X; namely there is complete predictive association.
Next, consider the reverse case which is the explanatory variable Y and the response variable X.The following measure λ A is suitable for predictions of X from Y, defined by see Goodman and Kruskal (1954).
The measures λ B and λ A are specifically designed for the situation in which the explanatory and response variables are defined.Now consider the situation where the explanatory and response variables are not defined.In this case, the following measure λ is given: see Goodman and Kruskal (1954).This indicates the PRE in predicting the category of either variable as between knowing and not knowing the category of the other variable.Also, the measure λ is the weighted sum of the measures λ B and λ A .
For a two-way contingency table with both nominal categories, Yamamoto and Tomizawa (2010) proposed new PRE measures, say Λ, expressed as the arithmetic, geometric and harmonic means of λ B and λ A .For a two-way contingency table with nominal-ordinal categories, Yamamoto, Nozaki and Tomizawa (2011) proposed a PRE measure although the detail is omitted.
The purpose of the present paper is to extend the Yamamoto and Tomizawa's (2010) measures into T -way contingency tables (T ≥ 3) with all nominal categories.Section 2 proposes measures for three-way tables (T = 3), and Section 3 extends them for multi-way (T ≥ 4) and expresses as more generalized form including such three kinds of means.Section 4 analyzes data as an example.
2. New PRE Measures for Three-way Contingency Tables

Measures
Consider an R × C × L contingency table with variables X, Y and Z which have all nominal categories.Let p i jk denote the probability of that an observation will fall in the (i, j, k)th cell of the table (i = 1, . . ., R; j = 1, . . ., C; k = 1, . . ., L).When the explanatory and response variables are not defined, namely, we cannot specifically define which of the variables is a response, we consider three kinds of prediction, predicting X, predicting Y and predicting Z.
First, consider the table with a response variable X and two explanatory variables Y and Z.In this case, a PRE measure, which describes the relative decrease in the probability of making error in predicting the value of X when the values of the other variables, Y and Z, are known, as opposed to when they are not known, is defined by where Similarly, each PRE measure for the table as having a response variable Y and two explanatory variables X and Z and as having a response variable Z and two explanatory variables X and Y is defined by where Then, we shall propose three kinds of new PRE measures as follows: The measures λ (3) a , λ (3) g and λ (3) h are the arithmetic mean, geometric mean and harmonic mean of the λ (3) A , λ (3) B and λ (3) C , respectively.
Let λ * denote each of measures λ (3) a , λ (3) g and λ (3) h .Each measure has the properties that (i) λ * must lie between 0 and 1, (ii) λ * = 0 if and only if the information about two variables does not reduce the probability of making an error in predicting the category of the other variable, and (iii) λ * = 1 if and only if no error is made, given knowledge of two variables; namely there is complete predictive association.We point out that if the variables are independent, then the measure λ * takes 0, but the converse need not hold.Note that when the values of λ (3) A , λ (3) B and λ (3) C are 0 such as the variables are independent, the measure λ (3)  h cannot measure the PRE.So in such a case, the measures λ (3) a and λ (3) g should be used as a PRE measure.
We see that min where the equality holds if and only if

Measures
nominal categories in which the (T − 1) explanatory variables and one response variable are not defined.Let p i 1 i 2 •••i T denote the probability that an observation will fall in the (i and p (k) i k = P(X k = i k ).Then, we shall extend the measures as follows: and The measures λ (T ) a , λ (T ) g and λ (T ) h are the arithmetic mean, geometric mean and harmonic mean of the λ (T ) 1 through λ (T ) T , respectively.
Let Λ (T ) denote each of measures λ (T )  a , λ (T ) g and λ (T ) h .Each measure has the properties that (i) Λ (T ) must lie between 0 and 1, (ii) Λ (T ) = 0 if and only if the information about (T − 1) variables does not reduce the probability of making an error in predicting the category of the other variable, and (iii) Λ (T ) = 1 if and only if no error is made, given knowledge of (T − 1) variables; namely there is complete predictive association.We point out that if all variables are independent, then the measure Λ (T ) takes 0, but the converse need not hold.Note that when λ (T )  k = 0 (k = 1, • • • T ) such as all variables are independent, the measure λ (T )  h cannot measure the PRE.So in such a case, the measures λ (T ) a and λ (T ) g should be used as a PRE measure.
We see that min where the equality holds if and only if λ (T ) 1 through λ (T ) T are all equal.We note that Λ (T ) when T = 2 is equivalent to the measure Λ proposed in Yamamoto et al. (2010).

Generalization of the Measures
Considering the monotonic function g, we shall propose a generalized measure, which includes the measures λ (T )  a , λ (T ) g and λ (T ) h , as follows: The functions g and g −1 are differentiable functions.Especially, (i) when g(x) = x, the measure Λ (T ) is identical to λ (T ) a , (ii) when g(x) = log x, the measure Λ (T ) is identical to λ (T )  g , and (iii) when g(x) = 1/x, the measure Λ (T ) is identical to λ (T )  h .Λ (T ) has the same properties as λ (T ) a , λ (T ) g and λ (T ) h (see Section 3.1).

Approximate Confidence Interval for Measures
In a similar way to the case of T = 3, √ n( Λ(T) − Λ (T ) ) (n is sample size and Λ(T) is the estimated measure) has asymptotically a normal distribution with mean 0 and variance For measures λ (T ) a , λ (T ) g and λ (T ) h , the variances are where and I(•) is the indicator function.
Then, we can construct an asymptotic confidence interval using estimated variance although the detail is omitted.

An Example
Consider the data in Table 1, taken from Goodman (1975), which shows the McHugh test data on creative ability in machine design.This table cross-classifies 137 engineers with respect to their dichotomized scores (above the subtest mean (1) or below the subtest mean ( 2)) obtained on each of four different subtests that were supposed to measure creative ability in machine design.There are sixteen response patterns because the table has four variables (items A, B, C and D) and each has two categories.Now, we are interested in what degree the relative decrease in the probability of making an error in predicting the value of one variable when we know the values of the other three variables as opposed to when we do not know them is.We shall analyze these data by using the proposed measure because the explanatory and response variables are not defined.When we use the measure λ (4) a , for example, the estimated value of λ (4) a is 0.470 (Table 2).We see that in prediction of one of the variables from the others, the information reduces the probability of making an error by 47.0%.Similarly, the estimated values of λ (4)  g and λ (4) h are 0.469 and 0.467, respectively.So we can also obtain a similar interpretation for the data.We are also interested in the values of test statistic for the hypotheses of independence of (1) item A and items (B, C, D), (2) B and (A, C, D), (3) C and (A, B, D), and (4) D and (A, B, C).The values of Pearson's chi-squared statistic are 35.93 for (1), 37.67 for (2), 48.17 for (3), and 42.06 for (4) with seven degrees of freedom.Therefore, we can see the strong association between one of the variables and the other three variables.So, it would be meaningful to see the values of proposed measures.

Concluding Remarks
For analyzing multi-way (T -way) contingency tables with nominal categories, we have proposed three kinds of PRE measures which describes the relative decrease in the probability of making error in predicting value of one variable when the values of the other variables are known, as opposed to when they are not known.The proposed measures include arithmetic mean (λ (T ) a ), geometric mean (λ (T ) g ) and harmonic mean (λ (T ) h ).These measures are useful for analyzing the table which explanatory and response variables are not defined.A point to notice is that the measure λ (T )  h cannot measure the PRE when the variables are independent and/or any λ (T )  k (k = 1, . . ., T ) is 0. In such a case, the measures λ (T ) a and λ (T )  g should be used.It is difficult to discuss how to choose between three propositions: arithmetic, geometric or harmonic mean.We recommend the use of λ (T )  a for the simple interpretation.In addition, the measure Λ (T ) , including λ (T ) a , λ (T ) g and λ (T ) h , is invariant under arbitrary permutations of the categories.Therefore the measure is suitable for analyzing the data on a nominal scale, but it is possible for analyzing the data on an ordinal scale because it only requires a categorical scale.Yamamoto, K., & Tomizawa, S. (2010).Measures of proportional reduction in error for two-way contingency tables with nominal categories.Biostatistics, Bioinformatics and Biomathematics, 2, 43-52.

Table 2 .
Estimates of the measures, approximate standard errors for them and approximate 95% confidence intervals for the measures, applied to Table1