Linear Contrasts Based on an Extension of the Wilcoxon – Mann – Whitney Approach

A well-known approach to comparing two independent groups is to focus on the probability that a randomly sampled observation from the first group is less than a randomly sampled observation from the second group. The paper suggests a generalization that can be used with any linear contrast based on J > 2 independent groups. Roughly, the proposed measure of effect size reflects the probability that among 2K random variables, the typical average associated with first K variables is less than the typical average among the other K variables. In effect, it represents a relatively simple measure of effect size that might be used to supplement other measures of effect size when dealing with two-way and higher designs.


Introduction
Let X 1 and X 2 be two independent random variables.As is evident, one of the best-known approaches to comparing the corresponding groups is the Wilcoxon-Mann-Whitney (WMW) test, which is based on an estimate of p = P(X 1 < X 2 ), the probability that a randomly sampled observation from the first group is less than a randomly sampled observation from the second group.Certainly, p is a useful and important measure of effect size, it is readily understood by nonstatisticians, and additional arguments supporting the use of p are summarized, for example, by Cliff (1996), Ruscio (2008) and Newcombe (2006).But as a method for making inferences about p, under general conditions, the WMW test is unsatisfactory.The basic reason is that the standard error of the WMW test statistic was derived assuming that X 1 and X 2 have identical distributions.When distributions differ, the WMW test uses an incorrect estimate of the standard error.Numerous methods have been proposed for dealing with this issue (e.g., Brunner & Munzel, 2000;Cliff, 1996;Wilcox, 2017), some of which perform reasonably well even with relatively small sample sizes.Now consider the case of J independent variables having means µ 1 , . . ., µ J .From basic principles, a common goal is testing where Ψ = ∑ c j µ j and where the linear contrast coefficients c 1 , . . ., c J satisfy ∑ c j = 0. Roughly, the goal in this paper is to suggest an analog of testing (1) that reduces to an estimate of p when dealing with two independent groups.
To elaborate in a more concrete manner, consider a two-by-two design where, for example, Factor A corresponds to two methods for treating depression and Factor B is gender.The situation can be depicted as follows: where E and C are the two methods for treating depression.A common way of dealing with main effects for the first factor is to test The basic idea here is to focus on which generalizes the WMW approach in an obvious way.More broadly, for any J ≥ 2, the goal is to make inferences about An extension of the WMW has already been derived for the particular case where the goal is to deal with an interaction in a 2-by-2 design (e.g., Wilcox, 2017, section 10.6.2).Let As suggested by Patel and Hoel (1973), an analog of no interaction corresponds to the situation where the null hypothesis is true.An analog of an ordinal interaction is a situation where both P(X 1 < X 2 ) and P(X 3 < X 4 ) are less than 0.5 or both are greater than 0.5.An analog of a disordinal interaction is a situation where these two probabilities are not ordinal.The method for making inferences about p I , described by Wilcox (2017), is based on a simple extension of results by Cliff (1996), which will be called method CPH henceforth.The computational details of method CPH are not provided because they are not directly relevant for the situation at hand.The main point here is that this method is not readily extended to testing for main effects, or more generally, testing (2) for any J ≥ 4.
To provide a brief sketch of the approach used here, consider again the case of two independent random variables, X 1 and Inferences about p are based on a nonparametric estimate of the distribution of D. This is evident based on how p is typically estimated.In particular, let X i j (i = 1, . . ., n j ; j = 1, . . ., J) be a random sample from the jth group.Then an estimate of the distribution of D can be based on the n 1 n 2 pairwise differences where the indicator function For a 2-by-2 design (J = 4), the approach is to estimate the distribution of ∑ c j X j in a similar manner and then consider how a confidence for p L might be computed.But it is evident that computational issues arise for J > 4. Here, a simple approximation of the distribution of ∑ c j X j is suggested for dealing with the case J > 4.
The paper is organized as follows.Section 2 describes two methods when J = 4.The focus is on a linear contrast that reflects an interaction, but the results extend in an obvious way to linear contrasts relevant to main effects.Section 3 summarizes simulation results regarding how well these methods control the probability of a Type I error.Motivated in part by results in section 3, section 4 describes an approximate method for dealing with J > 4 groups and section 5 reports simulation results on how well this method performs.

Description of the Methods
This section focuses on J = 4 using an estimate of the distribution of D that is an obvious generalization of the method used when J = 2.But the method becomes increasingly impractical as J increases.An alternative method must be used that represents an approximation of the method used here when estimating the distribution of D. One way of judging the adequacy of the approximate method is to compare it to the more "complete" approximation of the distribution of D that is used here, which is done in section 5.
Given the goal of testing (2), let So pL uses a "complete" estimate of the distribution of D in the sense that it uses all n 1 n 2 n 3 n 4 combinations of the X i j values.
Note that a similar approach can be used when dealing with main effects in a 2-by-2 design.For the first factor, for example, now and For the second factor, now and Observe that inferences about p L cannot be made by simply applying, for example, the methods derived by Cliff (Cliff, 1996) or Bruner and Munzel (2000) using the variables G and H.The reason is that there is dependence among the G im variables (i = 1, . . ., n 1 ; m = 1 . . ., n 2 ) and the same is true for H ac (i = 1, . . ., n 1 ; m = 1 . . ., n 2 ).So the estimate of the standard error of pI used by these methods would be incorrect.Moreover, simulations confirmed that this simple approach does indeed perform poorly.Method CPH avoids this problem, but it does not provide a basis for dealing with main effects and linear contrasts based on more than four groups.Here, two methods for dealing with this issue were considered.The first is to use a percentile bootstrap method and the second is based on a bootstrap estimate of the standard error or pI .
The percentile bootstrap method is applied as follows.Let X * i j be a bootstrap sample from the jth group, which is obtained by randomly sampling with replacement n j values from the jth group.Let p * be the estimate of p I based on this bootstrap sample.Repeat this process B times yielding p * be the p * b values written in ascending order.Here, B = 500 is used, which often seems to suffice when using a percentile bootstrap (Wilcox, 2017), However, B greater than 500 might increase power (Racine & MacKinnon, 2007).Let ℓ = αB/2, rounded to the nearest integer, and let u = B − ℓ.Then, based on general results in Liu and Singh (1997) an approximate 1 − α confidence interval for p I is ( p * (ℓ+1) , p * (u) ).Let P * be the proportion of p * values less than 0.5.When testing (4), a p-value is given by 2min(P * , 1 − P * ).This is called method PB henceforth.
A bootstrap estimate of the squared standard error of p * I is given by τ2 where p * I = ∑ p * Ib /B.Now B = 100 is used, which seems to suffice based on results in Efron (1987) and which is further supported by studies summarized by Wilcox (2017).So a reasonable test statistic for testing (4) is This will be call method BT henceforth.Here, the null distribution of T is approximated with a Student's T distribution with degrees of freedom estimated as described by Brunner and Munzel (2000).This approach is called method BT henceforth.Simulations reported in the next section indicate that the percentile bootstrap method performs better than the method based on T , so for brevity further details regarding the degrees of freedom are not provided.(The estimated degrees of freedom were computed via the R function bmp described in Wilcox, 2017, section 5.7.2.)

Simulation Results
Simulations were used as a partial check on the small-sample properties of methods PB and BT.Simulation estimates of the actual Type I error probability, when testing at the 0.05 level, are based on 2000 replications.(This choice for the number of replications was based in part on an effort to avoid high execution time.)The sample sizes considered were (n 1 , n 2 , n 3 , n 4 ) = (10, 10, 10, 10), (20, 20, 20, 20) and (10, 20, 30, 40).Unequal sample sizes offered no new insights, so they are not reported.Data were generated from four types of distributions: normal, symmetric and heavy-tailed (roughly meaning that outliers tend to be common), asymmetric and relatively light-tailed, and asymmetric and relatively heavytailed.More specifically, data are generated from g-and-h distributions (Hoaglin, 1985), which arise as follows.Let Z be a random variable having a standard normal distribution.Then has a g-and-h distribution, where g and h are parameters that determine the first four moments.The four distributions used here are the standard normal (g = h = 0), a symmetric heavy-tailed distribution (h = 0.2, g = 0), an asymmetric distribution with relatively light tails (h = 0, g = 0.2), and an asymmetric distribution with heavy tails (g = h = 0.2).Table 1 summarizes the skewness (γ 1 ) and kurtosis (γ 2 ) of these distributions.
The estimated Type I error probabilities are summarized in Table 2. Bradley (1978) suggests that in general, when testing at the 0.05 level, the actual level should be between 0.025 and 0.075.Based on this criterion, method BT is unsatisfactory when n = 10, while method PB satisfies this criterion for all of the situations considered.
A Welch-type method can be used to test (1), which allows heteroscedasticity (e.g., Wilcox, 2017, section 7.4.1).It is evident that it is sensitive to different features of the distribution compared to method PB.So at some level power comparisons are meaningless.However, to provide at least some perspective, consider testing both (1) and (2) using the contrast coefficients 1, 1, −1, −1 (main effects associated with the first factor) when δ is added to the first group.For symmetric distributions, estimated power for the Welch and PB methods differed by about two units in the second decimal place.It is when distributions differ in skewness that the choice of method might make a difference in terms of power.Consider, for example, the situation where the first three groups have standard normal distributions and the fourth group has a lognormal distribution that has been shifted to have a median of 0.8.For n = 30 and δ = 0.5, the estimated power was 0.62 and 0.84 for the Welch and PB methods, respectively.This is not to suggest that method PB has, in general, more power.The only point is that the choice of method can make a substantial difference.

Dealing with More Than Four Groups
Now consider the case of J ≥ 4 independent variables.The goal in this section is to suggest a method for testing (2) using an approximation of the complete estimate of the distribution of D. The approximate method is applied as follows.Let m = min{n 1 , . . ., n J }.For each j, randomly sample without replacement m values from X i j yielding say Y i j (i = 1, . . ., m; j = 1, . . ., J).Let Now repeat this process N times yielding p1L , . . ., pNL .Then the final estimate of p L is taken to be Inferences based on pL used in conjunction with a percentile bootstrap method, are henceforth called method APB.To provide some perspective on the choice of N, consider the case where J = 4 and p L is estimated with pL given by ( 6 So, suppose agreement is deemed acceptable if there is agreement within three units in the second decimal place or less with probability 0.95 or higher.With equal sample sizes, N = 50 suffices.For unequal sample sizes, a crude rule is that N = 50 suffices provided the minimum sample size is at least 20.If the minimum sample size is 10, N = 100 is a better choice.Of course, one could simply use N = 200 or larger to be safe.The only concern is that as N increases, execution time increases substantially when testing hypotheses with a percentile bootstrap method, at least based on the R functions described in the final section of this paper.(Additional results regarding the choice N = 10 are given in the next section.)

Simulation Results
This section reports estimated Type I error probabilities when using the method described in the previous section.The number of groups was taken to be 4 or 6.Data were generated as described in section 3.For J = 6 groups, now the linear contrast coefficients were taken to be c 1 = c 2 = c 3 = 1 and c 4 = c 5 = c 6 = −1.Again VP1 refers to homoscedasticity.Now VP 2 means that σ 1 = σ 2 = σ 3 = 1 and σ 4 = σ 5 = σ 6 = 4.The results are reported in Table 3 for n = N = 10.Based on Bradley's criterion, all indications are that method APB is satisfactory even with n = N = 10.Note that for J = 4, the results reported in Table 2 using the complete method for estimating the distribution of D, are very similar to those in Table 3, which were based on the incomplete estimate of the distribution of D described in section 4.

Concluding Remarks
Of course, despite the simulations reported here, perhaps situations can be found where the method in section 4 breaks down.The main point is that, at least for the situations considered, the proposed method performs reasonably well.Moreover, there is no known alternative method that can deal with the case J > 4 in a reasonably accurate manner.
Finally, the R function linWMW computes pL and the R function linWMWpb computes a confidence interval using the percentile bootstrap method in section 4, both of which are available from the author upon request.

0 Table 1 .
Some properties of the g-and-h distribution