Explaining Lord ’ s Paradox in Introductory Statistical Theory Courses

When two groups are compared in a pre-post study, two different conclusions can be drawn between the two-sample t-test and the analysis of covariance (ANCOVA). It is known as Lord’s Paradox, and it occurs because the parameter in the two-sample t-test and the parameter of interest in the ANCOVA model are not the same quantity. The difference between the two parameters can be explained by the covariance of linearly combined random variables which is an important topic in introductory statistical theory courses. Lord’s paradox is frequently observed in practice, and it is very important for students (future researchers) to have clear understanding of the paradox. The objective of this article is to explain Lord’s Paradox using the covariance of linearly combined random variables. The paradox is explained using three scenarios in the context of educational research. The first scenario is when the average baseline (pre-score) is greater in the treatment group than the control group, the second scenario is when the average baseline is lower in the treatment group than the control group, and the third scenario is when the average baseline is same between the two groups by randomization. This article is written at the level of introductory statistical theory courses for undergraduate and graduate statistics students to help understanding the difference between the parameter of interest in the two-sample t-test and the parameter of interest in the ANCOVA model.


Introduction
When two groups are compared in a pre-post study, Lord's Paradox can be observed between two researchers when a researcher compares the average change using the two-sample t-test and the other researcher compares the average post-measurement using the analysis of covariance or simply ANCOVA (Lord 1967;Lord 1969).The paradox has been studied in the context of health sciences, environmental sciences, and psychometrics (Holland & Rubin, 1983;Wainer & Brown, 2006;Glymour et al., 2005;Tu et al., 2008;Pearl, 2016).It is an interesting phenomenon which frequently occurs in practice, but it is not easy to quantify the exact difference between the parameter in the two-sample t-test and the parameter in the ANCOVA model without statistical theory.In this article, we explain Lord's Paradox using the covariance of linearly combined random variables which is discussed in many statistical theory textbooks (Wackerly et al., 2008;Ross, 2012).

Motivating Example
The following example is adapted from the example given by Wright (2006).Suppose two groups of students are compared in their mathematics skills.Group 1 is the treatment group of size n 1 (receiving a new teaching method), and Group 0 is the control group of size n 0 (receiving a traditional teaching method).Assume each student took pre-test and post-test.
2.1 Scenario 1 (Wright, 2006) Suppose each student selects a group by his or her own will.Suppose a student with high motivation (who tends to show high academic performance) is more likely to select Group 1, and suppose a student with relative low motivation is more likely to select Group 0. Wright (2006) illustrated a similar scenario with balanced group sizes n 1 = 5 and n 0 = 5 for Group 1 and Group 0, respectively.See Table 1 for the hypothetical data with minor modification from the example of Wright (2006).The average difference is (10 + 5 + 0 − 5 − 10) / 5 = 0 for both groups which can be calculated from Table 1, but the post-score is 10 points greater on average when we condition on the pre-score as shown in Figure 1.(The data in real world may contain random noise around the line.)Using the two-sample t-test, the data are not against the null hypothesis at all (same group average).Using the ANCOVA model, on the other hand, the data are against the null hypothesis and serve as strong evidence for the alternative hypothesis (greater average post-score in Group 1 conditioning on pre-score).This is a traditional example of Lord's Paradox (Lord, 1967;Wright, 2003;Maxwell and Delaney, 2004;Wainer and Brown 2006).In addition to the graphic illustration, an analytic explanation of the paradox can be provided using the covariance of linearly combined random variables.

Covariance of Linearly Combined Random Variables
Several textbooks for the first semester of undergraduate statistical theory courses include the following proposition (Wackerly et al., 2008;Ross, 2012).

Proposition
Let U 1 , . . ., U n and W 1 , . . ., W m be random variables.Let L 1 = ∑ n i=1 a i U i and L 2 = ∑ m j=1 b j W j for fixed real numbers a 1 , . . ., a n and b 1 , . . ., b m .Then From these results, we can explain why the two-sample t-test and the ANCOVA model can lead to different conclusions.

Two-sample t-test
Let Z i denote the pre-score and Y i denote the post-score of the i th subject in a sample.Let X i denote the group indicator for the i th subject, where X i = 0 for Group 0 (control) and X i = 1 for Group 1 (treatment).The two-sample t-test can be formulated as a simple linear model where D i = Y i − Z i is the change in test score (hence a positive value of D i is a desirable outcome), and ϵ i ∼ N(0, σ 2 ) is a random variable which is independent of X i .In Equation ( 1), the parameter of interest is the difference in the two group averages The null hypothesis is H 0 : β 1 = 0, and the one-sided alternative hypothesis is by the proposition in Section 3.1.

ANCOVA
Preserving the same notation used in Section 3.2, the ANCOVA model assumes where δ i ∼ N(0, τ 2 ) is a random variable which is independent of X i and Z i .Under the ANCOVA model, the parameter of interest is γ 1 , the difference in the expected post-score when we compare a randomly selected subject in Group 1 to a randomly selected subject in Group 0 of the same pre-score.The null hypothesis is H 0 : γ 1 = 0, and the one-sided alternative hypothesis is H 1 : γ 1 > 0. An alternative expression of the ANCOVA model is by subtracting Z i on both sides of Equation ( 3).Using the proposition in Section 3.1, so the parameter of interest can be written as from Equation (2).Using the same argument of the two-sample t-test, we can write which is interpreted as the difference in the average pre-score when we compare Group 1 to Group 0.

Summary
In general, the two-sample t-test and the ANCOVA model have different parameters of interest, and they are related as They are the same quantity (i.e., β 1 = γ 1 ) if κ 1 = 0 or γ 2 = 1.The first condition κ 1 = 0 can be satisfied by randomization (i.e., conducting an experimental study instead of an observational study), but the second condition γ 2 = 1 is out of researcher's control.In most pre-post studies, pre-and post-scores are positively correlated in both groups, so γ 2 > 0. In addition, we often have 0 < γ 2 < 1 because of regression toward the mean (Stigler, 1997;Barnett et al., 2005).

Hypothetical Scenarios
In this section, using the relationship between β 1 and γ 1 in Equation ( 4), three scenarios are discussed in the context of the educational research.The first scenario is when the average baseline (pre-score) is greater in the treatment group than in the control group (Section 2.1), the second scenario is when the average baseline is lower in the treatment group than in the control group, and the third scenario is when the average baseline is the same between the treatment group and the control group by randomization.The control group is referred to as Group 0, and the treatment group is referred to as Group 1.

Revisiting Scenario 1
In Scenario 1 (from Section 2.1), the ordinary least square estimation (OLSE) results in γ1 = 10 and γ2 = 0.5.Due to self-selection by students, the pre-score is greater in Group 1 by 20 points on average when compared to Group 0, so for the two-sample t-test.This is an example of Lord's Paradox when the ANCOVA model can reject the null hypothesis, whereas the two-sample t-test cannot reject the null hypothesis even though the new teaching method seems significantly more effective than the traditional teaching method when we compare two randomly selected students from each group with the same baseline score.

Scenario 2 (Lower Average Baseline Score in the Treatment Group)
In the second scenario, assume the instructor allocates each student to Group 0 (control) or Group 1 (treatment) believing that the new teaching method would benefit students particularly with low academic performance.See Table 2 for hypothetical data, and see Figure 2 for the scatter plot of pre-score and post-score by group.Note that the pre-score is lower in Group 1 by 20 points on average when compared to Group 0 (i.e., κ1 = −20).From the data, the OLSE provides γ1 = 0 and γ2 = 0.5.In this scenario, the ANCOVA model cannot reject the null hypothesis because γ1 = 0. From Equation (4), for the two-sample t-test, we estimate β1 = 0 + (0.5 − 1) (−20) = +10 which can lead to the rejection of β 1 = 0 in favor of β 1 > 0 (i.e., greater benefit from the new teaching method).This is another example of Lord's Paradox when the two-sample t-test can reject the null hypothesis even though the new teaching method seems ineffective conditioning on the pre-score.

Scenario 3 (Same Average Baseline Score between the Two Groups)
Suppose students are randomized (or controlled to match the average pre-score between the two groups) so that κ 1 = 0.In this case, the result from Equation (4) leads to β 1 = γ 1 .As shown in Table 3 and Figure 3, we have κ1 = 0, so β1 = γ1 = 10, but the strength of statistical evidence for the alternative hypothesis is stronger in the ANCOVA model than in the two-sample t-test because the standard error is lower in the ANCOVA model.Though the ANCOVA model leads to nearly zero p-value, the two-sample t-test results in a p-value close to 0.05 (for the right-tail H 1 : β 1 > 0).In practice, when students are randomized, the ANCOVA model should have higher statistical power than the two-sample t-test.It is because, while the OLSE is unbiased for both β 1 and γ 1 , the variance of Y i − γ 2 Z i is lower than the variance of Y i − Z i conditioning on X i as discussed in Appendix 1.

Examples
In this section, we provide two practical examples.The example in Section 5.1 is to compare the effect of two programs on self-esteem score, and the example in Section 5.2 is to compare the effect of two teaching methods on test score.

Effect of Exercise on Self-Esteem
This example is from the data in R with car package (R Core Team, 2016;Fox & Weisberg, 2011).The data can be seen using the code below.

> library(car) > WeightLoss
It has three groups, but we focus on two of the three groups.Twelve subjects (n 0 = 12) were treated by a diet program for three months, and this group is referred to as Group 0. Ten subjects (n 1 = 10) were treated by an exercise program in addition to the diet program for three months, and this group is referred to as Group 1. From the data presented in Table 4, we can estimate the average self-esteem score 14.8333 for Group 0 and 15.2 for Group 1 at Month 1, so κ1 = 0.3667.
To formulate hypothesis testing in terms of the expected change in self-esteem (comparing Month 3 to Month 1), the two-sample t-test can be used with H 0 : β 1 = 0 versus H 1 : β 1 > 0, assuming diet and exercise would be more beneficial than diet only, at significance level α = 0.05.Using the two-sample t-test, we have a lack of evidence to reject H 0 : β 1 = 0 with observed statistics β1 = 1.0667, se = 0.6568, T = 1.624, and p-value = 0.060.
To formulate hypothesis testing in terms of the expected self-esteem score at Month 3 given the score at Month 1, the ANCOVA model can be used with H 0 : γ 1 = 0 versus H 1 : γ 1 > 0 at α = 0.05.Using the ANCOVA model, we have a statistically significance result to conclude H 1 : γ 1 > 0 with observed statistics γ1 = 1.1764, se = 0.6253, T = 1.881, and p-value = 0.038.
In the left panel of Figure 4, the vertical distance between the two parallel lines is γ1 = 1.1764.In the right panel, the vertical distance between the two horizontal lines is β1 = 1.0667.Note that γ2 = 0.7006 in the ANCOVA model, and the estimated parameter in the two-sample t-test is slightly attenuated toward the null value from Equation (4).

Comparing Two Teaching Methods
In a mathematics course, two teaching methods were compared for students' learning on set theory, and the learning was quantified by test scores.The first teaching method was based on a traditional lecture (Group 0), and the second teaching method was based on an active-based learning (Group 1).Each of twenty students was randomized into Group 0 or Group 1 by researchers (n 0 = n 1 = 10), and each student took a pre-test and a post-test on conceptual thinking.
The left panel of Figure 5 shows the pre-score on x-axis and the post-score on y-axis by group.Random numbers were  generated by N(0, η 2 ) with η = 0.1, and they were added to original data points for illustration purpose because it was difficult to show all twenty data points without the random noise.Under the ANCOVA model, we estimated γ1 = 1.0283 (with standard error se = 0.3422) and γ2 = 0.2052.For the hypothesis testing H 0 : γ 1 = 0 and H 1 : γ 1 > 0 at significance level α = 0.05, we could reject H 0 in favor of H 1 with T = 1.0283/0.3422= 3.00 and p-value 0.004.
The right panel of Figure 5 shows the difference in scores (post-score minus pre-score) by group, and the horizontal lines indicate the estimated average difference for each group.Despite the significant result from ANCOVA, the two boxplots look very similar except for one data point in Group 1.Even though the students were randomized, the difference in estimated average pre-score was κ1 = 4.5209 − 3.8075 = 0.7134 (comparing Group 1 to Group 0).From Equation (4), we can estimate β1 = γ1 + (γ 2 − 1) κ1 = 1.0283 − (0.7948)(0.7134)= 0.4613.For the two-sample t-test, the estimated parameter β1 = 0.4613 was attenuated toward the null value β 1 = 0, the estimated standard error was se = 0.5948, and the resulting test statistic was T = 0.4613/0.5948= 0.776 with p-value 0.224.Therefore, we could not reject H 0 in the two-sample t-test at α = 0.05.

Discussion
Lord's Paradox has been known for a long time, and it has been explained graphically in literature, but it has received less attention analytically.Using the covariance of linearly combined random variables, we can show that the parameter β 1 in the two-sample t-test and the parameter γ 1 in the ANCOVA model are different by the magnitude of (γ 2 − 1) κ 1 , where κ 1 is the difference in the average baseline score, comparing Group 1 (treatment) to Group 0 (control).In practice, it is difficult to have (γ 2 − 1) κ 1 = 0 in observational studies.This article can be summarized by the three scenarios in terms of the educational research scenarios presented in Section 4.
• When students with high baseline scores belong to the treatment group, which means κ 1 > 0, we have β 1 < γ 1 .In an extreme case, we may have the opposite signs γ 1 > 0 and β 1 < 0.
• When students with low baseline scores belong to the treatment group, which means κ 1 < 0, we have β 1 > γ 1 .
When the treatment has no effect at all (i.e., H 0 : γ 1 = 0 is true), there is a good chance of rejecting H 0 : β 1 = 0 in favor of H 1 : β 1 > 0 under the two-sample t-test with a large sample size.
• When students are randomized so that the average baseline score is same in the two groups, which means κ 1 = 0, we have β 1 = γ 1 .In most practical situations, where pre-and post-scores are positively correlated in both groups, statistical power to conclude H 1 : γ 1 > 0 in the ANCOVA model is greater than statistical power to conclude H 1 : β 1 > 0 in the two-sample t-test as heuristically explained in Appendix 1.
The proposition in Section 3.1 is mentioned in most introductory statistical theory courses, and students can have deeper understanding of the two-sample t-test and the ANCOVA model through the examples.
In observational studies, we sometimes consider the propensity score, the conditional probability of assignment to a particular group (i.e., control or treatment) as a function of other variables, say (W 1 , . . ., W k ) (Rosebaum & Rubin, 1983).The association between (W 1 , . . ., W k ) and X i does not necessarily imply the association between (W 1 , . . ., W k ) and Y i .In general, the difference between β 1 in the two-sample t-test and γ 1 in the multiple linear regression Y i = γ 0 + γ 1 X i + γ 2 Z i + ∑ k j=1 α j W j,i + δ i can be quantified as β 1 − γ 1 = (γ 2 − 1) κ 1 + ∑ k j=1 α j ν j , where ν j ≡ E(W j,i | X i = 1) − E(W j,i | X i = 0).See Appendix 2 for detail.If W j,i is not associated with Y i given all other covariates (i.e., α j = 0), it does not contribute to the difference between β 1 and γ 1 .The same argument holds for the use of a scalar propensity score, say S i .The role of propensity score depends on the linear relationship between S i and Y i and E(S i | X i = 1) − E(S i | X i = 0).Without any association between S i and Y i , the propensity score does not play any role in the difference between β 1 and γ 1 .

Figure 1 .
Figure 1.Hypothetical data of a pre-post study (Scenario 1)

Figure 2 .
Figure 2. Hypothetical data of a pre-post study (Scenario 2)

Figure 3 .
Figure 3. Hypothetical data of a pre-post study (Scenario 3)

Table 1 .
Hypothetical data of a pre-post study (Scenario 1)

Table 2 .
Hypothetical data of a pre-post study (Scenario 2)

Table 4 .
Self-esteem data for comparing diet group (Group 0) and diet + exercise group (Group 1)