A Paradox in Bland-Altman Analysis and a Bernoulli Approach

A reliable method of measurement is important in various scientific areas. When a new method of measurement is developed, it should be tested against a standard method that is currently in use. Bland and Altman proposed limits of agreement (LOA) to compare two methods of measurement under the normality assumption. Recently, a sample size formula has been proposed for hypothesis testing to compare two methods of measurement. In the hypothesis testing, the null hypothesis states that the two methods do not satisfy a pre-specified acceptable degree of agreement. Carefully considering the interpretation of the LOA, we argue that there are cases of an acceptable degree of agreement inside the null parameter space. We refer to this subset as the paradoxical parameter space in this article. To address this paradox, we apply a Bernoulli approach to modify the null parameter space and to relax the normality assumption on the data. Using simulations, we demonstrate that the change in statistical power is not negligible when the true parameter values are inside or near the paradoxical parameter space. In addition, we demonstrate an application of the sequential probability ratio test to allow researchers to draw a conclusion with a smaller sample size and to reduce the study time.


Introduction
In medical, biological, health, and sport sciences, a new method of measurement is preferred if it is as or more reliable than the current standard method (gold standard). If a new method is attractive under practical considerations (e.g., cost, convenience), a small degree of disagreement between two methods of measurement may be acceptable. In the past, regression and correlation analyses were popular ways to assess the degree of agreement, but their drawbacks are now well known (Altman & Bland, 1983;Hopkins, 2000). Since the introduction of Bland and Altman analysis (Bland & Altman, 1986), the limits of agreement (LOA) have been widely used in practice because of its simple calculation, visualization, and interpretation.
In the Bland and Altman analysis, the difference between two measurements is modeled by a normal distribution with two parameters, mean µ and standard deviation σ, and the parameters of interest are A = µ − zσ and B = µ + zσ, where z is the critical value calculated from the standard normal distribution (e.g., z = 1.96 for a probability of 0.95). It is recommended to prespecify acceptable limits based on clinical necessity, biological considerations, or other practical goals (Giavarina, 2015). For given predefined acceptable limits, denoted by (−δ, δ), Lu et al. (2016) formulated hypothesis testing based on confidence intervals (CIs) for A and B. They proposed an iterative numerical approach to calculate the sample size for the hypothesis testing of a given significance level α, statistical power 1 − β, null value δ, and true parameter values δ/σ and µ/σ. They demonstrated the accuracy of the proposed method of sample size calculation via simulation studies, and it is currently implemented in statistical software (MEDCALC, 2019).
In the formulation of the hypothesis test (Lu et al., 2016), the null hypothesis states that the degree of agreement between two methods of measurement is not acceptable (i.e., (A, B) is not within −δ and δ), and the alternative hypothesis states that it is acceptable (i.e., (A, B) is within −δ and δ). In this article, we propose a question on the interpretations of (A, B) and (−δ, δ). In particular, based on the probabilistic interpretation of (A, B), we argue that the degree of agreement can be acceptable even when the null hypothesis is true. Throughout this paper, the paradoxical parameter space is defined as a set of values of (µ, σ) such that the degree of agreement is acceptable even though the null hypothesis is true. To resolve this paradox, we consider a simple alternative hypothesis test, referred to as the Bernoulli approach, which does not require the normality assumption or any distributional assumption. One caveat of the Bernoulli approach is that there is a minimum sample size required to perform a hypothesis test at a given significance level α. To overcome this caveat, the sequential probability ratio test (Wald, 1945;Wald, 1947) is applied in the context of the Bland and Altman analysis.

Bernoulli Approach to Address the Paradox
One simple approach to address the paradox under the normality assumption is a Bernoulli approach (which also addresses potential violation of the normality assumption). For a pre-specified value of δ > 0, let Y i = 1 if −δ ≤ D i ≤ δ and Y i = 0 otherwise. By defining π = P(−δ ≤ D i ≤ δ), a hypothesis test can be formulated as H 0 : π ≤ 1 − γ and H 1 : π > 1 − γ. If H 0 is true, the two methods of measurement agree with a probability at most 1 − γ. If H 1 is true, the two methods agree more often with a probability greater than 1 − γ.
LetȲ = n −1 n i=1 Y i be the sample proportion of agreement between two methods of measurement with a given threshold δ > 0. According to the large-sample theory, and H 0 is rejected when Z ≥ z 1−α for a fixed significance level α, and so the two methods of measurement would be inferred to agree.
Then the required sample size is for given α, 1 − β, 1 − γ, and 1 − γ * . See the Appendix B for a detailed explanation.
Under normality assumption, Lu et al. (2016) calculated the sample size needed for α = 0.05; 1 − β = 0.8, 0.9; δ/σ = 2, 2.1, . . . , 3.0; µ/σ = 0, 0.1, . . . , 0.9. Given δ > 0, µ > 0, and σ > 0, we can find and calculate the required sample size n using equation (2) as shown in table 1. In the table, the superscript * indicates a case when H 1 is true in the Bernoulli approach and H 01 and H 02 are true in the normal approach (i.e., a case in the paradoxical parameter space defined in Section 2.2). Tables 2 and 3 present the relative sample size in percent for comparing the Bernoulli approach to the normal approach. Table 2 is for the power of 1 − β = 0.8, and table 3 is for 1 − β = 0.9.
As δ/σ increases for fixed µ/σ, the Bernoulli approach requires more sample size than the normal approach. As µ/σ increases for fixed δ/σ (i.e., µ deviates from zero), the Bernoulli approach requires a smaller sample size than the normal approach. For large µ/σ, even under the normality assumption, there are extreme cases where the required sample size is only 2-3% in the Bernoulli approach when compared to the normal approach. On the other hand, in the scope of tables 2 and 3, the Bernoulli approach can require about 200% of the sample size required by the normal approach. In summary, the sample size requirement is very sensitive to how researchers formulate hypothesis testing particularly when the true parameter values are inside or near the paradoxical space.

Minimum Sample Size for Bernoulli Approach
Regardless of the sample size in the Bernoulli approach,Ȳ = 1 is the strongest evidence against H 0 in favor of H 1 , so the maximum value of the Z statistic in equation (1) is (1−γ)γ n = n γ 1 − γ Since we need Z ≥ z 1−α to reject H 0 in favor of H 1 , the minimum sample size requirement in the Bernoulli approach is the smallest integer such that n ≥ (1 − γ)(z 1−α ) 2 γ which is derived from the strongest evidence for H 1 : π > 1 − γ (i.e.,Ȳ = 1). For example, we need n to be at least 73 when 1 − γ = 0.95 and α = 0.05 in this asymptotic approximation.
If a researcher observes Y i = 1 for i = 1, 2, . . . in a row or Y i = 0 for i = 1, 2, . . . in a row, it may be tempting to terminate the study before reaching a fixed sample size n given α and 1 − β. Armitage el al. (1969) demonstrated the inflation of Type I error rate when a researcher continually performs hypothesis testing during data collection. In the following section, we discuss a statistical method for drawing a valid conclusion in the middle of a study.

Sequential Probability Ratio Test for the Bernoulli Approach
In many practical situations, observations are made sequentially. Due to logistics (e.g., recruiting human subjects and scheduling), the time between two observations D i and D i+1 can be long, and observing one data point can be expensive in terms both cost and labor. Furthermore, in the middle of a study, a researcher can be quite certain whether the degree of agreement is acceptable or not based on accumulated data. In such a case, the sequential probability ratio test (SPRT) can be considered to validly terminate the study before making all n observations (Wald, 1945;Wald, 1947). Particularly for the Bernoulli approach, a simple formula-based rule can be applied to terminate the study during data collection.
Suppose a researcher fixes δ > 0, the maximum of |D i | which is acceptable. Let π = P(|D i | ≤ δ) be the parameter of interest. Let H 0 : π = 1 − γ be a simple null hypothesis and H 1 : π = 1 − γ * be a simple alternative hypothesis, where 1 − γ * > 1 − γ is chosen based on considerations by the researcher. Let α be a significance level and 1 − β be a statistical power desired by the researcher. Set Y i = 1 if |D i | ≤ δ, and Y i = 0 otherwise. Let S m = m i=1 Y i be the total number of observations that the two methods of measurement agree after the m th observation. Then the likelihood ratio (for comparing H 1 to H 0 ) is given by The researcher makes one of the following decisions based on W m , γ, γ * , α, and β.

Simulations
In Section 2, we discussed the paradoxical parameter space in the normal approach, and we considered an alternative formulation of hypothesis testing based on the Bernoulli approach. We then discussed the application of SPRT to the Bland and Altman analysis, and saw that it allows a researcher to terminate the Bernoulli approach early if accumulative data strongly favors H 0 or H 1 over the other hypothesis. In Section 3.1, via Monte Carlo simulations, we compare the normal approach and the Bernoulli approach when the true values of (µ, σ) are inside or near the paradoxical parameter space. Note that our objective is not to argue that one approach is better than the other approach. The objective is to demonstrate the non-negligible difference when the true parameter values are in a neighborhood of the paradoxical space. In Section 3.2, we demonstrate the operating characteristics of the SPRT in the context of the Bland and Altman analysis, where a tested value of π is near the boundary of the parameter space (i.e., π close to one).
In most cases considered in this simulation study, the Bernoulli approach is more powerful than the normal approach for a given n. Table 4 provides the simulation results in a neighborhood of the paradoxical parameter space. Figure 2 graphically demonstrates this tendency for µ = 0.5, 0.01 ≤ σ ≤ 0.04, and n = 100 (left panel) and n = 1000 (right panel). The difference in 1 − β between the normal approach and the Bernoulli approach is more significant when n is larger. Note that σ = 0.0255 is a case of the alternative hypothesis B = 0.09998 for the normal approach and P(|D i | ≤ δ) = 0.975 for the Bernoulli approach. The probability of concluding the alternative hypothesis is about 0.05 under the normal approach (which is supposed to happen; nominal type I error rate), whereas it is already close to one under the Bernoulli approach when n = 1000 (see the right panel of figure 2). The simulation results demonstrate that the statistical power can be very different depending on whether researchers formulate the hypothesis testing based on the normal approach or based on the Bernoulli approach.
The conclusion of disagreement (i.e., unacceptable degree of agreement between J1 and S 1) is valid without the normality assumption.

Summary
In this article, we discussed the paradoxical parameter space in the normal approach, and the Bernoulli approach (which does not require the normality assumption) was considered based on the probabilistic interpretation of the Bland and Altman analysis given δ for π = P(|D i | ≤ δ). The authors emphasize that it is not reasonable to argue one approach is better than the other because the partition of the parameter space for the null and alternative hypotheses is not the same. However, researchers should consider carefully whether they want to formulate their hypothesis testing based on the normal approach or based on the Bernoulli approach, and the statistical power can be very different even when both approaches satisfy P(|D i | ≤ δ) = 0.95 or higher.
Given δ > 0, when two methods agree because both |µ| and σ are small, the Bernoulli approach may require a larger sample size than the normal approach for fixed α and 1 − β. If (µ, σ) belong to the paradoxical parameter space, the Bernoulli approach can require a substantially smaller sample size. Furthermore, the Bernoulli approach does not require any distributional assumption on D i , and the interpretation of H 0 : π = 1−γ and H 1 : γ > 1−γ may be more straightforward for researchers and practitioners than the interpretation of H 01 , H 02 , H 11 , and H 12 of the normal approach.
In most studies of comparing two methods of measurement, a sample (D 1 , . . . , D n ) is observed in a sequential manner. In such cases, the application of SPRT can save in sample size and in time of study. The application of sequential analysis is not new to medical and health sciences, however, and in some practical cases, it may not be feasible to calculate W m for each m = 1, 2, . . . , so group sequential analyses may be suitable alternative methods (Pocock, 1977;O'Brien & Fleming, 1979;Koepcke, 1989;Jennison & Turnbull, 2000). In other practical cases, multiple measurements are taken per subject to compare reliability and validity of two methods of measurement, and the SPRT can be applied in such cases (Kim & Wand, 2019).
There are some shortcomings of the SPRT in practice. Some funded research may require providing data on a pre-specified number of subjects, and a research team may be hired for a specific period of time, both of which could be impacted by a short SPRT and the random sample size. Additionally, if a study is terminated too early, though the small amount of information obtained might be sufficient to draw a conclusion, it would suffer from lack of precision in the parameter estimation.