An Exact Method for Power Calculation for a Three-arm Clinical Endpoint Bioequivalence Study

A clinical endpoint bioequivalence (BE) study aims to establish BE between a generic drug (TEST) and an innovator drug (REF). A placebo (PLB) is usually included to demonstrate the sensitivity of the study. BE is established if TEST is shown to be superior to PLB, REF superior to PLB, and TEST equivalent to REF. Therefore, an overall BE test for a clinical endpoint BE study is composed of two superiority tests (TEST vs. PLB and REF vs. PLB) and one equivalence test (TEST vs. REF).


Introduction
Most generic drugs are systematic drugs where drugs are intended to be absorbed into the bloodstream.For systematically absorbed drugs, pharmacokinetic (PK) studies are usually used to establish BE between a generic drug (TEST) and an innovator drug (REF) by assessing whether TEST and REF containing the same active moiety (chemically equivalent) in the same dosage form (pharmaceutically equivalent) reach the system circulation at an equivalent relative rate and extent in healthy subjects (Davit et al, 2009).For locally acting generic drugs, pharmacokinetic (PK) studies are generally not adequate for establishing bioequivalence and clinical endpoint BE studies are usually requested for establishing BE for these drugs.Clinical endpoint BE studies aim to establish bioequivalence by performing a comparison of clinical effects based on pre-defined clinical endpoints between TEST and RLD (Yu & Li, 2014).A placebo or vehicle arm (PLB) is usually included in these studies in order to demonstrate that the study is sufficiently sensitive to identify the clinical effect in the study population enrolled in the study.Therefore, the typical study design of a clinical endpoint BE study for topical products is a double-blind, randomized 3-arm parallel clinical trial.In order establish bioequivalence between TEST and REF, TEST has to be shown to be superior to PLB, REF shown to be superior to PLB, and TEST shown to be equivalent to REF.Therefore, an overall BE test for a clinical endpoint BE study is composed of two superiority tests (TEST vs. PLB and REF vs. PLB) and one equivalence test (TEST vs. REF).
For generic drugs, the two-sided tests (TOST) procedure (Schuirmann, 1987) is the standard test for average equivalence.Previously, Phillips (1990) calculated power for the TOST in a two-phase two-period PK cross-over study using a bivariate non-central t distribution.The equivalence test in a TOST is based on the different of means (DOM) for PK parameters after log-transformation, which are assumed to have a log-normal distribution.Chow et al (2003) determined power and sample size for the TOST in a PK cross-over study using the normal Z approximation.Hauschke et al (1999) derived exact methods for calculation of power and sample sizes for the ratio of means (ROM) equivalence test using a bivariate non-central t distribution, assuming the original (untransformed) outcome is normally distributed.Hauschke et al (1999) calculated power and sample sizes for both the parallel study design and the two-period cross-over designs.Chang et al (2014) calculated the sample size and power for an overall BE test based on one superiority test (TEST vs. PLB) and an equivalence test (TEST vs. REF) when the metric of treatment effect is DOM, ROM, or ratio of mean difference.Chang et al used the joint distribution of sample means and sample variances to calculate the expectation of normal distribution conditional upon a Chi-square distribution because "it is not easy to derive the sample size based on the multivariate t-distribution".We call this a Z-ChiSquare method.Shen et al (2015) also used a similar method to calculate power for a two-period two-treatment cross over study.
In this paper, we propose an exact method to calculate the power and sample size for an overall BE test based on two superiority tests (TEST vs. PLB, REF vs. PLB) and one equivalence test (TEST vs. REF) using a multivariate non-central t distribution directly, which we call an Exact-t method.We also applied Chang and Shens Z-ChiSquare method to an overall BE test with two superiority tests and one equality test.In this paper, we focus on two commonly used metrics of treatment effect: difference of means and ratio of means.Simulation was used to compare the proposed Exact-t method and the Z-ChiSquare method in power and sample size calculation against the empirical power, as well as in computation efficiency.

Method
The Method section describes in detail how the study was conducted, including conceptual and operational definitions of the variables used in the study, Different types of studies will rely on different methodologies; however, a complete description of the methods used enables the reader to evaluate the appropriateness of your methods and the reliability and the validity of your results, It also permits experienced investigators to replicate the study, If your manuscript is an update of an ongoing or earlier study and the method has been published in detail elsewhere, you may refer the reader to that source and simply give a brief synopsis of the method in this section.

Exact-t Method
We assume that the continuous outcomes of the three independent treatment groups TEST, REF and PLB X T , X R , X P are normally distributed with means µ T , µ R and µ P and variances σ 2 T , σ 2 R , σ 2 P .The sample size for the three treatment arms are n T , n R and n P , respectively.

Difference of Means as Metric of Treatment Effect
In a three-arm clinical BE study, an overall BE test is composed of two superiority tests (TEST vs PLB, REF vs PLB) each at α = 0.025 level and one equivalent test (TEST vs REF) at α = 0.05.The hypotheses of superiority tests are given by H 01 : µ R − µ P 0 vs. H 11 : µ R − µ P > 0 H 02 : µ T − µ P 0 vs. H 12 : µ T − µ P > 0 Here we assume that a larger value indicates a more desirable effect.The hypothesis of equivalence test is given by where θ 1 and θ 2 are pre-defined equivalence margins.Since the equivalence test is operationally the same as two one-sided tests (TOST) (Schuirmann, 1987), it can be expressed as follows: The test statistic for H 01 is T 1 = X T −X P S e(X T −X P ) = X T −X P √ S 2 T n T + S 2 P n P Therefore H 01 is rejected when T 1 t 0.025,ν 1 .For unequal variances, the degree of freedom ν 1 can be approximated by the Satterthwaite approach (Satterthwaite, 1946) and H 02 is rejected when T 2 t 0.025,ν 2 and ν 2 can be estimated by and H 03 is rejected when T 3 t 0.05,ν 3 and ν 3 can be estimated by and H 04 is rejected when T 4 −t 0.05,ν 4 and ν 4 can be estimated by ν 4 = Due to the Intercept-Union Test (IUT), the family-wise type 1 error is controlled under 0.05 (Hauck & Anderson, 1984).The power function for the overall BE test can be calculated as 1 − β =P(T 1 t 0.025,ν 1 , T 2 t 0.025,ν 2 , T 3 t 0.05,ν 3 , T 4 −t 0.05,ν 4 | H 11 , H 12 , H 13 , H 14 ) Without loss of generality, we assume equal variance among the three treatment groups σ 2 T = σ R = σ 2 P = σ 2 and the pooled variance is defined as We have equal degrees of freedom ν = ν 1 = ν 2 = ν 3 = ν 4 = n T + n R + n P − 3 and the test statistics Under the alternative hypotheses, T 1 , T 2 , T 3 and T 4 follow a multivariate non-central t distribution with non-centrality parameters With the density function of a multivariate non-central t distribution, we can calculate the power for the overall BE test as where x is the vector of (T 1 , T 2 , T 3 , T 4 ), µ is the mean of x.We use R and package "mvtnorm" and function "pmvt" to calculate the power for the Exact-t method.

Ratio of Means as Metric of Treatment Effect
Ratio of means is more often used to test equivalence in a BE study.Likewise, the overall BE test contains two superiority tests each at a significance level of 0.025 and an equivalence test at α = 0.05.The superiority tests are the same as before for the difference of means.The equivalence test is instead given by For H 03 , the test statistic is and H 03 is rejected when T 3 t 0.05,ν 3 where ν 3 is estimated by the Satterthwaite approach (Satterthwaite, 1946), where and H 04 is rejected when T 4 −t 0.05,ν 4 where ν 4 is estimated by we have equal degrees of freedom ν = n T + n R + n P − 3 and the test statistics Under the alternative hypotheses, T 1 , T 2 , T 3 and T 4 follows multivariate non-central t distribution with non-centrality parameters The power function for the overall BE test is the same as before in (1), which can be estimated by the same R package as for the difference of means.

Extension of Z-ChiSquare Method
Chang et al ( 2014) and Shen et al (2015) both used Z-ChisSquare method to calculate the power of BE test.

Difference of Means as Metric of Treatment Effect
Assuming homogeneity across treatment groups, power for an overall BE test including two superiority tests and one equivalence test can be written as ) Using Chang et al (2014) and Shen et al (2015)'s approach, let , the power can be written as Note that U 2 need to be smaller than or equal to ) in order to hold the validity of the probability such that the upper bound of Z 3 is larger than its lower bound.Therefore, the power function can be written in the form of expectation of multivariate normal distribution conditional upon a chi-square distribution: Under the assumption of equal sample sizes n T = n R = n P = n, we have We use R and package "mvtnorm" to calculate the power.The numerical integral is calculated in the following steps: 1. Use function "rchisq" to generate a sample of U 2 based on the truncated chi-square distribution.
2. For each U 2 , calculate the inside probability with multivariate normal cumulative density function "pmvnorm".
3. Take the mean of the probabilities to calculate the expectation with respect to U 2 as the Z-ChiSquare power.

Ratio of Means as Metric of Treatment Effect
Similar to the difference of means, the power function for the overall BE test based on the ratio of means can be written as 1 − β = P(T 1 t 0.025,ν 1 , T 2 t 0.025,ν 2 , ) Likewise, this can be written in the form of Z 1 , Z 2 , Z 3 , Z 4 :

Simulation
In order to examine the performance of Exact-t and Z-ChiSquare method, we compare the power of the two methods against the empirical power by simulation under different scenarios.

Data Generation
We assume normal distribution and equal variance across the groups (σ 2 1. Generate response for each treatment group as follows: where n T , n R and n P are the sample sizes for each group.In this simulation, we assume 0.35 < µ T < 0.65, µ R = 0.5, µ P = 0.3, σ ∈ (0.1, 0.2).We assume an equal sample size across treatment groups: n = 12, 24,36,48,60,100,200, 300 per arm.
3. Perform tests of superiority, equivalence, and overall BE test.
4. Iterate previous steps N times (N = 10000).Empirical power is calculated based on the proportion of the times when the null hypothesis is rejected.
5. Calculate power for the Exact-t method and Z-ChiSquare method as outlined in the Methods section.

Difference of Means Test
We first compare the powers for difference of mean tests under the condition that test group, reference group, and placebo group all follow a normal distribution where µ r = 0.5, µ p = 0.3 with equal standard deviation σ = 0.2 between the groups.We can see that the power of the superiority test increases when the difference between test group and placebo groups increases.The empirical power for equivalence and bioequivalence is bell shaped and reaches maximum when means of test group and reference group are equal.Figure 2 compares the power calculated using (A) the Exact-t method and (B) the Z-Chi-Square method for the overall BE test, by different effect sizes and different sample sizes.Both Exact-t and Z-ChiSquare are bell shaped and are generally good estimates of the empirical powers.However, when sample size is small, the Z-ChiSqure method has unstable estimates of power, whereas the Exact-t method has a smooth and accurate estimation of the empirical power.

Power vs. Sample Size by Different Methods
To better understand the difference between Exact-t and Z-ChiSquare power function, and to understand which tests (superiority test or equivalence test) drives the power of the overall BE test, we study power vs sample size for two superiority tests, one equivalence test, and the overall BE test based on the empirical power, as well as the power of the overall BE test calculated from the Exact-t method and the Z-ChiSquare method.Four different scenarios are considered in Figure 3 and Figure 4: When µ P = 0.3 ( 3A and 3D), i.e., the effect size is large, the power for superiority is high and the overall BE test is mostly driven by the equivalence test.On the other hand, when µ P = 0.4 ( 3B and 3C), i.e., the placebo effect is high, power for the superiority tests becomes the driven force for the overall BE test instead.Between 3B and 3C, as variance increases, power decreases and the power curve shift to the right in 3(C).
A zoom-in plot (Figure 4) further compares the Exact-t method and Z-ChiSquare method against the empirical power in the small sample size range.Figure 4 shows that the power curve of the Z-ChiSquare method (green) deviates from the empirical BE power curve (purple) when sample size is small, whereas the power curve of the Exact-t method (blue) overlaps perfectly with that of the empirical power.

Ratio of Means Test
Similar to the difference of means test, we plotted the empirical power for ratio of means by effect sizer in Figure 5 for (A) two superiority tests (TEST vs. PLB, RLD vs PLB); (B) one equivalence test (TEST vs. REF); and (C) the overall BE test including two superiority tests and one equivalence test, when µ R = 0.5, µ P = 0.3, and σ = 0.2.The results are similar to those of difference of means (Figure 1 and 2), except that the Z-ChiSquare Method (Figure 6B) matches the Exact-t Method (Figure 6A) and the empirical power (Figure 5C) very well even when sample size is small.
Likewise, the empirical power vs sample size for superiority equivalent test, and overall BE test, as well as the power calculated from the Exact-t test and Z-ChiSquare test based on ratio of means in Figure 7 have a very similar trend as those in Figure 4 for difference of means.This is more convenient to general users.Although the power of Exact-t and Z-ChiSquare method for the overall BE test overlap when sample size is large, the Z-ChiSquare method deviates from that of the empirical power when sample size is small.On the other hand, the power of the proposed Exact-t method overlaps with the empirical power of the overall BE test from a small sample size to a large sample size.One interesting finding from the simulation results (Figure 3 and Figure 7) is that when the effect size of REF vs. PLB is large, the equivalence test is the driving force of the power of the overall BE test; however, when the placebo effect is high, i.e., the effect size is small, superiority tests instead are the driving force for the overall power.One future research area is to explore different study designs to improve the efficiency of the overall BE test.
For simplicity, we assume homogeneous variance across treatment groups in this paper.A future research area is to release the assumption of homogeneous variance across treatment groups to heterogeneous variances, because active treatment groups are likely to have different variances from PLB.One challenge for heterogeneous variance is that with Satterthwaite's approximation (Satterthwaite, 1946) for the four t test statistics, there will be four different degrees of freedom for the four test statistics.A multivariate non-central t distribution, however, only allows one degree of freedom.We would need to specify an approximate degree of freedom for the multivariate non-central t distribution.One possible extension is to use Dannenburg et al's updated degree of freedom (Dannenberg et al, 1994), which remains to be explored.
In this paper, we assume a normal distribution.However, in reality, data usually deviates from normality.Therefore, another future research area is to extend the normal distribution to other distributions such as log-normal distribution, etc.
In 2. Our proposed Exact-t method is shown to be more accurate than the Z-ChiSquare method when sample size is small by simulation, is computationally more efficient, and does not require self writing codes to numerically calculate the conditional expectation of the normal distribution given a truncated chi-square distribution.
Therefore, we recommend using the proposed Exact-t method when calculate power and sample size for a three-arm clinical endpoint BE study.

Disclaimer
The opinions and information in this presentation are those of this presenter, and do not represent the views and/or policies of the U.S. Food and Drug Administration.

4. 1 . 1
Figure 1 shows the empirical power by effect size for (A) two superiority tests (TEST vs. PLB, RLD vs PLB); (B) one equivalence test (TEST vs. REF); and (C) the overall BE test including two superiority tests and one equivalence test, when µ R = 0.5, µ P = 0.3, and σ = 0.2.

Table 1 .
Chang et al (2014)omparison for difference of means test (seconds)Chang et al (2014)calculated the sample size and power for an overall BE test including one superiority test (TEST vs. PLB) and one equivalence test (TEST vs. REF) for the clinically end-point BE study.Chang et al used the joint distribution of sample means and sample variances to calculate the expectation of a normal distribution conditional upon a Chi-square distribution because "it is not easy to derive the sample size based on the multivariate-t distribution".In this paper, we propose an exact method to calculate the power and sample size for an overall BE test based on two (rather than one) superiority tests (TEST vs. PLB, REF vs. PLB) and one equivalence test (TEST vs. REF) using a multivariate non-central t distribution directly.The proposed Exact-t method is computationally more efficient, especially when sample size gets larger.It can take advantage of an existing R package without self writing codes to numerically calculating the conditional expectation of a multivariate normal distribution based on a truncated Chi-Square distribution.
summary, we propose an exact multivariate non-central t distribution to calculate power and sample size for a three-arm BE study with two superiority tests (TEST vs PLB, REF vs. PLB) and one equivalence test.It is an advance from Chang et al (2014)'s method in that 1. we extended the overall BE test from one superiority test and one equivalent test to two superiority tests and one equivalence test.