Evaluation of Performance of Adaptive Designs Based on Treatment Effect Intervals

The accuracy of the treatment effect estimation is crucial to the success of Phase 3 studies. The calculation of sample size relies on the treatment effect estimation and cannot be changed during the trial in a fixed sample size design. Oftentimes, with limited efficacy data available from early phase studies and relevant historical studies, the sample size calculation may not accurately reflect the true treatment effect. Several adaptive designs have been proposed to address this uncertainty in the sample size calculation. These adaptive designs provide flexibility of sample size adjustment during the trial by allowing early trial stopping or sample size adjustment at interim look(s). The use of adaptive designs can optimize the trial performance when the treatment effect is an assumed constant value. However in practice, it may be more reasonable to consider the treatment effect within an interval rather than as a point estimate. Because proper selection of adaptive designs may decrease the failure rate of Phase 3 clinical trials and increase the chance for new drug approval, this paper proposes measures and evaluates the performance of different adaptive designs based on treatment effect intervals, and identifies factors that may affect the performance of adaptive designs.


Introduction
It is well-reported that the cost of drug development keeps rising at a high rate while the new drug applications do not keep up with the same rate (Lesko, 2006). It was estimated that the failure rate for Phase 3 trials exceeds 50% (Chuang-Stein, 2004). A poorly designed Phase 3 trial may be a likely reason to account for the high failure rate. It costs both money and patient lives (Thoelke, 2007). The accuracy of the treatment effect assumption is crucial to the success of Phase 3 studies. The calculation of sample size in fixed sample size (FS) designs relies on the assumption of treatment effect and cannot be changed during the trial. With limited efficacy data available from early phase or other relevant historical studies, the sample size calculation may not accurately reflect the true treatment effect. This lack of knowledge leads to the calculated sample size either too small or too large. Thus, the trial may be either oversized or underpowered. The results could be either a waste of finances and patient resources or trial failure.
Many adaptive designs have been proposed to address this issue (e.g., Armitage et al., 1969;Cui et al., 1999;Shih, 1992 and1998;Hybittle, 1971;Jennison andTurnbull, 1990 andLan and DeMets, 1994;Li et al., 2002Li et al., 2005Liu et al., 2008;OBrien and Fleming, 1979;Peto et al., 1976;Pocock, 1977;Proschan et al., 2006;Tsiatis and Mehta, 2003;Wittes and Brittain, 1990). A broad definition of adaptive designs given by Shih (2006) is used in this paper. All classical group sequential (GS) designs and sample size re-estimation (SSR) designs fall into adaptive design scope per this definition. For (classical) GS designs, under a pre-specified total number of looks and maximum sample size, a study is allowed to stop for efficacy or futility at the interim analysis but no extension is allowed beyond the pre-specified maximum sample size. An adaptive GS design is a two-phase GS design with sample size adjustment as illustrated in Figure 1G, which conducts the sample size adjustment at the j th interim analysis (phase 1 portion of the trial), then the remaining duration of the study with the modified sample size is the phase 2 portion. It has been called SSR design in the group sequential setting in the literature (e.g., Gould and Shih, 1998;Cui et al., 1999). We simply call it SSR design in this paper since it is equivalent to the SSR design with the sample size adjustment at the first look ( j = 1). Shih et al. (2016) gave a review on popular weighted and unweighted SSR designs. In all adaptive designs, since interim analyses are built into the traditional studies and the results at the interim look(s) are used to adjust the future course of the study, the overall type I error rate and adequate power or conditional power need to be maintained in all adaptive designs.
Currently, the discussion in the literature on adaptive designs mostly focused on a single specified value as a representation of the unknown treatment difference. However, in practice what is often known is a treatment effect interval (Liu et al., 2008), as illustrated in the following.
It is believed that if an experimental drug is added to the current standard therapy (Carboplatin plus Paclitaxel), the remission time for ovarian cancer patients after surgery will be prolonged. A Phase 3 confirmatory trial is planned to compare the treatment effect of the combination therapy versus the standard therapy alone. Progression free survival (PFS) is used as the primary efficacy endpoint. The treatment effect is estimated based on the results of Phase 2 proof of concept studies for the experimental drug and published median PFS for the standard treatment. However, different PFS medians for standard therapy are found in the literature. In the Hellenic Cooperative Oncology Group (HeCOG) study (Aravantinos et al., 2005), the median PFS of 121 patients randomized to the standard (Carboplatin plus Paclitaxel) therapy was 38 months. In the Gynecologic Oncology Group (GOG) study, the median PFS of 392 patients receiving standard therapy was 19.4 months (Ozols et al., 2003). In another Phase 3 study supported by Bristol-Meyers Squibb, the median PFS of patients receiving standard therapy was 16 months (Neijt et al., 2000). Two other randomized trials indicated the median PFS was around 17.5 months (DuBois et al., 2003;Parmar et al., 2002). The standard therapy regimens including dose levels and dose frequencies were similar in these studies. There were also differences in cancer stages and tumor sizes among the patient populations enrolled into these studies. Thus, it became very difficult to find an accurate point estimate of the true median PFS for the standard therapy. After careful comparison of study designs including inclusion/exclusion criteria and treatment schedules, the median PFS for standard therapy was most likely between 15 to 20 months.
Most previous research and designs focused on how to maximize the study efficiency when the treatment effect was estimated as a point value. From the example above, the question arises as to how the current available adaptive designs can be used to maximize the study efficiency on a treatment effect interval? How can we use mathematical framework to evaluate the performance of different designs? What factors will affect the performance of adaptive designs? Under the same constrains, whether certain GS designs (e.g., with different boundaries or different sample size increments) have a similar performance as SSR designs?
In this paper, we address these issues. To be specific, we develop a method to evaluate and compare the performance of adaptive designs including GS, weighted and unweighted SSR designs on a pre-specified treatment effect interval. We first introduce the measures and indicators to be used for evaluating adaptive design performance in Section 2. The performance of adaptive designs when the treatment effect follows a uniform and in general, a location-scaled beta distribution are discussed in Section 3 and Section 4, respectively. Discussion and conclusion are given in Section 5.

Determination of Treatment Effect Interval
Treatment effect interval should be determined based on the combination of multiple considerations. The lower limit of treatment effect interval should be based on (1) clinical meaningful treatment difference, (2) medical policies, such as restriction of medication price, and (3) company financial considerations. The upper limit of treatment effect interval should be based on (1) the minimum number of patients needed for adequate safety evaluation of the test drug and (2) a realistic estimate of the largest treatment difference.

Measures of Performance
Several authors have evaluated adaptive designs in different contexts. For example, Xi et al. (2017) investigated the optimal timing of interim analyses for making the futility decision. Chen et al. (2016) proposed a biomarker-based subgroup analysis at the end of a Phase III trial to fine-tune the statistical design including hypothesis adjustment. Levin et al. (2013) compared adaptive designs with the test statistics which are based on minimal sufficient statistic, thus included GS and unweighted SSR designs but excluded the weighted SSR design (Cui et al. 1999). All these literature used average sample size to compare the performance of different adaptive designs. In this paper, we use functions of the true treatment effect δ on the treatment effect interval [δ L , δ U ] as the measures and construct performance indicators based on the measures. When comparing the sample size, this function is where α and β are the pre-specified type I error and type II error for the study. When comparing the power, this function is p (δ) = 1 − β.
These functions can be used separately or be combined as seen in the sequel. Let us first illustrate with the FS design.

True Treatment Effect Function for the Fixed Size Design
Without knowing the true treatment effect δ, the sample size for FS design has to be calculated based on an estimated effectδ from a previous study or historical data. Sample size cannot be changed during the study. Sample size curves as function of the true treatment effect for the FS design are illustrated in Figure 1A. Only whenδ is equal to true treatment effect δ, the FS design will have the ideal performance. Whenδ is shifted away from the δ, the FS design will be either under powered (ifδ > δ) or oversized (ifδ < δ). Its performance will be worse as the difference between the true and estimated value gets larger.

Failure Rate
Failure and Failure Rate: At each point on the treatment effect interval, it is considered a failure when the sample size for a particular design at that point is more than 1 f s (usually 1 f s = 2) times the sample size based on the true treatment effect or the power decreased more than f p (usually 20%) of the power based on the true treatment effect, where 0 < f s < 1 and 0 < f p < 1. For example, when f p = 20% and the nominal power is 80%, failure occurs if the power of the design is no more than 64% (= 80% × (1 − 0.2)). The failure rate is defined as the proportion of points that meet the failure criteria on the treatment effect interval. A lower failure rate indicates a better performance of the adaptive design.

Failure Rate for Fixed Sample Size Design:
Denote the sample size at the true treatment effect δ as u (δ) and n 0 the sample size calculated based on the FS design. Based on the above criterion, a failure occurs when n 0 is larger than 1 f s u (δ) . Because the sample size curve for the true treatment effect is monotone, it is very easy to see that failure occurs for all δ > δ f n on the interval, where δ f n = The targeted power on the interval for true treatment effect function is always 1 − β. Using the above criterion, it is a failure when power is decreased more than f p times of the targeted power. Also, because of the monotonicity of power curves given the sample size n 0 calculated based on the FS design, it can be seen that the failure occurs for all δ < δ f β on Thus the total failure rate on a treatment effect interval that combines both sample size and power measures is defined as: Below is an example of failure rate for FS designs. When treatment effect interval is [0.0882, 0.5] and assume f s = 1 2 and f p = 0.2, the failure rates at different sample sizes for FS designs are 1. Failure rate for a FS design with sample size calculation at δ U : Figure 1B, when treatment effect falls into the region (0.0882 < δ < 0.4131), power decreases more than 20% of the true treatment effect function for power (1 − β = 80%).
2. Failure rate for a FS design with sample size calculation at δ L : Figure 1C, when treatment effect falls into the region (0.1247 < δ <0.5), sample size is more than two times of the sample size from the true treatment function.
3. Failure rate for a FS design with sample size calculation at δ M = √ δ L δ U : 1%. As indicated in Figure 1D and Figure  1E, when treatment difference falls into the red region (0.0882 < δ <0.1738), power decreases more than 20% of the power from the true treatment effect function; when treatment difference falls into the red region (0.2968 < δ <0.5), sample size is more than two times of the sample size from the true treatment effect function.

Generalization of Failure Rate:
A generalized formula for failure rate on a treatment effect interval is: where g(δ) and f (δ) denote the sample size and power for adaptive design when treatment effect is δ, u(δ) and p (δ) denote the sample size and power from the true treatment effect function when treatment effect is δ, and I A (δ) is the indicator function of A. W (δ) denotes the weight assigned to δ ∈ [δ L , δ U ] which will be a probability density function on the treatment effect interval. When treatment effects have a uniform distribution over Treatment effect also can be assumed to follow other distributions. In this paper, we will consider the treatment effect follow a uniform and a location-scaled beta distribution in Sections 3 and 4, respectively. The generalized failure rate is used as an indicator of performance in the paper. For simplicity, we just call it failure rate in the sequel.

Area between Log Curves (ABLC)
Loss function: Loss function (or regret) in decision theory can also be used to evaluate the performance of a design at the true treatment effect δ on the interval [δ L , δ U ]. The ratio of the upper limit to lower limit of the interval is called adaptive index (AI), i.e., AI = δ U /δ L . We consider the absolute loss function: where S (δ) is the sample size or power at the true δ andŜ (δ) is the estimated sample size or the power based on the design. The smaller the loss is, the better the adaptive design performance is. In the following, we consider the loss in terms of sample size only. The study of the loss in terms of power is available from the authors upon request.

Area Between Curves:
Loss only accounts for the performance of adaptive design at a particular point on a treatment effect interval. It is important to identify a criterion to account for the cumulative performance on the treatment effect interval. The risk function is the average of loss over the treatment effect interval. Since the length of the treatment effect interval is the same, comparisons of risks is the same as comparisons of the areas between curves as defined in the following.

Area Between Curves (ABC) of Adaptive Design and True Treatment Effect Function:
To account for the deviation from the true treatment effect function, ABC for each adaptive design can be calculated on the treatment effect interval (see Figure 1F) by Performance can be evaluated by comparing ABC for different designs. The smaller the area between the curves is, the better the performance is.

Area between Log Curves (ABLC) for Adaptive Design and True Treatment Effect Function:
One can interpretŜ (δ) − S (δ) as the sample size difference between the adaptive design and the true treatment effect function. More appropriately though, the ratio ofŜ (δ) to S (δ) indicates the relative difference. For easier interpretation, log ratio can be calculated. We have = log(Ŝ (δ)) − log(S (δ)).
Define ABLC as In addition to the failure rate R f , ABLC is used as another performance indicator of a design in Sections 3 and 4.

Method
In this section, the performance of (classical) GS designs, weighted SSR designs, and unweighted SSR designs on a treatment effect interval are evaluated, assuming that the treatment effect follows a uniform distribution. CHW design (Cui et al., 1999) is used as the representative of the weighted SSR design. For the unweighted (likelihood) SSR design (e.g., Li et al., 2002;Shih et al., 2016), boundaries for the interim looks after the sample size adjustment are recalculated to maintain the overall type I error rate. For weighted and unweighted SSR designs, an initial sample size is selected based on the sample size at δ L , δ U , and δ M . Sample size can then be increased based on the interim finding on treatment effect size during the study. However, there is a restriction for the maximum allowed sample size. Sample size adjustment can be done based on the conditional power at selected looks before the final analysis. Effects on the patient increment patterns and different adaptive indices are also studied.
For the (classical) GS and weighted SSR designs, boundaries at each interim and final looks are fixed at the design stage through pre-specified alpha-spending. Four kinds of discrete boundaries -O'Brien and Fleming boundaries (OBF), Pocock boundaries (PK), Haybittle-Peto boundaries with critical value α 0 of 0.01 (HP01), and with critical value α 0 of 0.005 (HP005) are considered in the performance comparisons. Boundaries are calculated from exact methods (as opposed to the approximation of alpha-spending function for continuous boundaries). More specifically, let K be the total number of looks, t 1 , t 2 , · · · , t K denote information fraction at each look, δ denotes the treatment effect, and Z t 1 , Z t 2 , · · · , Z t K denote the test statistic at each look, then, 3. Haybittle-Peto Boundaries c 0 and c α−α(K−1) satisfy where c 0 = Φ −1 (1 − α 0 ) , α 0 is predetermined and α 0 (K − 1) is the cumulative alpha spending in the first K − 1 looks.
For unweighted SSR design, since sample size will be updated based on the interim finding, the information time needs to be updated. Thus, the boundaries after the sample size adjustment needs to be recalculated as well.
The effect of different patient increment patterns are also studied. Performance is compared when patient increment is equally-spaced or unequally-spaced with 2 time increments. Information time for each analysis is calculated based on the total number of looks and the patterns of increment.

Simulation Plan
Simulations for adaptive designs are based on the steps outlined below: Step 1: Identify an interval of exploration and the maximum and minimum allowed sample size on the basis of early study results and literature.
Step 2: Choose candidate adaptive designs to be considered -GS designs with OBF boundary, Pocock boundary, Haybittle-Peto boundary or SSR designs.
Step 3: Determine the following design parameters: • Adaptive index • Maximum sample size for GS designs and initial sample size for SSR designs • Total number of looks

• Types of information increment
• Time of sample size adjustment for SSR designs • Adjustment of sample size at the predetermined interim look Step 4: Obtain average sample size and power at 11 points of treatment effect for 10 evenly divided sections on the selected treatment effect interval for each design via Monte Carlo simulations Step 5: Get the sample size and power curve on the treatment effect interval through interpolation Step 6: Evaluate the performance for each adaptive design All simulation results are based on 10,000 runs for each treatment effect. Simulations are repeated based on different simulation parameters specified in step 3.

Performance of Adaptive Designs
Performance for each adaptive design on the treatment effect interval [0.0882, 0.5] is evaluated by failure rate R f and area between log curves (ABLC). For GS design, performance is evaluated at different maximum sample sizes -2018, 356, and 63 which are the sample sizes in the FS design when treatment effects are δ L , δ M , and δ U , respectively. For SSR designs, 356 is used as initial sample size (n init ) and the maximum allowed sample size after sample size adjustment is 2018. Sample size adjustment is based on a targeted conditional power of 80%.
In Figure 2, the top row shows the performance for GS designs with n max = 2018, 356 and 63. Rows 2 to 5 are graphs for the weighted SSR designs with n init = 2018, 356 and 63. The last row shows the performance for unweighted SSR designs with n init = 365. Both equally-spaced and unequally-spaced (double increment) information time are considered.
For GS designs (top row), the design with PK boundary has the best performance (low failure rate R f and low ABLC) when n max is not small. When n max is small (= 63), OBF is the best. In terms of the total number of looks, when n max is large (= 2018), the performance of GS designs gets better as the number of looks increases, regardless of the kind of boundaries. When n max is not large, the performance does not alter as much with the increase of the number of looks and is similar among the GS designs with different boundaries. In terms of the maximum allowed sample size, n max = 356 gives better performance than n max = 63 or 2018. There is an exception though with unequally-spaced information time, where the best performance is observed when n max is 2018.
For the weighted SSR design (rows 2 to 5), the best performance is still with the PK boundary. The focus here is the initial sample size. The best performance is obtained when the initial sample size n init = 365 for the following reason: When the initial sample size is small (= 63), the sample size adjustment is done based on very limited information and is not reliable. When the initial sample size is already large (= 2018), no sample size adjustment could occur, thus the SSR designs become GS designs.
Based on the observations for the weighted SSR designs, the unweighted SSR designs (last row) are examined with the initial sample size = 356 only. For unweighted SSR designs, the focus is on the total number of looks, the timing of the sample size adjustment, and the pattern of sample size increment. As shown, the SSR with PK boundary still performs the best overall. Performance improves for all boundaries when the total number of looks increases. However, the failure rate R f remains similarly low after 3 looks. The timing of the sample size adjustment does not matter for R f , but for ABLC it is the earlier adjustment, the lower ABLC.

Performance of SSR Designs with j = 1
Comparison of SSR designs with j = 1 are presented on the first row of Figure 3. In general, ABLC of the unweighted SSR are slightly smaller than the weighted SSR designs, while failure rate for these two designs are almost identical.

Comparison of Maximum of 5 Looks Unequally-Spaced GS Designs Using HP Boundaries Versus SSR Designs Using OBF and Pocock Boundaries
Since sample size can be adjusted based on interim findings, failure rate and ABLC for SSR designs are usually low. However in practice, there is still a lack of understanding on SSR designs compared to the (classical) GS designs. Thus, SSR designs are not well accepted by the regulatory agency (FDA). In the second and third rows of Figure 3 we compare unequally-spaced (double increment) maximum of 5 looks GS designs using HP01 and HP005 boundaries versus SSR designs using OBF and PK boundaries. As shown, the failure rate (second row) and ABLC (third row) for GS designs with HP01 and HP005 boundaries are lower than or similar to that of SSR designs with OBF boundary. However, the opposite is true for SSR designs with PK boundary except for two-stage designs. We discuss more on the special two-stage design in the next sub-section.

Performance of Adaptive Designs when Treatment Effect Follows a Location-Scaled Beta Distribution
In this section, we report results from the same simulation procedures and parameters as in the previous section, except that treatment effects follow the location-scaled beta distributions: Beta (5, 2), Beta (2, 5), or Beta (4, 5) on the treatment effect interval [0.0882, 0.5] (see Figure 1H).

Results
The top row of Figure 4 shows the performance indicators of GS designs with the maximum sample size n max = 2018. As in the previous uniform distribution case, GS design with PK boundary has the best performance (low failure rate R f and low ABLC). In terms of the total number of looks, also similar to the uniform distribution case, the performance of GS designs gets better as the number of looks increases, regardless of the kind of boundaries, equal or unequal patient increments. The performance seems robust to the true treatment distribution, especially with the ABLC indicator.
The second and third rows of Figure 4 presents the performance indicators of the weighted and unweighted SSR designs with the initial sample size n init = 365 and the sample size adjustable to maximum = 2018 after sample size adjustment.
(See previous discussion on n init = 365 in Section 3.3.1). Only either 2 or 5 looks are considered here. As shown, the best performance is still with the Pocock boundary. The most striking result here is that designs with treatment effect following the location-scaled beta(5, 2) perform much better than designs with treatment effect following a location-scaled beta(2, 5) or beta(4, 5), also the performance indicators with beta(5, 2) are relatively insensitive to the number of looks, the timing of the sample size adjustment, or the pattern of sample size increment.
The last row of Figure 4 presents the comparison of 2-look or 5-look unequally-spaced (double increments) GS designs using HP01 and HP005 boundaries versus SSR designs using OBF and PK equally-spaced boundaries. As shown, when the treatment distribution is beta(5, 2) and looks=5, similar to the uniform distribution case, GS designs with HP01 and HP005 boundaries perform better than or similar to SSR designs with OBF boundary, but worse than SSR designs with PK boundary. When the treatment distribution is beta(2, 5) or beta(4, 5) and looks=5, GS designs with HP01 and HP005 boundaries may perform better or worse than SSR designs with OBF boundary or PK boundary, depending on the timing of the sample size adjustment. In general, early or middle timing is better than late adjustment. For two-stage designs (looks=2), conclusions are the same as the uniform distribution case.

Discussion and Conclusion
Among the various adaptive designs, the use of the classical GS designs in clinical trials is well established. However, as commented in a recently published US FDA's Guidance for Industry, the adaptation of sample size based on the interim treatment effect estimates is still regarded as a less understood area (FDA 2010). In this paper, we attempt to contribute to the understanding of SSR designs by comparing performance of GS designs versus SSR designs with different design parameters. One aim is to examine situations where the performance of a SSR design may also be achieved a classical GS design, perhaps with different design parameters.
In this paper, the performance of adaptive designs is based on the measure of sample size and/or power function over the treatment effect interval. The design parameters include the maximum sample size for GS designs, initial sample size for SSR designs, alpha-spending boundaries, total number of looks, types of information increment (equal or double increment), timing of sample size adjustment, etc. Treatment effect is assumed to follow either a uniform distribution or a general location-scaled beta distribution.
There are several interesting findings. First, with the performance indicators defined by failure rate and ABLC in terms of the sample size and/or power measure, PK boundary is the best choice in most cases. GS designs perform better with interim analyses at double increment of information time than at equally-spaced increment. Not surprisingly, the more interim looks the better performance for GS, but not necessarily for SSR designs. Of course, more interim analyses requires more logistic efforts.
For the more common two-stage design, SSR designs perform better than GS designs, regardless of alpha-spending boundary or timing of the interim analysis. The weighted and unweighted SSR designs perform similarly.
Most interestingly, we find that 5-look unequally-spaced GS design with HP01 and HP005 boundaries can achieve similar or better performance than the SSR designs with OBF, but not necessarily better than SSR with PK boundary.
Finally, adaptive designs may not always be the best choice. When the treatment effect interval is narrow, indicating relatively accurate estimation of the treatment effect, performance is robust on the interval regardless of the design chosen. Thus, a fixed sample size design may be good for this circumstance. However, because of the difficulty of obtaining such a narrow treatment effect interval, one should be cautious and may use simulations to confirm the point treatment estimate before the fixed sample size design instead of adaptive design is used.