A Composite Likelihood Method for the Analysis of Multivariate Survival Data: An Application to a PBDE Study

When dealing with multivariate survival data, featuring the association structure is a key difference from the univariate survival analysis. In this paper, we explore to use the composite likelihood framework to handle multivariate survival data, where only the lower dimensional survival distributions need to be specified. The development allows us to use available modeling schemes for bivariate survival data to characterize association structures of correlated survival times. The inference procedure is based on the pseudolikelihood which is the product of the lower dimensional bivariate distributions. The proposed estimation procedure is assessed through simulation studies. As a genuine application, we apply the composite likelihood inference procedure to analyze the data from the polybrominated diphenyl ethers (PBDEs) study, where four types of PBDE congeners are available. The associations among the four PBDE congeners, and the relationships between the covariates and the PBDE congeners are of interest. The result shows that there is strong association among the concentrations of the four PBDE congeners, and statistically significant predictors on the concentrations of the four PBDE congeners are identified.


Introduction
Multivariate survival data arise in many settings where the association among times to events is a key difference from standard univariaete survival data. Analysis methods of multivariate survival data may be broadly based on three types of model: marginal models, frailty models and copula models. In marginal analysis (e.g., Wei, Lin and Weissfeld, 1989), the association among the survival times is ignored with the focus on deriving the point estimates of the marginal model parameters, and the sandwich type variance estimator is invoked to incorporate the ignorance of modeling the association structure. This method is usually applied for the regression models to estimate the covariate effects, but it is not useful if association among survival times is of interest. The frailty models (e.g., Clayton 1978) assumes that the association of the survival times is attributed to a random effect, and conditional on the random effect, the survival times are treated independent. The joint survival function can be obtained by integrating out the random effect. Copula models, on the other hand, directly assume the joint survival distribution structure to be a function of the marginal survival functions.
Both frailty models and copula models are useful tools, especially for analyzing bivariate survival data. For example, bivariate frailty models such as gamma frailty model or Clayton model (Clayton 1978), and bivariate copula model such as Clayton (Clayton 1978) or Frank models (Frank 1979) are popularly used in practice. When the dimension of survival times is larger than two, the specification of the joint distribution, either through frailty models or copula models, as well as inference based on it, is often challenging. It is desirable to have methods that retain the advantages of simple specification of bivariate models and can also be used to handle general multivariate survival data. To this end, we explore the composite likelihood approach (Lindsay 1988;Cox and Reid 2004) to handle multivariate survival data.
Composite likelihood inference has now become popular in various settings of multivariate data analysis. To name a few, Parner (2001) utilized a composite likelihood approach in an adoption study, where the pairs of relationships between children and their adopted or biological parents are postulated through pairwise survival distributions. Henderson and Shimakura (2003) applied the composite likelihood to handle longitudinal count data. Renard et al. (2004) considered the composite likelihood formulation under generalized linear mixed models. He and Yi (2011) explored a composite likelihood method for the analysis of binary data with missing observation. Lindsay, Yi and Sun (2011) investigated a weighted composite likelihood formulation to incorporate importance of lower dimension likelihoods. Varin et al. (2011) provided a survey of recent developments in the theory and application of composite likelihood.
Our research is motivated by the PolyBrominated Diphenyl Ethers (PBDEs) study. Polybrominated diphenyl ethers are a class of brominated flame retardants that are widely used in building materials, electronic equipment, and household products, to provide longer escape times and to reduce the damage of property. Environmental concerns of PBDEs have been raised in past two decades because of their high lipophilicity and high resistance to degradation processes. PBDEs have been recognized as persistent organic pollutants and have become ubiquitous in the environment and in people (Hites 2004;Loeber 2008;Frederiksen et al. 2009). Health hazards of PBDEs have attracted increasing scrutiny, and prenatal exposure to PBDEs has been found to associate with adverse neurodevelopment (Costa and Giordano 2007;Herbstman et al. 2010;Bellinger 2013). The association among the PBDEs and the identification of predictors of PBDE concentrations in human body (e.g. breast milk and blood) are of great interest to health researchers.
A study was conducted to characterize predictors of exposure to PBDEs, where a multi-ethnic, low-income cohort of healthy pregnant women at the age between 16 to 35 were enrolled from two New York City prenatal clinics between September 2009 and December 2010. An extensive questionnaire including items on demographics, diet and lifestyle were administered to these pregnant women, and their maternal serum samples were collected during the first half of pregnancy (Horton et al. 2013). While 12 PBDE congeners were measured in blood samples, it is of interested to focus on the most commonly detected PBDE congeners  in the analysis, where the PBDE concentrations below the limit of detection were treated as left censored observations. A previous study examined how the predictors may be associated with each congener marginally using univariate parametric models with left-censoring accounted for (Horton et al., 2013). Such an analysis basically assumed that the predictors have different effects on the outcomes of PBDE congeners. The feasibility of this assumption, however, has not been rigorously justified, and furthermore, the association structure among the PBDE congeners has been overlooked. It is our goal here to provide a more comprehensive analysis of the PBDE data. Our objectives specifically include the assessment of the level of association among the PBDE congeners, identification of statistically significant predictors as well as estimation of their effects on the level of concentrations of the four commonly detected PBDE congeners in blood samples.
To this end, we propose to use the composite likelihood formulation to analyze the PBDE data. We develop an estimation procedure that simultaneously estimate the association parameters and the covariate effects of response variables. The proposed composite likelihood estimation procedure assumes that each survival "time" (defined to be the concentration of the PBDE congeners in the blood) follows a Cox proportional hazards (PH) model, with a parametric Weibull baseline distribution, and paired survival "times" follow a bivariate Clayton frailty model (He and Lawless, 2003). The parameter estimation is achieved by maximizing the composite likelihood, and the covariance estimates are obtained through a sandwich type estimator. Although our method is motivated by the PBDE data, it applies to other multivariate survival data as well. Our method has several appealing features. It not only allows association structures to be accommodated for analysis of multivariate survival data, but also take advantages of existing modeling schemes. In addition, the implementation is straightforward.
The remainder of the paper is organized as follows. Notation and model setup are introduced in Section 2, and the proposed estimation procedure using the composite likelihood formulation is described in Section 3. In Section 4, we conduct simulation studies to assess the performance of the estimation method. In Section 5, the proposed method is applied to analyze the PBDE data. Concluding remarks and discussion are presented in Section 6.

Notation and Model Setup
Suppose there are n clustered survival data in the study. For i = 1, ..., n and j = 1, ..., m i , let T i j and C i j be the survival time and censoring time for subject j in cluster i, δ i j = I(T i j ≤ C i j ) be the censoring indicator and X i j be the covariate vector of dimension p, where I(·) represents indicator function, and m i is the number of subject in cluster i. Write t i j = min(T i j , C i j ) for i = 1, ..., n and j = 1, ..., m i .
Suppose that the marginal regression model of T i j is characterized by the Cox proportional hazards model as for i = 1, ..., n and j = 1, ..., m i , where S i j (·) is the marginal survival function of T i j , Λ j0 (·) is the cumulative baseline hazard function and β j is the vector of regression parameter. In general, both Λ j0 and β j can be subject-dependent; in some settings, Λ j0 and β j are common across the subjects within clusters. For ease of exposition, we consider that Λ j0 (·) = Λ 0 (·) and β j = β for some function Λ 0 (·) and parameter vector β, where j = 1, ..., m i . While the baseline cumulative hazard function Λ 0 (t) can be delineated by various method, here we take a parametric approach with Λ 0 (t) modeled by a Weibull distribution where λ and γ are positive parameters. Weibull distributions have been widely used in survival analysis due to their convenience and flexibility (Lawless 2003). Different values of γ enable us to describe various types of baseline cumulative hazard functions.
Multivariate survival data may be handled with different strategies which can be distinguished by the way of treating association structures of survival times within clusters. A simplest approach is to deliberately ignore the correlation among survival times within clusters and focus on marginal features of individual survival times (e.g., Wei, Lin and Weissfeld, 1989). On the contrary, two other methods have been commonly adopted to describe association structures within clusters: one approach is to use frailty models (e.g., Clayton, 1978, He andLawless, 2003), and the other is to employ copula models (e.g., He and Lawless, 2005) to incorporate association structures into modeling survival times within clusters. While these two schemes allow us to account for potential correlation which is usually present for clustered survival data, they impose implicitly strict assumptions on the association structures. For instance, all the survival times T i j in cluster i are basically treated equally; with frailty models, they are assumed to be independent by conditional on the frailties (and possibly covariates) while under copula models a common parameter vector is used to facilitate the association.
To deal with more general practical problems, we propose to model pairwise correlation structures within clusters and leave higher order association structures unspecified. Such an approach enable us to look into a broader range of problems with complex association structures, and usually applied modeling strategies such as frailty models and copula models are included as special cases.
To flesh out our ideas, we employ bivariate Clayton models (Clayton, 1978) to postulate pairwise survival times within clusters but bearing in mind that other bivariate models can be used as well. Specifically, for i = 1, ..., n and j, k = 1, ..., m i with j k, the joint survivor function of T i j and T ik is given by (1), and ϕ is a positive pairwise association parameter.
We note that as ϕ → ∞, model (2) corresponds to the scenario where T i j and T ik are independent. Tacitly, model (2) assumes that given the covariate {X i j , X ik }, survival times T i j and T i j are not influenced by other covariate X il with l j and l k.

Inference Method
Given the marginal model (1) and the bivariate model (2), we are interested in developing estimation procedures for the model parameters, denoted as θ = (β T , γ, λ, ϕ) T . Let L (i) jk (θ) be the pairwise likelihood of θ contributed from the pair of subjects j and k in cluster i. Then the logarithm of L (i) jk (θ) is For ease of notation, let Then Define to be the log pairwise likelihood function of θ, where l (i) = ∑ j<k l (i) jk . Then inference about θ may be carried out based on l p (θ).
To conduct inference about θ by using the asymptotic result (4), we must estimate the covariance matrix Σ(θ) using the data. Typically, Σ(θ) can be estimated by the method of moments, given bŷ

Simulation Study
Simulation studies are conducted to assess the performance of the pairwise likelihood method developed in Section 3 for estimation of covariate effects β as well as parameters (λ, γ, ϕ).
Different parameter settings are considered where we set m = 3. The sample size n is chosen to be 100 and 200 to examine the effect of the sample size on the proposed estimation method. 500 runs of samples are simulated for each parameter setting. Two dimensional covariates (X i1 , X i2 ) are generated, where X i1 follows a binomial distribution with the success probability being 0.5, and X i2 follows a normal distribution with mean 1 and variance 1. The coefficients of covariates are fixed at β 1 = 0.5 and β 2 = log2 for the marginal models. For the baseline cumulative hazard functions, we set the scale parameter λ = 1, and let the shape parameter γ = 0.5 or 1.5, representing a monotone decreasing or a monotone increasing hazard rate, respectively. Five values of the association parameter ϕ are assumed : ∞ (i.e., survival times are independent), 3, 2, 1/2 and 1/3, ranging from weak to strong dependence among the survival times.
Specifically, we use the formulae to simulate three associated survival times: where U i1 , U i2 , U i3 are independently generated from the uniform distribution Uniform(0, 1).
The censoring percentage is achieved by selecting a censoring time C such that the percentage of simulated survival times which are greater than C roughly equals the required percentage. Specifically, the value of C is set as C 1 = 2.2 for 20% censoring and C 2 = 0.5 for 40% censoring when γ = 0.5, and C 1 = 1.3 for 20% censoring and C 2 = 0.8 for 40% censoring when γ = 1.5.
The simulation data are analyzed using the proposed method described in Section 3. The estimation bias (bias), the empirical standard error (SE), the model based standard error (ME), as well as the coverage rate (CR) of 95% confidence intervals are reported. The analysis results are displayed in Tables 1-5, corresponding to ϕ = 1/3, ϕ = 1/2, ϕ = 2, ϕ = 3, and ϕ = ∞, respectively.
Finite sample biases of the estimatorsλ,γ,β 1 andβ 2 for the marginal model parameters are very small in all settings. Biases of the association parameter estimatorφ are small when the true value of ϕ is small (i.e., strong association), and biases tends to increase with increasing values of ϕ. This phenomenon agrees with what were observed in other studies.
Model-based standard errors and empirical standard errors ofλ,γ,β 1 ,β 2 andφ are reasonably close, and coverage rates of 95% confidence intervals are around 95% for all scenarios except the setting of ϕ = ∞ where 95% coverage rates can not be constructed. As expected, a larger sample size (n = 200) gives better estimation results than a lower sample size (n = 100). The 95% coverage rate forφ is closer to 95% for n = 200 than for n = 100 when the true ϕ is large. Different censoring proportions do not seem to make noticeable changes in estimating λ, γ, β 1 , β 2 , however, for parameter ϕ (except for ϕ = ∞), an increasing percentage of censoring tends to increase the bias, as expected.
In conclusion, the performance of the proposed method is satisfactory; the theoretical results in Section 3 are confirmed by the simulation studies.

PBDE Data Analysis
The proposed composite likelihood method is applied to analyze the data collected from the PBDEs study. The PBDE data contain observations from 316 pregnant women. The outcome variables include the concentrations of the four most commonly detected PBDE congeners, pb47, pb99, pb100, and pb153, in blood samples. Because the positivity, the concentrations can be taken as "survival" outcomes, where the 0 value of the concentrations is treated as being left censored, indicating the actual concentrations are unknown but below the detection limit of the equipment.
Most observed PBDE concentrations assume small values, while only a few observations take extremely large values. Two extremely large observations of the four PBDE congeners are deleted from the analysis due to the possibly involved error.
The majority observations of the congener pb47 are larger than the majority observations of other three PBDE congeners.
In order to make a suitable comparison among the four congeners, all the observations are standardized to assume similar ranges with the original covariance structure retained. The standardization is implemented by first subtracting the corresponding median value of the congener and then divided by the corresponding inter-quantile range for each congener. Table 1 presents the summary statistics of the four PBDE congeners after the standardization. The correlation matrix of the four PBDE congeners remains the same after the standardization. Figure 1 displays the boxplot comparison of the values of the four PBDE congeners before and after the standardization. We further transform the standardized observations by subtracting them from 38 to convert left censoring to right censoring which is more convenient to handle by survival techniques, where 38 is a number greater than the all standardized observations. Potential predictors include maternal education (MEDUC), race, body mass index categories (BMIC), the number of household electronics (Electronics), current employment status (CurrentWork), servings of solid diary per week (Soliddairy), income categories (Income), and whether or not having any processed meat (ProcessedMeat). All the potential predictors are categorical with each category being assigned an integer value. Table 2 displays the categories of each predictor, the frequency, and the corresponding percentage. We employ (k − 1) dummy variables for each categorical predictor to reflect the effects from the different categories, where k represents the number of categories for a predictor.   The proposed composite likelihood method is applied to the standardized PBDE data. For comparison purposes, we also apply the proposed method to the original data without standardization. We first build the full model which contains all possible predictors, and consider the full model as a starting point. We backward eliminate insignificant predictors, where the significance level α = 0.1 is applied for variable elimination. Table 3 and Table 4 report the estimated coefficients of the predictors for the final model for the standardized and the original data. The estimates of the parameters α, γ, and ϕ are presented in Table 5 for both the standardized and the original data. The large estimated positive values of γ show that the risk at the baseline increases as the concentration in the PBDE congeners increases. The small estimated values of ϕ demonstrates that the dependencies between the four PBDE congeners are high.  Table 3 of the results for the standardized data, lower education (under the college level) is significantly associated with lower levels of the PBDE congeners. Higher usage of household electronics is significantly associated with higher levels of the PBDE congeners. Non-Hispanic Whites have significantly lower levels of the PBDE congeners compared to African American and Hispanics. Obese pregnant women have significantly lower levels of the PBDE congeners relative to underweight pregnant women. Pregnant women whose body weight were normal or overweight tend to have lower levels of the PBDE congeners in contrast to normal pregnant women, but this effect is only statistically significant at the significance level 0.1 but not significant at level 0.05.
From Table 4 of the results for the original data, significant predictors are different from those obtained from the standardized data. Pregnant women with 5 − 10 servings of solid dairy per week have significant higher levels of PBDE congeners than those having 5 less or 10 more servings. Processed meat consumption is significantly associated with higher levels of the PBDE congeners.

Conclusion and Discussion
Usual analysis of multivariate survival data requires the specification of the joint survival function. When the dimension of the responses is more than two, the specification of the joint distribution is often difficult and its plausibility is not easy to check. In this paper, we propose to implement the composite likelihood inference which can reduce the burden of the joint distribution specification, and therefore is appealing for handling the case with a big number of responses. In particular, we explore to use the pairwise likelihood formulation where only the pairwise survival distribution is specified. Though other available bivariate survival models can also be used for our proposed estimation procedure, we use the Clayton bivariate model to delineate paired survival times. One advantage of the Clayton model lies in its interpretability from both the frailty and the copula model perspective (e.g., He 2014).
The reliable performance of the proposed composite likelihood method is confirmed through simulation studies. We apply the proposed method to the PBDE data, and the analyses suggest the association among the four PBDE congeners and identify statistical significant predictors for the PBDE congeners.
Although our development is motivated by the PBDE data, our method applies to any correlated survival data, including sequential survival data, clustered survival data, and their mixes, where complex association structures can be facilitated using available bivariate modeling schemes. In our discussion here, we take a common ϕ when characterizing the association for any two survival times. However, this treatment is not an essential requirement for our method, and it can be readily relaxed to hand general correlated survival times which may have different levels of association between different pairs. One may group the survival times into different subfamilies, with each subfamily sharing a common association structure. In these cases the composite likelihood method is still valid with proper modification in the formulation.

Copyrights
Copyright for this article is retained by the author(s), with first publication rights granted to the journal.
This is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).