Parameter Estimation of Shared Frailty Models Based on Particle Swarm Optimization

Standard survival techniques such as proportional hazards model are suffering from the unobserved heterogeneity. Frailty models provide an alternative way in order to account for heterogeneity caused by unobservable risk factors. Although vast studies have been done on estimation procedures, Evolutionary Algorithms (EAs) haven’t received much attention in frailty studies. In this paper, we investigate the estimation performance of maximum likelihood estimation (MLE) via Particle Swarm Optimization (PSO) in modelling multivariate survival data with shared gamma frailty. Simulation studies and real data application are performed in order to assess the performance of MLE via PSO, quasi-Newton and conjugate gradient method.


Introduction
The general concept of survival models is to investigate the impact of risk factors on individual failure times.However, it's impossible to include all important factors into the model.This is due to the lack of information on individual level or the researcher may not be aware of the importance of the factor, or even the existence of it (Hougaard, 1991).Several authors studied the consequences of ignoring the existence of unobserved heterogeneity caused by unobservable risk factors such as Aalen (1988), Lancaster (1979Lancaster ( , 1990)), Henderson and Oman (1999) and Van den Berg (2001).
In order to account for unobserved heterogeneity, the frailty term was first introduced by (Vaupel, Manton, & Stallard, 1979) for univariate survival data.They indicated that the individual hazard is the product of two terms: an individual level frailty and a baseline hazard function.The multivariate generalization was then introduced by Clayton (1978).The proposed model was a random effect model, which is an extension of a proportional hazards (PH) approach.In a shared frailty model, lifetimes of a group of observations in the same cluster (e.g., individuals in a family) are related to each other (Aalen, Borgan, & Gjessing, 2008).Each cluster shares the same level of frailty.In other words, the observations in a cluster share the same unobservable risks such as genetic structures.The variance of this common frailty is a measure of dependence among lifetimes within a cluster.
There exists a huge amount of literature on estimating parameters according to information of the distributional form of the baseline hazard function.If a parametric form is not assumed for the baseline hazard, different estimation methods are proposed for the semi-parametric frailty models (Clayton & Cuzick, 1985;Gill, 1985;Clayton, 1991;McGilchrist & Aisbett, 1991;Klein, 1992;Nielsen, Gill, Andersen, & Srensen, 1992;McGilchrist, 1993;Rondeau, Commenges, & Joly, 2003;Therneau, Grambsch, & Pankratz, 2003).Also, important works relevant with Bayesian framework were offered by Sinha and Dey (1997), Ibrahim, Chen and Sinha (2001) and Hanagal and Pandey (2015).On the other hand, when the baseline hazard is represented parametrically, model parameters can be estimated with MLE via derivative-based methods such as Newton Raphson, quasi-Newton and conjugate gradient.In spite of having useful properties, these techniques have some drawbacks.The main drawback is that the starting point should be chosen with a value close to a global optimum because they can easily get stuck in a local optimum at multiple solution space.The poor choice of an initial value for any model parameter could lead to finding an estimate that is far away from the optimal solution.When the number of model parameters is too large, choosing suitable initial values becomes harder and researchers are never sure whether the resulting solution is a local or global.Also, biased parameter estimates can be obtained due to the heavy censoring or small sample size.
PSO is a member of stochastic search family inspired by some natural phenomena to solve complicated and high-dimensional optimization problems.It is a global search strategy that has properties of being robust, fast converge and easy implemented for linear/non-linear functions.In contrast with the derivative-based methods, it is less sensitive to the selection of initial parameter values.Besides, it gives reliable estimates even though the number of parameters being estimated is oversized.Recently, PSO has become an important estimation technique for censored data (see Wang & Huang, 2014).However, to the best of our knowledge, there is no work that uses PSO in frailty models.The main purpose of this study is to present the estimation performance of PSO in modelling multivariate survival data with shared frailty.Two simulation studies with different parameter settings are conducted to evaluate the performances of MLE via PSO, quasi-Newton and conjugate gradient methods.It is shown that PSO is an efficient method to obtain maximum likelihood estimates (MLEs) even though the following data characteristics exist: (i) small number of clusters (ii) large number of observations in a cluster (iii) large number of censored observations (heavy censored).The remainder of paper is organized as follows.Section 2 provides a general concept of shared frailty models.Section 3 provides the basic steps of standard PSO method.Section 4 outlines the MLE via PSO procedure and demonstrates its performance using two simulation studies.Section 5 provides a real data application to confirm the efficiency of procedure.

Shared Frailty Models
Suppose that there are N clusters and each cluster i has n i observations (i = 1, ..., N). t i j is the observed failure time of j th ( j = 1, ..., n i ) observation in i th cluster.Under a right censoring scheme, t i j = min(c i j , t * i j ) where t * i j is the failure time and c i j is the censoring time.Here, t * i j and c i j are independent random variables.The observed censoring indicator δ i j is equal to 1 if t * i j < c i j , and 0 otherwise.Conditional on frailty z i (> 0) and X i j , the hazard function of i th cluster has the form where h 0 (.) is the baseline hazard function, X i j is a vector of observed covariates for the j th observation and β is a vector of regression parameters.It is assumed that survival times (in cluster i) are conditionally independent with respect to the frailty.This common frailty is the cause of dependence among lifetimes of within a cluster (Wienke, 2011).The frailties,z i , are i.i.d.variables with the common probability density function g(z i ).
Let (t i1 , t i2 , ..., t in i ) denote n i survival times of i th cluster.The conditional survival function in that cluster can be expressed such as (for a given Z i = z i ; t i j > 0 ; j = 1, 2, ..., n i ), where Λ 0 (.) = ∫ t i j 0 h 0 (s)ds is the common cumulative baseline hazard.Here, S (.\.) = exp(−z i Λ 0 (t i j )e β ′ X i j ) denotes the survival function of jth observation conditional on frailty.The conditional likelihood function of ith cluster has the form where ψ and β are the vector of baseline hazard parameters and regression parameters, respectively.Unconditional likelihood for observed data can be obtained by integrating Eq.( 3) with respect to frailty terms Various studies have been done on the choice of distribution of frailty random variables.While some authors use continuous distributions such as Gamma (Clayton, 1978;Vaupel et al., 1979), inverse Gaussian (Hougaard, 1984;Whitmore & Lee, 1991;Hanagal & Sharma, 2015), log-normal (McGilchrist & Aisbett, 1991) and positive stable (Hougaard, 1986), other authors use discrete distributions (Caroni, Crowder, & Kimber, 2010;Ata & Ozel, 2012).However, the Gamma distribution is the most common and widely used in literature for determining the frailty effect, which acts multiplicatively on the baseline hazard (Wienke, 2011).Due to its computational convenience, one parameter Gamma distribution (with mean 1 and variance θ) is used as the frailty distribution.The probability density function of one parameter Gamma distribution is as follows The larger value of θ indicates the greater degree of heterogeneity among lifetimes within a cluster.When Gamma distribution is degenerate at value 1, Eq.( 1) reduces to the standard PH model.Hence, we can conclude that θ takes the value 0 and lifetimes within clusters are independent.
Under the concept of gamma frailty, unconditional survival function for cluster i is obtained by integrating conditional survival function over the Gamma distribution and can be written as The unconditional hazard function of corresponding cluster is, Once the parametric form of baseline hazard is specified, the unconditional likelihood function can be easily derived (Wienke, 2011).Taking into account of all clusters and denoting the number of observed events in each cluster as δ i j , one might write the following unconditional likelihood function as (Klein, 1992;Duchateau & Janssen, 2008;Wienke, 2011), where i = 1, ..., N and j = 1, ..., n i .MLEs of model parameters can be obtained by maximizing Eq.( 8).

Estimation Method: PSO
PSO is a member of Evolutionary Algorithms (EAs) and can be used to overcome some limitations of derivative-based methods.It was first developed by Kennedy and Eberhart (1995) as a population-based method.It is easy to implement for complex and high-dimension, linear/non-linear functions.PSO simultaneously searches a global optimum using more than one candidate solutions in different regions of multiple solution space.Therefore, the problem of being stuck at a local optimum is almost avoided (Aladag, Yolcu, Egrioglu, & Dalar, 2012).Overall, PSO is not affected by the existence of large numbers of unknown parameters as much as derivative-based methods.The algorithm has its own parameters: particle size, inertia weight, acceleration coefficients and velocity vector.
A search in solution space is similar to bird food-finding behavior in a flock.In contrast with the derivative-based methods, the PSO algorithm starts with many candidate solutions (particles).The population of particles is called a swarm, and each particle has a location vector P m and a velocity vector V m .A particle flies thought the solution space in search for a better position and adjusts its own position with reference to the values of personel best (Pbest) and neighbors global best (Gbest).The basic steps standard PSO are given as follows (Egrioglu, Yolcu, Aladag, & Kocak, 2013): [Step 2]: Let ⃗ Pbest m represents the vector of best position of m th particle, and ⃗ Gbest is the particle which has the best fitness function value, found so far by all particles.According to the fitness function, determine ⃗ Pbest m and ⃗ Gbest that are respectively given by ( 9) and ( 10 where c 1 and c 2 are the positive acceleration coefficients (learning factors); ω is an inertia parameter, and r 1 and r 2 are uniform random variables which are generated within [0, 1].Generally, both the values of (c 1 , c 2 ) are taken fixed at 2 and ω = 0.5 + rand/2 changes in each iteration (Hu, Eberhart, & Shi, 2004). [ Step 4]: Stop when pre-specified iteration number is met.⃗ Gbest is the vector of optimal solution.Otherwise, return to Step 2.
The Gelman-Rubin (G-R) statistic, modified by Prasad and Souradeep (2012), is used in order to be sure that pre-specified iteration number satisfy the convergence for each position related to unknown model parameter.Let T represents the total iteration number.p m,k is the trajectory of mth particle of kth position.The variances of within particles (W) and between particles (B) are calculated for position k with the following formulas ( 13) and ( 14), where where pk = 1

√
Vk /W k monitors the convergence ability of performed total iteration.For the value of close to 1 suggests a good convergence achieved for the corresponding parameter k.

Simulation Study
In this section, the estimation performance of MLE via PSO is investigated by two scenarios.In the first simulation study, the performance of PSO is compared with the quasi-Newton method, and then in the second one it is compared with the conjugate gradient method.
For both of simulation studies, the baseline cumulative hazard function is assumed to be Weibull as Λ 0 (.) = λt p .In the first simulation study, Weibull parameters are set to (ψ = λ, p) = (0.2, 1.5) and the frailty variables are taken as Gamma distributed randoms with variance θ= 1.5.Three observed covariates x (1) , x (2) and x (3) are used and randomly generated from Bernoulli distribution with success probability 0.1, 0.5 and 0.9, respectively.Regression parameters are assumed to be β 1 =β 2 =β 3 = 1.In the second simulation study, Weibull parameters are set to (ψ = λ, p) = (0.5, 1).The variance of frailty is taken as θ= 1.Two observed covariates, x (1) and x (2) , are randomly generated from Binomial distribution having the same success probability 0.5 and the number of trials are taken as 2 and 3, respectively.Regression parameters are set to be β 1 =β 2 = 0.5.
The model for the failure times (t i1 , ..., t in i ) is considered as, where βX i j = β 1 x (1)  i j + β 2 x (2) i j + β 3 x (3) i j in the first simulation and βX i j = β 1 x (1) i j + β 2 x (2) i j in the second simulation.Following the study of Yu (2006), the multivariate survival data are generated as follows: (1) frailty variables are generated from Gamma distribution with mean 1 and variance θ. (2) for the j th observation in i th cluster, uniform random variable is generated as u i j ∼ U(0, 1).Also censoring time c i j is generated from an exponential distribution exp [κ] with parameter κ to create expected censoring rates.
(3) the failure times are generated by, ) 1/p (15) (4) the observed failure times are t i j = min(c i j , t * i j ).The observed censoring indicator δ i j is equal to 1 if t * i j < c i j , and 0 otherwise.
For both of simulations, different size of clusters (n i = 2, 3, 4) and different number of clusters (N = 20, 30, 40) are taken into account.Three censoring rates are chosen as 5%, 20% and 40% which represent light, medium and heavy censoring, respectively.Simulations are repeated 100 times.Mean absolute percentage error (MAPE), bias and standard error (SE) vectors are computed for all cases.Let η shows the vector of real parameters and η shows the vector of MLEs.In the first simulation study η = ( β1 , β2 , β3 , λ, p, θ), and in the second one η = ( β1 , β2 , λ, p, θ).To show the results of 54 different cases less complicated, we report the mean of parameters' MAPEs (mMAPE), biases (mBias) and SEs (mSE) given below, where ηr is the vector of MLEs obtained in r th repeat and E(η) = 100 ∑ r=1 ηr /100.Besides these measures, the mean of log-likelihood values obtained in each repeat are calculated (mLog.Lik.).The steps of MLE via PSO are given as follows: [ Step 1]: PSO parameters are selected as (ω, c 1 , c 2 , L) = (0.5 + rand/2, 2, 2, 30).T is taken as 1000.In the first simulation study the solution space is 6-dimensional and in the second, it is 5-dimensional.Positions of each particle and their corresponding velocities are generated randomly from uniform distribution in the range of [0,3] and [0,0.5],respectively.Positions of a particle in the swarm for the first and second simulation are illustrated in Figure 1 and 2, respectively.Gbest is the vector of MLEs found by using PSO (it is observed that T=1000 is sufficient for converge).To perform the MLE with either the quasi-Newton or conjugate gradient methods, randomly choose a particle from the initial swarm and use it as the initial vector.The results are shown in Table 1 and Table 2.As observed in the tables, MLE via PSO has lower mMAPE and mBias values for all cases.For all cases except two, the values of mSE are lower in the first simulation.In most cases, as the number of clusters increase, a remarkable reduction in mMAPE, mSE and mBias for PSO is seen.However, the quasi-Newton or conjugate gradient methods do not provide the same pattern.For the same level of N and δ, when the size of clusters increase, mMAPE, mSE and mBias decrease in all cases of MLE via PSO except one (in this case, mBias remains constant).It should also be noted that increasing censoring rate has lower impact on the values of mMAPE, mSE and mBias in MLE via PSO procedure.The kidney infection data (McGilchrist & Aisbett, 1991) has been widely used in frailty studies.The data set consists of first and second recurrence times of infection for 38 kidney patients using a portable dialysis machine.The event of interest is the occurrence of an infection at the point of insertion of the catheter.The catheter is removed when the infection is observed.Then, the infection is cleared up and the catheter is reinserted for the second time.Each lifetime (t i1 , t i2 > 0) represents the time elapsed (in days) from the catheter insertion to infection.Censoring occurs when the catheter is removed for any reasons except the infection.
We first investigate the goodness of fit of the kidney infection data.Weibull and exponential probability plots (P-P) are shown in Figures 5 and 6.According to P-P plots, Weibull fits the data better than the exponential distribution.Kolmogorov-Smirnov (KS) tests suggest that Weibull is an appropriate distribution on the choice of baseline hazard.
Table 3 provides the K-S statistics and associated p-values.).The hazard function of i th patient conditional on Gamma shared frailty at time t i j > 0 takes the form (i = 1, ..., 38; j = 1, 2), h(t i j \z i , X i j ) = z i (λpt i j p−1 ) exp(β 1 x (age) i j The results of MLE with PSO, quasi-Newton and conjugate gradient are reported in Table 4. Standard errors of parameters (Se) are obtained from the inverse of observed information matrix I(β, ψ, θ, ) = −H(β, ψ, θ), where H(β, ψ, θ) denotes the Hessian matrix.As observed in Table, lower standard errors and higher Log.Lik.values are obtained from PSO.Also, the G-R converge statistics of PSO calculated for 5 parameter estimates are found satisfactory (that is, equal to or less than 1.1).-332.488 -364.718 -364.776Note: The initials selected randomly for quasi-Newton and conjugate gradient are (0.257,0.929,0.307,1.180,0.098)

Conclusion
In this paper, the estimation performance of MLE via PSO is investigated against the estimation performance of MLE via quasi-Newton and conjugate gradient methods for the shared Gamma frailty model.Based on two different simulation studies, PSO procedure outperforms the two derivative-based methods for most cases based on all comparison criteria.Also it is shown that smaller biases are obtained under the situation of heavy censoring.MLE via PSO is also implemented to a real data set and it is shown that this procedure can be preferred for estimating the parameters of parametric shared frailty models.
Step 1]: In d-dimensional search space, stochastically generate a position and a velocity for each particle and then store as the vectors ⃗ P m = {p m,1 , ..., p m,k , ..., p m,d } and ⃗ V m = {v m,1 , ..., v m,k , ..., v m,d }, respectively.For m = 1, ..., L and k = 1, ..., d; p m,k denotes the k th position of m th particle and L is the total number of particles in a swarm.
and B k , the variance of the stationary distribution is calculated as, Vk = [(W k (T − 1))/T ] + [B k /T ].An estimated potential scale reduction factor (G-R statistic) Rk =

Figure 1 .Figure 2 .
Figure 1.Positions of a particle in Simulation 1

Table 1 .
Simulation Results for Model 1

Table 3 .
0.805)0.12050.2201 Two primary risk factors are included in the model: age and sex of patient (0=male, 1=female