Identification of Biomarkers for Predicting the Overall Survival of Ovarian Cancer Patients : a Sparse Group Lasso Approach

Next-generation sequencing has been routinely applied to cancer biology, making it possible for researchers to elucidate the molecular mechanisms underlying cancer initiation and progression. However, how to identify oncomarkers from massive complex genomic data poses a great challenge for both modeling and computing. In this paper, we propose a novel computational pipeline to identify genes related to the overall survival of ovarian cancer patients from the rich Cancer Genome Atlas data. Different from the existing studies, we incorporate dependence structure among genes and pathway information into the variable selection. Firstly, the dimensionality of the ovarian cancer data is reduced by a novel stepwise feature screening which mimics the hierarchy of the underlying causal network. The second step of the pipeline is to divide genes into clusters with distinct cellular functions by k-means, x-means and PAMSAM learning algorithms. In the final step, we fit a cox proportional hazard model with a sparse group lasso penalty for further variable selection. Of the 115 genes in the final list, many were reported to be associated with cancer initiation or progression in the literature. In addition, we find several gene families including the NEK family and RNF family, which are closely associated with the survival of ovarian cancer patients.


Introduction
Ovarian cancer is one of the most malignant gynecologic cancers, ranking fifth as the cause of cancer-related deaths among women in the United States.According to American Cancer Society, about 22, 280 women will receive a new diagnosis of ovarian cancer and about 14, 240 women will die from this disease in 2016.The latest data shows that about 70% of deaths occur in patients with high-grade serous epithelial ovarian cancer.The standard treatment for these patients is usually debulking surgery, followed by platinum-taxane chemotherapy.Platinum resistant cancer recurs within six months in about 25% of patients and the overall five-year survival rate is about 31%.Approximately 13% of high-grade serous ovarian cancer can be attributed to germline mutations in BRCA1 and BRCA2 and a smaller percentage can be accounted for by other germline mutations (The Cancer Genome Atlas Research Network, 2011).
In this paper, we aim to identify prognostic genes which play crucial roles in the survival of the ovarian cancer patients.Relevant works include but are not limited to McLaughlin et al. (McLaughlin et al., 2013), Nagle, Chenevixtrench, Webb, & Spurdle (Nagle, Chenevixtrench, Webb, & Spurdle, 2007), and Konstantinopoulos et al. (Konstantinopoulos et al., 2008).However, the methods used in current studies tend to be over simplistic and inaccurate, mostly the single-round independent screening based on the t test or proportional hazard model.As pointed out by many researchers, these naive methods might result in poor selection of important features by overlooking the complex dependence structure among them.For instance, the independent test may suffer from spurious correlations in high dimensional data and fail to identify important features presented in the underlying causal network.Due to several major difficulties, the association between cancer survival and different signaling pathways has been much less studied and numerous questions remain unanswered in this field.To fill this gap, we develop an efficient and general pipeline which achieves higher accuracy in the biomarker/pathway identification.The advantage of the proposed methods is twofold.First, the feature selection is conducted in account of the dependence structure among genes, resulting in a more accurate selection of biomarkers that may directly or indirectly affect cancer survival.Second, the sparse group lasso method offers a further refinement for the candidate set of biomarkers, by encouraging genes in the same pathway to be selected and balancing gene-wise and group-wise selection in the meanwhile.
The rest of paper is organized as follows.In Section 2, we briefly summarize the TCGA ovarian cancer data and elaborate the three steps in the computational pipeline: (1) stepwise feature selection for initial screening; (2) k-means clustering along with a "elbow method" to determine the number of clusters; (3) Cox proportional hazards model with sparse group lasso penalty for further variable selection.We present and discuss the main results from the analysis in Section 3, and conclude the article in Section 4.

Data Integration and Preprocessing
Using "data matrix" tool on the TCGA website, we extracted the level-3 microarray data containing the expression level of 17, 814 genes in 568 tumor samples, as well as the clinical information.Table 1 summarizes our data set.The overall survival time of each patient is defined as the time between diagnosis and death.The censoring indicator was set to be 1 if death event occurred and 0 otherwise.Throughout this study, we assume that censoring mechanism is independent of survival mechanism.The gene expression data were normalized using a quantile normalization method by Balstad et al. (Balstad et al., 2002) to correct the bias due to non-biological causes.We applied an existing method by Hsu et al. (Hsu et al., 2012) to remove age and batch effects (three age groups are defined as < 40 y.o., [40,70] y.o., and > 70 y.o.).This method is based on a median-matching and variance-matching strategy.For example, the batch-effect-adjusted gene expression value can be obtained as follows: where g i jk represents the gene expression value for gene i from batch j and sample k, M i j refers to the median of g i j = (g i j1 , ..., g i jn ), M i refers to the median of g i = (g i1 , ..., g iJ ), σg i and σg i j are the sample standard deviation of g i and g i j , respectively.

Stepwise Feature Selection
A necessary and crucial step for genome-wide association study is feature screening, i.e., to filter out irrelevant or redundant features.A refined variable set helps improve computing efficiency and estimation accuracy (Zhang et al., 2014).
Existing feature selection methods can be classified into either wrapper approach (Kohavi & John, 1997; Leng, Valli, & Armstrong, 2010) or filter approach (Haindl, Somol, Ververidis, & Kotropoulos, 1999;Jouve & Nicoloyannis, 2010).The filter approach using independent test for two conditions is more commonly used due to its efficiency and simplicity.However, it tends to filter out many related features in high-dimensional settings.To this end, Zhang et al. (Zhang et al., 2014) proposed a novel stepwise correlation-based selector (SCBS) to select features from TCGA data for further Bayesian network inference.Assume there is a causal chain X→Y→cancer.Though X to Y or Y to cancer has directed association, the association between X and cancer could greatly decay so that it cannot be detected by independent test.The SCBS procedure starts with detection of features strongly correlated with the phenotype and then progressively selects features that correlate with features selected in previous step.This procedure is a natural mimic of sparse network structure and is capable of identifying nodes that are indirectly associated with the phenotype.In practice, the method can be implemented as follows: • Step 1: Calculate the Spearman's correlation coefficients between the current variable X i and all the other variables, denoted by ρ i j , j i. Keep k most correlated variables with X i based on ρ i j for further filtering. • Step 2: Calculate the p-value of correlation coefficient for each of the k variables from step 1, select the variable if the p-value is significant under Benjamini-Hochberg (BH) procedure with FDR≤ 0.05.
• Step 3: Repeat step 1 and 2 until p variables are selected.
In practice, the total number of selected variables p is subject to the scale of the model to build.For the TCGA data, we run the SCBS for 4 rounds, in order to select more than 500 but less than 1000 genes.The computing time of SCBS is sublinear to p.The choice of k is essential for SCBS which partially depends on the network density.Based on an extensive simulation study, a k of 4 or 5 is recommended by Zhang et al. (Zhang et al., 2014) to attain moderate complexity or sparsity of the model.We set k = 4 in our analysis and obtained a set of 603 genes.

K-means Clustering
The unsupervised k-means clustering is applied to cluster the 603 selected genes based on a correlation metric defined as follows: where g i and g j represent expression level of gene i and gene j, i, j = 1, 2, ..., p, g i is n-dimensional vector where n is number of samples.The Spearman's correlation between gene i and gene j is denoted by ρ rg i ,rg j , where rg i and rg j represent the ranks of g i and g j .An immediate consequence by this definition is 0 ≤ ||g i , g j || ρ ≤ 1, where the two equalities hold when ρ rg i ,rg j = 0 and |ρ rg i ,rg j | = 1, respectively.The k-means clustering algorithm aims to partition p variables into K clusters C = (C 1 , C 2 , ..., C K ), where K is a predefined number of clusters.Its objective is to find: An "elbow method" was used for the choice of optimal number of clusters.Figure 1a shows the percentage of variance explained by the clusters against the number of clusters.At K=4 or 5, the marginal gain began to drop substantially, giving an angle in the graph.The number of clusters was chosen at the "elbow" K=4.The multi-dimensional Scaling (MDS) plot is shown in Figure 1b where clusters were highlighted by different colors.The x-means clustering (Pelleg & Moore, 2000), an alternative and variation of k-means, and PAMSAM algorithm were also applied to our data set.However, in terms of clustering, three methods did not give a significant difference.

Cox Proportional Hazard Model with Sparse Group Lasso Penalty
The last step of the pipeline conducts pathway level selection of prognostic genes.A natural way is to fit a regression model with group lasso penalty where each group represents a pathway.The group lasso, however, only works for large number of groups and gives a sparse set of groups.We therefore turned to a sparse group lasso (SGL), which generates a solution balancing both between-group and within-group sparsity.A Cox proportional hazards model and sparse group lasso regularization were then pieced together for further variable selection.
Let p be the number of genes, X = (X 1 , X 2 , ..., X n ) T be the n × p data matrix, where .., Y n ) T denote an n-dimensional vector which corresponds to failure/censor times.Let β β β = (β 1 , β 2 , ..., β p ) T be the vector of coefficients, and δ δ δ = (δ 1 , δ 2 , ..., δ n ) be the censoring indices, where δ i = 1 indicates event (death) occurred for subject i and δ i = 0 indicates censoring.The Cox proportional hazards model can be written as follows: where λ 0 (t) stands for the baseline hazard function.The loglikelihood can be written as follows: With the assumption of sparsity, the parameters β β β can be estimated through a SGL penalized likelihood: where || • || 1 and || • || 2 denote ℓ 1 -norm and ℓ 2 -norm respectively, and p k represents the size of group k and β β β k represents the coefficients of genes in group k.The SGL fit is simply a combination of the lasso and group lasso penalties (α = 0 gives the group lasso fit, α = 1 gives the lasso fit).In practice, one should choose α before the parameter estimation.In our problem, we expect a strong overall sparsity but encourage grouping, therefore a α = 0.8 was used.Here the choice of α is different from the choice of λ, which can be determined by data-driven method.In practice, the mixing rate α need to be predefined depending on the expected overall sparsity and group sparsity.Given two tuning parameters α and λ, a routine blockwise coordinate descent (BCD) approach can solve the optimization problem and we implemented the BCD algorithm using R package SGL (Simon, Friedman, Hastie, & Tibshirani, 2013).A sequence of ten candidate λ's with λ min = 0.05λ max in the regularization path was used, as suggested by R package SGL.
In the lasso-type problems, the common method for selecting the tuning parameter λ is cross-validation.However, it tends to yields a large number of false positives in the sparse network problem, as pointed out by Fu and Zhou in their seminal paper (Fu & Zhou, 2013).Fu and Zhou proposed an "elbow method" that outperforms the cross-validation method, where the optimal tuning parameter corresponds to the change point at which an increase of λ does not yield a substantial decrease of log-likelihood.In our Cox model with SGL regularization, the optimal lambda selected by this rule is λ = 0.000492 as shown in Figure 2 and 115 genes were identified in the final list.
Figure 2. Selection of tuning parameter for the penalty term is sparse group lasso regression.The log-likelihood (y-axis) against tuning parameter (x-axis) for the sparse group lasso penalty, where the optimal λ is circled.

Gene Clusters
Using the k-means approach, the set of 603 genes from initial screening were further clustered into four subgroups.Gene functions in each cluster were investigated.Interestingly, we found that genes within the same cluster tend to have similar/related cellular functions.For instance, cluster 1 (black dots in Figure 1b), the core cluster containing 407 genes including CENPJ and CDK5RAP2, is functionally related to cell cycle, spindle formation, and mitosis etc. Cluster 2 (blue dots in Figure 1b) contains 48 genes including MYOG and CDK5R2, mostly related to protein binding and transmembrane activity etc. Cluster 3 (green dots in Figure 1b), containing 85 genes including COL5A2 and COL8A2, corresponds to the pathways related to collagen biosynthesis and enzymes modification etc. Cluster 4 (red dots in Figure 1b), containing 63 genes including CD48 and CD53, is related to immune response and T-cell and B-cell development.This finding indicates that certain cellular pathways/functions may play crucial roles in the progression of the serous ovarian cancer, which may provide new clues for the cancer prevention and treatment.

Prognostic Gene Identification
Using the Cox model with sparse group lasso penalty, we obtain a final list of 115 prognostic genes (in Table 2), of which many were reported to be involved in cancer initiation and progression.To name a few, gene CTSS is closely related to gastric cancer and silencing CTSS expression suppressed the migration and invasion of gastric cancer cells (Yang et al., 2010).Gene CD248 can facilitate tumor growth via its cytoplasmic domain and multiple pathways regulated by the cytoplasmic domain of CD248 highlight its potential as a therapeutic target to treat cancer (Maia et al., 2011).Gene DRD3 is a dopamine receptor, whose expression can change as stress factors associated with breast cancer (Pornour et al., 2014).Gene CDK5RAP2 is required for spindle checkpoint function and is a common target in paclitaxel and doxorubicin.Cancer cells cultured in the presence of paclitaxel or doxorubicin exhibit a dramatic decrease in CDK5RAP2 levels (Zhang et al., 2009).
We also identify several subgroups (families) of genes whose associations with cancer have been reported.For instance, three genes in the NEK family, NEK1, NEK2 and NEK9, were identified in our final list.Mutations of NEK family members have also been identified as drivers behind the development of ciliopathies and cancer.Recent emergence of comprehensive cancer genomes is highlighting certain members of the NEK family as targets of frequent mutations (Moniz, Dutt, Haider, & Stambolic, 2011).We also identified five genes in the RNF family: RNF12, RNF181, RNF20, RNF31,RNF7.A recent study reported that the RNF family such as RNF20 drives histone H2B monoubiquitylation and modulates inflammation and inflammation-associated cancer in mice and humans (Tarcic et al., 2016).
The predictive power of our 115-gene signature was illustrated using Kaplan-Meier curves in Figure 3, where the samples were equally divided into two groups based on the hazard risk.The moderate separation of two groups demonstrates the effectiveness of our method.

Conclusion
In this paper, we developed a flexible three-step computational pipeline for identifying prognostic biomarkers related to the overall survival of serous ovarian cancer patients using the rich TCGA data set.This pipeline facilitates the pathway level analysis of the biomarkers associated with cancer survival.The proposed methods are computationally efficient and can be generally applied to many large-scale genomic cancer data sets including the TCGA data.We applied this pipeline to TCGA ovarian cancer data and identified a list of 115 genes, as well as several gene families including the NEK family and RNF family, which may greatly affect the overall survival of ovarian cancer patients.Some of these findings are well supported by literature.

Figure 1 .
Figure 1.Four gene clusters.(a) The proportion of variance that can be explained by clustering (y-axis) against the number of clusters (x-axis) based on different values of k (k=1,2,...,15) by k-means clustering method.From this plot, the most likely number of clusters is four.(b) Multidimensional scaling (MDS) plots based on correlation dissimilarity metric among 603 genes, where genes in different clusters were highlighted by different colors.

Figure 3 .
Figure 3. Kaplan-Meier curves.Survival probability against time (in days) of different groups by the hazard risk based on the 115-gene signature.The red and blue lines are based on high-risk group and low-risk group, respectively.

Table 1 .
Data types, platforms and sample size in the analysis

Table 2 .
List of 115 identified prognostic genes and corresponding coefficients in the Cox model.