Optimal Algorithm for Metabolomics Classification and Feature Selection varies by Dataset

Charles E. Determan Jr

Abstract


Metabolomics, the systematic identification and quantification of all metabolites in a biological system, is increasingly applied towards identification of biomarkers for disease diagnosis, prognosis and risk prediction. Applications of metabolomics extend across the health spectrum including Alzheimer's, cancer, diabetes, and trauma. Despite the continued interest in metabolomics there are numerous techniques for analyzing metabolomics datasets with the intent to classify group membership (e.g. Control or Treated). These include Partial Least Squares Discriminant Analysis, Support Vector Machines, Random Forest, Regularized Generalized Linear Models, and Prediction Analysis for Microarrays. Each classification algorithm is dependent upon different assumptions and can potentially lead to alternate conclusions. This project seeks to conduct an in depth comparison of algorithm performance on both simulated and real datasets to determine which algorithms perform best given alternate dataset structures. Three simulated datasets were generated to validate algorithm performance and mimic 'real' metabolomics data: (Han et al., 2011) independent null dataset (no correlation, no discriminatory variables), (Davis, Schiller, Eurich, & Sawyer, 2012) correlated null (no discriminating variables), (Guan et al., 2009) correlated discriminatory. This comparison is also applied to 3 open-access datasets including two Nuclear Magnetic Resonance (NMR) and one Mass Spectrometry (MS) dataset. Performance was evaluated based on the Robustness-Performance-Trade-off (RPT) incorporating a balance between model classification accuracy and feature selection stability. We also provide a free, open-source R Bioconductor package (OmicsMarkeR) that conducts the analyses herein. The proposed work provides an important advancement in metabolomics analysis and helps alleviate the confusion of potentially paradoxical analyses thereby leading to improved exploration of disease states and identification of clinically important biomarkers.




DOI: http://dx.doi.org/10.5539/ijb.v7n1p100

International Journal of Biology   ISSN 1916-9671(Print)   ISSN 1916-968X  (Online)

Copyright © Canadian Center of Science and Education

To make sure that you can receive messages from us, please add the 'ccsenet.org' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.

------------------------------------------------------------------------------------------------------------------------------

scholar_logo_lg_2011_120 proquest_logo_120 lockss_logo_2_120