Benchmarking Classification Models for
Cancer Prediction from Gene Expression Data:
A Novel Approach and New Findings
Geetha RAMANI, Shomona Gracia JACOB
Campus de Anna University (CEG Campus),
Guindy, Chennai, India – 600 025,
Abstract: Gene Selection from gene expression data for Cancer prediction has been an area of intensive research, aiming at identifying the minimal and optimal set of candidate genes that could generate accurate predictive performance. The two major problems encountered in this process are the high dimensionality of data with comparatively few instances and the need to categorize records under multiple classes. In this paper we propose a novel approach called Rank-Weight Feature Selection that utilizes the filtering capacity of more than one feature selection algorithm to detect the minimal set of predictive genes that generate higher predictor performance in categorizing and predicting diverse oncogenic gene expression data. The filtered features (genes) are weighted based on the number of feature relevance algorithms reporting them to be significant. The ranked genes are then used to validate the proposed method by utilizing ten classifiers over five diverse gene expression datasets. The results proved that the proposed approach generated higher predictive performance with fewer features than previously reported results with the most relevant and minimal set of genes and commend classifiers based on their accuracy and reliability in predicting cancer data.
Keywords: Cancer prediction, Gene Expression, Feature Relevance, Multi-class classification.
CITE THIS PAPER AS:
R. Geetha RAMANI, Shomona Gracia JACOB, Benchmarking Classification Models for Cancer Prediction from Gene Expression Data:A Novel Approach and New Findings, Studies in Informatics and Control, ISSN 1220-1766, vol. 22 (2), pp. 133-142, 2013.
In recent years, gene expression profiling and data analysis has gained remarkable momentum to obtain new insights on the regulation of cellular processes in biological systems of substantial significance [1-2]. Selection of relevant genes to differentiate between cancerous and healthy patients is a common task and has been researched extensively. Cancer prediction from microarray data currently faces two major problems. The first being the need to identify the most relevant genes for subsequent analysis and use in diagnostic practice while the second is to identify and design novel computational techniques that generate optimal predictive performance with the relevant genes [1-4]. We believe this research area is of great interest to investigators from both the biological and informatics fields to identify the best predictive techniques to enhance predictive performance and explore the relevant genes for diagnostic, prognostic and therapeutic purposes. Cancer is the most deadly genetic disease, and reports trace their cause to inherited mutations or epigenetic alterations that lead to modified gene expression profile of oncogenic cells . Subsequent research was focused towards microarray technology to identify up or down regulated genes that played a major role in targeted cancers, activation of oncogenic pathways, and detection of previously unknown biomarkers for clinical diagnosis [4-6]. Previous studies on gene selection and cancer prediction have affirmed the fact that it is necessary to find an optimal set of genes for each cancer type as predictors that help to classify different labelled cells with high prediction accuracy[1-3]. Hence determination of potentially predictive genes to predict and categorize oncogenic ailments has been the rationale for this research. We believe this will enhance the current state of diagnostic and prognostic practice for diverse Cancer ailments.
In this paper, we propose a novel predictor method that utilizes multiple feature relevance analysis and classification techniques to identify the most minimal and optimal set of genes for cancer prediction. The proposed model of feature evaluators and classifiers is validated through the 10-fold cross–validation method on five different gene expression datasets. Precisely this paper makes the following contributions: 1) A novel and general cancer prediction framework from gene expression datasets with improved prediction accuracy is proposed, 2) the most minimal and optimally relevant genes are identified for use in diagnostic purposes, 3) the performance of both evolutionary and supervised machine learning algorithms in multi-class categorization of five gene expression datasets have been compared and evaluated.
The choice of datasets was made to identify classifier performance on diverse kinds of data (different target values, instances and number of features) while the choice of feature selection algorithms was made to include the effects of both subset and ranking attribute evaluators.
The rest of this paper is organized as follows: Section 2 reviews the recent and related work in the field of Cancer prediction from gene expression data.
Section 3 describes the proposed framework while Section 4 elaborates on the experimental setup and discussion of obtained results. Section 5 concludes the paper with possible scope for further investigations.
- GORDON, G. J., et.al, Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, vol. 62(17), 2002, pp. 4963-4967.
- BAKER, S. G, Simple and Flexible Classification of Gene Expression Microarrays via Swirls and Ripples. BMC Bioinformatics, vol. 11, 2010, p. 452.
- BANERJEE, M., S. MITRA, H. BANKA, Evolutionary-Rough Feature Selection in Gene Expression Data. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Application and Reviews, vol. 37, 2007, pp. 622-632.
- LIOTTA, L., E. PETRICOIN, Molecular Profiling of Human Cancer. National Review of Genetics, vol. 1(1), 2000, pp. 48-56.
- TAN, A. C., D. GILBERT, Ensemble Machine Learning on Gene Expression Data for Cancer Classification. Applied Bioinformatics, vol. 2(3), 2003, pp. S75-83.
- DUPUY, A., R. M. SIMON, Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting. Journal of National Cancer Institute, vol. 99(2), 2007, pp. 147-157.
- DAGLIYAN, O., F. UNEY-YUKSEKTEPE, I. H. KAVAKLI, M. TURKAY, (2011), Optimization Based Tumor Classification from Microarray Gene Expression Data. PLoS ONE 6(2): e14579. doi:10.1371/journal.pone.0014579
- DIAZ-URIARTE, R., S. ALVAREZ DE ANDRES, Gene Selection and Classification of Microarray Data Using Random Forest. BMC Bioinformatics, vol. 7, 2006, p. 3.
- WANG, SIMON, Microarray-based Cancer Prediction using Single Genes, BMC Bioinformatics, vol. 12, 2011, p. 391.
- MRAMOR, M. G. LEBAN, J. DEMSAR, B. ZUPAN, Visualization-based Cancer Microarray Data Classification Analysis. Bioinformatics vol. 23(16), 2007, pp. 2147-2154. A.I.Lab, Ljubjana -http://www.biolab.si/supp/bi-cancer/projections/index.htm
- Waikato Environment for Knowledge Analysis(WEKA) Machine Learning Tool, http://www.cs.waikato.ac.nz/ml/weka/
- LI BI-QING, et.al, Predict and Analyze S-Nitrosylation Modification Sites with the mRMR and IFS Approaches. Journal of Proteomics vol. 7(S), 2012, pp. 1654-1655.
- BÄCK, T., Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms, Oxford University Press, 1996.
- ASHLOCK, D., 2006, Evolutionary Computation for Modeling and Optimization. Springer, ISBN 0-387-22196-4.
- KOTSIANTIS S. B., Supervised Machine Learning: A Review of ClassificationTechniques. Informatica, vol. 31, 2007, pp. 249-268.
- RAMANI GEETHA, R, S. G. JACOB, Improved Classification of Lung Cancer Tumors Based on Structural and Physicochemical Properties of Proteins Using Data Mining Models. PLoS ONE 8(3): e58772. doi:10.1371/journal.pone.0058772
- YI PENG, GANG KOU, DAJI ERGU, WENSHUAI WU, YONG SHI, An Integrated Feature Selection and Classification Scheme. Studies in Informatics and Control, ISSN 1220-1766, vol. 21(3), 2012, pp. 241-248.
- GANG KOU, YI PENG, YONG SHI, WENSHUAI WU, Classifier Evaluation for Software Defect Prediction. Studies in Informatics and Control, ISSN 1220-1766, vol. 21(2), 2012, pp. 117-126.
- SIFAOUI, A., A. ABDELKRIM, S. ALOUANE, M. BENREJEB, On New RBF Neural Network Construction Algorithm for Classification. Studies in Informatics and Control, ISSN 1220-1766, vol. 18(2), 2009, pp. 103-110.
- JACOB, S. G., R. GEETHA RAMANI, Discovery of Knowledge Patterns in Clinical Data through Data Mining Algorithms: Multi-class Categorization of Breast Tissue Data. International Journal of Computer Applications, vol. 32(7) 2011, pp. 46-53.
- GEETHA RAMANI, R., S. G. JACOB, Prediction of P53 Mutants (Multiple Sites) Transcriptional Activity based on Structural (2D & 3D) Properties. PLoS ONE 8(2): e55401. doi:10.1371/journal.pone.0055401.
- NUTT, C. L., D. R. MANI, R. A. BETENSKY, P. TAMAYO, Gene Expression-based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification. Cancer Research, 2003.
- POMEROY et.al, Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression. Nature vol. 415, 2002, pp. 436-442, doi:10.1038/415436a.