Gene expression based cancer classification using classifier ensembles is the main focus of this work. A new ensemble method is proposed that combines predictions of a small number of k-nearest neighbor (k-NN) classifiers with majority vote. Diversity of predictions is guaranteed by assigning a separate feature subset, randomly sampled from the original set of features, to each classifier. Accuracy of k-NNs is ensured by the statistically confirmed dependence between dataset complexity, determining how difficult is a dataset for classification, and classification error. Experiments carried out on three gene expression datasets containing different types of cancer show that our ensemble method is superior to 1) a single best classifier in the ensemble, 2) the nearest shrunken centroids method originally proposed for gene expression data, and 3) the traditional ensemble construction scheme that does not take into account dataset complexity.

Dataset Complexity Can Help to Generate Accurate Ensembles of K-Nearest Neighbors / O. Okun, G. Valentini - In: Neural Networks 2008, IJCNN 2008. (IEEE World Congress on Computational Intelligence) : IEEE International Joint Conference onPiscataway : IEEE, 2008. - ISBN 9781424418206. - pp. 450-457 (( convegno IEEE International Joint Conference on Neural Networks - IJCNN 2008 (IEEE World Congress on Computational Intelligence) tenutosi a Hong Kong nel 2008.

Dataset Complexity Can Help to Generate Accurate Ensembles of K-Nearest Neighbors

G. Valentini
Ultimo
2008

Abstract

Gene expression based cancer classification using classifier ensembles is the main focus of this work. A new ensemble method is proposed that combines predictions of a small number of k-nearest neighbor (k-NN) classifiers with majority vote. Diversity of predictions is guaranteed by assigning a separate feature subset, randomly sampled from the original set of features, to each classifier. Accuracy of k-NNs is ensured by the statistically confirmed dependence between dataset complexity, determining how difficult is a dataset for classification, and classification error. Experiments carried out on three gene expression datasets containing different types of cancer show that our ensemble method is superior to 1) a single best classifier in the ensemble, 2) the nearest shrunken centroids method originally proposed for gene expression data, and 3) the traditional ensemble construction scheme that does not take into account dataset complexity.
Settore INF/01 - Informatica
2008
IEEE
Book Part (author)
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/49872
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact