Cluster analysis has been widely applied for investigating structure in bio-molecular data: for instance, unsupervised learning methods, exploiting the overall gene expression profile of a patient, may research and discover subclasses of pathologies that cannot be detected with traditional biochemical, histopathological and clinical criteria. Unfortunately, clustering algorithms may find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters. Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. In particular, Ben-Hur, Ellisseeff and Guyon proposed to perturb the original data through subsampling procedures, applying then a suitable clustering algorithm to the subsampled data; after estimating the stability of the obtained solutions through a pairwise clustering similarity measure, they assessed the "optimal" number of clusters by means of a visual inspection of the similarity measures across different numbers of clusters. In this paper we propose an improvement of the Ben-Hur algorithm to assess the significance level of the solutions, by introducing a quantitative approach and a statistical test based on the distribution of suitable similarity measures between pairs of clustered projected data. Moreover we propose also a new way to perturb the data, based on random projections into lower dimensional subspaces, that seems to be well-suited to the characteristics (high-dimensionality, redundancy, noise) of genomic and proteomic data.
Model order selection for clustered bio-molecular data / A. Bertoni, G. Valentini - In: Probabilistic Modeling and Machine Learning in Structural and Systems Biology / [a cura di] J. Rousu, S. Kaski, E. Ukkonen. - Helsinki : Helsinki University Printing House, 2006 Jun. - ISBN 9521032774. - pp. 85-90 (( convegno Probabilistic Modeling and Machine Learning in Structural and Systems Biology Workshop tenutosi a Tuusula nel 2006.
Model order selection for clustered bio-molecular data
A. BertoniPrimo
;G. ValentiniUltimo
2006
Abstract
Cluster analysis has been widely applied for investigating structure in bio-molecular data: for instance, unsupervised learning methods, exploiting the overall gene expression profile of a patient, may research and discover subclasses of pathologies that cannot be detected with traditional biochemical, histopathological and clinical criteria. Unfortunately, clustering algorithms may find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters. Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. In particular, Ben-Hur, Ellisseeff and Guyon proposed to perturb the original data through subsampling procedures, applying then a suitable clustering algorithm to the subsampled data; after estimating the stability of the obtained solutions through a pairwise clustering similarity measure, they assessed the "optimal" number of clusters by means of a visual inspection of the similarity measures across different numbers of clusters. In this paper we propose an improvement of the Ben-Hur algorithm to assess the significance level of the solutions, by introducing a quantitative approach and a statistical test based on the distribution of suitable similarity measures between pairs of clustered projected data. Moreover we propose also a new way to perturb the data, based on random projections into lower dimensional subspaces, that seems to be well-suited to the characteristics (high-dimensionality, redundancy, noise) of genomic and proteomic data.File | Dimensione | Formato | |
---|---|---|---|
bertoni-vale-pmsb06-revised.pdf
accesso aperto
Tipologia:
Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Dimensione
178.5 kB
Formato
Adobe PDF
|
178.5 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.