IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Cluster analysis has been widely applied for investigating structure in bio-molecular data: for instance, unsupervised learning methods, exploiting the overall gene expression profile of a patient, may research and discover subclasses of pathologies that cannot be detected with traditional biochemical, histopathological and clinical criteria. Unfortunately, clustering algorithms may find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters. Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. In particular, Ben-Hur, Ellisseeff and Guyon proposed to perturb the original data through subsampling procedures, applying then a suitable clustering algorithm to the subsampled data; after estimating the stability of the obtained solutions through a pairwise clustering similarity measure, they assessed the "optimal" number of clusters by means of a visual inspection of the similarity measures across different numbers of clusters. In this paper we propose an improvement of the Ben-Hur algorithm to assess the significance level of the solutions, by introducing a quantitative approach and a statistical test based on the distribution of suitable similarity measures between pairs of clustered projected data. Moreover we propose also a new way to perturb the data, based on random projections into lower dimensional subspaces, that seems to be well-suited to the characteristics (high-dimensionality, redundancy, noise) of genomic and proteomic data.

Model order selection for clustered bio-molecular data / A. Bertoni, G. Valentini - In: Probabilistic Modeling and Machine Learning in Structural and Systems Biology / [a cura di] J. Rousu, S. Kaski, E. Ukkonen. - Helsinki : Helsinki University Printing House, 2006 Jun. - ISBN 9521032774. - pp. 85-90 (( convegno Probabilistic Modeling and Machine Learning in Structural and Systems Biology Workshop tenutosi a Tuusula nel 2006.

Model order selection for clustered bio-molecular data

A. Bertoni^Primo;G. Valentini^Ultimo

2006

Abstract

Cluster analysis has been widely applied for investigating structure in bio-molecular data: for instance, unsupervised learning methods, exploiting the overall gene expression profile of a patient, may research and discover subclasses of pathologies that cannot be detected with traditional biochemical, histopathological and clinical criteria. Unfortunately, clustering algorithms may find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters. Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. In particular, Ben-Hur, Ellisseeff and Guyon proposed to perturb the original data through subsampling procedures, applying then a suitable clustering algorithm to the subsampled data; after estimating the stability of the obtained solutions through a pairwise clustering similarity measure, they assessed the "optimal" number of clusters by means of a visual inspection of the similarity measures across different numbers of clusters. In this paper we propose an improvement of the Ben-Hur algorithm to assess the significance level of the solutions, by introducing a quantitative approach and a statistical test based on the distribution of suitable similarity measures between pairs of clustered projected data. Moreover we propose also a new way to perturb the data, based on random projections into lower dimensional subspaces, that seems to be well-suited to the characteristics (high-dimensionality, redundancy, noise) of genomic and proteomic data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				giu-2006
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
bertoni-vale-pmsb06-revised.pdf accesso aperto Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore) Dimensione 178.5 kB Formato Adobe PDF Visualizza/Apri	178.5 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/19875

Citazioni

ND

ND

ND

ND

social impact