IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems, ranging from the definition of new taxonomies of malignancies based on bio-molecular data, to the validation of clusters of co-regulated or co-expressed genes, or the discovery of functional relationships from protein-protein interaction data. Recently, several methods based on the concept of stability have been proposed to estimate the reliability and the "optimal" number of clusters. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques, to noise injection into the data or random projections into lower dimensional subspaces. Usually, stability-based methods provide only a score or a measure of the reliability of the discovered clusters, without any assessment of the statistical significance of the clustering solutions; moreover they are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Recently we proposed a chi squared-based statistical test and a distribution-free test based on the classical Bernstein inequality, showing that stability-based methods can be successfully applied to the assessment of the reliability of clusterings, as well as to discover multiple structures underlying complex bio-molecular data.

Statistical methods for the assessment of clusters discovered in bio-molecular data / G. Valentini - In: Atti del 6^ Congresso della Societa' Italiana di BiometriaPisa : Istituto di Fisiologia Clinica - CNR Pisa, 2007. - pp. 57-60 (( Intervento presentato al 6th. convegno SIB National Congress, Statistics in Life and Environment Sciences tenutosi a Pisa, Italy nel 2007.

Statistical methods for the assessment of clusters discovered in bio-molecular data

G. Valentini^Primo

2007

Abstract

The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems, ranging from the definition of new taxonomies of malignancies based on bio-molecular data, to the validation of clusters of co-regulated or co-expressed genes, or the discovery of functional relationships from protein-protein interaction data. Recently, several methods based on the concept of stability have been proposed to estimate the reliability and the "optimal" number of clusters. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques, to noise injection into the data or random projections into lower dimensional subspaces. Usually, stability-based methods provide only a score or a measure of the reliability of the discovered clusters, without any assessment of the statistical significance of the clustering solutions; moreover they are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Recently we proposed a chi squared-based statistical test and a distribution-free test based on the classical Bernstein inequality, showing that stability-based methods can be successfully applied to the assessment of the reliability of clusterings, as well as to discover multiple structures underlying complex bio-molecular data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo
	
			Settore INF/01 - Informatica
		
	Data di pubblicazione
	
			2007
		
	Tipologia
	
			Book Part (author)
		
	Appare nelle tipologie:
	
			03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
valentini-stability.pdf accesso aperto Tipologia: Pre-print (manoscritto inviato all'editore) Dimensione 342.04 kB Formato Adobe PDF Visualizza/Apri	342.04 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/44214

Citazioni

ND

ND

ND

social impact