The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems, ranging from the definition of new taxonomies of malignancies based on bio-molecular data, to the validation of clusters of co-regulated or co-expressed genes, or the discovery of functional relationships from protein-protein interaction data. Recently, several methods based on the concept of stability have been proposed to estimate the reliability and the "optimal" number of clusters. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques, to noise injection into the data or random projections into lower dimensional subspaces. Usually, stability-based methods provide only a score or a measure of the reliability of the discovered clusters, without any assessment of the statistical significance of the clustering solutions; moreover they are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Recently we proposed a chi squared-based statistical test and a distribution-free test based on the classical Bernstein inequality, showing that stability-based methods can be successfully applied to the assessment of the reliability of clusterings, as well as to discover multiple structures underlying complex bio-molecular data.
Statistical methods for the assessment of clusters discovered in bio-molecular data / G. Valentini - In: Atti del 6^ Congresso della Societa' Italiana di BiometriaPisa : Istituto di Fisiologia Clinica - CNR Pisa, 2007. - pp. 57-60 (( Intervento presentato al 6th. convegno SIB National Congress, Statistics in Life and Environment Sciences tenutosi a Pisa, Italy nel 2007.
Statistical methods for the assessment of clusters discovered in bio-molecular data
G. ValentiniPrimo
2007
Abstract
The assessment of the reliability of clusters discovered in bio-molecular data is a central issue in several bioinformatics problems, ranging from the definition of new taxonomies of malignancies based on bio-molecular data, to the validation of clusters of co-regulated or co-expressed genes, or the discovery of functional relationships from protein-protein interaction data. Recently, several methods based on the concept of stability have been proposed to estimate the reliability and the "optimal" number of clusters. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques, to noise injection into the data or random projections into lower dimensional subspaces. Usually, stability-based methods provide only a score or a measure of the reliability of the discovered clusters, without any assessment of the statistical significance of the clustering solutions; moreover they are not able to directly detect multiple structures (e.g. hierarchical structures) simultaneously present in the data. Recently we proposed a chi squared-based statistical test and a distribution-free test based on the classical Bernstein inequality, showing that stability-based methods can be successfully applied to the assessment of the reliability of clusterings, as well as to discover multiple structures underlying complex bio-molecular data.File | Dimensione | Formato | |
---|---|---|---|
valentini-stability.pdf
accesso aperto
Tipologia:
Pre-print (manoscritto inviato all'editore)
Dimensione
342.04 kB
Formato
Adobe PDF
|
342.04 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.