Cluster analysis has been widely applied for investigating structure in bio-molecular data: for instance, unsupervised learning methods, exploiting the overall gene expression profile of a patient, may research and discover subclasses of pathologies that cannot be detected with traditional biochemical, histopathological and clinical criteria. Unfortunately, clustering algorithms may find structure in the data, even when no structure is present instead. Hence we need methods for assessing the validity of the discovered clusters to test the existence of biologically meaningful clusters. Recently, several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters in complex bio-molecular data. In this conceptual framework multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. In particular, Ben-Hur, Ellisseeff and Guyon proposed to perturb the original data through subsampling procedures, applying then a suitable clustering algorithm to the subsampled data; after estimating the stability of the obtained solutions through a pairwise clustering similarity measure, they assessed the "optimal" number of clusters by means of a visual inspection of the similarity measures across different numbers of clusters. In this paper we propose an improvement of the Ben-Hur algorithm to assess the significance level of the solutions, by introducing a quantitative approach and a statistical test based on the distribution of suitable similarity measures between pairs of clustered projected data. Moreover we propose also a new way to perturb the data, based on random projections into lower dimensional subspaces, that seems to be well-suited to the characteristics (high-dimensionality, redundancy, noise) of genomic and proteomic data.
|Titolo:||Model order selection for clustered bio-molecular data|
|Autori interni:||VALENTINI, GIORGIO (Ultimo)|
BERTONI, ALBERTO (Primo)
|Settore Scientifico Disciplinare:||Settore INF/01 - Informatica|
|Data di pubblicazione:||giu-2006|
|Tipologia:||Book Part (author)|
|Appare nelle tipologie:||03 - Contributo in volume|