In this work we focused on methods to solve classification problems characterized by high dimensionality and low cardinality data. These features are relevant in bio-molecular data analysis and particularly in class prediction whith microarray data. Many methods have been proposed to approach this problem, characterized by the so called curse of dimensionality (term introduced by Richard Bellman (9)). Among them, gene selection methods, principal and independent component analysis, kernel methods. In this work we propose and we experimentally analyze two ensemble methods based on two randomized techniques for data compression: Random Subspaces and Random Projections. While Random Subspaces, originally proposed by T. K. Ho, is a technique related to feature subsampling, Random Projections is a feature extraction technique motivated by the Johnson-Lindenstrauss theory about distance preserving random projections. The randomness underlying the proposed approach leads to diverse sets of extracted features corresponding to low dimensional subspaces with low metric distortion and approximate preservation of the expected loss of the trained base classifiers. In the first part of the work we justify our approach with two theoretical results. The first regards unsupervised learning: we prove that a clustering algorithm minimizing the objective (quadratic) function provides a -closed solution if applied to compressed data according to Johnson-Lindenstrauss theory. The second one is related to supervised learning: we prove that Polynomials kernels are approximatively preserved by Random Projections, up to a degradation proportional to the square of the degree of the polynomial. In the second part of the work, we propose ensemble algorithms based on Random Subspaces and Random Projections, and we experimentally compare them with single SVM and other state-of-the-art ensemble methods, using three gene expression data set: Colon, Leukemia and DLBL-FL - i.e. Diffuse Large B-cell and Follicular Lymphoma. The obtained results confirm the effectiveness of the proposed approach. Moreover, we observed a certain performance degradation of Random Projection methods when the base learners are SVMs with polynomial kernel of high degree.

Ensembles based on Random Projection for gene expression data analysis / R. Folgieri ; --. DIPARTIMENTO DI SCIENZE DELL'INFORMAZIONE, 2008. 20. ciclo, Anno Accademico 2006/2007. [10.13130/folgieri-raffaella_phd2008].

Ensembles based on Random Projection for gene expression data analysis

R. Folgieri
2008

Abstract

In this work we focused on methods to solve classification problems characterized by high dimensionality and low cardinality data. These features are relevant in bio-molecular data analysis and particularly in class prediction whith microarray data. Many methods have been proposed to approach this problem, characterized by the so called curse of dimensionality (term introduced by Richard Bellman (9)). Among them, gene selection methods, principal and independent component analysis, kernel methods. In this work we propose and we experimentally analyze two ensemble methods based on two randomized techniques for data compression: Random Subspaces and Random Projections. While Random Subspaces, originally proposed by T. K. Ho, is a technique related to feature subsampling, Random Projections is a feature extraction technique motivated by the Johnson-Lindenstrauss theory about distance preserving random projections. The randomness underlying the proposed approach leads to diverse sets of extracted features corresponding to low dimensional subspaces with low metric distortion and approximate preservation of the expected loss of the trained base classifiers. In the first part of the work we justify our approach with two theoretical results. The first regards unsupervised learning: we prove that a clustering algorithm minimizing the objective (quadratic) function provides a -closed solution if applied to compressed data according to Johnson-Lindenstrauss theory. The second one is related to supervised learning: we prove that Polynomials kernels are approximatively preserved by Random Projections, up to a degradation proportional to the square of the degree of the polynomial. In the second part of the work, we propose ensemble algorithms based on Random Subspaces and Random Projections, and we experimentally compare them with single SVM and other state-of-the-art ensemble methods, using three gene expression data set: Colon, Leukemia and DLBL-FL - i.e. Diffuse Large B-cell and Follicular Lymphoma. The obtained results confirm the effectiveness of the proposed approach. Moreover, we observed a certain performance degradation of Random Projection methods when the base learners are SVMs with polynomial kernel of high degree.
2008
random subspace ; supervised learning ; support vector machine ; DNA microarray ; cancer
Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
BERTONI, ALBERTO
VALENTINI, GIORGIO
PIURI, VINCENZO
Doctoral Thesis
Ensembles based on Random Projection for gene expression data analysis / R. Folgieri ; --. DIPARTIMENTO DI SCIENZE DELL'INFORMAZIONE, 2008. 20. ciclo, Anno Accademico 2006/2007. [10.13130/folgieri-raffaella_phd2008].
File in questo prodotto:
File Dimensione Formato  
Tesi-Folgieri.pdf

accesso aperto

Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Dimensione 1.06 MB
Formato Adobe PDF
1.06 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/45878
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact