The Gene Function Prediction (GFP) problem consists in inferring biological properties for the genes whose function is unknown or only partially known, and raises challenging issues from both a machine learning and a computational biology standpoint. The GFP problem can be formalized as a semisupervised learning problem in an undirected graph. Indeed, given a graph with a partial graph labeling, where nodes represent genes, edges functional relationships between genes, and labels their membership to functional classes, GFP consists in inferring the unknown functional classes of genes, by exploiting the topological relationships of the networks and the available a priori knowledge about the functional properties of genes. Several networkbased machine learning algorithms have been proposed for solving this problem, including Hopfield networks and label propagation methods; however, some issues have been only partially considered, e.g. the preservation of the prior knowledge and the unbalance between positive and negative labels. A first contribution of the thesis is the design of a Hopfieldbased cost sensitive neural network algorithm (COSNet) to address these learning issues. The method factorizes the solution of the problem in two parts: 1) the subnetwork composed by the labelled vertices is considered, and the network parameters are estimated through a supervised algorithm; 2) the estimated parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows to predict the labeling of the unlabeled vertices. The proposed method embeds in the neural algorithm the “a priori” knowledge coded in the labeled part of the graph, and separates node labels and neuron states, allowing to differentially weight positive and negative node labels, and to perform a learning approach that takes into account the “unbalance problem” that affects GFP. A second contribution of this thesis is the development of a new algorithm (LSI ) which exploits some ideas of COSNet for evaluating the predictive capability of each input network. By this algorithm we can estimate the effectiveness of each source of data for predicting a specific class, and then we can use this information to appropriately integrate multiple networks by weighting them according to an appropriate integration scheme. Both COSNet and LSI are computationally efficient and scale well with the dimension of the data. COSNet and LSI have been applied to the genomewide prediction of gene functions in the yeast and mouse model organisms, achieving results comparable with those obtained with stateoftheart semisupervised and supervised machine learning methods.
GRAPHBASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS / M. Frasca ; relatore: A. Bertoni ; correlatore: G. Valentini ; coordinatore: E. Damiani. Universita' degli Studi di Milano, 2012 Mar 06. 24. ciclo, Anno Accademico 2011. [10.13130/frascamarco_phd20120306].
GRAPHBASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS
M. Frasca
2012
Abstract
The Gene Function Prediction (GFP) problem consists in inferring biological properties for the genes whose function is unknown or only partially known, and raises challenging issues from both a machine learning and a computational biology standpoint. The GFP problem can be formalized as a semisupervised learning problem in an undirected graph. Indeed, given a graph with a partial graph labeling, where nodes represent genes, edges functional relationships between genes, and labels their membership to functional classes, GFP consists in inferring the unknown functional classes of genes, by exploiting the topological relationships of the networks and the available a priori knowledge about the functional properties of genes. Several networkbased machine learning algorithms have been proposed for solving this problem, including Hopfield networks and label propagation methods; however, some issues have been only partially considered, e.g. the preservation of the prior knowledge and the unbalance between positive and negative labels. A first contribution of the thesis is the design of a Hopfieldbased cost sensitive neural network algorithm (COSNet) to address these learning issues. The method factorizes the solution of the problem in two parts: 1) the subnetwork composed by the labelled vertices is considered, and the network parameters are estimated through a supervised algorithm; 2) the estimated parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows to predict the labeling of the unlabeled vertices. The proposed method embeds in the neural algorithm the “a priori” knowledge coded in the labeled part of the graph, and separates node labels and neuron states, allowing to differentially weight positive and negative node labels, and to perform a learning approach that takes into account the “unbalance problem” that affects GFP. A second contribution of this thesis is the development of a new algorithm (LSI ) which exploits some ideas of COSNet for evaluating the predictive capability of each input network. By this algorithm we can estimate the effectiveness of each source of data for predicting a specific class, and then we can use this information to appropriately integrate multiple networks by weighting them according to an appropriate integration scheme. Both COSNet and LSI are computationally efficient and scale well with the dimension of the data. COSNet and LSI have been applied to the genomewide prediction of gene functions in the yeast and mouse model organisms, achieving results comparable with those obtained with stateoftheart semisupervised and supervised machine learning methods.File  Dimensione  Formato  

phd_unimi_R08181.pdf
accesso aperto
Tipologia:
Tesi di dottorato completa
Dimensione
811.03 kB
Formato
Adobe PDF

811.03 kB  Adobe PDF  Visualizza/Apri 
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.