GRAPH-BASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS

Frasca, M.

doi:10.13130/frasca-marco_phd2012-03-06

The Gene Function Prediction (GFP) problem consists in inferring biological properties for the genes whose function is unknown or only partially known, and raises challenging issues from both a machine learning and a computational biology standpoint. The GFP problem can be formalized as a semi-supervised learning problem in an undirected graph. Indeed, given a graph with a partial graph labeling, where nodes represent genes, edges functional relationships between genes, and labels their membership to functional classes, GFP consists in inferring the unknown functional classes of genes, by exploiting the topological relationships of the networks and the available a priori knowledge about the functional properties of genes. Several network-based machine learning algorithms have been proposed for solving this problem, including Hopfield networks and label propagation methods; however, some issues have been only partially considered, e.g. the preservation of the prior knowledge and the unbalance between positive and negative labels. A first contribution of the thesis is the design of a Hopfield-based cost sensitive neural network algorithm (COSNet) to address these learning issues. The method factorizes the solution of the problem in two parts: 1) the subnetwork composed by the labelled vertices is considered, and the network parameters are estimated through a supervised algorithm; 2) the estimated parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows to predict the labeling of the unlabeled vertices. The proposed method embeds in the neural algorithm the “a priori” knowledge coded in the labeled part of the graph, and separates node labels and neuron states, allowing to differentially weight positive and negative node labels, and to perform a learning approach that takes into account the “unbalance problem” that affects GFP. A second contribution of this thesis is the development of a new algorithm (LSI ) which exploits some ideas of COSNet for evaluating the predictive capability of each input network. By this algorithm we can estimate the effectiveness of each source of data for predicting a specific class, and then we can use this information to appropriately integrate multiple networks by weighting them according to an appropriate integration scheme. Both COSNet and LSI are computationally efficient and scale well with the dimension of the data. COSNet and LSI have been applied to the genome-wide prediction of gene functions in the yeast and mouse model organisms, achieving results comparable with those obtained with state-of-the-art semi-supervised and supervised machine learning methods.

GRAPH-BASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS / M. Frasca ; relatore: A. Bertoni ; correlatore: G. Valentini ; coordinatore: E. Damiani. Universita' degli Studi di Milano, 2012 Mar 06. 24. ciclo, Anno Accademico 2011. [10.13130/frasca-marco_phd2012-03-06].