IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

This thesis is dedicated to the study of algorithms for clustering and robust unsupervised learning. Specifically, a significant part of this thesis is dedicated to the problem of emph{semi-supervised cluster recovery}. Given a finite set of $n$ points partitioned according to an unknown clustering, the goal is to recover the unknown clustering using as few emph{oracle queries} as possible. We mostly focus on the framework of the emph{same-cluster queries}: given a pair of points $x$ and $y$, a same cluster query returns a binary answer saying emph{yes} if $x$ and $y$ belongs to the same cluster, and emph{no} otherwise. In this framework, we seek for algorithms that by actively asking queries reconstruct exactly the underlying clustering. The target emph{query complexity} (i.e. the number of queries asked by the algorithm) if $O(log n)$. We also consider a more general (and weaker) notion of supervision offered by the emph{similarity queries}. Given a pair of points $x$ and $y$, the oracle answer emph{yes} if the points are emph{similar} and emph{no} otherwise. In this context, the similarity may, for example, represent a notion of metric-closeness between points or also the membership to the same-cluster. In the former case, similarity queries may be seen as same-cluster queries with errors since points that are close may be well into different clusters. This type of feedback is common for clustering problems on graphs, where the presence of an edge between two vertices do not need to establish the membership to the same cluster. In this framework, we consider the case of emph{Correlation Clustering} and design a simple and efficient query-based approximation algorithm for it. We prove also that the same approximation algorithm features some cluster recovery properties. Finally a third contribution goes into the area of emph{robust unsupervised learning}. In this framework, data comes from a mixture distribution $lambda mu + (1-lambda) u$ where $ u$ is noise, and the goal is to identify a model with a small risk w.r.t. the target distribution $mu$. We show that the minimization of a certain emph{L-statistic} of the training data, provably leads to a model with small risk w.r.t. $mu$; even in the high noise regime. Our method works in the general emph{reconstruction error minimization} framework which encompass $k$-means and $k$-median clustering, principal subspace analysis, sparse coding and non-negative matrix factorization. Since the proposed method is NP-Hard to compute, we complement the results with an efficient Lloyd-like descent algorithm.

ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS / A. Paudice ; tutor: N. A. CESA BIANCHI, M. PONTIL ; coordinator: P. BOLDI. Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jan 24. 34. ciclo, Anno Accademico 2021.

ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS

A. Paudice

2022

Abstract

This thesis is dedicated to the study of algorithms for clustering and robust unsupervised learning. Specifically, a significant part of this thesis is dedicated to the problem of emph{semi-supervised cluster recovery}. Given a finite set of $n$ points partitioned according to an unknown clustering, the goal is to recover the unknown clustering using as few emph{oracle queries} as possible. We mostly focus on the framework of the emph{same-cluster queries}: given a pair of points $x$ and $y$, a same cluster query returns a binary answer saying emph{yes} if $x$ and $y$ belongs to the same cluster, and emph{no} otherwise. In this framework, we seek for algorithms that by actively asking queries reconstruct exactly the underlying clustering. The target emph{query complexity} (i.e. the number of queries asked by the algorithm) if $O(log n)$. We also consider a more general (and weaker) notion of supervision offered by the emph{similarity queries}. Given a pair of points $x$ and $y$, the oracle answer emph{yes} if the points are emph{similar} and emph{no} otherwise. In this context, the similarity may, for example, represent a notion of metric-closeness between points or also the membership to the same-cluster. In the former case, similarity queries may be seen as same-cluster queries with errors since points that are close may be well into different clusters. This type of feedback is common for clustering problems on graphs, where the presence of an edge between two vertices do not need to establish the membership to the same cluster. In this framework, we consider the case of emph{Correlation Clustering} and design a simple and efficient query-based approximation algorithm for it. We prove also that the same approximation algorithm features some cluster recovery properties. Finally a third contribution goes into the area of emph{robust unsupervised learning}. In this framework, data comes from a mixture distribution $lambda mu + (1-lambda) u$ where $ u$ is noise, and the goal is to identify a model with a small risk w.r.t. the target distribution $mu$. We show that the minimization of a certain emph{L-statistic} of the training data, provably leads to a model with small risk w.r.t. $mu$; even in the high noise regime. Our method works in the general emph{reconstruction error minimization} framework which encompass $k$-means and $k$-median clustering, principal subspace analysis, sparse coding and non-negative matrix factorization. Since the proposed method is NP-Hard to compute, we complement the results with an efficient Lloyd-like descent algorithm.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
			24-gen-2022
		
	Settori scientifico-disciplinari della tesi
	
			Settore INF/01 - Informatica
		
	Tutor afferenti all'Ateneo
	
			CESA BIANCHI, NICOLO' ANTONIO
		
	Supervisori e coordinatori afferenti all'Ateneo
	
			BOLDI, PAOLO
		
	Tipologia
	
			Doctoral Thesis
		
	Citazione
	
			ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS / A. Paudice ; tutor: N. A. CESA BIANCHI, M. PONTIL ; coordinator: P. BOLDI. Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jan 24. 34. ciclo, Anno Accademico 2021.
		
	Appare nelle tipologie:
	
			Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
phd_unimi_R12251.pdf Open Access dal 14/06/2022 Tipologia: Tesi di dottorato completa Dimensione 3.49 MB Formato Adobe PDF Visualizza/Apri	3.49 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/888854

Citazioni

ND

ND

ND

social impact