This thesis is dedicated to the study of algorithms for clustering and robust unsupervised learning. Specifically, a significant part of this thesis is dedicated to the problem of emph{semi-supervised cluster recovery}. Given a finite set of $n$ points partitioned according to an unknown clustering, the goal is to recover the unknown clustering using as few emph{oracle queries} as possible. We mostly focus on the framework of the emph{same-cluster queries}: given a pair of points $x$ and $y$, a same cluster query returns a binary answer saying emph{yes} if $x$ and $y$ belongs to the same cluster, and emph{no} otherwise. In this framework, we seek for algorithms that by actively asking queries reconstruct exactly the underlying clustering. The target emph{query complexity} (i.e. the number of queries asked by the algorithm) if $O(log n)$. We also consider a more general (and weaker) notion of supervision offered by the emph{similarity queries}. Given a pair of points $x$ and $y$, the oracle answer emph{yes} if the points are emph{similar} and emph{no} otherwise. In this context, the similarity may, for example, represent a notion of metric-closeness between points or also the membership to the same-cluster. In the former case, similarity queries may be seen as same-cluster queries with errors since points that are close may be well into different clusters. This type of feedback is common for clustering problems on graphs, where the presence of an edge between two vertices do not need to establish the membership to the same cluster. In this framework, we consider the case of emph{Correlation Clustering} and design a simple and efficient query-based approximation algorithm for it. We prove also that the same approximation algorithm features some cluster recovery properties. Finally a third contribution goes into the area of emph{robust unsupervised learning}. In this framework, data comes from a mixture distribution $lambda mu + (1-lambda) u$ where $ u$ is noise, and the goal is to identify a model with a small risk w.r.t. the target distribution $mu$. We show that the minimization of a certain emph{L-statistic} of the training data, provably leads to a model with small risk w.r.t. $mu$; even in the high noise regime. Our method works in the general emph{reconstruction error minimization} framework which encompass $k$-means and $k$-median clustering, principal subspace analysis, sparse coding and non-negative matrix factorization. Since the proposed method is NP-Hard to compute, we complement the results with an efficient Lloyd-like descent algorithm.

ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS / A. Paudice ; tutor: N. A. CESA BIANCHI, M. PONTIL ; coordinator: P. BOLDI. - : . Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jan 24. ((34. ciclo, Anno Accademico 2021.

ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS

A. Paudice
2022

Abstract

This thesis is dedicated to the study of algorithms for clustering and robust unsupervised learning. Specifically, a significant part of this thesis is dedicated to the problem of emph{semi-supervised cluster recovery}. Given a finite set of $n$ points partitioned according to an unknown clustering, the goal is to recover the unknown clustering using as few emph{oracle queries} as possible. We mostly focus on the framework of the emph{same-cluster queries}: given a pair of points $x$ and $y$, a same cluster query returns a binary answer saying emph{yes} if $x$ and $y$ belongs to the same cluster, and emph{no} otherwise. In this framework, we seek for algorithms that by actively asking queries reconstruct exactly the underlying clustering. The target emph{query complexity} (i.e. the number of queries asked by the algorithm) if $O(log n)$. We also consider a more general (and weaker) notion of supervision offered by the emph{similarity queries}. Given a pair of points $x$ and $y$, the oracle answer emph{yes} if the points are emph{similar} and emph{no} otherwise. In this context, the similarity may, for example, represent a notion of metric-closeness between points or also the membership to the same-cluster. In the former case, similarity queries may be seen as same-cluster queries with errors since points that are close may be well into different clusters. This type of feedback is common for clustering problems on graphs, where the presence of an edge between two vertices do not need to establish the membership to the same cluster. In this framework, we consider the case of emph{Correlation Clustering} and design a simple and efficient query-based approximation algorithm for it. We prove also that the same approximation algorithm features some cluster recovery properties. Finally a third contribution goes into the area of emph{robust unsupervised learning}. In this framework, data comes from a mixture distribution $lambda mu + (1-lambda) u$ where $ u$ is noise, and the goal is to identify a model with a small risk w.r.t. the target distribution $mu$. We show that the minimization of a certain emph{L-statistic} of the training data, provably leads to a model with small risk w.r.t. $mu$; even in the high noise regime. Our method works in the general emph{reconstruction error minimization} framework which encompass $k$-means and $k$-median clustering, principal subspace analysis, sparse coding and non-negative matrix factorization. Since the proposed method is NP-Hard to compute, we complement the results with an efficient Lloyd-like descent algorithm.
CESA BIANCHI, NICOLO' ANTONIO
BOLDI, PAOLO
Settore INF/01 - Informatica
ALGORITHMS FOR CLUSTERING AND ROBUST UNSUPERVISED LEARNING PROBLEMS / A. Paudice ; tutor: N. A. CESA BIANCHI, M. PONTIL ; coordinator: P. BOLDI. - : . Dipartimento di Informatica Giovanni Degli Antoni, 2022 Jan 24. ((34. ciclo, Anno Accademico 2021.
Doctoral Thesis
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R12251.pdf

Open Access dal 14/06/2022

Tipologia: Tesi di dottorato completa
Dimensione 3.49 MB
Formato Adobe PDF
3.49 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/888854
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact