Size constrained clustering problems in fixed dimension

Lin, J.

Clustering or cluster analysis \cite{hastieetal09} is a classical method in unsupervised learning and one of the most used techniques in statistical data analysis. Clustering has a wide range of applications in many areas like pattern recognition, medical diagnostics, data mining, biology, market research and image analysis among others. A cluster is a set of data points that in some sense are similar to each other, and clustering is a process of partitioning a data set into disjoint clusters. In \emph{distance clustering}, the similarity among data points is obtained by means of a \emph{distance} function. Fixed a norm $\|$ $\|_p$ ($p\geq 1$), the \emph{clustering problem} consists in finding for a finite point set $X\subset\Q^d$ and an integer $k$, a $k$-partition $\{A_1,...,A_k\}$of $X$ that minimizes the cost function \begin{equation}\label{costfunction} W(A_1,...,A_k)=\sum_{i=1}^k \sum_{x\in A_i} \|x-C_{A_i}\|_p^p \end{equation} where $C_{A_i}$ is the \emph{$p$-centroid} of $A_i$, i.e.$$ C_{A_i}=\argmin_\mu \sum_{x\in A_i} \|x-\mu\|_p^p$$ Distance clustering is a difficult problem. For an arbitrary dimension $d$, assuming the Euclidean norm ($p=2$), the problem is NP-hard even if the number $k$ of clusters equals $2$ cite{aloiseetal09}; the same occurs if $d=2$ and $k$ is arbitrary \cite{mahajanetal09,vattani10}. For the Euclidean distance, a well-known heuristic is Lloyd's algorithm \cite{lloyd82,macqueen67}, also known as the $k$-Means Algorithm; however there is no guarantee that the solution yielded by this procedure approximates the global optimum. This algorithm is usually very fast, but it can require exponential time in the worst case \cite{vattani09}. In real-world problems, often people have some information on the clusters: incorporating this information into traditional clustering algorithms can increase the clustering performance. Problems that include background information are called \emph{constrained clustering} problems and are divided in two classes. On the one hand, clustering problems with instance-based constraints typically comprise a set of must-link constraints or cannot-link constraints \cite{wagstaffcardie00}, defining pairs of elements that must be included, respectively, in the same cluster or in different clusters. On the other hand, clustering problems with cluster-based constraints \cite{bradleyetal00,tungetal01} incorporate constraints concerning the size of the possible clusters. Recently, in \cite{zhuetal10} cluster size constraints are used for improving clustering accuracy; this approach, for instance, allows one to avoid extremely small or large clusters in standard cluster analysis. Here we study a constrained clustering problem where the size of clusters is included in the instance. This problem, called \emph{Size Constrained Clustering Problem} (SCC), is formally defined as follows: given a set $X\subset \Q^d$ of $n$ points and $k$ many positive integers $m_1,...,m_k$ such that $\sum_1^k m_i=n$, find a $k$-partition $\{A_1,...,A_k\}$ of $X$ that minimizes the cost function $W(A_1,...,A_k)$ such that $|A_i|=m_i$ for each $i=1,...,k$. This problem was studied in \cite{sacca10,bertonietal12} and it is known to be a difficult problem. More precisely, the following results hold \cite{bertonietal12}: \begin{enumerate} \item[1)] For every norm $\|$ $\|_p$ with $p>1$, SCC with fixed clustering size $k$ is NP-hard, even in the case $k=2$ and $m_1=m_2=\frac{n}{2}$. \item[2)] For every norm $\|$ $\|_p$ with $p\geq 1$, SCC with fixed dimension $d$ is NP-hard, even in the case $d=1$. \end{enumerate} As a consequence, we can't expect to obtain a polynomial-time algorithm for solving the general SCC problem. \\ In this paper we investigate SCC in the plane ($d=2$) with a fixed clustering size $k=2$. In particular, we consider the following two problems:\\ $\bullet$ 2-SCC in the Plane:\\ Given a point set $X=\{x_1,...,x_n\}\subset \Q^2$ and a positive integer $m\leq \frac{n}{2}$, find a $2$-partition $\{A,\bar A\}$ of $X$ with $|A|=m, |\bar A|=n-m$, that minimizes $$ W(A,\bar A)=\sum_{x\in A} \|x-C_A\|_2^2+\sum_{x\in \bar A} \|x-C_{\bar A}\|_2^2 $$ where $C_A$ and $C_{\bar A}$ are the centroid of $A$ and $\bar A$ respectively.\\ $\bullet$ Full 2-SCC in the Plane:\\ Given a point set $X=\{x_1,...,x_n\}\subset\Q^2$, for all integers $m$, $1\leq m\leq \frac{n}{2}$, find the optimal $2$-partition $\pi_m=\{A_m,\bar A_m\}$, with $|A_m|=m$. The main results we obtain are the following: \begin{enumerate} \item[1)] There is an algorithm for solving Full 2-SCC in the Plane in time $O(n^2\cdot \log n)$. \item[2)] There is an algorithm for solving 2-SCC in the Plane in time $O(n\sqrt[3]m \cdot \log^2 n)$. \end{enumerate} It should be observed that, the algorithm for solving 2-SCC in the plane requires the application of methods for enumerating the $k$-sets of a collection of points in the plane, which is a challenging problem \cite{erdosetal73} in combinatorial geometry. \\ Here we also study the problem 2-SCC in fixed dimension $d$. First, we use a separation result \cite{bertonietal12} stating that if $\{A,\bar A\}$ is an optimal solution of an instance of the 2-SCC problem, then $A$ and $\bar A$ are separated by an hypersurface of the form $$ \|x-\alpha\|_p^p - \|x-\beta\|_p^p = c $$ for some constant parameters $\alpha,\beta\in \R^d$, $c\in\R$. By applying a suitable method for decomposing the parameter space $\R^{2d+1}$, one can compute a set of optimal 2-partitions $\pi_m=\{A_m,\bar A_m\}$ such that $|A_m|=m$, for $m=1,...,\lfloor\frac{n}{2}\rfloor$. This allows us to design an algorithm for the Full 2-SCC problem in fixed dimension $d$ that works in polynomial time both in $n$ and $p$. To obtain this result we make use of concepts and methods of real algebraic geometry, and in particular we apply the cylindrical algebraic decomposition \cite{collins75}. \\ In this work we also study another variant of the clustering problem, called \emph{Relaxed Constraints Clustering Problem} (RCC), which is defined as follows: given a point set $X=\{x_1,...,x_n\}\subset\Q^d$, an integer $k>1$ and a finite set $\mathcal M$ of positive integers, find a $k$-partition $\{A_1,...,A_k\}$ of $X$ with $$ |A_i|\in \mathcal M \quad \text{ for all } i=1,...,k $$ that minimizes the cost function $$ W(A_1,...,A_k)=\sum_{i=1}^k \sum_{x\in A_i} \|x-C_{A_i}\|_p^p. $$ We prove that for the euclidean norm $\|$ $\|_2$, the decision version of RCC in dimension $d=2$ is NP-complete even in the case $\mathcal M=\{2,3\}$. On the contrary, RCC in dimension $1$ is known to be solvable in polynomial time by a dynamic programming technique \cite{sacca10}.

Size constrained clustering problems in fixed dimension / J. Lin. ((Intervento presentato al 13. convegno Italian Conference on Theoretical Computer Science tenutosi a Varese nel 2012.