Given a set of points in a metric space, an additional query point and a positive threshold, a range query determines the subset of points whose distance from the query point does not exceed the given threshold. This paper tackles the problem of clustering the set of points so as to minimize the number of distance evaluations required by a range query. This problem models the efficient extraction of information from a database when the user is not interested in an exact match retrieval, but in the search for similar items. Since this need has become widespread in the management of text, image, audio and video databases, several data structures have been proposed to support such queries. Their optimization, however, is still left to extremely simple heuristic rules, if not to random choices. We propose the Balanced Compact Clustering Problem (BCCP) as a combinatorial model of this problem. We discuss its approximation properties and the complexity of special cases. Then, we present two Integer Programming formulations, prove their equivalence and introduce valid inequalities and variable fixing procedures. We discuss the application of a general-purpose solver on the more efficient formulation. Finally, we describe a Tabu Search algorithm and discuss its application to randomly generated and to real-world benchmark instances up to one hundred thousands points.

Balanced compact clustering for efficient range queries in metric spaces / A. Ceselli, F. Colombo, R. Cordone. - In: DISCRETE APPLIED MATHEMATICS. - ISSN 0166-218X. - 169(2014 May), pp. 43-67.

Balanced compact clustering for efficient range queries in metric spaces

A. Ceselli;F. Colombo;R. Cordone
2014-05

Abstract

Given a set of points in a metric space, an additional query point and a positive threshold, a range query determines the subset of points whose distance from the query point does not exceed the given threshold. This paper tackles the problem of clustering the set of points so as to minimize the number of distance evaluations required by a range query. This problem models the efficient extraction of information from a database when the user is not interested in an exact match retrieval, but in the search for similar items. Since this need has become widespread in the management of text, image, audio and video databases, several data structures have been proposed to support such queries. Their optimization, however, is still left to extremely simple heuristic rules, if not to random choices. We propose the Balanced Compact Clustering Problem (BCCP) as a combinatorial model of this problem. We discuss its approximation properties and the complexity of special cases. Then, we present two Integer Programming formulations, prove their equivalence and introduce valid inequalities and variable fixing procedures. We discuss the application of a general-purpose solver on the more efficient formulation. Finally, we describe a Tabu Search algorithm and discuss its application to randomly generated and to real-world benchmark instances up to one hundred thousands points.
Clustering; Information retrieval; Integer programming; Similarity search; Tabu search
Settore MAT/09 - Ricerca Operativa
Settore INF/01 - Informatica
Article (author)
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/2434/234897
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact