Most papers on similarity retrieval present experiments exe- cuted on an assortion of complex datasets. However, no work focuses on analyzing the selection of datasets to evaluate the techniques proposed in the related literature. Ideally, the datasets chosen for experimental analysis should cover a variety of properties to ensure a proper evalu- ation; however, this is not always the case. This paper introduces the dataset-similarity-based approach, a new conceptual view of datasets that explores how they vary according to their characteristics. The app- roach is based on extracting a set of features from the datasets to rep- resent them in a similarity space and analyze their distribution in this space. We present an instantiation of our approach using datasets gath- ered by surveying the dataset usage in papers published in relevant conferences on similarity retrieval and sample analyses. Our analyses show that datasets often used together in experiments are more similar than they seem to be at first glance, reducing the variability. The pro- posed representation of datasets in a similarity space allows future works to improve the choice of datasets for running experiments in similarity retrieval.
The Dataset-Similarity-Based Approach to Select Datasets for Evaluation in Similarity Retrieval / M.A.L. Matiazzo, V. de Castro-Silva, R.S. Oyamada, D.S. Kaster (LECTURE NOTES IN COMPUTER SCIENCE). - In: Similarity Search and Applications / [a cura di] Oscar Pedreira, Vladimir Estivill-Castro. - Cham : Springer, 2023. - ISBN 978-3-031-46993-0. - pp. 125-132 (( Intervento presentato al 16. convegno International Conference on Similarity Search and Applications, SISAP tenutosi a A Coruña : October 9–11 nel 2023 [10.1007/978-3-031-46994-7_11].
The Dataset-Similarity-Based Approach to Select Datasets for Evaluation in Similarity Retrieval
R.S. OyamadaPenultimo
;
2023
Abstract
Most papers on similarity retrieval present experiments exe- cuted on an assortion of complex datasets. However, no work focuses on analyzing the selection of datasets to evaluate the techniques proposed in the related literature. Ideally, the datasets chosen for experimental analysis should cover a variety of properties to ensure a proper evalu- ation; however, this is not always the case. This paper introduces the dataset-similarity-based approach, a new conceptual view of datasets that explores how they vary according to their characteristics. The app- roach is based on extracting a set of features from the datasets to rep- resent them in a similarity space and analyze their distribution in this space. We present an instantiation of our approach using datasets gath- ered by surveying the dataset usage in papers published in relevant conferences on similarity retrieval and sample analyses. Our analyses show that datasets often used together in experiments are more similar than they seem to be at first glance, reducing the variability. The pro- posed representation of datasets in a similarity space allows future works to improve the choice of datasets for running experiments in similarity retrieval.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.