Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.

The role of classifiers and data complexity in learned Bloom filters: insights and recommendations / D. Malchiodi, D. Raimondi, G. Fumagalli, R. Giancarlo, M. Frasca. - In: JOURNAL OF BIG DATA. - ISSN 2196-1115. - 11:1(2024), pp. 45.1-45.26. [10.1186/s40537-024-00906-9]

The role of classifiers and data complexity in learned Bloom filters: insights and recommendations

D. Malchiodi
;
M. Frasca
Ultimo
2024

Abstract

Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.
No
English
Bloom filters; Learned Bloom filters; Approximate set membership; Dataset complexity
Settore INF/01 - Informatica
Articolo
Esperti anonimi
Ricerca di base
Pubblicazione scientifica
   Multi-criteria optimized data structures: from compressed indexes to learned indexes, and beyond
   MINISTERO DELL'ISTRUZIONE E DEL MERITO
   2017WR7SHH_004
2024
27-mar-2024
Springer
11
1
45
1
26
26
Pubblicato
Periodico con rilevanza internazionale
crossref
Aderisco
info:eu-repo/semantics/article
The role of classifiers and data complexity in learned Bloom filters: insights and recommendations / D. Malchiodi, D. Raimondi, G. Fumagalli, R. Giancarlo, M. Frasca. - In: JOURNAL OF BIG DATA. - ISSN 2196-1115. - 11:1(2024), pp. 45.1-45.26. [10.1186/s40537-024-00906-9]
open
Prodotti della ricerca::01 - Articolo su periodico
5
262
Article (author)
Periodico con Impact Factor
D. Malchiodi, D. Raimondi, G. Fumagalli, R. Giancarlo, M. Frasca
File in questo prodotto:
File Dimensione Formato  
s40537-024-00906-9 (1).pdf

accesso aperto

Descrizione: Paper pubblicato
Tipologia: Publisher's version/PDF
Dimensione 3.99 MB
Formato Adobe PDF
3.99 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1050422
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact