Joinable table discovery consists of the identification of tabular datasets that can be joined with a given query dataset. The use of contextual information associated with the datasets and columns (tailored to the kinds of analyses the user intends to carry out) is seldom considered in the approaches proposed so far. In this paper, the generation of semantic task-oriented schema-based catalogs that facilitate the identification of joinable columns is proposed. By identifying a schema diagram that outlines the classes and relationship types for a certain kind of analysis, datasets are semantically annotated, and annotations are used to generate the catalog. The catalog, represented as a property graph, can then be leveraged for visual exploration, query formulation, and identification of joinable datasets useful for a specific analysis. The approach leverages the availability of metadata about datasets and their columns, combined with general-purpose large language models (LLMs). Initial experiments suggest that our approach is both practical and efficient, yielding promising results in terms of both accuracy and usability.

A Semantic Schema-Based Catalog for Identifying Joinable Columns via LLMs / E. Cavalleri, M. Castagna, M. Mesiti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Flexible Query Answering Systems / [a cura di] G. De Tré, S. Sotirov, J. Kacprzyk, G. Psaila, G. Smits, T. Andreasen, G. Bordogna, H. Legind Larsen. - [s.l] : Springer, 2025 Sep 08. - ISBN 9783032056061. - pp. 206-218 (( Intervento presentato al 16. convegno FQAS tenutosi a Burgas nel 2025 [10.1007/978-3-032-05607-8_20].

A Semantic Schema-Based Catalog for Identifying Joinable Columns via LLMs

E. Cavalleri
Primo
;
M. Castagna;M. Mesiti
Ultimo
2025

Abstract

Joinable table discovery consists of the identification of tabular datasets that can be joined with a given query dataset. The use of contextual information associated with the datasets and columns (tailored to the kinds of analyses the user intends to carry out) is seldom considered in the approaches proposed so far. In this paper, the generation of semantic task-oriented schema-based catalogs that facilitate the identification of joinable columns is proposed. By identifying a schema diagram that outlines the classes and relationship types for a certain kind of analysis, datasets are semantically annotated, and annotations are used to generate the catalog. The catalog, represented as a property graph, can then be leveraged for visual exploration, query formulation, and identification of joinable datasets useful for a specific analysis. The approach leverages the availability of metadata about datasets and their columns, combined with general-purpose large language models (LLMs). Initial experiments suggest that our approach is both practical and efficient, yielding promising results in terms of both accuracy and usability.
data lake; joinable table discovery; semantic catalog
Settore INFO-01/A - Informatica
8-set-2025
uniburgas
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
fqas2025.pdf

accesso riservato

Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore)
Licenza: Nessuna licenza
Dimensione 4.31 MB
Formato Adobe PDF
4.31 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1182255
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact