Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.

Semi-automatic Column Type Inference for CSV Table Understanding / S. Bonfitto, L. Cappelletti, F. Trovato, G. Valentini, M. Mesiti (LECTURE NOTES IN ARTIFICIAL INTELLIGENCE). - In: SOFSEM 2021: Theory and Practice of Computer Science / [a cura di] T. Bureš, R. Dondi, J. Gamper, G. Guerrini, T. Jurdziński, C. Pahl, F. Sikora, P.W.H. Wong. - [s.l] : Springer, 2021. - ISBN 9783030677305. - pp. 535-549 (( Intervento presentato al 47. convegno International Conference on Current Trends in Theory and Practice of Computer Science tenutosi a Bolzano nel 2021 [10.1007/978-3-030-67731-2_39].

Semi-automatic Column Type Inference for CSV Table Understanding

S. Bonfitto;L. Cappelletti;G. Valentini;M. Mesiti
2021

Abstract

Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.
Table understanding; Type inference; GUI; CSVs
Settore INF/01 - Informatica
2021
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
Bonfitto2021_Chapter_Semi-automaticColumnTypeInfere.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 2.42 MB
Formato Adobe PDF
2.42 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/809060
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 5
social impact