Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.
Semi-automatic Column Type Inference for CSV Table Understanding / S. Bonfitto, L. Cappelletti, F. Trovato, G. Valentini, M. Mesiti (LECTURE NOTES IN ARTIFICIAL INTELLIGENCE). - In: SOFSEM 2021: Theory and Practice of Computer Science / [a cura di] T. Bureš, R. Dondi, J. Gamper, G. Guerrini, T. Jurdziński, C. Pahl, F. Sikora, P.W.H. Wong. - [s.l] : Springer, 2021. - ISBN 9783030677305. - pp. 535-549 (( Intervento presentato al 47. convegno International Conference on Current Trends in Theory and Practice of Computer Science tenutosi a Bolzano nel 2021 [10.1007/978-3-030-67731-2_39].
Semi-automatic Column Type Inference for CSV Table Understanding
S. Bonfitto;L. Cappelletti;G. Valentini;M. Mesiti
2021
Abstract
Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.File | Dimensione | Formato | |
---|---|---|---|
Bonfitto2021_Chapter_Semi-automaticColumnTypeInfere.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
2.42 MB
Formato
Adobe PDF
|
2.42 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.