Semi-automatic Column Type Inference for CSV Table Understanding

Bonfitto, S.; Cappelletti, L.; Trovato, F.; Valentini, G.; Mesiti, M.

doi:10.1007/978-3-030-67731-2_39

Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.

Semi-automatic Column Type Inference for CSV Table Understanding / S. Bonfitto, L. Cappelletti, F. Trovato, G. Valentini, M. Mesiti (LECTURE NOTES IN ARTIFICIAL INTELLIGENCE). - In: SOFSEM 2021: Theory and Practice of Computer Science / [a cura di] T. Bureš, R. Dondi, J. Gamper, G. Guerrini, T. Jurdziński, C. Pahl, F. Sikora, P.W.H. Wong. - [s.l] : Springer, 2021. - ISBN 9783030677305. - pp. 535-549 (( Intervento presentato al 47. convegno International Conference on Current Trends in Theory and Practice of Computer Science tenutosi a Bolzano nel 2021 [10.1007/978-3-030-67731-2_39].

Semi-automatic Column Type Inference for CSV Table Understanding

S. Bonfitto;L. Cappelletti;Trovato, Fabrizio;G. Valentini;M. Mesiti

2021

Abstract

Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents. Starting from CSV spreadsheets that present values of different types and errors, in this paper we introduce an approach for inferring the types of columns in CSV tables by exploiting a multi-label classification approach. By means of our approach, each column of the table can be associated with a simple datatype (such as integer, float, text), a domain-specific one (such as the name of a municipality, and address), or an “union” of types (that takes into account the frequency of the corresponding values). Since the automatically inferred types might not be accurate, graphical interfaces have been developed for supporting the user in fixing the mistakes. Experimental results are finally reported on real spreadsheets obtained by a debt collection agency.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Table understanding; Type inference; GUI; CSVs
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				2021
			
	DOI
	
				https://dx.doi.org/10.1007/978-3-030-67731-2_39
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
Bonfitto2021_Chapter_Semi-automaticColumnTypeInfere.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 2.42 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.42 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/809060

Citazioni

ND

13

8

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca