Exploratory analysis of textual data streams

Castano, S.; Ferrara, A.; Montanelli, S.

doi:10.1016/j.future.2016.07.005

In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.

Exploratory analysis of textual data streams / S. Castano, A. Ferrara, S. Montanelli. - In: FUTURE GENERATION COMPUTER SYSTEMS. - ISSN 0167-739X. - 68(2017), pp. 391-406. [10.1016/j.future.2016.07.005]

Exploratory analysis of textual data streams

S. Castano^Primo;A. Ferrara;S. Montanelli^Ultimo

2017

Abstract

In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Clustering; Detection of emergent topics; Exploratory analysis; Textual data stream; Topic evolution; Software; Hardware and Architecture; Computer Networks and Communications
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
			
	Data di pubblicazione
	
				2017
			
	Rivista in ANCE
	
				FUTURE GENERATION COMPUTER SYSTEMS
			
	DOI
	
				https://dx.doi.org/10.1016/j.future.2016.07.005
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
castano_etal.pdf accesso aperto Tipologia: Pre-print (manoscritto inviato all'editore) Dimensione 1.13 MB Formato Adobe PDF Visualizza/Apri	1.13 MB	Adobe PDF	Visualizza/Apri
1-s2.0-S0167739X16302357-main.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 1.92 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.92 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/468934

Citazioni

ND

7

4

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca