In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.
Exploratory analysis of textual data streams / S. Castano, A. Ferrara, S. Montanelli. - In: FUTURE GENERATION COMPUTER SYSTEMS. - ISSN 0167-739X. - 68(2017), pp. 391-406. [10.1016/j.future.2016.07.005]
Exploratory analysis of textual data streams
S. CastanoPrimo
;A. Ferrara
;S. MontanelliUltimo
2017
Abstract
In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters, based on keyword commonalities; (ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.File | Dimensione | Formato | |
---|---|---|---|
castano_etal.pdf
accesso aperto
Tipologia:
Pre-print (manoscritto inviato all'editore)
Dimensione
1.13 MB
Formato
Adobe PDF
|
1.13 MB | Adobe PDF | Visualizza/Apri |
1-s2.0-S0167739X16302357-main.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
1.92 MB
Formato
Adobe PDF
|
1.92 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.