Spatio-temporal data mining is a growing research area dedicated to the development of algorithms and computational techniques for the analysis of large spatio-temporal databases and the disclosure of interesting and hidden knowledge in these data, mainly in terms of periodic hidden patterns and outlier detection. In this thesis, the attention has been focalized on outlier detection in spatio-temporal data. Indeed, detecting outliers which are grossly different from or inconsistent with remaining data is a major challenge in real-world knowledge discovery and data mining applications. Nowadays, the high availability of data gathered from wireless sensor networks and telecommunication systems (such as GPS, GSM), that daily generate terabytes of data, has focalized the research attention on the interesting knowledge that can be gained from the analysis of spatio-temporal data. Spatio-temporal data are constituted by sampled locations at specific timestamps, tipically this kind of data deal with trajectory of moving objects that change their locations over time. The management and analysis of these data is interesting because undetected correlations between phenomena could be discovered and adequate improvements could be taken in many different fields, such as problem prevention, traffic management, discovery of meaningful behaviour pattern or accessibility of restricted areas and so on. In this thesis, we face an unsupervised outlier detection problem in an unlabeled spatio-temporal data. Two main research contributions are reported in the following two main parts of this thesis. In the first part of this thesis, we describe the first research contribution that consists of two non parametric methods. Most current methods for outlier detection give a binary classification of objects: is or is not an outlier or, but for many scenarios, it is more meaningful to assign to each object a degree of being an outlier (degree of outlierness), that can be based on different rules, well known in literature. In both methods, the degree of outlierness of each object is based on the sum of the distances among the object itself and its k-nearest neighbours. The choice of developing a nearest neighbor based technique is that it is unsupervised in nature and does not make any assumptions regarding the generative distribution for the data. It is purely data driven. The former outlier detection method, called a two step approach, considers the spatial weight (component) in order to identify the spatial outliers, and, in a second time, considers also the temporal weight but only as a more refined level of anomaly detection. The latter outlier detection method, called ST-OutlierDetector, is a non parametric outlier detection approach that finds the top outliers in an unlabeled spatio-temporal data set. Our proposed method relies on a new fusion approach able to discover outliers according to the spatial and temporal features, at the same time: the user can decide the importance to give to both components (spatial and temporal) depending upon the kind of data to be analyzed and/or the kind of analysis to be performed. Based on ST-OutlierDetector method, another contribution has been proposed. This contribution, the spatio-temporal outlierness degree map, is a visualization tool aimed at visualize the dataset structure with respect to the spatio-temporal outlier presence. It allows to make a 3D-plot (space and time) of the dataset by drawing them with different colors and also different color nuance based upon their outlierness degree. The map is built without setting, a-priori, the input parameter: outlier number to be found. In the second part of this thesis, we describe the second research contribution that consists of a new outlier detection method, called ROSE (Rough Outlier Set Extraction). The attention has been focalized on outlier detection in spatio-temporal data using rough set theory. Most current methods for outlier detection exploit rough theory to define new rough weights as degree of outlierness. Our goal is representing the Outlier Set such as a Rough Outlier Set through its lower, upper approximation, remarking the benefits of keeping into account the objects belonging to the boundary. Moreover, we introduce a new set, called Kernel Set. This set is a selected subset of elements that is able to maintain the original data set both in terms of data structure and in terms of obtained results. In particular, we want to show the advantages of considering this new set. Indeed, we compare the Rough Outlier Set extracted by the entire data set (our Universe of the discourse) and the Rough Outlier Set extracted by the Kernel Set.

A ROUGH SET APPROACH TO OUTLIER DETECTION IN SPATIO TEMPORAL DATA / A. Albanese ; Tutor: Alfredo Petrosino ; coordinatore: Ernesto Damiani. Universita' degli Studi di Milano, 2011 Mar 24. 23. ciclo, Anno Accademico 2010. [10.13130/albanese-alessia_phd2011-03-24].

A ROUGH SET APPROACH TO OUTLIER DETECTION IN SPATIO TEMPORAL DATA

A. Albanese
2011

Abstract

Spatio-temporal data mining is a growing research area dedicated to the development of algorithms and computational techniques for the analysis of large spatio-temporal databases and the disclosure of interesting and hidden knowledge in these data, mainly in terms of periodic hidden patterns and outlier detection. In this thesis, the attention has been focalized on outlier detection in spatio-temporal data. Indeed, detecting outliers which are grossly different from or inconsistent with remaining data is a major challenge in real-world knowledge discovery and data mining applications. Nowadays, the high availability of data gathered from wireless sensor networks and telecommunication systems (such as GPS, GSM), that daily generate terabytes of data, has focalized the research attention on the interesting knowledge that can be gained from the analysis of spatio-temporal data. Spatio-temporal data are constituted by sampled locations at specific timestamps, tipically this kind of data deal with trajectory of moving objects that change their locations over time. The management and analysis of these data is interesting because undetected correlations between phenomena could be discovered and adequate improvements could be taken in many different fields, such as problem prevention, traffic management, discovery of meaningful behaviour pattern or accessibility of restricted areas and so on. In this thesis, we face an unsupervised outlier detection problem in an unlabeled spatio-temporal data. Two main research contributions are reported in the following two main parts of this thesis. In the first part of this thesis, we describe the first research contribution that consists of two non parametric methods. Most current methods for outlier detection give a binary classification of objects: is or is not an outlier or, but for many scenarios, it is more meaningful to assign to each object a degree of being an outlier (degree of outlierness), that can be based on different rules, well known in literature. In both methods, the degree of outlierness of each object is based on the sum of the distances among the object itself and its k-nearest neighbours. The choice of developing a nearest neighbor based technique is that it is unsupervised in nature and does not make any assumptions regarding the generative distribution for the data. It is purely data driven. The former outlier detection method, called a two step approach, considers the spatial weight (component) in order to identify the spatial outliers, and, in a second time, considers also the temporal weight but only as a more refined level of anomaly detection. The latter outlier detection method, called ST-OutlierDetector, is a non parametric outlier detection approach that finds the top outliers in an unlabeled spatio-temporal data set. Our proposed method relies on a new fusion approach able to discover outliers according to the spatial and temporal features, at the same time: the user can decide the importance to give to both components (spatial and temporal) depending upon the kind of data to be analyzed and/or the kind of analysis to be performed. Based on ST-OutlierDetector method, another contribution has been proposed. This contribution, the spatio-temporal outlierness degree map, is a visualization tool aimed at visualize the dataset structure with respect to the spatio-temporal outlier presence. It allows to make a 3D-plot (space and time) of the dataset by drawing them with different colors and also different color nuance based upon their outlierness degree. The map is built without setting, a-priori, the input parameter: outlier number to be found. In the second part of this thesis, we describe the second research contribution that consists of a new outlier detection method, called ROSE (Rough Outlier Set Extraction). The attention has been focalized on outlier detection in spatio-temporal data using rough set theory. Most current methods for outlier detection exploit rough theory to define new rough weights as degree of outlierness. Our goal is representing the Outlier Set such as a Rough Outlier Set through its lower, upper approximation, remarking the benefits of keeping into account the objects belonging to the boundary. Moreover, we introduce a new set, called Kernel Set. This set is a selected subset of elements that is able to maintain the original data set both in terms of data structure and in terms of obtained results. In particular, we want to show the advantages of considering this new set. Indeed, we compare the Rough Outlier Set extracted by the entire data set (our Universe of the discourse) and the Rough Outlier Set extracted by the Kernel Set.
24-mar-2011
Settore INF/01 - Informatica
Outlier Detection ; Spatio-Temporal Data ; Spatiotemporal Uncertainty Management ; Granular Computing ; Rough-Sets
PETROSINO, ALFREDO
DAMIANI, ERNESTO
Doctoral Thesis
A ROUGH SET APPROACH TO OUTLIER DETECTION IN SPATIO TEMPORAL DATA / A. Albanese ; Tutor: Alfredo Petrosino ; coordinatore: Ernesto Damiani. Universita' degli Studi di Milano, 2011 Mar 24. 23. ciclo, Anno Accademico 2010. [10.13130/albanese-alessia_phd2011-03-24].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R07703.pdf

Open Access dal 03/09/2011

Tipologia: Tesi di dottorato completa
Dimensione 4.73 MB
Formato Adobe PDF
4.73 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/155480
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact