Fragment-based approximate retrieval in highly heterogeneous XML collections

Sanz, I.; Mesiti, M.; Guerrini, G.; Llaviori Berlanga, R.

doi:10.1016/j.datak.2007.05.008

Due to the heterogeneous nature of XML data for internet applications exact matching of queries is often inadequate. The need arises to quickly identify subtrees of XML documents in a collection that are similar to a given pattern. Similarity involves both tags, that are not required to coincide, and structure, in which not all the relationships among nodes in the tree structure are strictly preserved. In this paper we present an efficient approach to the identification of similar subtrees, relying on ad-hoc indexing structures. The approach allows to quickly detect, in a heterogeneous document collection, the minimal portions that exhibit some similarity with the pattern. These candidate portions are then ranked according to their actual similarity. The approach supports different notions of similarity, thus it can be customized to different application domains. In the paper, three different similarity measures are proposed and compared. The approach is experimentally validated and the experimental results are extensively discussed.

Fragment-based approximate retrieval in highly heterogeneous XML collections / I. Sanz, M. Mesiti, G. Guerrini, R. Llaviori Berlanga. - In: DATA & KNOWLEDGE ENGINEERING. - ISSN 0169-023X. - 64:1(2008 Jan), pp. 266-293. [10.1016/j.datak.2007.05.008]

Fragment-based approximate retrieval in highly heterogeneous XML collections

I. Sanz;M. Mesiti^Secondo;G. Guerrini;R. Llaviori Berlanga

2008

Abstract

Due to the heterogeneous nature of XML data for internet applications exact matching of queries is often inadequate. The need arises to quickly identify subtrees of XML documents in a collection that are similar to a given pattern. Similarity involves both tags, that are not required to coincide, and structure, in which not all the relationships among nodes in the tree structure are strictly preserved. In this paper we present an efficient approach to the identification of similar subtrees, relying on ad-hoc indexing structures. The approach allows to quickly detect, in a heterogeneous document collection, the minimal portions that exhibit some similarity with the pattern. These candidate portions are then ranked according to their actual similarity. The approach supports different notions of similarity, thus it can be customized to different application domains. In the paper, three different similarity measures are proposed and compared. The approach is experimentally validated and the experimental results are extensively discussed.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				XML ; Approximate structural retrieval ; Enhanced indexing techniques
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				gen-2008
			
	Rivista in ANCE
	
				DATA & KNOWLEDGE ENGINEERING
			
	DOI
	
				https://dx.doi.org/10.1016/j.datak.2007.05.008
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
FragmentBased.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 827.74 kB Formato Adobe PDF Visualizza/Apri	827.74 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/36968

Citazioni

ND

27

13

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca