IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

DBpedia is a project aiming to represent Wikipedia content in RDF triples. It plays a central role in the Semantic Web, due to the large and growing number of resources linked to it. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of subject-attribute-value triples that represent a summary of the Wikipedia page. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes in the English Wikipedia has been mapped to the corresponding classes in DBpedia. Subsequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, (i) the number of accomplished mappings is still small and limited to most frequent infoboxes, (ii) mappings need maintenance due to the constant and quick changes of Wikipedia articles, and (iii) infoboxes are manually compiled by the Wikipedia contributors, therefore in more than 50% of the Wikipedia articles the infobox is missing. As a demonstration of these issues, only 1.7M Wikipedia pages are “deeply” classified in the DBpedia ontology, although the English Wikipedia contains almost 4M pages. This shows a clear problem of coverage, and this issue is even worse in other languages (like French and Spanish). The objective of this thesis is to define a methodology to increase the coverage of DBpedia in different languages, using various techniques to reach two different goals: automatic mapping generation and DBpedia dataset completion. A key aspect of our research is multi-linguality in Wikipedia: we bootstrap the available information through cross-language links, starting from the available mappings in some pivot languages, and then extending the existing DBpedia datasets (or create new ones from scratch) comparing the classifications in different languages. When the DBpedia classification is missing, we train a supervised classifier using the original DBpedia as training. We also use the Distant Supervision paradigm to extract the missing properties directly from the Wikipedia articles. We evaluated our system using a manually annotated test set and some existing DBpedia mappings excluded from the training. The results demonstrate the suitability of the approach in extending the DBpedia resource. Finally, the resulting resources are made available through a SPARQL endpoint and a downloadable package.

EXTENDING LINKED OPEN DATA RESOURCES EXPLOITING WIKIPEDIA AS SOURCE OF INFORMATION / A. Palmero Aprosio ; advisor: E. Damiani ; co-advisors: A. Lavelli, C. Giuliano. DIPARTIMENTO DI INFORMATICA, 2014 Mar 18. 25. ciclo, Anno Accademico 2012. [10.13130/palmero-aprosio-alessio_phd2014-03-18].

EXTENDING LINKED OPEN DATA RESOURCES EXPLOITING WIKIPEDIA AS SOURCE OF INFORMATION

A. PALMERO APROSIO

2014

Abstract

DBpedia is a project aiming to represent Wikipedia content in RDF triples. It plays a central role in the Semantic Web, due to the large and growing number of resources linked to it. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of subject-attribute-value triples that represent a summary of the Wikipedia page. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes in the English Wikipedia has been mapped to the corresponding classes in DBpedia. Subsequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, (i) the number of accomplished mappings is still small and limited to most frequent infoboxes, (ii) mappings need maintenance due to the constant and quick changes of Wikipedia articles, and (iii) infoboxes are manually compiled by the Wikipedia contributors, therefore in more than 50% of the Wikipedia articles the infobox is missing. As a demonstration of these issues, only 1.7M Wikipedia pages are “deeply” classified in the DBpedia ontology, although the English Wikipedia contains almost 4M pages. This shows a clear problem of coverage, and this issue is even worse in other languages (like French and Spanish). The objective of this thesis is to define a methodology to increase the coverage of DBpedia in different languages, using various techniques to reach two different goals: automatic mapping generation and DBpedia dataset completion. A key aspect of our research is multi-linguality in Wikipedia: we bootstrap the available information through cross-language links, starting from the available mappings in some pivot languages, and then extending the existing DBpedia datasets (or create new ones from scratch) comparing the classifications in different languages. When the DBpedia classification is missing, we train a supervised classifier using the original DBpedia as training. We also use the Distant Supervision paradigm to extract the missing properties directly from the Wikipedia articles. We evaluated our system using a manually annotated test set and some existing DBpedia mappings excluded from the training. The results demonstrate the suitability of the approach in extending the DBpedia resource. Finally, the resulting resources are made available through a SPARQL endpoint and a downloadable package.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
				18-mar-2014
			
	Settori scientifico-disciplinari della tesi (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Tutor afferenti all'Ateneo
	
				DAMIANI, ERNESTO
			
	Tipologia
	
				Doctoral Thesis
			
	Citazione
	
				EXTENDING LINKED OPEN DATA RESOURCES EXPLOITING WIKIPEDIA AS SOURCE OF INFORMATION / A. Palmero Aprosio ; advisor: E. Damiani ; co-advisors: A. Lavelli, C. Giuliano. DIPARTIMENTO DI INFORMATICA, 2014 Mar 18. 25. ciclo, Anno Accademico 2012. [10.13130/palmero-aprosio-alessio_phd2014-03-18].
			
	Appare nelle tipologie:
	
				Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
phd_unimi_R08605.pdf accesso aperto Tipologia: Tesi di dottorato completa Dimensione 3.79 MB Formato Adobe PDF Visualizza/Apri	3.79 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/233327

Citazioni

ND

ND

ND

ND

social impact