DBpedia is a project aiming to represent Wikipedia content in RDF triples. It plays a central role in the Semantic Web, due to the large and growing number of resources linked to it. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of subject-attribute-value triples that represent a summary of the Wikipedia page. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes in the English Wikipedia has been mapped to the corresponding classes in DBpedia. Subsequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, (i) the number of accomplished mappings is still small and limited to most frequent infoboxes, (ii) mappings need maintenance due to the constant and quick changes of Wikipedia articles, and (iii) infoboxes are manually compiled by the Wikipedia contributors, therefore in more than 50% of the Wikipedia articles the infobox is missing. As a demonstration of these issues, only 1.7M Wikipedia pages are “deeply” classified in the DBpedia ontology, although the English Wikipedia contains almost 4M pages. This shows a clear problem of coverage, and this issue is even worse in other languages (like French and Spanish). The objective of this thesis is to define a methodology to increase the coverage of DBpedia in different languages, using various techniques to reach two different goals: automatic mapping generation and DBpedia dataset completion. A key aspect of our research is multi-linguality in Wikipedia: we bootstrap the available information through cross-language links, starting from the available mappings in some pivot languages, and then extending the existing DBpedia datasets (or create new ones from scratch) comparing the classifications in different languages. When the DBpedia classification is missing, we train a supervised classifier using the original DBpedia as training. We also use the Distant Supervision paradigm to extract the missing properties directly from the Wikipedia articles. We evaluated our system using a manually annotated test set and some existing DBpedia mappings excluded from the training. The results demonstrate the suitability of the approach in extending the DBpedia resource. Finally, the resulting resources are made available through a SPARQL endpoint and a downloadable package.
EXTENDING LINKED OPEN DATA RESOURCES EXPLOITING WIKIPEDIA AS SOURCE OF INFORMATION / A. Palmero Aprosio ; advisor: E. Damiani ; co-advisors: A. Lavelli, C. Giuliano. DIPARTIMENTO DI INFORMATICA, 2014 Mar 18. 25. ciclo, Anno Accademico 2012. [10.13130/palmero-aprosio-alessio_phd2014-03-18].
EXTENDING LINKED OPEN DATA RESOURCES EXPLOITING WIKIPEDIA AS SOURCE OF INFORMATION
A. PALMERO APROSIO
2014
Abstract
DBpedia is a project aiming to represent Wikipedia content in RDF triples. It plays a central role in the Semantic Web, due to the large and growing number of resources linked to it. Currently, the information contained in DBpedia is mainly collected from Wikipedia infoboxes, a set of subject-attribute-value triples that represent a summary of the Wikipedia page. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes in the English Wikipedia has been mapped to the corresponding classes in DBpedia. Subsequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, (i) the number of accomplished mappings is still small and limited to most frequent infoboxes, (ii) mappings need maintenance due to the constant and quick changes of Wikipedia articles, and (iii) infoboxes are manually compiled by the Wikipedia contributors, therefore in more than 50% of the Wikipedia articles the infobox is missing. As a demonstration of these issues, only 1.7M Wikipedia pages are “deeply” classified in the DBpedia ontology, although the English Wikipedia contains almost 4M pages. This shows a clear problem of coverage, and this issue is even worse in other languages (like French and Spanish). The objective of this thesis is to define a methodology to increase the coverage of DBpedia in different languages, using various techniques to reach two different goals: automatic mapping generation and DBpedia dataset completion. A key aspect of our research is multi-linguality in Wikipedia: we bootstrap the available information through cross-language links, starting from the available mappings in some pivot languages, and then extending the existing DBpedia datasets (or create new ones from scratch) comparing the classifications in different languages. When the DBpedia classification is missing, we train a supervised classifier using the original DBpedia as training. We also use the Distant Supervision paradigm to extract the missing properties directly from the Wikipedia articles. We evaluated our system using a manually annotated test set and some existing DBpedia mappings excluded from the training. The results demonstrate the suitability of the approach in extending the DBpedia resource. Finally, the resulting resources are made available through a SPARQL endpoint and a downloadable package.File | Dimensione | Formato | |
---|---|---|---|
phd_unimi_R08605.pdf
accesso aperto
Tipologia:
Tesi di dottorato completa
Dimensione
3.79 MB
Formato
Adobe PDF
|
3.79 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.