IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and LAN(guage) and also adding label classes for various numeric and temporal expressions. Applying the scheme to the spoken data as well as a collection of teaser tweets from newspaper sites, we can confirm its generality for both domains, also achieving good inter-annotator agreement. We also show empirically how our inventory relates to the well-established 4-category NER inventory by re-annotating a subset of the GermEval 2014 NER coarse-grained dataset with our fine label inventory. Finally, we use a BERT-based system to establish some baselines for NER tagging on our two new datasets. Global results in in-domain testing are quite high on the two datasets, near what was achieved for the coarse inventory on the CoNLLL2003 data. Cross-domain testing produces much lower results due to the severe domain differences

Fine-grained Named Entity Annotations for German Biographic Interviews / J. Ruppemhofer, I. Rehbein, C. Flinz - In: Proceedings of the 12th Language Resources and Evaluation Conference / [a cura di] N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis. - [s.l] : European Language Resources Association, 2020. - ISBN 9791095546344. - pp. 4605-4614 (( Intervento presentato al 12. convegno LREC tenutosi a Marseille nel 2020.

Fine-grained Named Entity Annotations for German Biographic Interviews

Ruppemhofer J.;Rehbein I.;C. Flinz

2020

Abstract

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and LAN(guage) and also adding label classes for various numeric and temporal expressions. Applying the scheme to the spoken data as well as a collection of teaser tweets from newspaper sites, we can confirm its generality for both domains, also achieving good inter-annotator agreement. We also show empirically how our inventory relates to the well-established 4-category NER inventory by re-annotating a subset of the GermEval 2014 NER coarse-grained dataset with our fine label inventory. Finally, we use a BERT-based system to establish some baselines for NER tagging on our two new datasets. Global results in in-domain testing are quite high on the two datasets, near what was achieved for the coarse inventory on the CoNLLL2003 data. Cross-domain testing produces much lower results due to the severe domain differences

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Named Entity Recognition; spoken language; German; oral history corpora
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore L-LIN/14 - Lingua e Traduzione - Lingua Tedesca
			
	Data di pubblicazione
	
				2020
			
	URL
	
				https://www.aclweb.org/anthology/2020.lrec-1.566/
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
Flinz_LREC_2020.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 386.48 kB Formato Adobe PDF Visualizza/Apri	386.48 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/843012

Citazioni

ND

6

5

ND

social impact