DAN+: Danish Nested Named Entities and Lexical Normalization

Barbara Plank, Kristian Nørgaard Jensen, Rob van der Goot

December 2020

Abstract

This paper introduces DAN+, a multi-domain resource for nested named entities (NEs) and lexical normalization for Danish, a less-resourced language. We empirically assess three strategies to model the two-layer NE annotations, cross-lingual cross-domain transfer from German versus in-language annotation, language-specific versus multilingual BERT, and the effect of lexical normalization on Danish NE. Our results show that the most robust strategy is multi-task learning which is rivaled by multi-label decoding, transfer is successful also for zero-shot, and in-language BERT and lexical normalization works the best on the least canonical data. However, our results also show that out-of-domain remains challenging, while performance on news plateaus quickly. This highlights the importance of cross-domain evaluation of cross-lingual transfer.

Type

Conference paper

Publication

The 28th International Conference on Computational Linguistics

Kristian Nørgaard Jensen

Computer Science Master Student

My research interests include natural language processing, artificial intelligence, robotics