A Benchmark of Named Entity Recognition Approaches in Historical Documents
From LRDE
- Authors
- Nathalie Abadie, Edwin Carlinet, Joseph Chazalon, Bertrand Duménieu
- Where
- Proceedings of the 15th IAPR International Workshop on Document Analysis System
- Place
- La Rochelle, France
- Type
- inproceedings
- Publisher
- Springer
- Projects
- Olena
- Keywords
- Image
- Date
- 2022-04-07
Abstract
Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”“location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464
Documents
Bibtex (lrde.bib)
@InProceedings{ abadie.22.das, author = {Nathalie Abadie and Edwin Carlinet and Joseph Chazalon and Bertrand Dum\'enieu}, title = {A Benchmark of Named Entity Recognition Approaches in Historical Documents}, booktitle = {Proceedings of the 15th IAPR International Workshop on Document Analysis System}, year = 2022, address = {La Rochelle, France}, month = 5, abstract = {Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. ``person name'', ``location'', ``number''. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464}, series = {Lecture Notes in Computer Science}, volume = 13237, pages = {445--460}, publisher = {Springer}, doi = {10.1007/978-3-031-06555-2_30}, lrderank = {B} }