A Benchmark of Named Entity Recognition Approaches in Historical Documents

From LRDE

Revision as of 14:26, 26 April 2022 by Bot (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Abstract

Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”“location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464


Bibtex (lrde.bib)

@InProceedings{	  abadie.22.das,
  author	= {Nathalie Abadie and Edwin Carlinet and Joseph Chazalon and
		  Bertrand Dum\'enieu},
  title		= {A Benchmark of Named Entity Recognition Approaches in
		  Historical Documents},
  booktitle	= {Proceedings of the 15th IAPR International Workshop on
		  Document Analysis System},
  year		= 2022,
  address	= {La Rochelle, France},
  month		= may,
  abstract	= {Named entity recognition (NER) is a necessary step in many
		  pipelines targeting historical documents. Indeed, such
		  natural language processing techniques identify which class
		  each text token belongs to, e.g. ``person name'',
		  ``location'', ``number''. Introducing a new public dataset
		  built from 19th century French directories, we first assess
		  how noisy modern, off-the-shelf OCR are. Then, we compare
		  modern CNN- and Transformer-based NER techniques which can
		  be reasonably used in the context of historical document
		  analysis. We measure their requirements in terms of
		  training data, the effects of OCR noise on their
		  performance, and show how Transformer-based NER can benefit
		  from unsupervised pre-training and supervised fine-tuning
		  on noisy data. Results can be reproduced using resources
		  available at
		  https://github.com/soduco/paper-ner-bench-das22 and
		  https://zenodo.org/record/6394464},
  note		= {accepted}
}