Difference between revisions of "Publications/abadie.22.das"
From LRDE
(Created page with "{{Publication | published = true | date = 2022-04-07 | authors = Nathalie Abadie, Edwin Carlinet, Joseph Chazalon, Bertrand Duménieu | title = A Benchmark of Named Entity Rec...") |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
| booktitle = Proceedings of the 15th IAPR International Workshop on Document Analysis System |
| booktitle = Proceedings of the 15th IAPR International Workshop on Document Analysis System |
||
| address = La Rochelle, France |
| address = La Rochelle, France |
||
− | | abstract = Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs |
+ | | abstract = Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”“location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464 |
+ | | series = Lecture Notes in Computer Science |
||
+ | | volume = 13237 |
||
+ | | pages = 445 to 460 |
||
+ | | publisher = Springer |
||
| lrdenewsdate = 2022-04-07 |
| lrdenewsdate = 2022-04-07 |
||
+ | | lrdepaper = http://www.lrde.epita.fr/dload/papers/abadie.22.das.pdf |
||
+ | | lrdeprojects = Olena |
||
+ | | lrdekeywords = Image |
||
+ | | lrderank = B |
||
| type = inproceedings |
| type = inproceedings |
||
| id = abadie.22.das |
| id = abadie.22.das |
||
+ | | identifier = doi:10.1007/978-3-031-06555-2_30 |
||
| bibtex = |
| bibtex = |
||
@InProceedings<nowiki>{</nowiki> abadie.22.das, |
@InProceedings<nowiki>{</nowiki> abadie.22.das, |
||
Line 20: | Line 29: | ||
year = 2022, |
year = 2022, |
||
address = <nowiki>{</nowiki>La Rochelle, France<nowiki>}</nowiki>, |
address = <nowiki>{</nowiki>La Rochelle, France<nowiki>}</nowiki>, |
||
− | month = |
+ | month = 5, |
abstract = <nowiki>{</nowiki>Named entity recognition (NER) is a necessary step in many |
abstract = <nowiki>{</nowiki>Named entity recognition (NER) is a necessary step in many |
||
pipelines targeting historical documents. Indeed, such |
pipelines targeting historical documents. Indeed, such |
||
Line 37: | Line 46: | ||
available at |
available at |
||
https://github.com/soduco/paper-ner-bench-das22 and |
https://github.com/soduco/paper-ner-bench-das22 and |
||
− | https://zenodo.org/record/6394464<nowiki>}</nowiki> |
+ | https://zenodo.org/record/6394464<nowiki>}</nowiki>, |
+ | series = <nowiki>{</nowiki>Lecture Notes in Computer Science<nowiki>}</nowiki>, |
||
+ | volume = 13237, |
||
+ | pages = <nowiki>{</nowiki>445--460<nowiki>}</nowiki>, |
||
+ | publisher = <nowiki>{</nowiki>Springer<nowiki>}</nowiki>, |
||
+ | doi = <nowiki>{</nowiki>10.1007/978-3-031-06555-2_30<nowiki>}</nowiki>, |
||
+ | lrderank = <nowiki>{</nowiki>B<nowiki>}</nowiki> |
||
<nowiki>}</nowiki> |
<nowiki>}</nowiki> |
||
Latest revision as of 13:47, 30 August 2023
- Authors
- Nathalie Abadie, Edwin Carlinet, Joseph Chazalon, Bertrand Duménieu
- Where
- Proceedings of the 15th IAPR International Workshop on Document Analysis System
- Place
- La Rochelle, France
- Type
- inproceedings
- Publisher
- Springer
- Projects
- Olena
- Keywords
- Image
- Date
- 2022-04-07
Abstract
Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”“location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464
Documents
Bibtex (lrde.bib)
@InProceedings{ abadie.22.das, author = {Nathalie Abadie and Edwin Carlinet and Joseph Chazalon and Bertrand Dum\'enieu}, title = {A Benchmark of Named Entity Recognition Approaches in Historical Documents}, booktitle = {Proceedings of the 15th IAPR International Workshop on Document Analysis System}, year = 2022, address = {La Rochelle, France}, month = 5, abstract = {Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. ``person name'', ``location'', ``number''. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464}, series = {Lecture Notes in Computer Science}, volume = 13237, pages = {445--460}, publisher = {Springer}, doi = {10.1007/978-3-031-06555-2_30}, lrderank = {B} }