Difference between revisions of "Olena/DatasetDBD"

From LRDE

 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{DISPLAYTITLE:LRDE Document Binarization Dataset (LRDE DBD)}}
 
{{DISPLAYTITLE:LRDE Document Binarization Dataset (LRDE DBD)}}
   
  +
<noinclude>{{OlenaMenu}}</noinclude>
__NOTOC__
 
  +
 
Release date: February, 2013
 
Release date: February, 2013
<br>Version : 1.0
+
<br/>Version : 1.0
   
 
This is a dataset is composed of full-document images, groundtruth, and tools to perform an evaluation of binarization algorithms. It allows pixel-based accuracy and OCR-based evaluations.
 
This is a dataset is composed of full-document images, groundtruth, and tools to perform an evaluation of binarization algorithms. It allows pixel-based accuracy and OCR-based evaluations.
 
 
   
 
== Sample Images ==
 
== Sample Images ==
Line 20: Line 19:
   
 
|-
 
|-
| [[File:Olena-original_small.png|200px]]<br><center>[https://www.lrde.epita.fr/dload/olena/datasets/dbd/samples/original.png Full resolution]</center>
+
| [[File:Olena-original_small.png|200px]]<br/><center>[{{SERVER}}/dload/olena/datasets/dbd/samples/original.png Full resolution]</center>
| [[File:Olena-clean_small.png|200px]]<br><center>[https://www.lrde.epita.fr/dload/olena/datasets/dbd/samples/clean.png Full resolution]</center>
+
| [[File:Olena-clean_small.png|200px]]<br/><center>[{{SERVER}}/dload/olena/datasets/dbd/samples/clean.png Full resolution]</center>
| [[File:Olena-scanned_small.png|200px]]<br><center>[https://www.lrde.epita.fr/dload/olena/datasets/dbd/samples/scanned.png Full resolution]</center>
+
| [[File:Olena-scanned_small.png|200px]]<br/><center>[{{SERVER}}/dload/olena/datasets/dbd/samples/scanned.png Full resolution]</center>
| [[File:Olena-gt_small.png|200px]]<br><center>[https://www.lrde.epita.fr/dload/olena/datasets/dbd/samples/gt.png Full resolution]</center>
+
| [[File:Olena-gt_small.png|200px]]<br/><center>[{{SERVER}}/dload/olena/datasets/dbd/samples/gt.png Full resolution]</center>
 
   
 
|}
 
|}
Line 36: Line 34:
 
Documents have been extracted from the same magazine. Text language is French.
 
Documents have been extracted from the same magazine. Text language is French.
   
The provided dataset is composed of:
+
The provided dataset is composed of:
   
 
* Full-Document Images (A4 format, 300-dpi resolution)
 
* Full-Document Images (A4 format, 300-dpi resolution)
Line 81: Line 79:
 
* [http://www.springerlink.com/content/0gh0gm5takf5c7gv/ Wolf]
 
* [http://www.springerlink.com/content/0gh0gm5takf5c7gv/ Wolf]
   
These implementations are based on the image processing
+
These implementations are based on the image processing plaform [[Olena]]. They are released under the GNU GPLv2 license.
plaform [[Olena]]. They are released under the GNU GPLv2 license.
 
   
 
We provide a Python script to automate the download and installation of the whole
 
We provide a Python script to automate the download and installation of the whole
Line 88: Line 85:
 
easily replay the experiments, inspect the results and later compare these techniques to other approaches.
 
easily replay the experiments, inspect the results and later compare these techniques to other approaches.
   
The benchmark process is performed in two steps:
+
The benchmark process is performed in two steps:
 
# The quality of the binarization of the "clean documents" is
 
evaluated.
 
# For each binarization algorithm, a selection of lines is passed
 
to the OCR. Then, the OCR output is compared to the groundtruth and
 
evaluated thanks to mean edit distance. Line are grouped by x height
 
(small, medium and large) ; the result are given for each size and
 
quality of documents ("clean" or "scanned").
 
   
 
# The quality of the binarization of the "clean documents" is evaluated.
  +
# For each binarization algorithm, a selection of lines is passed to the OCR. Then, the OCR output is compared to the groundtruth and evaluated thanks to mean edit distance. Line are grouped by x height (small, medium and large) ; the result are given for each size and quality of documents ("clean" or "scanned").
   
 
<!--
 
<!--
Line 106: Line 97:
 
easily replay the experiments, inspect the results and later compare these techniques to other approaches.
 
easily replay the experiments, inspect the results and later compare these techniques to other approaches.
   
The benchmark process is performed in two steps:
+
The benchmark process is performed in two steps:
   
 
# The quality of the binarization of the "clean documents" is
 
# The quality of the binarization of the "clean documents" is
Line 115: Line 106:
 
(small, medium and large) ; the result are given for each size and
 
(small, medium and large) ; the result are given for each size and
 
quality of documents ("clean" or "scanned").
 
quality of documents ("clean" or "scanned").
 
   
 
== Requirements ==
 
== Requirements ==
Line 131: Line 121:
   
 
It has been tested successfully on Debian Wheezy.
 
It has been tested successfully on Debian Wheezy.
 
   
 
-->
 
-->
Line 144: Line 133:
 
* [[Publications/201302-IJDAR|Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]]
 
* [[Publications/201302-IJDAR|Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]]
 
* [[Publications/201109-ICDAR|The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]]
 
* [[Publications/201109-ICDAR|The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]]
 
   
 
This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.
 
This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.
 
   
 
== Download ==
 
== Download ==
Line 160: Line 147:
 
'''Dependencies''': Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).
 
'''Dependencies''': Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).
   
* [http://www.lrde.epita.fr/dload/olena/datasets/dbd/setup.py Download Setup Script]
+
* [{{SERVER}}/dload/olena/datasets/dbd/setup.py Download Setup Script]
   
 
=== Data (v1.0) ===
 
=== Data (v1.0) ===
Line 172: Line 159:
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_orig-1.0.zip Original documents]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_orig-1.0.zip Original documents]
 
| 213MB
 
| 213MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_clean-1.0.zip Clean documents]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_clean-1.0.zip Clean documents]
 
| 67MB
 
| 67MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_scanned-1.0.zip Scanned documents]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_scanned-1.0.zip Scanned documents]
 
| 583MB
 
| 583MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_textlines-1.0.zip Text lines localization]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_textlines-1.0.zip Text lines localization]
 
| 9.8MB
 
| 9.8MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth]
 
| 21MB
 
| 21MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_ocr_gt-1.0.zip OCR groundtruth]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/nouvel_obs_2402_ocr_gt-1.0.zip OCR groundtruth]
 
| 3.4MB
 
| 3.4MB
   
 
|-
 
|-
| [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/lrde-dbd-tools-1.0.zip Benchmark tools]
+
| [{{SERVER}}/dload/olena/datasets/dbd/1.0/lrde-dbd-tools-1.0.zip Benchmark tools]
 
| 100Ko
 
| 100Ko
 
   
 
|}
 
|}
   
Or simply [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0 browse the directory].
+
Or simply [{{SERVER}}/dload/olena/datasets/dbd/1.0 browse the directory].
   
Pre-computed outputs for several common binarization algorithms are available [http://www.lrde.epita.fr/dload/olena/datasets/dbd/1.0/outputs/ here].
+
Pre-computed outputs for several common binarization algorithms are available [{{SERVER}}/dload/olena/datasets/dbd/1.0/outputs/ here].
   
 
== Acknowledgements ==
 
== Acknowledgements ==

Latest revision as of 20:02, 9 July 2015


Release date: February, 2013
Version : 1.0

This is a dataset is composed of full-document images, groundtruth, and tools to perform an evaluation of binarization algorithms. It allows pixel-based accuracy and OCR-based evaluations.

Sample Images

Original document Clean document Scanned document Binarization Groundtruth
Olena-original small.png
Full resolution
Olena-clean small.png
Full resolution
Olena-scanned small.png
Full resolution
Olena-gt small.png
Full resolution

Description

Data

This dataset is composed of documents images and tools to perform an evaluation of binarization algorithms.

Documents have been extracted from the same magazine. Text language is French.

The provided dataset is composed of:

  • Full-Document Images (A4 format, 300-dpi resolution)
    • 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
    • 125 numerical "clean documents" created from the "original documents" where images have been removed.
    • 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
  • Text Lines Localization Information
    • 123 large text lines localization (clean).
    • 320 medium text lines localization (clean).
    • 9551 small text lines localization (clean).
    • 123 large text lines localization (original).
    • 320 medium text lines localization (original).
    • 9551 small text lines localization (original).
    • 123 large text lines localization (scanned).
    • 320 medium text lines localization (scanned).
    • 9551 small text lines localization (scanned).
  • Groundtruth
    • 125 binarized images for "clean documents".
    • 123 OCR outputs for large text lines.
    • 320 OCR outputs for medium text lines.
    • 9551 OCR outputs for small text lines.

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

The text lines dataset covers only a subset of the full-document dataset. It is generated from the binarization of the full-document images.

Purpose of the three document qualities :

  • Original : evaluate the binarization quality on perfect documents mixing text and images.
  • Clean : evaluate the binarization quality on perfect document with text only.
  • Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Tools

Implementation for the following binarization algorithms is available in the Olena platform and are evaluated by default with this benchmark.

These implementations are based on the image processing plaform Olena. They are released under the GNU GPLv2 license.

We provide a Python script to automate the download and installation of the whole framework and tools necessary for the benchmark. Therefore, you can easily replay the experiments, inspect the results and later compare these techniques to other approaches.

The benchmark process is performed in two steps:

  1. The quality of the binarization of the "clean documents" is evaluated.
  2. For each binarization algorithm, a selection of lines is passed to the OCR. Then, the OCR output is compared to the groundtruth and evaluated thanks to mean edit distance. Line are grouped by x height (small, medium and large) ; the result are given for each size and quality of documents ("clean" or "scanned").


Copyright Notice

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from Le Nouvel Observateur. This work is based on the French magazine Le Nouvel Observateur</a>, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

Download

Setup script (v1.0)

A setup script is available to download and setup tools AND data. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, ...)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

Data (v1.0)

For convenience, data is also available separately:

File Size
Original documents 213MB
Clean documents 67MB
Scanned documents 583MB
Text lines localization 9.8MB
Binarization groundtruth 21MB
OCR groundtruth 3.4MB
Benchmark tools 100Ko

Or simply browse the directory.

Pre-computed outputs for several common binarization algorithms are available here.

Acknowledgements

The LRDE is very grateful to Yan Gilbert who has accepted that we use and publish as data some pages from this French magazine "Le Nouvel Observateur" (issue 4202, November 18th-24th, 2010) for our experiments.