LRDE Document Binarization Dataset (LRDE DBD)

From LRDE


Release date: February, 2013
Version : 1.0

This is a dataset is composed of full-document images, groundtruth, and tools to perform an evaluation of binarization algorithms. It allows pixel-based accuracy and OCR-based evaluations.

Sample Images

Original document Clean document Scanned document Binarization Groundtruth
Olena-original small.png
Full resolution
Olena-clean small.png
Full resolution
Olena-scanned small.png
Full resolution
Olena-gt small.png
Full resolution

Description

Data

This dataset is composed of documents images and tools to perform an evaluation of binarization algorithms.

Documents have been extracted from the same magazine. Text language is French.

The provided dataset is composed of:

  • Full-Document Images (A4 format, 300-dpi resolution)
    • 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
    • 125 numerical "clean documents" created from the "original documents" where images have been removed.
    • 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
  • Text Lines Localization Information
    • 123 large text lines localization (clean).
    • 320 medium text lines localization (clean).
    • 9551 small text lines localization (clean).
    • 123 large text lines localization (original).
    • 320 medium text lines localization (original).
    • 9551 small text lines localization (original).
    • 123 large text lines localization (scanned).
    • 320 medium text lines localization (scanned).
    • 9551 small text lines localization (scanned).
  • Groundtruth
    • 125 binarized images for "clean documents".
    • 123 OCR outputs for large text lines.
    • 320 OCR outputs for medium text lines.
    • 9551 OCR outputs for small text lines.

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

The text lines dataset covers only a subset of the full-document dataset. It is generated from the binarization of the full-document images.

Purpose of the three document qualities :

  • Original : evaluate the binarization quality on perfect documents mixing text and images.
  • Clean : evaluate the binarization quality on perfect document with text only.
  • Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Tools

Implementation for the following binarization algorithms is available in the Olena platform and are evaluated by default with this benchmark.

These implementations are based on the image processing plaform Olena. They are released under the GNU GPLv2 license.

We provide a Python script to automate the download and installation of the whole framework and tools necessary for the benchmark. Therefore, you can easily replay the experiments, inspect the results and later compare these techniques to other approaches.

The benchmark process is performed in two steps:

  1. The quality of the binarization of the "clean documents" is evaluated.
  2. For each binarization algorithm, a selection of lines is passed to the OCR. Then, the OCR output is compared to the groundtruth and evaluated thanks to mean edit distance. Line are grouped by x height (small, medium and large) ; the result are given for each size and quality of documents ("clean" or "scanned").


Copyright Notice

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from Le Nouvel Observateur. This work is based on the French magazine Le Nouvel Observateur</a>, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

Download

Setup script (v1.0)

A setup script is available to download and setup tools AND data. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, ...)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

Data (v1.0)

For convenience, data is also available separately:

File Size
Original documents 213MB
Clean documents 67MB
Scanned documents 583MB
Text lines localization 9.8MB
Binarization groundtruth 21MB
OCR groundtruth 3.4MB
Benchmark tools 100Ko

Or simply browse the directory.

Pre-computed outputs for several common binarization algorithms are available here.

Acknowledgements

The LRDE is very grateful to Yan Gilbert who has accepted that we use and publish as data some pages from this French magazine "Le Nouvel Observateur" (issue 4202, November 18th-24th, 2010) for our experiments.