LRDE Document Binarization Dataset (LRDE DBD)

Home

News

Resources

Examples
Demos

Download

Olena 2.1
Snapshots

Documentation

Publications

FAQ

Platform Components

Data Sets

DBD

Olena in use

Development

Contact

About

Release date: February, 2013
Version : 1.0

This is a dataset is composed of full-document images, groundtruth, and tools to perform an evaluation of binarization algorithms. It allows pixel-based accuracy and OCR-based evaluations.

Sample Images

Original document	Clean document	Scanned document	Binarization Groundtruth
Full resolution	Full resolution	Full resolution	Full resolution

Description

Data

This dataset is composed of documents images and tools to perform an evaluation of binarization algorithms.

Documents have been extracted from the same magazine. Text language is French.

The provided dataset is composed of:

Full-Document Images (A4 format, 300-dpi resolution)
- 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
- 125 numerical "clean documents" created from the "original documents" where images have been removed.
- 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Text Lines Localization Information
- 123 large text lines localization (clean).
- 320 medium text lines localization (clean).
- 9551 small text lines localization (clean).
- 123 large text lines localization (original).
- 320 medium text lines localization (original).
- 9551 small text lines localization (original).
- 123 large text lines localization (scanned).
- 320 medium text lines localization (scanned).
- 9551 small text lines localization (scanned).

Groundtruth
- 125 binarized images for "clean documents".
- 123 OCR outputs for large text lines.
- 320 OCR outputs for medium text lines.
- 9551 OCR outputs for small text lines.

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

The text lines dataset covers only a subset of the full-document dataset. It is generated from the binarization of the full-document images.

Purpose of the three document qualities :

Original : evaluate the binarization quality on perfect documents mixing text and images.
Clean : evaluate the binarization quality on perfect document with text only.
Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Tools

Implementation for the following binarization algorithms is available in the Olena platform and are evaluated by default with this benchmark.

These implementations are based on the image processing plaform Olena. They are released under the GNU GPLv2 license.

We provide a Python script to automate the download and installation of the whole framework and tools necessary for the benchmark. Therefore, you can easily replay the experiments, inspect the results and later compare these techniques to other approaches.

The benchmark process is performed in two steps:

The quality of the binarization of the "clean documents" is evaluated.
For each binarization algorithm, a selection of lines is passed to the OCR. Then, the OCR output is compared to the groundtruth and evaluated thanks to mean edit distance. Line are grouped by x height (small, medium and large) ; the result are given for each size and quality of documents ("clean" or "scanned").

Copyright Notice

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from Le Nouvel Observateur. This work is based on the French magazine Le Nouvel Observateur</a>, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

Download

Setup script (v1.0)

A setup script is available to download and setup tools AND data. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, ...)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

Download Setup Script

Data (v1.0)

For convenience, data is also available separately:

File	Size
Original documents	213MB
Clean documents	67MB
Scanned documents	583MB
Text lines localization	9.8MB
Binarization groundtruth	21MB
OCR groundtruth	3.4MB
Benchmark tools	100Ko

Or simply browse the directory.

Pre-computed outputs for several common binarization algorithms are available here.

Acknowledgements

The LRDE is very grateful to Yan Gilbert who has accepted that we use and publish as data some pages from this French magazine "Le Nouvel Observateur" (issue 4202, November 18th-24th, 2010) for our experiments.