Evaltex

From LRDE

EvaLTex (Evaluating Text Localization) is a unified evaluation framework used to measure the performance of text detection and text segmentation algorithms. It takes as input text objects represented either by rectangle coordinates or by irregular masks. The output consists of a set of scores, at local and global levels, and a visual representation of the behavior of the analysed algorithm through quality histograms.

For more details on the evaluation protocol, read the scientific paper published in the Image and Vision Computing Journal and the Ph.D Thesis. Details on the visual representation of the evaluation can be found in the article published in the Proc. of International Conference in Document Analysis and Recognition. To use the protocol for segmentation purposes please check out the article published in the Proc. of International Workshop on Robust Reading.



Please cite the following papers in all publications that use EvaLTex:
IVC for text detection evaluation
ECCV for text segmentation evaluation
ICDAR for the histogram representation and EMD metrics.


Evaluation performance measurements

Local evaluation

For each matched GT object Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G_i} by a detection Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D_j} we assign two quality measures: Coverage (Cov) and Accuracy (Acc);

  • Cov computes the rate of the matched area with respect to the GT object area

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Cov_{i}=\frac{Area(G_{i}\bigcap D_{j})}{Area(G_{i})}}

  • Acc computes the rate of the matched area with respect to the detection area

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Acc_{i}=\frac{Area(G_{i}\bigcap D_{j})}{Area(D_{j})}}

The two quality metrics are adapted based on the type of matching (one-to-one, one-to-many, many-to-one or many-to-many). For more details please refer to the scientific paper published in the Image and Vision Computing Journal and the details in Chapter 3 of this Ph.D Thesis.

Global evaluation

The global evaluation consists of a set of measurements: a global recall Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_G} , a quantitative recall Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{quant}} , a qualitative recall Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{qual}} , a "global" precision Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{G}} , a "quantitative" precision Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{quant}} , a "qualitative" precision Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{qual}} , a split metric as well as an overall F-Score value. In addition, the tool provides two histogram representations of the local qualities and a derived set of metrics (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{EMD}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{EMD}} ) computed using histogram distances. For the comprehension of all these metrics, we define the following:

  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_G} = nb. of GT objects in the image/dataset
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle TP} = nb. of true positives (GT objects that were detected)
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle FP} = nb. false positives (detections with no correspondence in the GT)


Recall. The Recall (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_G} ) computes the amount of detected text and is defined as the product of two terms:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{G}=\frac{\sum_{i=1}^{N_G} Cov_{i}}{N_G}=\frac{TP}{N_G} \cdot \frac{\sum_{i=1}^{N_G} Cov_{i}}{TP}}

The left term of the product represents the ratio between the number of true positives and the total number of GT objects. We interpret this ratio as the quantity Recall Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{quant}} , as it accurately describes the percentage of detected GT objects, regardless of their coverage:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{quant} = \frac{TP}{N_G}}

The second term is get by averaging all coverage rates of the detected GT objects. Intuitively, we can denote this proportion as the quality Recall, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{qual}} , as it characterizes the mean of covered surface of the GT:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{qual} = \frac{\sum_{i=1}^{N_G} Cov_{i}}{TP}}


Precision. By applying the same reasoning, we obtain the following decomposition of the global Precision Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_G} :

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{G}=\frac{\sum_{i=1}^{N_G} Acc_{i}}{TP+FP}= \frac{TP}{TP+FP} \cdot \frac{\sum_{i=1}^{N_G} Acc_{i}}{TP}}

Here again, the left term of the product provides an insight on the percentage of detections that have a correspondence in the GT. Consequently, we call this measure the quantity precision Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{quant}} :

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{quant} = \frac{TP}{TP+FP}}

Inversely, the right term computes the accuracy average obtained from the matching of the detection set and the GT. This ratio will then be referred to as the Precision quality Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{qual}} :

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{qual} = \frac{\sum_{i=1}^{N_G} Acc_{i}}{TP}}


Split. The Split metric evaluates the level of GT fragmentation in a dataset and is computed as:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S = \frac{\sum_{i=1}^{N_G} \frac{1}{1+ \ln(s_i)\cdot \ln(s_i)} \cdot 0.6 +0.4 }{N_G}} , where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s_i} =nb. of detections matching Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G_i }

The Split measure can be used as an individual metric or integrated in the Recall computation. For more details please refer to the scientific paper published in the Image and Vision Computing Journal and the details in Chapter 3 of this Ph.D Thesis.


F-Score. We use as an overall metric the well known F-Score defined as:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F_G=\frac{2\cdot R_G\cdot P_G}{R_G+P_G}}

Quality histograms.

Histograms can be seen as convenient tools to represent simultaneously the quality and quantity aspects of a set of detections: the quality aspect can be described by the histogram's bin (each bin corresponds to a coverage or accuracy interval); the detection quantity feature can be represented by the bin values (for example, the bin value counts how many GT objects have a coverage or accuracy value that belongs to that bin's interval). The coverage and accuracy histograms intuitively provide at a glance different properties of the detection (or segmentation) behaviour, as illustrated in the following figures:

Coverage histogram
Coverage histogram
Accuracy histogram
Accuracy histogram


EMD metrics. As an alternative to the global score set explained above, the tool also provides a recall and precision value obtained by applying the Earth Mover's Distance between the coverage (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \widetilde{h}_{Cov}} ), respectively the accuracy (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \widetilde{h}_{Acc}} ) histogram and an optimal histogram (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \widetilde{h}_{O}} ), which describes a perfect detection.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{EMD}=1-EMD(\widetilde{h}_{Cov},\widetilde{h}_{O})}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{EMD}=1-EMD(\widetilde{h}_{Acc},\widetilde{h}_{O})}

For more details on the histogram representation and the EMD metrics please refer to the scientific paper published in the International Conference on Document Analysis and Recognition and the details in Chapter 4 of this Ph.D Thesis.

Input format

To simplify the use of the EvaLTex tool for both detection and segmentation tasks, we unified the input format. Hence, to evaluate both text detection and text segmentation we use the same input format consisting of .txt files which contain different attributes of each text object (i.e. coordinates of the bounding boxes).

Text detection tasks

Text detection results (i.e. word, lines, regions) can be represented both through boxes and masks. For text detection tasks using bounding boxes, a .txt file is enough. If the text objects are represented by irregular masks, then an additional labeled image will be needed.

GT format. The GT necessary for the text detection (and also segmentation tasks) is represented by the following format:

  • img name
  • image height, image width
  • text object list (one per line) with the following attributes:
  • ID: unique text object ID
  • region ID: region ID to which the object belongs to
  • "transcription": can be empty
  • text reject: option that decides if a text object should be counted or not; can be set to f (default) or t (not take into account)
  • x: x coordinate of the bounding box
  • y: y coordinate of the bounding box
  • width: width of the bounding box
  • height: height of the bounding box

GT and region IDs
GT and region IDs.

e.g.: img_1.txt

img_1

960,1280
1,1,"Tiredness",f,38,43,882,172
2,2,"kills",f,275,264,390,186
3,3,"A",f,0,699,77,131
4,3,"short",f,128,705,355,134
5,3,"break",f,542,710,396,131
6,4,"could",f,87,884,370,137
7,4,"save",f,517,919,314,105
8,5,"your",f,166,1095,302,136
9,5,"life",f,530,1069,213,137

Detection format. The detection .txt file format differs slightly from the GT one: it does not contain the image size, the region ID and the text reject attributes. Hence, the detection file has the following format:

  • img name
  • text object list (one per line) with the following attributes:
  • ID: unique text object ID
  • "transcription": can be empty
  • x: x coordinate of the bounding box
  • y: y coordinate of the bounding box
  • width: width of the bounding box
  • height: height of the bounding box


e.g.: img_1.txt

img_1

1,"",272,264,392,186
2,"",34,40,886,175
3,"",168,1082,300,148

Text mask representation

The interest of using masks rather than rectangles is to represent text strings, not only in horizontal or vertical configurations, but also tilted, circular, curved or in perspective. In such cases, the rectangular representation might disturb the matching process: a detection can involuntary match a GT object due to its varying direction (inclined, curved, circular). To evaluate mask detection objects, we need, in the addition of the file format explained before, a set of labeled images, for both the GT and the detection set.

The only difference between the GT format of the text box representation and the text mask representation consists in the region ID. The irregular mask annotation disables the use of the region tag. When dealing with rectangular boxes, the regions are generated automatically based on the coordinates of the GT objects. Consequently, a region is the bounding box of several "smaller" boxes. Thus, when masks are annotated irregularly, regions cannot be generated automatically, so each GT object will have a different region ID. One can simplify this by attributing the same ID to the object and the region.

example

Original image with curved text
Original image with curved text
Labeled GT masks
Labeled GT masks

Text segmentation tasks

Evaluating text segmentation tasks is very similar to evaluating text detection using a mask representation. For text segmentation we use a mask for each character, contrary to text detection when we use masks to represent words, lines or regions.

Original image
Original image
Labeled GT characters
Labeled GT characters
Labeled segmentation result
Labeled segmentation result example

Text segmentation GT format. Similar to text detection tasks using masks, to evaluate text segmentation we use, in addition to the .txt file a labeled image (each character is labeled differently). Each GT object is represented by a character. Character-level GT objects cannot be grouped into regions and consequently each text object has a different region tag. The x, y, width and height will define the coordinates of the bounding box of each character.

e.g.: img_1.txt

img_1

960,1280
1,1,"",f,384,43,101,166
2,2,"",f,142,44,46,164
3,3,"",f,38,47,106,163
4,4,"",f,192,80,71,126
5,5,"",f,269,80,100,131
6,6,"",f,501,81,97,126
7,7,"",f,721,81,97,131
...
16,16,"",t,97,703,53,16
...

Notice that the rectangular mask shape in the segmentation GT example depicts a text object that should not be considered. This corresponds to setting the reject option to t for the text mask having the ID 16, as seen above.


Text segmentation result format. The result format consists in the same labeled image as the one used for the GT and the detection .txt file containing the positions of the bounding boxes of each segmented connected component.

e.g.: img_1.txt

img_1

1,"",383,42,103,167
2,"",142,43,49,167
3,"",35,44,112,168
4,"",268,79,101,132
5,"",194,81,71,124
6,"",500,81,99,127
7,"",612,81,100,131
8,"",721,81,97,131
9,"",824,82,99,133
10,"",344,883,29,135
11,"",387,886,65,135
...

Output format

The evaluation results are given as .txt files, in two forms: a file with the results obtained on the entire dataset and a file with results generated for each image in the dataset. The difference between the local evaluation and the global one consists in the statistics (Cov, Acc and split) for each GT object in an image.

Global evaluation for an entire dataset

EvaLTex statistics
General
Number of GTs =6410
Number of detections = 6338
Number of false positives =1890
Number of true positives =4678

EvaLTex statistics summarize the number of GT objects, detections, false positives and true positives in the dataset.


Global results
Recall=0.759731
Recall_noSplit=0.760799
Precision=0.692591
Split=0.791221
FScore=0.724609

FScore_noSplit=0.725095

The global scores are the default Recall score (with integrated Split), the Recall with no integrated Split, the Precision, as well as the default FScore (with integrated Split) and the FScore without the integrated Split.


Quantity results
Recall=0.792747
Precision=0.712241

Quantity results only refer to the number of detected text objects or the number of detections with a match in the GT regardless of the coverage or accuracy areas.


Quality results
Recall=0.958352
Recall_noSplit=0.959699
Precision=0.972411
Coverage histogram = {0.214201, 0.00491442, 0.00423657, 0.00491442, 0.00491442, 0.00525335, 0.00610066, 0.0132181, 0.0250805, 0.717167}
Accuracy histogram = {0.288977, 0.00137028, 0.00091352, 0.0022838, 0.00365408, 0.00471985, 0.00517661, 0.0103532, 0.0235993, 0.658952}

The quality results contain two histograms, representing the coverage and accuracy distributions over the dataset. The histogram format produced by EvaLTex is given as a n-size array, where n is the chosen number of bins. In order to generate the quality histograms, any visualization tool can be used. e.g.

Coverage histogram
Coverage histogram
Accuracy histogram
Accuracy histogram


EMD results
Recall=0.702796
Precision=0.696302
FScore=0.699534

As an alternative to the global scores, we can also compute, using the EMD distance two quality scores based on the Coverage and Accuracy histograms.

Local evaluation .txt file for each image

The local evaluation file, generated for each image of the dataset, has the same format as the dataset evaluation result file.

EvaLTex statistics - image img_1
General
Number of GTs =43
Number of detections = 19
Number of false positives =1
Number of true positives =18

Global results
Recall=0.414803
Recall_noSplit=0.414803
Precision=0.921798
Split=0.428571
FScore=0.572144

FScore_noSplit=0.572144

Quantity results
Recall=0.967873
Precision=0.947368

Quality results
Recall=0.428571
Recall_noSplit=0.967873
Precision=0.973009
Coverage histogram = {0.571429, 0, 0, 0, 0, 0, 0.0238095, 0, 0.0238095, 0.380952}
Accuracy histogram = {0.288977, 0.00137028, 0.00091352, 0.0022838, 0.00365408, 0.00471985, 0.00517661, 0.0103532, 0.0235993, 0.658952}

EMD results
Recall=0.420952
Recall_noSplit=0.420952
Precision=0.926316
FScore=0.578853

FScore_noSplit=0.578853

In addition, it contains the accuracy, coverage and split values for all the GT objects in the dataset.

Local evaluation
GT object 1
Coverage = 1 Accuracy = 0.991792 Split = 1
GT object 2
Coverage = 0.809862 Accuracy = 0.994543 Split = 1
GT object 3
Coverage = 1 Accuracy = 0.954386 Split = 1
GT object 4
Coverage = 0.998092 Accuracy = 0.967474 Split = 1
GT object 5
Coverage = 1 Accuracy = 0.993222 Split = 1
GT object 6
Coverage = 1 Accuracy = 0.960362 Split = 1

Run the evaluation

The executable (EvaLTex) takes as input two .txt files, one for the ground truth and one for the detection/segmentation.

Usage:

./EvaLTex gt.txt det.txt res_dir [-a] [gtImgDir detImgDir]
gt.txt                         ground truth file path
det.txt                       detection file path
res_dir                       result output directory
-a                              use to generate a result file for each image
gtImgDir detImgDir   directory path of the gt and detection segmentation images
-h/--help                    show help information

The configuration file(evaltex.ini) contains parameter values needed for the evaluation process. The file should be placed in the same repository as the executable.

Structure of evaltex.ini:

region                         boolean that decides whether to use or no the region option (default region=true)
print_level                   sets the level of output details (default print_level=0)"
min_area                    threshold for the minimum area acceptance between a GT and a detection (default min_area=0)
hist_bin_nb                 number of bins used to generate the quality histograms and EMD scores (default hist_bin_nb=0)
split                           boolean that decides whether to integrate the split into the coverage (default split=true)
det_border                  border variation in terms of percentages for text detection (default det_border=0.01)

Resources

Evaluation Datasets

Text detection

Text segmentation


Download: Executable and config file evaltex.tar.gz

Dependencies: libTIFF and GraphicsMagick

Example: example.tar.gz containing a ground truth and a detection file

Credits

This work is part of the LINX project and was partially supported by FUI (Fond Unique Interministeriel) 14.

EvaLTex was written by Ana Stefania CALARASANU. Please send any suggestions, comments or bug reports to calarasanu@lrde.epita.fr.


Please cite the following papers in all publications that use EvaLTex:
IVC for text detection evaluation
ECCV for text segmentation evaluation
ICDAR for the histogram representation and EMD metrics.