This page summarizes the results provided in the paper. A full description of the benchmark can be found [here]. All results presented in this webpage have been normalized, i.e. the standard score was computed for selected provisos and all models. The standardization is performed as follows. For each model M, we take the set of all runs generated (100 runs per proviso), and compute a mean number of states and a standard deviation. More details about this normalization can be found in the paper.