
GPCE ’22, December 06–07, 2022, Auckland, New Zealand Baptiste Esteban, Edwin Carlinet, Guillaume Tochon, and Didier Verna
Table 2. Execution time overhead (in percentage) of the
algorithmic schemes compared to the statically-typed with
direct access image of side size of 4096
Statically Typed Yes No
Direct Access Yes No Yes No
C++
Raster +0% +176% +183% +367%
Random +0% +20% +18% +29%
Local +0% +208% +174% +283%
Rust
Raster +0% +388% +251% +574%
Random +0% +9% +6% +15%
Local +0% +61% -14% +66%
benchmarks have been compiled with the Rustc compiler
using the third optimization level (-C opt-level=3) and the
measurements have been performed using the Criterion.rs
library.
The results of the benchmark are displayed in Figure 9.
Each plot displays the performance of an algorithm in sec-
onds related to the size of the side of an image. The second
row is the result of the algorithms implemented in C++ and
the third one of the algorithms implemented in Rust. Except
for the Rust implementation of the dilation (Figure 9i), the
statically typed buer is the fastest implementation of the
algorithm, which is particularly true when traversing the
image in raster order as for the elementwise operation (Fig-
ures 9d and 9g). This is due to the number of cache misses,
low for the elementwise operation, due to the fact the images
are traversed in the same order they are stored in memory.
Furthermore, the knowledge of the type at compile time en-
ables optimizations by the compiler such as automatic vector-
ization of the instruction in the produced binary. Finally, the
implementations of the algorithms with the
buffer2d<T>
knowing the nature of the input object values type and the
input object implementation details, the operations given
to the algorithm are processed by the compiler, enabling its
inlining and avoiding the indirection induced by a function
call.
Furthermore, we observe in Figures 9e and 9h that the
algorithmic scheme is an important criterion to choose the
generic model to use. For the max-tree algorithm, whatever
the model used, the performance of its computation is similar
for each one. Indeed, the random algorithmic scheme does
not access the memory in the same order as the memory is
used. Thus, it results in several cache misses, but also the
compiler is unable to optimize the generated machine code.
We can conclude from these benchmarks that the static
information of the image values type is important, but also
the algorithmic scheme. This is even more obvious in Table 2,
where the dynamism overhead is shown. In the context of
a bridge between the C++ or Rust language and Python,
specializing generic algorithms to a wide variety of types
in the case of a pattern such as the random pattern is not
necessary in term of performance.
The second experiment performed in this paper is the
measurement of the size evolution of the generated machine
code from the C++ implementation of the max-tree algo-
rithm related to the number of handled image value types.
To make it, we used the Bloaty
1
proler which measures
the size of dierent elements in binaries. The max-tree al-
gorithm has been chosen because its compilation generates
the largest amount of machine code from the three previ-
ous algorithms, but also because the cost of dynamism of
its algorithmic scheme is negligible and permits the usage
of its dynamic version. The result of this measurement is
shown Table 3. Both version exhibit a linear increase with
the addition of new image types. However, the quantity of
new code generated in the dynamic version ( 100b/type) is
26 times lower than in the static version (2.6Kb/type) where
a new full algorithm is instantiated. Therefore, the dynamic
version prevents code bloat.
5 Conclusion
In this paper, we presented dierent models of generic pro-
gramming and we compared them applied to image pro-
cessing by implementing dierent algorithmic schemes in
C++ and Rust. We showed that the performance of each
model was dependent on the algorithmic scheme used, and
we highlight the fact that some information such as the type
of the values of an image, was more important to be known
statically by the compiler in some cases. To reduce the loss
of performance induced by the lack of static information
knowledge, some leads may be explored: rst, the usage of
external modules downloaded at runtime and linked to the
application, with precompiled algorithms, optimized for this
particular use case. The second lead may be the usage of
Just-In-Time compilation to generate optimized assembly
code such as SIMD instruction at runtime for critical opera-
tions. Then, as observed in the benchmark’s result Figure 9,
for a small square image, the dierence in performance is
negligible. Looking for the best side size related to the per-
formance of an algorithm for distributed tile-based image
processing algorithm would be a means to reduce this gap in
performance. Finally, this work is intended to be used in the
context of a bridge from a static language to a dynamic one
to provide an interface for dynamic environments without
a loss of performance for a C++ image processing library.
Thus, we will use it as a basis for ecient bindings of our
algorithms from C++ to Python.
Acknowledgments
The authors would like to acknowledge Antoine Martin for
his useful advice about the Rust programming language.
1
hps://github.com/google/bloaty
177