Beating C in Scientiﬁc Computing Applications

On the Behavior and Performance of LISP, Part 1

Didier Verna

EPITA Research and Development Laboratory

14–16 rue Voltaire, F-94276 Le Kremlin-Bicêtre, France

Abstract

This paper presents an ongoing research on the behavior and per-

formance of LISP with respect to C in the context of scientiﬁc

numerical computing. Several simple image processing algorithms

are used to evaluate the performance of pixel access and arithmetic

operations in both languages. We demonstrate that the behavior of

equivalent LISP and C code is similar with respect to the choice of

data structures and types, and also to external parameters such as

hardware optimization. We further demonstrate that properly typed

and optimized LISP code runs as fast as the equivalent C code, or

even faster in some cases.

Keywords C, LISP, Image Processing, Scientiﬁc Numerical Cal-

culus, Performance

1. Introduction

More than 15 years after the standardization process of COMMON-

LISP

(Steele, 1990), and more than 20 years after people really

started to care about performance (Gabriel, 1985; Fateman et al.,

1995; Reid, 1996), LISP still suffers from the legend that it is a

slow language.

As a matter of fact, it is very rare to ﬁnd efﬁciency demanding

applications written in LISP. To take a single example from a

scientiﬁc numerical calculus application domain, image processing

libraries are mostly written in C (Froment, 2000) or in C++ with

various degrees of genericity (bib, 2002; Ibanez et al., 2003; Duret-

Lutz, 2000). Most of the time, programmers are simply unaware

of LISP, and in the best case, they falsely believe that sacriﬁcing

expressiveness will get them better performance.

We must admit however that this point of view is not totally

unjustiﬁed. Recent studies (Neuss, 2003; Quam, 2005) on various

numerical computation algorithms ﬁnd that LISP code compiled

with CMU-CL can run at 60% of the speed of equivalent C code.

People having already made the choice of LISP for other reasons

might call this “reasonable performance” (Boreczky and Rowe,

1994), but people coming from C or C++ will not: if you consider

an image processing chain that requires 12 hours to complete for

instance (and this is a real case), running at a 60% speed means

that you have just lost a day. This is unacceptable.

Hence, we have to do better: we have to show people that they

would lose nothing in performance by using LISP, the corollary

being that they would gain a lot in expressiveness.

This article presents the ﬁrst part of an experimental study on

the behavior and performance of LISP in the context of numeri-

cal scientiﬁc computing; more precisely in the ﬁeld of image pro-

cessing. This ﬁrst part deals with fully dedicated (or “specialized”)

code: programs built with full knowledge of the algorithms and the

COMMON-LISP is deﬁned by the X3.226-1994 (R1999) ANSI standard.

data types to which they apply. The second and third part will be de-

voted to studying genericity, respectively through dynamic object-

orientation and static meta-programming. The rationale here is that

adding support for genericity would at best give equal performance,

but more probably entail a loss of efﬁciency. So if dedicated LISP

code is already unsatisfactory with respect to C versions, it would

be useless to try and go further.

In this paper, we demonstrate that given the current state of the

art in COMMON-LIS P compiler technology (most notably, open-

coded arithmetics on unboxed numbers and efﬁcient array types

(Fateman et al., 1995, sec. 4, p.13), the performances of equivalent

C and LISP programs are comparable; sometimes better with LISP.

Section 2 describes the experimental conditions in which our

performance measures were obtained. In section 3 on the next page,

we study the behavior and performance of several specialized C

programs involving two fundamental operations of image process-

ing: pixel access and arithmetic processing. Section 4 on page 5 in-

troduces the equivalent LISP code and studies its behavior compar-

atively to C. Finally, section 5 on page 10 summarizes the bench-

marking results we obtained with both languages and several com-

pilers.

2. Experimental Conditions

2.1 The Protocol

Our experiments are based on 4 very simple algorithms: pixel as-

signment (initializing an image with a constant value), and pixel ad-

dition / multiplication / division (ﬁlling a destination image with the

addition / multiplication / division of every pixel from a source im-

age by a constant value). These algorithms involve two fundamen-

tal atomic operations of image processing: pixel (array element)

access and arithmetic processing.

In order to avoid benchmarking any program initialization side-

effect (initial page faults etc.), the performances have been mea-

sured on 200 consecutive iterations of each algorithm.

Additionally, the behavior of the algorithms is controlled by the

following parameters:

Image size Unless otherwise speciﬁed, the benchmarks presented

in this paper are for 800 ∗ 800 images. However, we also ran

the tests on smaller images with an increased iterations number.

Some interesting comparisons between the two image sizes are

presented in sections 3.5 on page 4 and 4.6 on page 9.

Data type We have benchmarked both integer and ﬂoating-point

images, with the corresponding arithmetic operations (operand

of the same type, no type conversion needed). Additionally, no

arithmetic overﬂow occurs in our programs.

Array types We have benchmarked 2D images represented either

directly as 2D arrays, or as 1D arrays of consecutive lines

(see also 2.2). Some interesting comparisons between these two

storage layouts are presented in sections 3.1 on the next page

and 4.3 on page 6.

Access type We have benchmarked both linear and pseudo-random

image traversal. By “linear”, we mean that the images are tra-

versed as they are represented in memory: from the ﬁrst pixel to

the last, line after line. By “pseudo-random”, we mean that the

images are traversed by examining in turn pixels distant from

a constant offset in the image memory representation (see also

section 2.3).

Optimization We have benchmarked our algorithms with 3 differ-

ent optimization levels: unoptimized, optimized, and also with

inlining of each algorithm’s function into the iterations loop.

The exact meaning of these 3 optimization levels will be de-

tailed later. For simplicity, we will now simply say “inlined” to

refer to the optimized and inlined versions.

The combination of all these parameters amounts to a total of

48 actual test cases for each algorithm, for a total of 192 individual

tests. Not all combinations are interesting; we present only com-

parative results where we think there is a point to be made.

The benchmarks have been generated on a Debian GNU/Linux

system running a packaged kernel version on a

Pentium 4, 3GHz, with 1GB RAM and 1MB level 2 cache. In or-

der to avoid non deterministic operating system or hardware side-

effects as much as possible, the PC was freshly rebooted in single-

user mode, and the kernel used was compiled without symmet-

ric multiprocessing support (the CPU’s hyperthreading facility was

turned off).

Given the inherent fuzzy and non-deterministic nature of the

art of benchmarking, we are reluctant to provide precise numerical

values. However such values are not really needed since the global

shape of the comparative charts presented in this paper are usually

more than sufﬁcient to make some behavior perfectly clear. Nev-

ertheless, people interested in the precise benchmarks we obtained

can ﬁnd the complete source code and results of our experiments

at the author’s website

. The reader should be aware that during

the development phase of our experiments, several consecutive trial

runs have demonstrated that timing differences of less than 10% are

not signiﬁcant of anything.

2.2 Note on array types

In the case of a 2D image represented as a 1D array in memory,

we used a single index to traverse the whole image linearly. One

might argue that this is unfair to the 2D representation because

since the image really is a 2D one, we should use 2D coordinates

(hence a double loop) and use additional arithmetics to compute the

resulting 1D position of the corresponding pixel in the array. Our

answer to this is no, for the following two reasons.

1. Firstly, remember that we are in the case of dedicated code,

with full knowledge of data types and implementation choices.

In such a case, a real world program will obviously choose the

fastest solution and choose a single loop implementation for

linear pixel access.

2. Secondly, remember that our main goal is to compare the per-

formances of “equivalent” code in C and LISP. Comparing the

respective merits of 1D or 2D implementation is only second-

order priority at best. Hence, as long as the tested code is equiv-

alent in both languages, our comparisons hold.

2.3 Note on access types

It is important to consider pseudo-random access in the experi-

ments because not all image processing algorithms work linearly.

For instance, many morphological operators, like those based on

“front propagation” algorithms (d’Ornellas, 2001) access pixels in

an order that, while not being random, is however completely un-

related to the image’s in-memory storage layout.

In order to simulate non-linear pixel access, we chose a scheme

involving only arithmetic operations (the ones we were interested

in benchmarking anyway): we avoided using real randomization

because that would have introduced a bias in the benchmarks, and

worse, it would have been impossible to compare results from C

and LISP without a precise knowledge of the respective cost of the

randomization function in each language or compiler.

The chosen scheme is to traverse the image by steps of a con-

stant value, unrelated to the image geometry. In order to be sure

that all pixels are accessed, the image size and the step size have to

be relatively prime. 509 was chosen as the constant offset for this

experiment, because it is prime relatively to 800 (our chosen image

size) and is also small enough to avoid arithmetic overﬂow for a

complete image traversal.

3. C Programs and Benchmarks

In this section, we establish the ground for comparison by studying

the behavior and performance of specialized C implementations of

the algorithms described earlier.

For benchmarking the C programs, we used the GNU C com-

piler, GCC

, version 4.0.3 (Debian package version 4.0.3-1). Full

optimization is obtained with the and ﬂags. To

prevent the compiler from inlining the algorithm functions, the

GCC attribute (GNU C speciﬁc extension) is used. On

the contrary, to force inlining of these functions, they are declared

as .

For the sake of clarity, two sample programs (details removed)

are presented in listings 1 and 2: the addition algorithm for

images, both in linear and pseudo-random access mode.

v o i d add ( image ∗ t o , image ∗ from , f l o a t v a l )

{

i n t i ;

c o n s t i nt n = ima−>n ;

f o r ( i = 0; i < n ; ++ i )

to −>d at a [ i ] = from−>da t a [ i ] + v a l ;

}

Listing 1. Linear Pixel Addition, C Version

v o i d add ( image ∗ t o , image ∗ from , f l o a t v a l )

{

i n t i , pos , o f f s et = 0;

c o n s t i nt n = ima−>n ;

f o r ( i = 0; i < n ; ++ i )

{

o f f s e t += OFFSET;

pos = o f f s e t % n ;

to −>d at a [ pos ] = from−>da t a [ pos ] + v a l ;

}

Listing 2. Pseudo-Random Pixel Addition, C Version

3.1 Array types

Although 1D array implementation would probably be the usual

choice (for questions of locality, and perhaps also genericity with

respect to the image ranks), it is interesting to see how 1D or 2D

array implementations behave in C, and compare this behavior with

the equivalent LISP implementations.

In the 2D implementation, the image data is an array of pointers

to individually ’ed lines (in sequential order), and a double

line / column loop is used to access the pixels linearly.

Chart 1 shows the timings for all linear optimized algorithms

with a 2D (rear) or 1D (front) array implementation.

Assignment Addition Multiplication Division

0.5

1.5

Execution time (seconds)

From Rear to Front: 2D / 1D Array

Integer

Floating Point

Chart 1. Linear Optimized Algorithms, 2D or 1D C Versions

This chart indicates that the 1D implementation ran 10% faster

than the 2D one at best for integer arithmetics. This is hardly

signiﬁcant, and even less in the case of ﬂoating point treatment.

The same experiments (not presented here) with inlined algorithms

conﬁrm this, with the case of the integer division more pronounced,

as it runs 25% faster in the 1D implementation (the performance of

integer division will be discussed in section 3.4 on the following

page).

Chart 2 shows the timings for all pseudo-random optimized

algorithms with a 2D (rear) or 1D (front) array implementation.

Assignment Addition Multiplication Division

Execution time (seconds)

From Rear to Front: 2D / 1D Array

Integer

Floating Point

Chart 2. Pseudo-Random Optimized Algorithms, 2D or 1D C

Versions

In that case, the differences between the two implementations

are somewhat deeper: the 1D implementation almost 20% faster

than the 2D one on the 3 algorithms invoking arithmetics, and

this is the other way around for pixel assignment. Again, the same

experiments (not presented here) with inlined algorithms conﬁrm

this.

In the remainder of this section, we will consider 1D storage

layout exclusively.

3.2 The Assignment algorithm

Int / Rand Int / Linear Float / Rand Float / Linear

0.01

0.1

Execution time (seconds)

From Rear to Front: Unoptimized / Optimized / Inlined

Chart 3. 1D Pixel Assignment, C Versions

Chart 3 presents the results obtained for the 1D assignment al-

gorithm. Benchmarks are grouped by data type / access type com-

binations (this makes 4 groups). In each group, the timings (from

rear to front) for the unoptimized, optimized and inlined versions

are displayed. In order to make small timings more distinct, a log-

arithmic scale is used on the Y axis.

3.2.1 Access Types

The ﬁrst thing we remark on this chart is the huge difference in

the performance of pseudo-random and linear access. In numbers,

the linear versions run between 15 and 35 times faster than their

pseudo-random counterpart, depending on the optimization level.

An explanation for this behavior is provided in section 3.5 on the

next page.

3.2.2 Optimization levels

Chart 3 also demonstrates that while optimization is insigniﬁcant

on the pseudo-random versions, the inﬂuence is non-negligible in

the linear case: the optimized versions of the pixel assignment

algorithm run approximately 60% faster than the unoptimized ones.

To explain this, two hypotheses can be raised:

1. the huge cost of pseudo-random access completely hides the

effects of optimization,

2. and / or special optimization techniques related to sequential

memory addressing are at work in the linear case.

These two hypotheses have not been investigated further yet.

As real world applications need non linear pixel access some-

times, it is important to remember, especially during program de-

velopment, that turning on optimization is much less rewarding in

that case.

3.3 Addition and Multiplication

Charts similar to chart 3 were made for the addition and multiplica-

tion algorithms. They are very similar in shape so only the addition

one is presented (see chart 4 on the following page). This chart en-

tails 3 remarks:

•

The difference in performance between pseudo-random and lin-

ear access is conﬁrmed: the linear versions run approximately

Int / Rand Int / Linear Float / Rand Float / Linear

0.1

Execution time (seconds)

From Rear to Front: Unoptimized / Optimized / Inlinined

Chart 4. 1D Pixel Addition, C Versions

between 25 and 35 times faster than the pseudo-random access

ones (15 – 35 for the assignment algorithm), depending on the

optimization level. This increased performance gap in the unop-

timized case is unlikely to be due to the arithmetic operations

involved. A more probable cause is the fact that two images

are involved, hence doubling the number of memory accesses.

Again, refer to section 3.5 for an explanation.

•

As in the previous algorithm (although less visible on this

chart), optimization is more rewarding in the case of linear ac-

cess: whereas the optimized versions of pseudo-random access

algorithms run 10% faster than their unoptimized counterpart,

the gain is of 36% for linear access. However, it should be noted

that this ratio is inferior to that obtained with the assignment

algorithm (60%) in which no arithmetics was involved (apart

from the same looping code). This, again, suggests that opti-

mization has more to do with sequential memory addressing

than with arithmetic operations.

•

Inlining the algorithms functions doesn’t help much (this was

also visible on chart 3 on the preceding page): the gain in perfor-

mance is insigniﬁcant. It ﬂuctuates between - and +2% which

does not permit to draw any conclusion. This observation is not

surprising because the cost of the function calls is negligible

compared to the time needed to execute the functions them-

selves (i.e. the time needed to traverse the whole images).

3.4 Division

Int / Rand Int / Linear Float / Rand Float / Linear

0.1

Execution time (seconds)

From Rear to Front: Unoptimized / Optimized / Inlined

Chart 5. 1D Pixel Division, C Versions

The division algorithm, however, comes with a little surprise.

As shown in chart 5, the division chart globally has the same shape

as that of chart 4, but inlining does have a non-negligible impact

on integer division: in pseudo-random access mode, the inlined

version runs 1.2 times faster than the optimized one, and the ratio

amounts to almost 3.5 in the linear case. In fact, inlining reduces

the execution time to something closer to that of the multiplication

algorithm.

Careful disassembly of the code (shown in listings 3 and 4)

reveals that the division instruction ( ) is indeed missing from

the inlined version, and replaced by a multiplication ( ) and

some other additional (cheaper) arithmetics.

0 x08048970 < d i v_ s b +48 >: l e a 0x0( ,% ecx ,4) ,% eax

0 x08048977 < d i v_ s b +55 >: i n c %ecx

0 x08048978 < d i v_ s b +56 >: mov (%e s i ,%eax ,1) , % edx

0 x0804897b < d i v_ s b +59 >: mov %eax , 0 x f f f f f f e 8 (%ebp )

0 x0804897e < d i v_ s b +62 >: mov %edx ,%eax

0 x08048980 < d i v_ s b +64 >: mov %edx , 0 x f f f f f f e 4 (%ebp )

0 x08048983 < d i v_ s b +67 >: c l t d

0 x08048984 < d i v_ s b +68 >: i di v %ebx

0 x08048986 < d i v_ s b +70 >: mov 0 x f f f f f f e 8 (%ebp ) ,% edx

0 x08048989 < d i v_ s b +73 >: cmp %ecx ,0 x f f f f f f f 0 (%ebp )

0 x0804898c < d i v_ s b +76 >: mov %eax , 0 x f f f f f f e 0 (%ebp )

0 x0804898f < div _ sb +79 >: mov %eax ,(%edi ,%edx , 1 )

0 x08048992 < d i v_ s b +82 >: jn e 0 x8048970 < d i v _s b +48>

Listing 3. Assembly Integer Division Excerpt

0 x080489d0 <main+640 >: l e a 0 x0( ,% e s i ,4) , % ecx

0 x080489d7 <main+647 >: mov $0x92492493 ,%eax

0 x080489dc <main+652 >: i m u l l (%ebx ,%ecx , 1 )

0 x080489df <main+655 >: i n c %e s i

0 x080489e0 <main+656 >: mov (%ebx ,% ecx , 1) ,% ea x

0 x080489e3 <main+659 >: mov %edx ,0 x f f f f f f 7 c (%ebp )

0 x080489e9 <main+665 >: add %eax ,%edx

0 x080489eb <main+667 >: mov (%ebx ,% ecx , 1) ,% ea x

0 x080489ee <main+670 >: sa r $0x2 ,%edx

0 x080489f1 <main+673 >: s a r $0x1f ,% eax

0 x080489f4 <main+676 >: sub %eax ,%edx

0 x080489f6 <main+678 >: cmp %e s i ,0 x f f f f f f a 8 (%ebp )

0 x080489f9 <main+681 >: mov %edx ,(% edi ,%e cx ,1 )

0 x0 804 89f c <main+684 >: jne 0x80489d0 <main+640>

Listing 4. Assembly Inlined Integer Division Excerpt

This is an indication of a constant integer division optimization

(Warren, 2002, Chap. 10). In fact, the dividend in our program is

indeed a constant, and GCC is able to propagate this knowledge

within the inlined version of the division function. Replacing this

constant by a call to , for instance, in the source code

sufﬁces to neutralize the optimization.

Although the case of integer division is somewhat peculiar, if

not trivial, it demonstrates that inlining is always worth trying, even

when the cost of the function call itself is negligible, because one

never knows which next optimization such or such compiler will be

able to perform.

Finally, notice that division is more costly in general: the pres-

ence of division arithmetics here, ﬂattens the performance gap be-

tween pseudo-access and linear access. Whereas the other algo-

rithms featured a factor between 25 and 35, linear ﬂoating-point

division runs “only” 8.1 – 9.5 times faster than the pseudo-access

versions.

3.5 Caching

As we just saw, the gain from pseudo-random to linear access is

15 – 35 for pixel assignment, addition and multiplication, and 8.1

– 9.5 for (ﬂoating point) pixel division. The arithmetic operations

needed to implement pseudo-random access involve one addition

and one (integer division) remainder per pixel access, as apparent

on listing 2 on page 2. These supplemental arithmetic operations

alone cannot explain this performance gap.

In order to convince ourselves of this, we benchmarked alter-

native versions of the algorithms in which pseudo-random access

related computations are being performed, but the image is still ac-

cessed linearly. An example of such code is shown in listing 5.

v o i d a s si g n ( image ∗ ima , f l o a t v al )

{

i n t i , o f fs e t = 0;

c o n s t i nt n = ima−>n ;

f o r ( i = 0 ; i < n ; ++ i )

{

o f f s e t += OFFSET;

ima−>d a t a [ i ] = o f f s e t % n ;

}

Listing 5. Fake Pseudo-Random Pixel Assignment, C Ver-

sion

Comparing the performances of these versions and the real

linear ones shows that the performances degrade only by a factor 4

– 6; not 25 – 35.

What is really going on here is that pseudo-random access de-

feats the CPU’s level 2 caching optimization. This can be demon-

strated by using smaller images (that ﬁt entirely into the cache)

and a larger number of iterations in order to maintain the same

amount of arithmetic operations. The initial experiments were run

on 800∗ 800 images and 200 iterations. Such images are 2.44 times

bigger than our level 2 cache. We ran the same tests on 400 ∗ 400

images and 800 iterations. This gives us almost the same number of

arithmetic operations, but this time, it is almost possible to ﬁt two

images into the cache.

Charts 6 and 7 present all algorithms side by side, respectively

for optimized linear and pseudo-random access. From rear to front,

benchmarks for big and small images are superimposed. Timings

are displayed on a linear scale this time.

Assignment Addition Multiplication Division

Execution time (seconds)

From Rear to Front: Big / Small Image

Integer

Floating Point

Chart 6. 1D Pseudo-Random Optimized Algorithms, C Versions

For pseudo-random access, the assignment algorithm goes ap-

proximately 5.5 times faster for the small image, and the other algo-

rithms gain roughly a factor of 3, despite the same amount of arith-

metics. In the linear case however, the assignment algorithm gains

2.6 (instead of 5.5), the addition and multiplication ones hardly

reach 1.3 (instead of 3), and the division one gains practically noth-

ing.

In section 2.3 on page 2, we emphasized on the importance of

considering pseudo-random access in our tests. Here, we further

Assignment Addition Multiplication Division

0.5

1.5

Execution time (seconds)

From Rear to Front: Big / Small Image

Integer

Floating Point

Chart 7. 1D Linear Optimized Algorithms, C Versions

demonstrate the importance of hardware considerations like the

size of the level 2 cache: one has to realize that given the size of

our images (800 ∗ 800) and the chosen offset for pseudo-random

access (509), we hardly jump from one image line to the next, and

it already sufﬁces to degrade the performance by a factor of 30.

This consideration is all the more important that it is not only a

technical matter: for instance, Lesage et al. (2006) observe that for

a certain class of morphological operators implemented as queue-

based algorithms, there is a relation between the image entropy (for

instance the number of grey levels) and the amount of cache misses.

4. LISP Programs and Benchmarks

In this section, we study the behavior and performance of LISP

implementations of our algorithms, and establish a ﬁrst batch of

comparisons with the C code previously examined.

For testing LISP code, we used the experimental conditions

described in section 2 on page 1, with some LISP speciﬁcities

described below.

4.1 Experimental Conditions

Compilers We took the opportunity to try several LISP compilers.

We tested CMU-CL

(version 19c), S BCL

(version 0.9.9) and

ACL

(Allegro 7.0 trial version). Our desire to add LispWorks

to the benchmarks was foiled by the fact that the personal

edition lacks the usual and command line options

used with the other compilers to automate the tests.

Array Types Both 1D and 2D array types were tested, but with a

total of 7 access methods: apart from the usual method for

both 1D and 2D arrays, we also experimented

access for 2D arrays and / storage / ac-

cess for ﬁxnums. See section 4.3 on the next page for further

discussion on this.

These new parameters increased the number of tests to 252 per

algorithm, making a total of 1008 individual test cases. Again, only

especially interesting comparisons are shown.

Also, following the advice of Anderson and Rettig (1994), care

has been taken to use instead of for ﬁxnum division in

order to avoid unwanted ratio manipulation.

Finally, it should be noted that none of the tested programs

triggered any garbage collection (GC) during timing measurement,

apart from a few unoptimized (hence unimportant) cases, and a

“special occasion” with ACL described later. This is a good thing

because it demonstrates that for low-level operations such as the

ones we are studying, the language design does not get in the way

of performance, and the compilers are able to avoid costly memroy

management when it is theoretically possible. See also section 7.3

on page 11 for more remarks on GC.

4.2 LISP code tuning

For the reader unfamiliar with LISP, it should be mentionned that

requesting optimization is not achieved by passing ﬂags to the com-

piler as in C, but by “annotating” the source code directly with so-

called declarations, specifying the expected type of variables, the

level of required optimization, and even compiler-speciﬁc informa-

tion. Examples are given below.

Unoptimized code is obtained with and all other

qualities set to 0, while fully optimized code is obtained with

and all other qualities set to 0. Providing the type dec-

larations needed by LISP compilers in order to optimize was more

difﬁcult than expected however. The process and the difﬁculties are

explained below.

First, we compiled untyped LISP code with full optimiza-

tion and examined the compilation logs. The Python compiler

(MacLachlan, 1992) from CMU-CL (or SBCL for that matter) has

a very nice way to throw a compilation note each time it is unable

to optimize an operation. This makes it very easy to provide it with

the strictly required type declarations without cluttering the code

too much.

After that, we used the somewhat less ergonomic

declaration of ACL in order to check that these mini-

mal type declarations were also enough for the Allegro compiler

to avoid number consing. This is not however sufﬁcient to avoid

suboptimal type declaration in ACL, because even without any sig-

nalled number boxing, the compiler might be unable to open-code

some of its internal functions. Thus, we had to use yet another ACL-

speciﬁc declaration to get that information; namely

. Some typing issues with ACL are still remain-

ing; they will be described in section 4.3.

As a result of this typing process, two code examples are pro-

vided in listings 6 and 7. These are the LISP equivalent of the C

examples shown in listings 1 and 2 on page 2.

( defun add ( to from op )

( d e cl a re ( ty p e ( simp le − ar ra y s in g le − f l oa t ( ∗ )) t o ) )

( d e cl a re ( ty p e ( simp le − ar ra y s in g le − f l oa t ( ∗ )) from ) )

( d e cl a re ( ty p e s i n gl e − f lo a t op ) )

( l e t ( ( s iz e ( ar ra y− dimen si on t o 0 ) ) )

( d o t i me s ( i s i z e )

( s e t f ( a r e f to i ) (+ ( ar e f from i ) op ) ) ) ) )

Listing 6. Linear Pixel Addition, LISP Version

These two code samples should also be credited for illustrating

one particular typing annoyance: note that adding two ﬁxnums

might produce something bigger than a ﬁxnum, hence the risk

of number boxing. This means that in principle, we should have

written the addition like this:

or used a macro to do so. Surprisingly however, none of the tested

compilers complained about this missing declaration. After some

investigation, it appears that they don’t need this declaration, but

for different reasons:

( defun add ( to from op )

( d e cl a re ( ty p e ( simp le − ar ra y s in g le − f l oa t ( ∗ )) t o ) )

( d e cl a re ( ty p e ( simp le − ar ra y s in g le − f l oa t ( ∗ )) from ) )

( d e cl a re ( ty p e s i n gl e − f lo a t op ) )

( l e t ( ( s iz e ( ar ra y− dimen si on t o 0 ) )

( o f f s e t 0 )

pos )

( d e cl a re ( ty p e fixnum o f f s e t ) )

( d o t i me s ( i s i z e )

( s e t f o f f s e t (+ o f f s et + o f f s e t +)

pos (rem o f f s e t s i ze )

( ar e f to pos ) (+ ( a r e f from pos ) op ) ) ) ) )

Listing 7. Pseudo-Random Pixel Addition, LISP

Version

•

Python’s type inference system is relatively clever: if you

the result of an addition of two ﬁxnums to a variable also

declared as a ﬁxnum (the array in our case), then it is

deduced that no arithmetic overﬂow is expected in the code,

hence it is not necessary to declare (neither to check) the type

the addition’s result. Note that this is also true for multiplication

for instance.

•

On the other hand, ACL’s type inference system does not go

that far, but has a “hidden” switch automatically turned on in

our full optimization settings, and which assumes the result

of ﬁxnum addition to remain a ﬁxnum. This is not true for

multiplication however. Hence, the same code skeleton might

(surprisingly) behave differently, depending on the arithmetic

operation involved.

As a last example of typing difﬁculty, CMU-CL’s type infer-

ence (or compiler note generation) system seems to have prob-

lems with loops: declaring the type of a loop

index is normally not needed to get inlined arithmetics (and SBCL

doesn’t require it either). However, when two loops are

nested, CMU-CL issues a note requesting an explicit declaration for

the ﬁrst index. Macro-expanding the resulting code shows that two

consecutive declarations are actually issued: the one present in the

macro, and the user-provided one. On the other hand, no

declaration is requested for the second loop’s index. The reason for

this behavior is currently unknown (even by the CMU-C L maintain-

ers we contacted). As a consequence, it is difﬁcult to ﬁgure out if

we can really trust the compilation notes or if there is a missed op-

timization opportunity: we had to disassemble the code in question

in order to make sure that inlined ﬁxnum arithmetics was indeed

used.

All of this demonstrates that trying to provide the strictly nec-

essary and correct type declarations only is almost impossible be-

cause compilers may behave quite differently with respect to type

inference, and the typing process can lead to surprising behaviors,

unless you have been “wading through many pages of technical

documentation”, to put it like Fischbacher (2003), and ﬁnally ac-

quired an in-depth knowledge of each compiler speciﬁcities.

4.3 Array types

In LISP just as in C, one could envision to represent 2D images

as 1D or 2D arrays. In section 3.1 on page 3, we showed that

the chosen C representation does not make much difference for

optimized code.

In LISP, the choice is actually larger than just 1 or 2D: in

addition to choosing between single or multidimensional array, the

following opportunities also arise:

•

It is possible to treat multi-dimensional arrays as linear ones

thanks to the standard function .

•

At some point in LISP history (Anderson and Rettig, 1994),

it was considered a better choice to use an unspecialized

for storing ﬁxnums along with its accompany-

ing accessor , because using a specialized

might involve shifting in order to represent them as machine in-

tegers. On the other hand, ﬁxnums are immediate objects so

they are stored as-is in a .

The combination of element type / array type / access method

amounts to a total of 7 cases that are worth studying.

Assignment Addition Multiplication Division

0.5

1.5

2.5

Execution time (seconds)

From Rear to Front: 2D / 2D ROW-MAJOR / 1D Array

Fixnum

Single Float

Fixnum SVREF

Chart 8. Linear Optimized Algorithms, CM U-CL Versions

Chart 8 shows the CMU-CL timings for all linear optimized al-

gorithms, with (from rear to front), 2D, 2D accessed,

1D, and ﬁxnum implementations. This chart en-

tails several remarks:

•

Contrary to C, using a plain 2D access method gives poor

performance: the algorithms perform between 2.8 (for single-

ﬂoat images) and 4.5 (for ﬁxnum images) times slower than the

1D version. This performance loss is much less important on the

integer division though: the 1D ﬁxnum version performs “only”

1.35 times faster than the 2D one, and the single-ﬂoat versions

perform at exactly the same speed.

•

The performance gap is more pronounced for ﬁxnum images

than for single-ﬂoat ones: while the performance for ﬁxnum

and single-ﬂoat is equivalent in the case of 1D access, ﬁxnum

algorithms are roughly 1.5 times slower than their single-ﬂoat

counterpart in plain 2D access.

•

Altogether, the timings for 2D access, and 1D

access methods are practically the same, although there is a

small noticeable difference for integer division. This brings an

interesting remark: it seems that while 2D access optimiza-

tion is performed at the compiler level in C, it is available

at the language level in COMMON-LISP, through the use of

•

Finally, using of a simple-vector to store ﬁxnum images does

not bring anything compared to the traditional simple-array ap-

proach. This is in contradiction to Anderson and Rettig (1994),

but the reason is that nowadays, LISP compilers use efﬁcient

representations for properly specialized arrays, so for instance,

ﬁxnums are stored unshifted, just as in a simple-vector. Actu-

ally, if one wants machine integers, the

type is available.

A similar chart for the inlined versions (not presented here)

teaches us that inlining has a bad inﬂuence on the 2D

access versions: they take about twice the time to execute, which is

a strong point in favor of choosing a 1D implementation.

Similar chars (not presented here) were also made for pseudo-

random access (both in optimized and inlined versions). There is

no clear distinction between the different implementation choices

however. The plain 2D method is behind the others in most cases

(but not all), the 2D choice would seem to be better

most of the time, but the performance differences hardly reach 10%

which is not enough to lead to any deﬁnitive conclusion.

The case of SBCL is very similar to that of CMU-CL, hence will

not be discussed further. Let us just mention that some differences

are noticed on inlined versions, which seems to indicate that inlin-

ing is still a “hot” topic in LISP compilers (probably because of its

tight relationship to register allocation policies).

Assignment Addition Multiplication Division

Execution time (seconds)

From Rear to Front: 2D / 2D ROW-MAJOR / 1D Array

Fixnum

Single Float

Fixnum SVREF

Chart 9. Linear Optimized Algorithms, ACL Versions

The case of ACL is peculiar enough to be presented here how-

ever (see chart 9). The most surprising behavior is the incredibly

poor performance of the access method in general, and

for single-ﬂoat images in particular (note that the timing goes up to

22 seconds, not even including GC time, whereas CMU-CL hardly

reaches 2.5 seconds). A verbose compilation log examination leads

us to the following assumptions:

•

ACL indicates that it is unable to open-code the calls to

and, more importantly, ,

so it has to use a generic (polymorphic ?) one. This might ex-

plain why the performance is so poor.

•

In the case of ﬂoating point images, ACL also indicates that it

is generating a single-ﬂoat box for every arithmetic operation.

This might further explain why ﬂoating point images perform

even poorer than ﬁxnum ones.

As of now, we are still unable to explain this behavior, so it is

unknown whether we missed an optimization opportunity (that

would have required additional code trickery or intimate knowledge

of the compiler anyway), or whether we are facing an inherent

limitation in ACL.

Final remark on plain 2D representation In our experiments, the

size of the images are not known at compile time, so arrays are

declared as follows in the 2D case:

Out of curiosity, and motivated by the fact that ACL is unable

to inline calls to in this situation, we wanted

to know the impact of a more precise type declaration on the

performance.

To that aim, we ran experiments with hard-wired image sizes in

the declarations. Arrays were hence declared as follows:

We observed a performance gain of approximately 30% (for CMU-

CL) and 20% (for ACL) in plain 2D access (still worse than the

1D version), and no gain at all for access. This is

interesting because even when using the full expressiveness of

the language (in that case, generating and compiling dedicated

functions with full knowledge of the array types on the ﬂy), we

gain nothing compared to using a standardized access facility.

To conclude, the choice between 2D (even with

access because of ACL) and 1D array representation does matter

a bit more than in C, but also tends towards the 1D choice. In the

remainder of this section, we will consider only 1D simple array.

4.4 The Assignment Algorithm

Chart 10 shows comparative results for the 3 tested compilers on

the optimized versions of the 1D pixel assignment algorithm. Ad-

ditionally, the inlined versions are shown on the right side of each

bar group for CMU-CL and SBCL (note that ACL doesn’t support

user function inlining). Timings are displayed on a logarithmic

scale. Following are several observations worth mentioning from

this chart.

Int / Rand Int / Linear Float / Rand Float / Linear

0.01

0.1

Execution time (seconds)

From Rear to Front: ACL / CMU-CL / SBCL

Optimized

Inlined

Chart 10. Pixel Assignment, Optimized LISP Versions

4.4.1 Lisp Compilers

In pseudo-random access, all compilers are in competition, al-

though ACL seems a bit slower. However, this is not signiﬁcant

enough to draw any conclusion. On the other hand, the difference

in performance is signiﬁcant for linear access: both CMU-CL and

SBCL run twice as fast.

4.4.2 Access Types

The impact of the access type on performance is exactly the same

as in the C case: both CMU-CL and SBCL perform 30 times faster in

linear access mode. For ACL, this factor falls down to 13 though.

Further investigation on the impact of the additional arithmetics

only is provided in section 4.6 on the facing page.

4.4.3 Optimization levels

It would not make any sense to compare unoptimized versions of

LISP code and C code (neither across LISP compilers actually) be-

cause the semantics of “unoptimized” may vary across languages

and compilers: for instance, Python compilers may treat type decla-

rations as assertions and perform type veriﬁcation for unoptimized

code, which never happens in C. We are not interested in bench-

marking this process.

However, it can be interesting, for development process, to

ﬁgure out the performance gain from non to fully optimized LISP

Int / Rand Int / Linear Float / Rand Float / Linear

0.01

0.1

Execution time (seconds)

From Rear to Front: Unoptimized / Optimized / Inlined

Chart 11. CM U-CL Pixel Assignment Performance

code, and also to see its respective impact on pseudo-random and

linear access, just as we did for C.

Chart 11 shows the performance of CMU-CL on the assignment

algorithm for code fully safe (GC time removed), fully optimized,

and inlined (from rear to front), in the case of 1D access.

Timings are displayed on a logarithmic scale. We observe a behav-

ior very similar to that of C (chart 3 on page 3): optimization is

insigniﬁcant for pseudo-random access, but quite important for lin-

ear access. This gain is also more important here than in C: while

optimized C versions ran approximately twice as fast as their un-

optimized counterpart, the LISP versions run between 3.8 and 5.3

times faster. In fact, the gain is more important, not because opti-

mized versions are faster, but because unoptimized ones are slower:

in LISP, safety is costly.

Although not presented here, similar charts were made for SBCL

and ACL. The behavior of SBCL is almost identical to that of CMU-

CL, so we will say no more about it. ACL has some differences,

due to the fact that the unoptimized versions are much slower

(sometimes twice as slow) than Python generated code. Again,

it would not be meaningful to compare unoptimized code, even

between LISP compilers. But this just leads to the consequence

that the effect of optimization is signiﬁcant even in pseudo-random

access mode for Allegro.

Int / Rand Int / Linear Float / Rand Float / Linear

0.1

Execution time (seconds)

From Rear to Front: Unoptimized / Optimized / Inlined

Chart 12. CM U-CL Pixel Addition Performance

With some anticipation on the following section, chart 12 shows

the impact of optimization on the addition algorithm for CMU-C L

(logarithmic scale), where we learn something new: it appears that

contrary to C, the impact of optimization is much more important

on ﬂoating point calculations. More precisely, it is the cost of

safety which is much greater for ﬂoating point processing than for

integer calculation: optimized versions run at approximately the

same speed, regardless of the pixel type. However, unoptimized

ﬁxnum assignment goes 1.4 times faster than the ﬂoating point

version in the pseudo-random access case, and 8.5 times faster in

linear access mode. This explains why optimization has a more

important impact on the ﬂoating point versions.

Similar charts were made for the other algorithms, and the other

compilers. They all conﬁrm the preceding observations. There is

one issue speciﬁc to CMU-CL however: as visible on chart 12 on

the facing page, inlining actually degrades the performances, which

is particularly visible on the pseudo-random access versions.

4.5 Other Algorithms

Performance charts were made for the addition, multiplication and

division algorithms. They are very similar in shape so only the

results for the division one is presented on chart 13 (the division

was chosen because it has some speciﬁcities). As in chart 10 on the

facing page, the performance of the 3 tested compilers is presented

for the optimized versions, along with the inlined versions of CMU-

CL and SBCL.

Int / Rand Int / Linear Float / Rand Float / Linear

Execution time (seconds)

From Rear to Front: ACL / CMU-CL / SBCL

Optimized

Inlined

Chart 13. Pixel Division, Optimized LISP Versions

The following points can be noticed on this chart:

•

Globally, we can classify ACL, CMU-CL and ﬁnally SBCL by

order of efﬁciency (from the slowest to the fastest). This is also

the case for the other algorithms.

•

Speciﬁcally to pseudo-random integer division, SBCL seems

quite better than the two others. The reason for this particularity

is currently unknown.

•

The impact of the access type on performance here is exactly

the same as in the C case: the division, being more costly

reduces the gap between linear and pseudo-random access to

a factor of 9 – 11 (the other algorithms feature a factor of 30,

just like in C).

•

This chart also conﬁrms that inlining is insigniﬁcant at best.

However, CMU-CL seems to have a speciﬁc problem with

pseudo-random access (this is true for all algorithms): inlin-

ing degrades the performances by a factor up to 15% in some

cases. Although not presented here, we found at that SBCL

suffers from inlining too, but mostly for linear access. All of

this suggests that inlining is still a “hot” topic in LISP compil-

ers, and can have a serious impact on locality and on register

allocation policies.

•

Finally, the global insigniﬁcance of inlining in this very case

of integer division demonstrates that none of the tested com-

pilers seem to be able to perform the constant integer division

optimization brought to light with C. This hypothesis was con-

ﬁrmed by compiling and disassembling a short function of one

argument performing a constant division of this argument: for

both CMU-CL and SBCL, the division is optimized only when

the dividend is a power of 2, whereas GCC is able to optimize

any constant dividend.

4.6 Caching

As in the C case, we noticed a factor of 30 in performance between

linear a pseudo-random access. In order to evaluate the impact of

the additional arithmetic operations only, we ran tests equivalent

to those described in section 3.5 on page 4 (linear access, but still

performing pseudo-random access related arithmetics). Listing 8

shows the alternative assignment algorithm for single-ﬂoat images

used in this aim.

( defun a s si g n ( image va l u e )

( d e cl a re ( ty p e ( simp le − ar ra y fixnum ( ∗ ) ) image ) )

( d e cl a re ( ty p e fixnum v a l u e ) )

( l e t ( ( s iz e ( ar ra y− dimen si on image 0 ) )

( o f f s e t 0 ) )

( d e cl a re ( ty p e fixnum o f f s e t ) )

( d o t i me s ( i s i z e )

( s e t f o f f s e t (+ o f f s et + o f f s e t +)

( ar e f image i ) ( rem o f f s e t s i ze ) ) ) ) )

Listing 8. Alternative Pixel Assignment, LISP

Version

Here again, comparing the performances of these versions and

the real linear ones shows that the performances degrade only by

approximately a factor of 5 (just like in C); not 30. Still as in the

C case, comparing performances on big and small images with the

same amount of arithmetic operations leads to charts 14 and 15 on

the next page for CMU-CL.

Assignment Addition Multiplication Division

Execution time (seconds)

Rear to Front: Big / Small Image

Integer

Floating Point

Chart 14. Pseudo-Random Optimized Algorithms, CMU-CL Ver-

sions

These charts are very similar to charts 6 on page 5 and 7 on

page 5 for C: for pseudo-random access, the assignment algorithm

goes 3.8 times faster for the small image (5.5 for C), and the other

algorithms gain roughly a factor of 2.3 (3 for C). In the linear case,

the assignment algorithm gains roughly a factor of 3 (2.6 for C)

instead of 3.8, the addition and multiplication algorithms hardly

reach 1.2 (1.3 n the case of C), and the division algorithm gains

nothing, exactly as in C.

Although not presented here, similar charts were made for SBCL

and, as expected, they exhibit the same behavior as CMU-CL. The

case of ACL is also very similar, although the gain in performance

Assignment Addition Multiplication Division

0.5

1.5

Execution time (seconds)

Rear to Front: Big / Small Image

Integer

Floating Point

Chart 15. Linear Optimized Algorithms, CM U-CL Versions

from big to small image have a tendency to be globally smaller than

for CMU-CL and SBCL.: in pseudo-random access, the assignment

algorithm runs less than twice as fast on the small image (compare

this to 3.8 in the case of CMU-CL) and the ratio for the other

algorithms is rather of 2.2, almost identical to the case of CMU-CL.

In linear access, the gain is practically nothing however, even for

the assignment algorithm which is surprising since we got a factor

of 3 with Python generated code.

5. Final Comparative Performance

In this section, we draw a ﬁnal comparison of absolute performance

between the C and LI SP versions of our algorithms.

Charts 16 and 17 show all the algorithms, respectively in

pseudo-random and linear access mode. From rear to front, exe-

cution times for ACL, SBCL, CMU-CL and C are presented.

Assignment Addition Multiplication Division

Execution time (seconds)

Rear to Front: ACL / SBCL / CMU-CL / C

Integer

Floating Point

Chart 16. Pseudo-Random Algorithms, Fastest Versions

Note that these charts compare what we found to be globally

the best implementation solution for performance in every case.

This means that we use a 1D representation in both camps, and

we actually compare inlined C versions with optimized (but not

inlined) LISP versions because inlining makes things worse in LISP

most of the time. Hence, strictly speaking, we are not comparing

exactly the same things.

This is justiﬁed by the position we are adopting: that of a pro-

grammer interested in getting the best in each language while not

being an expert in low level optimization. Only general optimiza-

tion ﬂags ( in C, standard qualities declarations in LISP) and in-

Assignment Addition Multiplication Division

Execution time (seconds)

Rear to Front: ACL / SBCL / CMU-CL / C

Integer

Floating Point

Chart 17. Linear Algorithms, Fastest Versions

lining are used. No compiler-speciﬁc local trick has been tested,

either in C or in LISP.

And here comes an enjoyable little surprise:

•

For pseudo-random access, the assignment algorithm runs

about 19% faster in LISP than in C. For pixel addition and

multiplication, the performance gap hardly reaches 5%, and is

sometimes in favor of LISP, sometimes in favor of C. So the

differences are completely insigniﬁcant.

•

The only exception to that is the case of integer division where

C is signiﬁcantly faster, but notice however that even without

knowledge of the constant integer division optimization, SBCL

manages to lose only by 17%.

The linear case also teaches us several things:

•

ACL is noticeably slower than both CMU-CL and SBCL: it runs

between 1.4 and 2.2 times slower than its competitors. The case

of division is less pronounced though.

•

C code gives performances strictly equivalent to that of CMU-

CL or SBCL, with the exception of division.

•

In that case, it clearly wins with integers (because of the op-

timization already mentioned), but lose by 10% with ﬂoating

point images.

Similar charts were made for small images, ﬁtting entirely into

the cache, and are of similar shape, so they are not presented

here. The only different behavior they exhibit is that of ACL for

which the performances are even poorer, especially in the linear

case. This suggests that apart maybe from ACL, low-level hardware

optimization has globally the same impact on LISP and C code. In

other words, the aspects of locality (both of instructions and data)

are topologically very close.

6. Conclusion

In this paper, we described an ongoing research on the behavior

and performance of LISP in the context of scientiﬁc numerical

computing. We tested a few simple image processing algorithms

both in C and LISP, and also took the opportunity to examine

three different LISP compilers. With this study, we demonstrated

the following points:

•

With respect to parameters external to the languages, such as

hardware optimization via caching, the behavior of equivalent

LISP and C code is globally the same. This is comforting for

people considering switching language, because they should

not be expecting any unpredictable behavior from anything else

than language transition.

•

With respect to parameters internal to the languages, we saw

that there can be surprising differences in behavior between C

and LISP. For instance, we showed that plain 2D array repre-

sentations behave quite differently (slow in LISP), and that per-

formance can depend on the data type in LISP (which does not

happen in C). However, as soon as optimization is at work, and

is the main goal, those differences vanish: we are led to choose

equivalent data structures (namely 1D (simple) arrays), and re-

gain homogeneity on the treatment of different data types. This

is also comforting for considering language transition.

•

Since Fateman et al. (1995); Reid (1996), considerable work

has been done in the LISP compilers in the ﬁelds of efﬁcient nu-

merical computation and data structures, to the point that equiv-

alent LISP and C code entails strictly equivalent performance,

or even better performance in LISP sometimes. This should ﬁ-

nally help getting C or C++ people’s attention.

•

We should also emphasize on the fact that these performance

results are obtained without intimate knowledge of either the

languages or the compilers: only standard and / or portable

optimization settings were used.

To be fair, we should mention that with such simple algorithms

as the ones we tested, we are comparing compilers performance

as well as languages performance, but compiler performance is a

critical part of the production chain anyway. Moreover, to be even

fairer, we should also mention that when we speak of “equivalent

C and LISP” code, this is actually quite inaccurate. For instance,

we are comparing a language construct ( ) with a programmer

macro ( ); we are comparing sealed function calls in C with

calls to functions that may be dynamically redeﬁned in LISP etc..

This means that it is actually impossible to compare exclusively

either language, or compiler performance. This also means that

given the inherent expressiveness of LISP, compilers have to be

even smarter to reach the efﬁciency level of C, and this is really

good news for the LISP community.

LISP still has some weaknesses though. As we saw in sec-

tion 4.2 on page 6 it is not completely trivial to type LISP code

both correctly and minimally (without cluttering the code), and a

fortiori portably. Compilers may behave very differently with re-

spect to type declarations, may provide type inference systems of

various quality, and who knows which more or less advertised op-

timization ﬂag of their own (this happens also in C however). The

yet-to-be-explained problem we faced with ACL’s ac-

cess is the perfect example. Maybe the COMMON-LISP standard

leaves too much freedom to the compilers in this area.

Another current weakness of LISP compilers seems to be inlin-

ing technology. ACL simply does not support user function inlining

yet. The poor performance of CMU-CL in this area seems to indi-

cate that inlining defeats its register allocation strategy. It is also

regrettable that none of the tested compilers are aware of the in-

teger constant division optimization brought to light with C, and

from which inlining makes a big difference.

7. Perspectives

The perspectives of this work are numerous. We would like to

emphasize on the ones we think are the most important.

7.1 Further Investigation

The research described in this paper is still ongoing. In particular,

we have outlined several currently unexplained behaviors with re-

spect to typing, performance, or optimization of such or such LISP

compiler. These oddities should be further analyzed, as they prob-

ably would reveal room for improvement in the compilers.

7.2 Vertical Complexity

Benchmarking these very simple algorithms was necessary to iso-

late the parameters we wanted to test (namely pixel access and

arithmetic operations). These algorithms can actually be consid-

ered as the kernel of a special class of image processing ones called

“point-wise algorithms”, involving pixels one by one. It is likely

that our study will remain relevant for all algorithms in this class,

but nevertheless, the same experiments should be conducted on

more complex point-wise treatments, involving more local vari-

ables, function calls etc. for instance in order to spot potential

weaknesses in the register allocation policies of LISP compilers,

as the inlining problem tends to suggest, or in order to evaluate the

cost of function calls.

This is also the place where it would be interesting to measure

the impact of compiler-speciﬁc optimization capabilities, including

architecture-aware ones like the presence of SIMD (Single Instruc-

tion, Multiple Data) instruction sets. One should note however that

this leads to comparing compilers more than languages, and that

the performance gain from such optimizations would be very de-

pendant on the algorithms under experimentation (thus, it would be

difﬁcult to draw a general conclusion).

7.3 Horizontal complexity

Besides point-wise algorithms, there are two other important algo-

rithm classes in image processing: algorithms working with “neigh-

borhoods” (requiring several pixels at once) and algorithms like

front-propagation ones where treatments may begin at several ran-

dom places in the image. Because of their very different nature in

pixel access patterns, it is very likely that their behavior would be

quite different from what we discussed in this paper, so it would be

interesting to examine them too.

Also, because of their inherent complexity, this is where the dif-

ferences in language expressiveness would be taken into account,

and in particular where it would become important to include GC

timings in the comparative tests, because it is part of the language

design (not to mention the fact that different compilers use different

GC techniques which adds even more parameters to compare).

7.4 Benchmarking other compilers / architectures

The benchmarks presented in this paper were obtained on a spe-

ciﬁc platform, with speciﬁc compilers. It would be interesting to

measure the behavior and performance of the same code on other

platforms, and also with other compilers. The automated bench-

marking infrastructure provided in the source code should make

this process easier for people willing to do so.

7.5 From dedication to genericity

Benchmarking low-level, fully dedicated code was the ﬁrst step in

order to get C or C++ people’s attention. However, most image

treaters want some degree of genericity: genericity on the image

types (RGB, Gray level etc.), image representation (integers, un-

signed, ﬂoats, 16bits, 32bits etc.), and why not on the algorithms

themselves. To this aim, the object oriented approach is a natu-

ral way to go. Image processing libraries with a variable degree

of genericity exist both in C++ and in LISP, using CLOS (Keene,

1989) . As a next step of our research, a study on the cost of dy-

namic genericity in the same context as that of this paper is at work

already. However, given the great ﬂexibility and expressiveness of

CLOS that C++ people might not even feel the need for (classes

can be redeﬁned on the ﬂy, the dispatch mechanism is customiz-

able etc.), the results are not expected to be in favor of LISP this

time.

7.6 From dynamic to static genericity

Even in the C++ community, some people feel that the cost of dy-

namic genericity is too high. Provided with enough expertise on

the template system and on meta-programming, it is now possible

to write image processing algorithms in an object oriented fashion,

but in such a way that all generic dispatches are resolved at compile

time (Burrus et al., 2003). Reaching this level of expressiveness in

C++ is a very costly task, however, because template programming

is cumbersome (awkward syntax, obfuscated compiler error mes-

sages etc.). There, the situation is expected to turn dramatically in

favor of LISP. Indeed, given the power of the LISP macro system

(the whole language is available in macros), the ability to generate

function code and compile it on-the-ﬂy, we should be able to auto-

matically produce dedicated hence optimal code in a much easier

way. There, the level of expressiveness of each language becomes

of a capital importance.

Finally, it should be noted that one important aspect of static

generic programming is that the cost of abstraction resides in the

compilation process; not in the execution anymore. In other words,

it will be interesting to compare the performances of LISP and C++

not in terms of execution times, but compilation times.

7.7 From Image Processing to . . .

In this paper we put ourselves in the context of image processing

to provide a concrete background to the tests. However, one should

note that the performance results we got do not reduce to this appli-

cation domain in particular. Any kind of application which involves

numerical processing on large sets of contiguous data might be in-

terested to know that LISP has caught up with performance.

Acknowledgments

The author would like to thank Jérôme Darbon, Nicolas Pier-

ron, Sylvain Peyronnet, Thierry Géraud and many people on

and some compiler-speciﬁc mailing lists for

their help or insight.

References

(2002). Horus User Guide. University of Amsterdam, The Nether-

lands.

Anderson, K. R. and Rettig, D. (1994). Performing LISP:

Analysis of the fannkuch benchmark. ACM SIGPLAN

LISP Pointers, VII(4):2–12. Downloadable version

Boreczky, J. and Rowe, L. A. (1994). Building COMMON-

LISP applications with reasonable performance.

Burrus, N., Duret-Lutz, A., Géraud, T., Lesage, D., and Poss, R.

(2003). A static C++ object-oriented programming (SCOOP)

paradigm mixing beneﬁts of traditional OOP and generic pro-

gramming. In Proceedings of the Workshop on Multiple

Paradigm with OO Languages (MPOOL), Anaheim, CA, USA.

d’Ornellas, M. (2001). Algorithmic Pattern for Morphological

Image Processing. PhD thesis, University of Amsterdam.

Duret-Lutz, A. (2000). Olena: a component-based platform for

image processing, mixing generic, generative and OO, pro-

gramming. In Proceedings of the 2nd International Sym-

posium on Generative and Component-Based Software En-

gineering (GCSE)—-Young Researchers Workshop; published

in “Net.ObjectDays2000”, pages 653–659, Erfurt, Germany.

Fateman, R. J., Broughan, K. A., Willcock, D. K., and Rettig, D.

(1995). Fast ﬂoating-point processing in COMMON-LISP. ACM

Transactions on Mathematical Software, 21(1):26–62. Down-

loadable version at

Fischbacher, T. (2003). Making LISP fast.

Froment, J. (2000). MegaWave2 System Library. CMLA, École

Normale Supérieure de Cachan, Cachan, France.

Gabriel, R. P. (1985). Performance and Evaluation of LISP Sys-

tems. MIT Press.

Ibanez, L., Schroeder, W., Ng, L., and Cates, J. (2003). The

ITK Software Guide: The Insight Segmentation and Registration

Toolkit. Kitware Inc. .

Keene, S. E. (1989). Object-Oriented Programming in COMMON-

LISP: a Programmer’s Guide to CLOS. Addison-Wesley.

Lesage, D., Darbon, J., and Akgul, C. B. (2006). An efﬁcient al-

gorithm for connected attribute thinnings and thickenings. In

Proceedings of the International Conference on Pattern Recog-

nition, Hong-Kong, China.

MacLachlan, R. A. (1992). The python compiler for CMU-CL. In

ACM Conference on LISP and Functional Programming, pages

235–246. Downloadable version at

Neuss, N. (2003). On using COMMON-LISP for scientiﬁc com-

puting. In CISC Conference, LNCSE. Springer-Verlag. Down-

loadable version at

Quam, L. H. (2005). Performance beyond expectations. In of LISP

Users, T. A., editor, International LISP Conference, pages 305–

315, Stanford University, Stanford, CA. The Association of LISP

Users. Downloadable version at

Reid, J. (1996). Remark on “fast ﬂoating-point processing in

COMMON-LISP”. In ACM Transactions on Mathematical Soft-

ware, volume 22, pages 496–497. ACM Press.

Steele, G. L. (1990). COMMON-LISP the Language, 2nd edition.

Digital Press. Online and downloadable version at

Warren, H. S. (2002). Hacker’s Delight. Addison Wesley Profes-

sional. .