TUGboat, Volume 0 (9999), No. 0 draft: August 24, 2024 13:52 ?1

A large-scale format compliance checker for

X Font Metrics

Didier Verna

Abstract

We present

tfm-validate

, a T

X Font Metrics for-

mat checker. The library’s core functionality is to

inspect TFM les and report any discovered com-

pliance issue. It can be run on individual les or

complete directory trees.

tfm-validate

also pro-

vides a convenience function to (in)validate a local

X Live installation. When run this way, the li-

brary processes every TFM le in the distribution

and generates a website aggregating all the discov-

ered non-compliance issues. One public instance of

tfm-validate

is now automatically triggered on a

daily basis. The corresponding website is available

at texlive.info/tfm-validate/.

1 Introduction

As part of ETAP,

our experimental typesetting algo-

rithms platform [

], we have developed a parser for

TFM (T

X Font Metrics) les, simply called

tfm

. To

ensure robustness, a parser for an ocial data format

must be prepared to handle all sorts of compliance

problems, with varying degrees of seriousness rang-

ing from simple warnings to non-recoverable errors.

tfm

not only provides a rich (hopefully exhaustive)

ontology of errors, but also a powerful recovery mech-

anism, allowing for proceeding as long as possible

with the parsing, for example by xing errors on the

y or discarding problematic input.

A side-eect of

tfm

’s robustness is that it is

possible to use it as a validation tool rather than

for loading font information. Indeed, the

tfm

excep-

tion handler reies the problematic situations into

objects (in the “Object-Oriented” sense) which can

be silently collected until the parsing is over or needs

to be terminated prematurely. These objects can in

turn be used to produce a full compliance report for

the analyzed le. We have automated this process

for the whole T

X Live distribution, resulting in the

(in)validation of almost 80 000 fonts, and the gen-

eration of a website providing direct access to the

generated compliance reports.

This paper is organized as follows. Section 2

provides an overview of the

tfm

library and explains

how it is made robust. Section 3 describes the very

peculiar exception handling mechanism in use, and

how it simplies the design of

tfm-validate

con-

siderably. Finally, Section 4 analyzes the results of

github.com/didierverna/etap

the TFM validation process applied to the whole

X Live distribution.

2 The tfm library

The

tfm

library was designed to bring T

X Font

Metrics information to Common Lisp [

] applications.

Essentially, it provides an entry point function called

load-font

, which takes a le name as argument

and returns a data structure containing an abstract

representation of the contents of the TFM le. A full

description of the library is beyond the scope of this

paper. The interested reader will nd a complete

user manual in the distribution, as well as online.

The important thing for this discussion is that

tfm

aims at being both robust and exible.

2.1 Robustness

Robustness for a parser means that it should be

prepared to handle all the possible problematic sit-

uations, for example in order to abort loading the

culprit font le and exit gracefully, rather than just

crashing or behaving erratically. During the devel-

opment of

tfm

, we have identied twenty such situa-

tions, with varying degrees of severity.

Examples of critical situations include truncated

les or invalid section pointers, making it impossible

to know exactly where to nd character, ligature,

kerning information, etc. In those situations, there

is nothing clever one can do to make the bogus font

functional.

A less critical, yet problematic situation, would

be the detection of a cycle in a ligature program,

resulting in an innite loop when attempting the

ligature. In such a case, we can still hope to get

a functional (although incomplete) font if we just

forget about the ligature(s).

Non-critical situations might be inconsistencies

in parts of the TFM le which are purely informative

(such as several places in the header) and not used to

render the font. T

X itself simply ignores a number

of such situations and proceeds normally.

Finally, note that the severity of a problem

may depend on the context. One interesting such

case is that of the font’s design size. The TFM

format requires it to be greater than 1. At the

same time, T

X allows the design size to be over-

ridden by the user (this is what happens when you

say

\font\foo=cmr10 at 12pt

for example). An in-

valid design size is normally an error, but it doesn’t

really hurt when overridden by a correct one. Hence,

the

tfm

library signals an error in the former case,

but only a warning in the latter.

github.com/didierverna/tfm

A large-scale format compliance checker for T

X Font Metrics

?2 draft: August 24, 2024 13:52 TUGboat, Volume 0 (9999), No. 0

CL-USER> (tfm:load-font "/tmp/cmr10.tfm")

While reading /tmp/cmr10.tfm,

while reading the character encoding scheme string,

padded string "TeX (ex)" is not in BCPL format.

See §10 of the TFtoPL documentation for more information.

[Condition of type NET.DIDIERVERNA.TFM:INVALID-PADDED-STRING]

Restarts:

0: [KEEP-STRING] Keep it anyway.

1: [FIX-STRING] Fix it using /'s and ?'s.

2: [DISCARD-STRING] Discard it.

3: [CANCEL-LOADING] Cancel loading this font.

--more--

Figure 1: Sample interactive recovery session

2.2 Flexibility

Flexibility for a parser means that when possible,

it should provide less drastic ways to recover from

problems than just giving up.

tfm

currently provides

a dozen recovery options, the availability of which

depends on the situation.

As mentioned previously, it is possible to discard

a ligature or a kerning instruction rather than abort-

ing the whole loading process if something is wrong

(like an invalid character code). Another example is

the requirement that the width, height, depth and

italic corrections tables all start with a rst value

of 0. When appropriate,

tfm

oers to x a bogus

value (by setting it to 0) and proceed, rather than

just aborting.

The question of whether a font would be func-

tional after recovery is crucial. Discarding a single

ligature because of an invalid character code may be

safe. Resetting a non-zero rst table entry may be

safe as well, but it might also be the case that the

entire table (or the whole font for all we know) is in

fact completely corrupted. The point here is that it

is not the job of the library to make a decision, only

to oer options.

In fact, having options may come in handy for

interactive use (Common Lisp applications can be

run both interactively and as standalone executa-

bles). Figure 1 illustrates this. In this example,

a fake

cmr10

font has been corrupted on purpose:

the character encoding string present in the le’s

header has been modied to contain parentheses,

which is illegal. When loading the font interactively,

the user ends up in the debugger and is presented

with a number of “soft” recovery options (keeping

the string as-is, xing it, discarding it), in addition

to plain cancellation.

A non-interactive application, on the other hand,

would have the ability to automatically select an op-

tion without requiring user intervention. In produc-

tion, the most likely choice is

CANCEL-LOADING

(and

then fall back to another font). Given the goal we are

trying to achieve here however, we would prefer to

select the recovery option that allows us to proceed

with the parsing for as long as possible.

Figure 2 summarizes all the possible problems

(rectangles) and the corresponding recovery options

(ellipses) that

tfm

provides. The details are not

important. The intent of this picture is to convey the

feeling that even for a relatively simple le format, a

complete error/recovery ontology can quickly become

rather intricate.

3 The tfm-validate library

While

tfm

was originally a requirement for ETAP,

tfm-validate

is a typical case of a project that was

born out of curiosity rather than necessity, and also

because it was quite easy to do. The key ingredient in

tfm-validate

’s design simplicity is the very peculiar

exception handling that Common Lisp provides, the

so-called “condition system” [

], which we’ll now

describe.

3.1 The Common Lisp condition system

Most programming languages with explicit support

for exception handling use some form of “

try

catch

”

mechanism, as illustrated in the left part of Figure 3.

A program may establish points at which exceptions

(thrown elsewhere) are caught and handled. In the

example, the program throws an exception while

executing

func4

. The exception travels up the call

stack until it reaches the handler in

func2

. If the

exception is caught there, execution resumes at that

github.com/didierverna/tfm-validate

Didier Verna

TUGboat, Volume 0 (9999), No. 0 draft: August 24, 2024 13:52 ?3

file-overflow

cancel-loading

padded-string-overflow

invalid-original-design-size

character-list-cycle

discard-next-character

file-underflow

u16-overflow

fix-word-overflow

set-to-zero

invalid-padded-string keep-string

spurious-char-info

invalid-character-code

discard-extension-recipe

discard-kerning

discard-ligature

invalid-character-range

invalid-design-size set-to-ten

invalid-header-length

invalid-ligature-opcode

invalid-section-lengths

invalid-padded-string-length read-maximum-length

invalid-table-index

abort-lig/kern-program

invalid-table-length

invalid-table-start

ligature-cycle

no-boundary-character

fix-string discard-string

Figure 2: The tfm error ontology

point. Otherwise, the exception goes one more step

up, to func1.

Unfortunately, this mechanism suers from an

unnecessary limitation in expressiveness: the excep-

tion handler actually does two dierent things at

the same time (and for no good reason). Namely, a

control point established by a handler serves not only

to catch an exception, but also to resume execution.

There is in fact no reason to limit ourselves to such a

simple scheme, and the Common Lisp condition sys-

tem adds one more degree of freedom to its exception

handling infrastructure.

3.1.1 Signal / Handle / Restart

The equivalent of “throwing an exception” is called

“signalling a condition” in Lisp, and the concept

is equivalent. There is, however, no such thing as

single catch/resume points in the Lisp condition sys-

tem. Instead, a program establishes points where it

is possible to resume execution (called “restarts”),

and points where conditions are caught (called “han-

dlers”). This is illustrated in the right part of Fig-

ure 3. Given the same scenario as before,

func4

signals a condition. The condition goes up the call

A large-scale format compliance checker for T

X Font Metrics

?4 draft: August 24, 2024 13:52 TUGboat, Volume 0 (9999), No. 0

func1()

try/catch 1

func2()

try/catch 2

func3()

func4()

throw

func1()

handler 1

func2()

handler 2

restart 1

func3()

restart 2

func4()

signal

Figure 3: try/catch vs. handle/restart

stack and nds a handler in

func2

. If this handler is

interested, it now has two options: resume execution

right here (with restart 1), or in

func3

with restart 2.

Otherwise, the condition goes one more step up and

the handler in

func1

is given the same two choices,

since no additional restart is installed.

3.1.2 First-class conditions

A second important aspect of the Common Lisp con-

dition system (not unique to Lisp this time) is that it

is grounded in CLOS [

], the object-oriented layer of

the language. This means that creating an ontology

of errors boils down to designing a hierarchy of condi-

tion classes, and the signalled conditions are reied

as objects, that is, instances of the corresponding

classes. In other words, conditions are “rst-class”

citizens in the language [2, 7].

3.2 The design simplicity of tfm-validate

Why is all this relevant to the design simplicity of

tfm-validate

? As mentioned before (Section 2.2), it

is not the job of

tfm

to handle errors; only to detect

them and oer as many soft recovery options as

possible, for exibility. In the technical terms of the

Common Lisp condition system, we now understand

that

tfm

signals conditions and provides a variety of

restarts, but does not establish any handlers.

Short of handling conditions, a

tfm

user ulti-

mately ends up in the debugger if something goes

wrong (again, as demonstrated in Figure 1). But

the key point is that since restarts and handlers are

dierent concepts, it is possible to decide what to

do programmatically rather than interactively, by es-

tablishing handlers outside

tfm

, or more specically

around calls to it.

We can now understand why

tfm-validate

was in fact quite easy to write. The main entry

point is a function called

invalidate-font

, which

calls

tfm

’s

load-font

function. But before doing so,

invalidate-font

establishes a (rather large) han-

dler for all the conditions that

tfm

may signal, and

for every one of them, selects the “softest” restart

available, allowing to proceed with the parsing for

as long as possible. Note again that because han-

dlers and restarts are not required to be located at

the same places in the code, no modications to the

original

tfm

library are required to make it work like

a compliance checker rather than for loading fonts.

But

invalidate-font

doesn’t stop there. Ev-

ery time a condition is caught, the function collects

it before restarting (remember that conditions are ac-

tual objects). The return value of

invalidate-font

is thus the list (possibly empty) of all the signalled

conditions. In fact,

invalidate-font

doesn’t do

any printing by itself. After execution, the user gets

the list of signalled conditions, and is then free to

do whatever they wish with it, such as inspecting,

printing in one form or another, or even generating

a website …

4 T

X Live validation

… which is the point we are getting to. The func-

tion

invalidate-font

which, again, is essentially a

wrapper around

load-font

, collecting the signalled

conditions, is 68 lines long. With 10 more lines,

we oer a function checking the compliance of a

whole directory tree rather than of a single font le.

This function is unsurprisingly called

invalidate-

directory.

At that point, we were curious about the state

of the T

X Live distribution, since it is a rather

large repository of TFM les, all located under a

single directory tree. As it turns out, running our

function

invalidate-directory

on it revealed a

quite large number of non-compliance issues, which

was an incentive to put all that information into a

human-readable shape.

4.1 Non-compliance reports

The

tfm-validate

library provides yet another en-

try point called

invalidate-texlive

. It generates

a website aggregating non-compliance reports (one

HTML page per culprit TFM le) plus a couple of

indexes. With the help of Norbert Preining, the

system is now run on a daily basis and the cor-

responding website is made available at

texlive.

info/tfm-validate/.

Didier Verna

TUGboat, Volume 0 (9999), No. 0 draft: August 24, 2024 13:52 ?5

At the time of this writing, the results of the

validation process are as follows. 79016 fonts are in-

spected. 2983 fonts are skipped because

tfm

doesn’t

support OFM or JFM yet. 770 fonts are found to be

non-compliant, which may seem quite a lot. On the

other hand, there are only 4 kinds of problems: 3 of

which are considered warnings, and only a single one

a truly unrecoverable error.

4.2 File overow

By far, the most common issue that

tfm-validate

nds is le overows, aecting 628 fonts. The TFM

standard mandates that the rst two bytes of a TFM

le encode the le’s length. A “le overow” warning

is signalled if the actual le’s length is greater than

expected. Note that

tfm

knows about the special

values 0, 9, and 11, denoting extended TFM les

(OFM or JFM), which are not supported yet.

Of course, when the declared le size disagrees

with the actual, there is no way to tell for sure

which (if any) is correct. However, absent any other

problem during parsing, the le containing a tail of

junk is much more likely than the rst two bytes

(only) being corrupted, hence a warning.

A quick test on a couple of such les seems

to conrm that hypothesis. We compiled a sample

document with them, and it appears that not only

X has no problem loading the fonts, the outputs

look normal as well. On top of that, let us mention

that

tftopl

adopts the same posture: it signals

the problem but otherwise just discards the junk

(Section 20 of tftopl).

Further investigation on the tails was inconclu-

sive. In particular we couldn’t gure out whether

some tails contain meaningful information rather

than just junk (a possible cause for le overows

could be padding to storage blocks). As a conse-

quence, the signalled warnings do not include the

tails’ content.

4.3 String overow

The situation is slightly dierent with the next kind

of problem we encountered, namely, padded string

overows, currently aecting 74 fonts.

A TFM le may contain two optional strings in

its header. The rst one, 40 bytes long, identies

the character coding scheme. The second one, 20

bytes long, is the font identier (font family name).

These strings are supposed to be in BCPL format.

In particular, the rst byte must contain the actual

length of the string.

tfm

signals a “padded string overow” warning

when a BCPL string is not padded with zeros. Doug

McKenna suggested

that padding a BCPL string

with zeros may not have always been a requirement,

as it was only added to

pltotf

in April 1983, for

version 1.3, that is, two years after its initial release

(Section 87 of

pltotf

). On the other hand, David

Fuchs mentioned padding with zeros as early as in

February 1981 [3].

Anyway, the decision as to whether a padded

string overow should be a warning or an error is even

simpler to make than in the case of a le overow.

Those strings are purely informative, they have no

impact on the font’s usability, so it does not hurt to

continue loading the font.

Besides, the padding area seems to have been

intentionally abused in the majority of the cases: a

lot of fonts contain “

Y&Y Inc

” in there, making their

origin quite clear. Because of that (and contrary to

le overows), the content of the padding area is

included in the warnings.

4.4 Spurious char info

The next problem we encountered (also a warning,

aecting 66 les) is a more obscure matter. TFM les

have a so-called “char info table” providing the actual

character metrics of the font. The table contains 4-

byte entries for the full range of characters from the

minimum character code (

) to the maximum one

(

). However, a font may also have “holes” in this

range, that is, undened characters for some codes

between bc and ec.

Undened characters must have a width of 0,

materialized by a width table index of 0 as well. The

spurious char info warning indicates that an entry for

a non-existent character is not completely zeroed out.

In the problematic char info entries that we found,

the third byte usually has a value of 1 (indicating

an index into a ligature or kerning program), and

sometimes a non-zero fourth byte (the actual index).

A possible explanation would have been the

existence of a so-called “boundary character” (also

an obscure matter in TFM) which is not required

to exist for real in the font, but upon inspection of

several problematic ones, this appears not to be the

case.

tftopl

completely ignores characters with a

width index of 0 (Section 78 of

tftopl

), and

pltotf

zeroes out non-existent characters (Section 74 of

pltotf

). All the more reasons to not consider this

problem a showstopper.

4.5 Fix word overow

Finally, this one is the only true error we encountered,

and it only aects two fonts:

ArevSans-Bold

, and

reference lost; could have been in a thread on texhax…

A large-scale format compliance checker for T

X Font Metrics

?6 draft: August 24, 2024 13:52 TUGboat, Volume 0 (9999), No. 0

ArevSans-BoldOblique

. TFM has a notion of “x

word” numerical values which (with two exceptions)

must remain within

]− 16, +16 [

. In particular, the

actual font metrics (width, height, depth, and italic

correction) are expressed in x words.

In the two aforementioned fonts, exactly 124

such values are o the charts. Again, for the sake

of exibility,

tfm

oers a soft recovery option for

this problem (see Figure 2): setting the culprit value

to 0, which would most likely result in an unreadable

document. T

X refuses to load these fonts, which

conrms the severity of the problem; hence an error.

5 Related work

Manuel Pégourié-Gonnard wrote a Perl script

for

checking the validity of a variety of les using ex-

ternal programs (typically,

tftopl

for TFM les).

It is our understanding that this script produces

a somewhat terse output: it prints a list of “bad”

les without collecting more specic information, let

alone presenting it in a human readable form.

According to a comment by Karl Berry, the

script took a long time to run and maintenance of

the list of broken fonts was tedious, with no particular

action happening on the part of the font maintainers

to x the problems, so using it was abandoned in

August 2019.

6 Conclusion and perspectives

As mentioned before, this project was born out of

curiosity rather than necessity, and because it was

easy to develop. Whether it is actually useful remains

to be seen. Perhaps having compliance problems

publicly advertised on a website will be a new kind

of incentive for authors to update their les, and

perhaps this project will be more helpful to watch

over new additions rather than blame older content.

One merit of this project is to provide an insight

into the global status of TFM compliance over a

large set of fonts. In particular, we can see that the

surprising number of non-compliant les is mitigated

by the fact that most issues are in fact benign (only

two fonts were found to be truly unusable).

In the future, we plan on adding new font for-

mats to the system. Provided that we can nd the

appropriate documentation, OFM and JFM are likely

to be straightforward additions and as a matter of

fact, the

tfm

library is already prepared for it. We

have also started to work on an OTF parser, designed

along the same lines (that is, built around the Com-

mon Lisp condition system) but this will take slightly

longer to complete.

tug.org/svn/texlive/trunk/Master/tlpkg/bin/

tl-check-files-by-format

Finally, the current layout of the website still has

a lot of room for improvements. It currently provides

two indexes, but the general question boils down to

oering dierent forms of access to cross-referenced

information. Karl Berry has already suggested a cou-

ple of possible ways to do so, which we will denitely

take into account in the future.

Acknowledgements

The author wishes to thank Norbert Preining, Karl

Berry, and Doug McKenna for fruitful exchanges dur-

ing the development of both

tfm

and

tfm-validate

References

[1]

ANSI. American National Standard:

Programming Language — Common Lisp.

ANSI X3.226:1994 (R1999), 1994.

[2]

R. Burstall. Christopher Strachey —

Understanding programming languages.

Higher Order Symbolic Computation,

13(1–2):51–55, 2000.

[3]

D. Fuchs. T

X font metric les. TUGboat,

2(1):12–16, Feb. 1981.

tug.org/TUGboat/tb02-1/tb02fuchstfm.pdf

[4]

M. Herda. The Common Lisp Condition System.

Apress, 2020.

doi.org/10.1007/978-1-4842-6134-7

[5]

S.E. Keene. Object-Oriented Programming in

Common Lisp: a Programmer’s Guide to CLOS.

Addison-Wesley, 1989.

[6]

P. Seibel. Practical Common Lisp. Apress,

Berkeley, CA, USA, 2005. Online version at

gigamonkeys.com/book/.

[7]

J. Stoy, C. Strachey. OS6 — An experimental

operating system for a small computer. Part 2:

Input/output and ling system. The Computer

Journal, 15(3):195–203, 1972.

[8]

D. Verna. ETAP: Experimental typesetting

algorithms platform. In 15th European Lisp

Symposium, pp. 48–52, Porto, Portugal,

Mar. 2022. doi.org/10.5281/zenodo.6334248

[9]

D. Verna. Interactive and real-time typesetting

for demonstration and experimentation: ETAP.

TUGboat 44(2):242–248, 2023.

doi.org/10.

47397/tb/44-2/tb137verna-realtime

 Didier Verna

EPITA Research Lab

14–16, rue Voltaire

94270 Le Kremlin-Bicêtre

France

didier (at) lrde.epita.fr

https://www.lrde.epita.fr/~didier/

ORCID 0000-0002-6315-052X

Didier Verna