Similarity Problems in Paragraph Justification

An Extension to the Knuth-Plass Algorithm

Didier Verna

EPITA Research Laboratory

Le Kremlin-Bicêtre, France

didier@lrde.epita.fr

ABSTRACT

In high quality typography, consecutive lines beginning or ending

with the same word or sequence of characters is considered a defect.

We have implemented an extension to T

X’s paragraph justication

algorithm which handles this problem. Experimentation shows that

geing rid of similarities is both worth addressing and achievable.

Our extension automates the detection and avoidance of similarities

while leaving the ultimate decision to the professional typographer,

thanks to a new adjustable cursor. e extension is simple and

lightweight, making it a useful addition to production engines.

CCS CONCEPTS

• Applied computing

→

Document preparation; • eory of

computation → Dynamic graph algorithms.

KEYWORDS

Paragraph Justication, Similarity Avoidance, Homeoteleutons,

Homeoarchies, T

X, Knuth-Plass Extension

ACM Reference Format:

Didier Verna. 2024. Similarity Problems in Paragraph Justication: An Ex-

tension to the Knuth-Plass Algorithm. In ACM Symposium on Document

Engineering 2024 (DocEng ’24), August 20–23, 2024, San Jose, CA, USA. ACM,

New York, NY, USA, 4 pages. https://doi.org/10.1145/3685650.3685666

1 INTRODUCTION

In spite of its relatively old age, Donald Knuth’s T

X typeseing

system [

] is still considered a de facto standard when it comes

to digital typography. In particular, its paragraph justication al-

gorithm, known as the Knuth-Plass [

], established a landmark

in the category of algorithms considering a paragraph as a whole

rather than proceeding line by line, as earlier (greedy) algorithms

used to do [1, 4, 16].

Yet, many aspects of ne typography are not directly or automat-

ically handled by T

X. Consider for example the lemost paragraph

in Figure 1. e typeseing was done by T

X with the Latin Modern

Roman font at a 10pt size, and for a paragraph width of 201pt (the

gure is optically scaled down in order to t on the page).

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

DocEng ’24, August 20–23, 2024, San Jose, CA, USA

ACM ISBN 979-8-4007-1169-5/24/08

https://doi.org/10.1145/3685650.3685666

Notice how three lines near the end of the paragraph end the

same way, with the word “and”. is is considered a defect in

high quality typography, as it generates a micro-interruption: the

reader’s aention may be caught by the similarity and temporarily

loose focus or concentration. Such similarities may also lead the

reader to accidentally skip a line or re-read the same one, even more

so when the problem occurs at the beginning of the line rather than

at the end of it. In fact, the eld of textual criticism has identied

the problem and its consequences in the very ancient context of

scribal errors (e.g. missing lines in manual copies of the bible made

by monks) [20].

We have implemented an extension to the Knuth-Plass (

)

algorithm that is able to deal with that kind of defect. In this paper,

we use the term similarity for the lack of an ocial terminology. We

would also like to propose the expression character / word ladder,

analogous to the more widespread expression hyphenation ladder,

the meaning of which should be obvious. Accidental line skipping

has been referred to with a rather awkward French expression, saut

du même au même (“jump from same to same”), even in non French

lierature [

]. Borrowed from rhetoric, the terms homeoarchy and

homeoteleuton have come to designate beginning and end of line

similarities respectively, and by extension, accidental line skipping

because of them [15].

is paper is organized as follows. Section 2 mentions some re-

lated work. Section 3 provides an outline of the

algorithm’s oper-

ation, necessary to understand how our extension works. Section 4

describes our extension, and Section 5 presents some experimental

results.

2 RELATED WORK

Frank Mielbach mentions the similarity problem in a survey of

existing alternative T

X engines and remaining issues [

]. No solu-

tion is proposed in this paper, as it is merely a state-of-the-art review.

Alex Holkner addresses the problem in his multiple-objective ap-

proach to line breaking [

]. However, the paper only seems to be

concerned with beginning-of-line similarities (called stacks). Al-

though inspired from it, the approach is not technically an extension

to T

X’s algorithm. e paper does not provide a precise denition

for stacks, and does not use the corresponding objective function in

the reported experimental results. Other extensions to the

algo-

rithm have been proposed in the past, some micro-typographic [

some macro-typographic, for example to help with automatic docu-

ment layout [

]. Concerning the laer, our underlying motivations

are in fact opposite. e work in question aempts to provide ex-

ibility at the expense of quality in order to cope with situations

in which manual intervention is impossible, such as automatically

adjusting to dierent displays. We, on the other hand, are interested

DocEng ’24, August 20–23, 2024, San Jose, CA, USA Didier Verna

In olden times when wishing still helped one,

there lived a king whose daughters were all

beautiful; and the youngest was so beautiful

that the sun itself, which has seen so much,

was astonished whenever it shone in her face.

Close by the king’s castle lay a great dark for-

est, and under an old lime-tree in the forest

was a well, and when the day was very warm,

the king’s child went out into the forest

:::

and

sat down by the side of the cool fountain;

:::

and

when she was bored she took a golden ball,

:::

and

threw it up on high and caught it; and this ball

was her favorite plaything.

In olden times when wishing still helped one,

there lived a king whose daughters were all

beautiful; and the youngest was so beautiful

that the su n itself, which has seen so much,

was astonished whenever it shone in her face.

Close by the king’s castle lay a great dark for-

est, and under an old lime-tree in the forest

was a well, and when the day was very warm,

the king’s child went out into the forest

:::

and

sat down by the side of the cool fountain;

:::

and

when she was bored she took a golden ball,

and threw it up on high and caught it; and

this ball was her favorite plaything.

In olden times when wishing still helped one,

there lived a king whose daughters were all

beautiful; and the youngest was so beautiful

that the sun itself, which has seen so much,

:::

was

astonished whenever it shone in her face. Close

by the king’s castle lay a great dark forest, and

under an old lime-tree in the forest was a well,

and when the day was very warm, the king’s

child went out into the forest and sat down by

the side of the cool fountain; and when she was

bored she took a golden ball, and threw it up

on high and caught it; and this ball was her

favorite plaything.

Figure 1: Similarity avoidance

beautiful

that

much,

was

astonished

and

this

ball

was

her

favorite

plaything.

Figure 2: Line breaking graph excerpt

in helping professional (human) typographers aiming for the nest,

by providing as much quality as possible automatically, and leing

them focus on what strictly requires human intervention. is is

more in line with the view of Hurst et al. [8].

3 THE KNUTH-PLASS IN A NUTSHELL

e

algorithm essentially expresses the paragraph justication

problem as a Single Pair Shortest Path one [

]. e possible so-

lutions for breaking a paragraph into lines are represented in the

form of a graph (see Figure 2), in which every possible line break

is a node, and the ability to go from one line break to the next is

represented by an edge connecting two nodes. In the gure, each

node advertises the corresponding end-of-line, and the beginning

of the next. e problem thus boils down to nding the best route

(referred to as “shortest path” in graph theory) from the top node

to the boom one.

In order to perform eciently, the

algorithm uses a dynamic

programming [

] optimization technique which allows it to never

construct the full (potentially huge) graph in memory. As it pro-

gresses through the paragraph’s text, T

X maintains branches rep-

resenting the best possible solutions (note the plural) so far, but also

gets rid of branches which are provably going to be sub-optimal in

the end.

Finding the shortest path between two nodes in a graph is usually

expressed by minimizing cost rather than maximizing gain. When

the

algorithm explores the line breaking possibilities, it assigns

demerits to various aspects of the solutions and eventually chooses

the cheapest one. T

X’s demerits are adjustable parameters, and

can be classied in two categories.

Local Demerits are computed line by line, independently of the

surrounding context. In particular, the so-called badness of a line

increases as it needs to be stretched or shrunk, and additional

penalties are applied to hyphenated lines.

Contextual Demerits are applied by comparing some aspects of

two consecutive lines. In particular, hyphenation ladders are heavily

penalized, as well as “adjacency” problems (two consecutive lines

with very dierent scaling; for example a very loose line followed

by a very compact one).

e

algorithm works by minimizing the sum of local and

contextual demerits across the whole paragraph.

4 SIMILARITY HANDLING

In order to address similarity problems, our extension works by

introducing a new kind of contextual demerits that we call similar

demerits. Each time the algorithm compares two consecutive lines,

it now also compares the respective beginnings and ends of line,

and adds the similar demerits value to the total if similarities are

encountered. Note that the end of the paragraph needs a special

treatment. Most of the time, the last line is not justied, so the po-

tential similarities would not be vertically aligned. Similar demerits

may be applied in the very rare cases of complete justication

though.

In order to compare consecutive lines for similarities, each node

representing a potential break point in the partial graph must re-

member how the corresponding lines begin and end (which is

apparent in Figure 2). is is done as follows. Every time the algo-

rithm creates a new node, it collects the characters from the end of

the line backward, discarding kerns, and stopping at the rst glue

or hyphenation point. e same process is applied (forward) to the

beginning of the next line. ere are several reasons for doing it

this way.

First of all, kerns do not drastically change the vertical alignment

of characters, so they have lile impact on the reader’s perception

of similarity (besides, they would be identical in the middle of a

similarity). On the other hand, glues, which are elastic, are much

more likely to have a noticeable impact on vertical alignment. By

stopping before the rst one, we thus do not need to compare the

Similarity Problems in Paragraph Justification DocEng ’24, August 20–23, 2024, San Jose, CA, USA

200

400

600

800

1000

1200

150 200 250 300 350 400 450 500 550 600

Total

With similarities

Number of paragraph breaking solutions

Paragraph width (pt)

Figure 3: Cross-Width Similarity Report

vertical alignment of the similarities; we just need to compare the

sequences of characters, which can be done much more eciently.

Finally, stopping at the rst hyphenation point avoids the complex-

ity of deconstructing T

X’s discretionaries for content introspection.

Discretionaries are slightly more complicated objects used in partic-

ular for handling hyphenation and ligatures. Note that by detecting

a short similarity (typically one syllable), we implicitly handle a

larger one as well, if it exists. In the end, testing for similarity boils

down to comparing two short sequences of characters, and nding

the longest common sub-sequence.

e rst paragraph in gure 1 was obtained with a value of

0 for similar demerits, that is, as T

X itself would do it. With a

value slightly above 2800, we obtain the middle paragraph in the

gure. e algorithm was able to get rid of the second similarity by

pushing the third occurrence of “and” to the beginning of the next

line (the one before last), at the expense of more line stretching.

is also had the eect of shiing the remainder of the paragraph

by two words.

If we push similar demerits to a value slightly above 5230, we

obtain the rightmost paragraph in the gure. is time, no more sim-

ilarity remains, but the algorithm had to change the layout starting

as early as at the fourth line to get this result. is paragraph also

looks beer than the second one in terms of adjacency problems:

a close look at the two before last lines in the middle paragraph

reveals that they are quite loose compared to the surrounding ones.

Finally, a professional typographer to whom we showed the gure

nds that the third version globally improves over T

X’s original

choice, as there is no more hyphenation, and fewer rivers.

5 EXPERIMENTS

In order to assess both the pertinence of the question and the

ecacy of our solution, we ran two orthogonal experiments: a single

paragraph typeset at many dierent widths, and many dierent

paragraphs typeset at a single width.

5.1 Pertinence

e purpose of the rst experiment is to assess the pertinence of the

question. Is the similarity problem a frequent one, and is it worth

addressing? In honor of the

algorithm’s founding paper [

], we

took an English version of the Grimm Brothers’ “Frog King” novel,

paragraph 1, and typeset it at all widths ranging from 142pt (approx.

5cm) to 569pt (approx. 20cm) with our own implementation of the

algorithm. For each of the 427 passes, we recorded the total

number of possible layouts, and the number of layouts containing

similarities. e results are reported in Figure 3.

Note that because T

X’s paragraph breaking algorithm is opti-

mized (see Section 3), it does not normally compute all the possible

solutions. On the other hand, our personal implementation comes

in two avours: the regular, dynamically optimized one, and a vari-

ant working on the complete solutions graphs. is is how we are

able to record the total number of possible layouts.

e gure exhibits a number of interesting characteristics. First

of all, the well known inherent instability of the paragraph breaking

problem is clearly visible in the chaotic shapes of the curves. Next,

and also unsurprisingly, the two curves are very close to each

other for smaller paragraph widths. is illustrates the fact that

narrow paragraphs are notoriously dicult (sometimes impossible)

to justify, and that similarities may be unavoidable. On the other

hand, the gure also indicates that in the vast majority of the cases,

there is a lot of similarity-free layouts to choose from. All in all, this

experiment proves that the similarity problem is indeed a frequent

one, but also that geing rid of similarities is an achievable goal.

e next logical question is thus the following: given that in most

cases, there exist many similarity-free layouts, will T

X choose one

of those, or will it favor a layout with similarities instead? Further

analysis of the results gives us the following answers. In 4% of the

cases, it is impossible to get rid of similarities (meaning that all

possible solutions contain some). Otherwise, when there is a choice,

X favours a layout with similarities in 21% of the cases. A similar

calculation for the second experiment (see Section 5.2) puts this

gure at 26%.

21% (26% respectively) is far from being negligible. In very con-

crete terms, it means that a professional typographer aiming at

high quality typeseing would have to manually intervene on two

paragraphs out of ten, in order to decide whether they can be im-

proved or not. As a maer of fact, this very paper contains at least

a dozen similarity problems, including three very bad ones… In our

view, this justies the claim that the similarity problem is not only

frequent, but also worth addressing.

5.2 Ecacy

For the second experiment, we took the text of Herman Melville’s

Moby Dick novel, freely available from Project Gutenberg’s web-

site

. We “cleaned up” the text by removing artefacts such as table

of contents, chapter names, and in general, all pieces of text that

would lead to paragraphs of less than two lines. e resulting corpus

contains 1524 paragraphs that we typeset at 284pt (approx. 10cm).

Combined with the rst experiment’s 427 runs, this amounts

to a total of 1951 individual cases, which we ran in three dierent

experimental conditions (for a total of 5853 individual passes) as

https://www.gutenberg.org/

DocEng ’24, August 20–23, 2024, San Jose, CA, USA Didier Verna

follows. e rst batch was typeset with T

X’s default seings, thus,

not handling similarities at all. e second batch was typeset by

maximizing the cost of similarity (similar demerits set to

10 000

Finally, the third batch was typeset by not only penalizing similari-

ties, but also disregarding adjacency problems (adjacent demerits

set to 0; see Section 3).

When we set the similar demerits to the maximum value, 48 (rst

experiment) to 50% (second one) of the problematic paragraphs are

“corrected”, in the sense that no similarities remain. In the cases

where a paragraph contains multiple similarities (up to four in

the Moby Dick experiment), the algorithm can sometimes only

reduce their number. In such a situation, the number of “improved”

paragraphs increases to 50 and 63% respectively. If we not only

penalize similarity, but also disregard adjacency problems, 53 to

66% of the problematic paragraphs are corrected, and a total of 57 to

73% are globally improved. ese gures clearly demonstrate that

similarity can be treated in an automated fashion up to a notable

proportion.

Note that completely discarding adjacent demerits is nonsensical

from an aesthetic point of view. More generally, just because the

algorithm nds a similarity-free layout does not mean that it will

necessarily look beer to a professional typographer. Further exper-

imentation is planned to address that (see Section 7). e purpose

of these two experiments is rather to evaluate the leeway we have

in similarity handling by studying extreme conditions, and again,

the gures clearly demonstrate that the problem can be addressed

in the vast majority of the cases.

6 CONCLUSION

We have proposed an extension to the

algorithm that is able to

address similarity problems in paragraph justication. is exten-

sion is implemented in our open source platform

for typeseing

experimentation and demonstration [

] and could be incorpo-

rated into any T

X based or inspired system (alternative

*TeX

en-

gines, Boxes and Glue

, Typeset

, InDesign [

], etc.). is extension

is both simple and lightweight, so it is expected to have a negligible

impact on performance, should it be used in production. In fact, a

recent conversation with two people involved in LuaT

X conrms

that paragraph breaking is not a performance boleneck today. It is

also worth noting that this extension is backward-compatible with

X, in the sense that seing the similar demerits to 0 eectively

deactivates it.

7 PERSPECTIVES

Experimentation has demonstrated that treating similarity auto-

matically is both a worthy and achievable goal; Figure 1 illustrates

that. On the other hand, geing rid of similarities implies a nec-

essary trade o with other aesthetic criteria [

], for a result

the quality of which is ultimately in the eye of the typographer. In

the near future, we intend to work hand in hand with professional

typographers in order to nd a suitable default value for our similar

demerits, and also to gure out the acceptable trade os with the

other adjustable penalties and demerits in T

https://github.com/didierverna/etap

https://boxesandglue.dev/

https://github.com/bramstein/typeset/

Another area of further experimentation is to not limit ourselves

to a constant (albeit adjustable) amount for similar demerits. We

could for example weight homeoarchies and homeoteleutons dier-

ently, penalize similarities proportionally to their length, or even

increase the cost of consecutive similarities in a non-linear fashion,

so as to penalize ladders more heavily. As a maer of fact, this idea

is also applicable to TeX’s original demerits (adjacent and double-

hyphen ones in particular), and we are already investigating in that

direction.

ACKNOWLEDGMENTS

e author would like to thank omas Savary, Hans Hagen, and

Mikael Sundqvist for some fruitful exchanges.

REFERENCES

[1]

Michael P. Barne. Computer Typeseing: Experiments and Prospects. M.I.T. Press,

Cambridge, Massachuses, USA, 1965.

[2]

Richard Bellman. e theory of dynamic programming. Bulletin of the American

Mathematical Society, 60(6):503–516, 1954.

[3]

Edsger W. Dijkstra. A note on two problems in connexion with graphs. Nu-

merische Mathematik, 1(1):269–271, December 1959. ISSN 0029-599X. doi:

10.1007/BF01386390. URL https://doi.org/10.1007/BF01386390.

[4]

Paul E.Justus. ere is more to typeseing than seing type. IEEE Transactions

on Professional Communication, PC-15(1):13–16, 1972. doi: 10.1109/TPC.1972.

6591969.

[5]

Peter Hart, Nils Nilsson, and Bertram Raphae. A formal basis for the heuristic

determination of minimum cost paths. IEEE Transactions on Systems Science and

Cybernetics, 4(2):100–107, 1968.

[6]

Tamir Hassan and Andrew Hunter. Knuth-Plass revisited: Flexible line-breaking

for automatic document layout. In Proceedings of the 2015 ACM Symposium

on Document Engineering, DocEng’15, page 17–20, Lausanne, Switzerland, 2015.

Association for Computing Machinery. ISBN 9781450333078. doi: 10.1145/

2682571.2797091.

[7]

Alex Holkner. Global Multiple Objective Line Breaking. PhD thesis, School of

Computer Science and Information Technology, RMIT University, Melbourne,

Australia, October 2006.

[8]

Nathan Hurst, Wilmot Li, and Kim Marrio. Review of automatic document

formaing. In Proceedings of the 2009 ACM Symposium on Document Engineering,

DocEng’09, page 99–108, Munich, Germany, 2009. Association for Computing

Machinery. ISBN 9781605585758. doi: 10.1145/1600193.1600217.

[9]

Eric A. Kenninga. Optimal line break determination. US Patent 6,510,441, January

2003.

[10]

Donald E. Knuth. e T

Xbook, volume A of Computers and Typeseing. Addison-

Wesley, MA, USA, 1986. ISBN 0201134470.

[11]

Donald E. Knuth. T

X: the Program, volume B of Computers and Typeseing.

Addison-Wesley, MA, USA, 1986. ISBN 0201134373.

[12]

Donald E. Knuth and Michael F. Plass. Breaking paragraphs into lines. Soware:

Practice and Experience, 11(11):1119–1184, 1981. doi: 10.1002/spe.4380111102.

[13]

Frank Mielbach. E-T

X: Guidelines for future T

X extensions – revisited. TUG-

boat, 34(1), 2013.

[14]

Peter Moulder and Kim Marrio. Learning how to trade o aesthetic criteria in

layout. In Proceedings of the 2012 ACM Symposium on Document Engineering, Do-

cEng’12, page 33–36, Paris, France, 2012. Association for Computing Machinery.

ISBN 9781450311168. doi: 10.1145/2361354.2361361.

[15]

Stephen R. Reimer. Manuscript studies: Medieval and erly modern. https://sites.

ualberta.ca/~sreimer/ms-course/course/scbl-err.htm, 1998.

[16]

R. P. Rich and A. G. Stone. Method for hyphenating at the end of a printed

line. Communications of the ACM, 8(7):444–445, July 1965. ISSN 00010782. doi:

10.1145/364995.365002.

[17]

H. T. ành. Micro-typographic extensions to the T

X typeseing system. TUG-

boat, 21(4), 2000.

[18]

Didier Verna. ETAP: Experimental typeseing algorithms platform. In 15th

European Lisp Symposium, pages 48–52, Porto, Portugal, March 2022. ISBN

9782955747469. doi: 10.5281/zenodo.6334248.

[19]

Didier Verna. Interactive and real-time typeseing for demonstration and ex-

perimentation: ETAP. In Barbara Beeton and Karl Berry, editors, TUGboat,

volume 44, pages 242–248. T

X Users Group, T

X Users Group, September 2023.

doi: 10.47397/tb/44-2/tb137verna-realtime.

[20]

Martin Litcheld West. Textual Criticism and Editorial Technique. B. G.Teubner,

Stugart, 1973.