Similarity Problems in Paragraph Justification
An Extension to the Knuth-Plass Algorithm
Didier Verna
EPITA Research Laboratory
Le Kremlin-Bicêtre, France
didier@lrde.epita.fr
ABSTRACT
In high quality typography, consecutive lines beginning or ending
with the same word or sequence of characters is considered a defect.
We have implemented an extension to T
E
X’s paragraph justication
algorithm which handles this problem. Experimentation shows that
geing rid of similarities is both worth addressing and achievable.
Our extension automates the detection and avoidance of similarities
while leaving the ultimate decision to the professional typographer,
thanks to a new adjustable cursor. e extension is simple and
lightweight, making it a useful addition to production engines.
CCS CONCEPTS
Applied computing
Document preparation; eory of
computation Dynamic graph algorithms.
KEYWORDS
Paragraph Justication, Similarity Avoidance, Homeoteleutons,
Homeoarchies, T
E
X, Knuth-Plass Extension
ACM Reference Format:
Didier Verna. 2024. Similarity Problems in Paragraph Justication: An Ex-
tension to the Knuth-Plass Algorithm. In ACM Symposium on Document
Engineering 2024 (DocEng ’24), August 20–23, 2024, San Jose, CA, USA. ACM,
New York, NY, USA, 4 pages. https://doi.org/10.1145/3685650.3685666
1 INTRODUCTION
In spite of its relatively old age, Donald Knuth’s T
E
X typeseing
system [
10
,
11
] is still considered a de facto standard when it comes
to digital typography. In particular, its paragraph justication al-
gorithm, known as the Knuth-Plass [
12
], established a landmark
in the category of algorithms considering a paragraph as a whole
rather than proceeding line by line, as earlier (greedy) algorithms
used to do [1, 4, 16].
Yet, many aspects of ne typography are not directly or automat-
ically handled by T
E
X. Consider for example the lemost paragraph
in Figure 1. e typeseing was done by T
E
X with the Latin Modern
Roman font at a 10pt size, and for a paragraph width of 201pt (the
gure is optically scaled down in order to t on the page).
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
DocEng ’24, August 20–23, 2024, San Jose, CA, USA
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-1169-5/24/08
https://doi.org/10.1145/3685650.3685666
Notice how three lines near the end of the paragraph end the
same way, with the word “and”. is is considered a defect in
high quality typography, as it generates a micro-interruption: the
reader’s aention may be caught by the similarity and temporarily
loose focus or concentration. Such similarities may also lead the
reader to accidentally skip a line or re-read the same one, even more
so when the problem occurs at the beginning of the line rather than
at the end of it. In fact, the eld of textual criticism has identied
the problem and its consequences in the very ancient context of
scribal errors (e.g. missing lines in manual copies of the bible made
by monks) [20].
We have implemented an extension to the Knuth-Plass (
KP
)
algorithm that is able to deal with that kind of defect. In this paper,
we use the term similarity for the lack of an ocial terminology. We
would also like to propose the expression character / word ladder,
analogous to the more widespread expression hyphenation ladder,
the meaning of which should be obvious. Accidental line skipping
has been referred to with a rather awkward French expression, saut
du même au même (“jump from same to same”), even in non French
lierature [
20
]. Borrowed from rhetoric, the terms homeoarchy and
homeoteleuton have come to designate beginning and end of line
similarities respectively, and by extension, accidental line skipping
because of them [15].
is paper is organized as follows. Section 2 mentions some re-
lated work. Section 3 provides an outline of the
KP
algorithm’s oper-
ation, necessary to understand how our extension works. Section 4
describes our extension, and Section 5 presents some experimental
results.
2 RELATED WORK
Frank Mielbach mentions the similarity problem in a survey of
existing alternative T
E
X engines and remaining issues [
13
]. No solu-
tion is proposed in this paper, as it is merely a state-of-the-art review.
Alex Holkner addresses the problem in his multiple-objective ap-
proach to line breaking [
7
]. However, the paper only seems to be
concerned with beginning-of-line similarities (called stacks). Al-
though inspired from it, the approach is not technically an extension
to T
E
X’s algorithm. e paper does not provide a precise denition
for stacks, and does not use the corresponding objective function in
the reported experimental results. Other extensions to the
KP
algo-
rithm have been proposed in the past, some micro-typographic [
17
],
some macro-typographic, for example to help with automatic docu-
ment layout [
6
]. Concerning the laer, our underlying motivations
are in fact opposite. e work in question aempts to provide ex-
ibility at the expense of quality in order to cope with situations
in which manual intervention is impossible, such as automatically
adjusting to dierent displays. We, on the other hand, are interested
DocEng ’24, August 20–23, 2024, San Jose, CA, USA Didier Verna
In olden times when wishing still helped one,
there lived a king whose daughters were all
beautiful; and the youngest was so beautiful
that the sun itself, which has seen so much,
was astonished whenever it shone in her face.
Close by the king’s castle lay a great dark for-
est, and under an old lime-tree in the forest
was a well, and when the day was very warm,
the king’s child went out into the forest
:::
and
sat down by the side of the cool fountain;
:::
and
when she was bored she took a golden ball,
:::
and
threw it up on high and caught it; and this ball
was her favorite plaything.
In olden times when wishing still helped one,
there lived a king whose daughters were all
beautiful; and the youngest was so beautiful
that the su n itself, which has seen so much,
was astonished whenever it shone in her face.
Close by the king’s castle lay a great dark for-
est, and under an old lime-tree in the forest
was a well, and when the day was very warm,
the king’s child went out into the forest
:::
and
sat down by the side of the cool fountain;
:::
and
when she was bored she took a golden ball,
and threw it up on high and caught it; and
this ball was her favorite plaything.
In olden times when wishing still helped one,
there lived a king whose daughters were all
beautiful; and the youngest was so beautiful
that the sun itself, which has seen so much,
:::
was
astonished whenever it shone in her face. Close
by the king’s castle lay a great dark forest, and
under an old lime-tree in the forest was a well,
and when the day was very warm, the king’s
child went out into the forest and sat down by
the side of the cool fountain; and when she was
bored she took a golden ball, and threw it up
on high and caught it; and this ball was her
favorite plaything.
Figure 1: Similarity avoidance
In
beautiful
that
much,
was
was
astonished
and
this
ball
was
her
favorite
plaything.
Figure 2: Line breaking graph excerpt
in helping professional (human) typographers aiming for the nest,
by providing as much quality as possible automatically, and leing
them focus on what strictly requires human intervention. is is
more in line with the view of Hurst et al. [8].
3 THE KNUTH-PLASS IN A NUTSHELL
e
KP
algorithm essentially expresses the paragraph justication
problem as a Single Pair Shortest Path one [
3
,
5
]. e possible so-
lutions for breaking a paragraph into lines are represented in the
form of a graph (see Figure 2), in which every possible line break
is a node, and the ability to go from one line break to the next is
represented by an edge connecting two nodes. In the gure, each
node advertises the corresponding end-of-line, and the beginning
of the next. e problem thus boils down to nding the best route
(referred to as “shortest path” in graph theory) from the top node
to the boom one.
In order to perform eciently, the
KP
algorithm uses a dynamic
programming [
2
] optimization technique which allows it to never
construct the full (potentially huge) graph in memory. As it pro-
gresses through the paragraph’s text, T
E
X maintains branches rep-
resenting the best possible solutions (note the plural) so far, but also
gets rid of branches which are provably going to be sub-optimal in
the end.
Finding the shortest path between two nodes in a graph is usually
expressed by minimizing cost rather than maximizing gain. When
the
KP
algorithm explores the line breaking possibilities, it assigns
demerits to various aspects of the solutions and eventually chooses
the cheapest one. T
E
X’s demerits are adjustable parameters, and
can be classied in two categories.
Local Demerits are computed line by line, independently of the
surrounding context. In particular, the so-called badness of a line
increases as it needs to be stretched or shrunk, and additional
penalties are applied to hyphenated lines.
Contextual Demerits are applied by comparing some aspects of
two consecutive lines. In particular, hyphenation ladders are heavily
penalized, as well as “adjacency” problems (two consecutive lines
with very dierent scaling; for example a very loose line followed
by a very compact one).
e
KP
algorithm works by minimizing the sum of local and
contextual demerits across the whole paragraph.
4 SIMILARITY HANDLING
In order to address similarity problems, our extension works by
introducing a new kind of contextual demerits that we call similar
demerits. Each time the algorithm compares two consecutive lines,
it now also compares the respective beginnings and ends of line,
and adds the similar demerits value to the total if similarities are
encountered. Note that the end of the paragraph needs a special
treatment. Most of the time, the last line is not justied, so the po-
tential similarities would not be vertically aligned. Similar demerits
may be applied in the very rare cases of complete justication
though.
In order to compare consecutive lines for similarities, each node
representing a potential break point in the partial graph must re-
member how the corresponding lines begin and end (which is
apparent in Figure 2). is is done as follows. Every time the algo-
rithm creates a new node, it collects the characters from the end of
the line backward, discarding kerns, and stopping at the rst glue
or hyphenation point. e same process is applied (forward) to the
beginning of the next line. ere are several reasons for doing it
this way.
First of all, kerns do not drastically change the vertical alignment
of characters, so they have lile impact on the reader’s perception
of similarity (besides, they would be identical in the middle of a
similarity). On the other hand, glues, which are elastic, are much
more likely to have a noticeable impact on vertical alignment. By
stopping before the rst one, we thus do not need to compare the
Similarity Problems in Paragraph Justification DocEng ’24, August 20–23, 2024, San Jose, CA, USA
0
200
400
600
800
1000
1200
150 200 250 300 350 400 450 500 550 600
Total
With similarities
Number of paragraph breaking solutions
Paragraph width (pt)
Figure 3: Cross-Width Similarity Report
vertical alignment of the similarities; we just need to compare the
sequences of characters, which can be done much more eciently.
Finally, stopping at the rst hyphenation point avoids the complex-
ity of deconstructing T
E
X’s discretionaries for content introspection.
Discretionaries are slightly more complicated objects used in partic-
ular for handling hyphenation and ligatures. Note that by detecting
a short similarity (typically one syllable), we implicitly handle a
larger one as well, if it exists. In the end, testing for similarity boils
down to comparing two short sequences of characters, and nding
the longest common sub-sequence.
e rst paragraph in gure 1 was obtained with a value of
0 for similar demerits, that is, as T
E
X itself would do it. With a
value slightly above 2800, we obtain the middle paragraph in the
gure. e algorithm was able to get rid of the second similarity by
pushing the third occurrence of “and” to the beginning of the next
line (the one before last), at the expense of more line stretching.
is also had the eect of shiing the remainder of the paragraph
by two words.
If we push similar demerits to a value slightly above 5230, we
obtain the rightmost paragraph in the gure. is time, no more sim-
ilarity remains, but the algorithm had to change the layout starting
as early as at the fourth line to get this result. is paragraph also
looks beer than the second one in terms of adjacency problems:
a close look at the two before last lines in the middle paragraph
reveals that they are quite loose compared to the surrounding ones.
Finally, a professional typographer to whom we showed the gure
nds that the third version globally improves over T
E
X’s original
choice, as there is no more hyphenation, and fewer rivers.
5 EXPERIMENTS
In order to assess both the pertinence of the question and the
ecacy of our solution, we ran two orthogonal experiments: a single
paragraph typeset at many dierent widths, and many dierent
paragraphs typeset at a single width.
5.1 Pertinence
e purpose of the rst experiment is to assess the pertinence of the
question. Is the similarity problem a frequent one, and is it worth
addressing? In honor of the
KP
algorithm’s founding paper [
12
], we
took an English version of the Grimm Brothers’ “Frog King” novel,
paragraph 1, and typeset it at all widths ranging from 142pt (approx.
5cm) to 569pt (approx. 20cm) with our own implementation of the
KP
algorithm. For each of the 427 passes, we recorded the total
number of possible layouts, and the number of layouts containing
similarities. e results are reported in Figure 3.
Note that because T
E
X’s paragraph breaking algorithm is opti-
mized (see Section 3), it does not normally compute all the possible
solutions. On the other hand, our personal implementation comes
in two avours: the regular, dynamically optimized one, and a vari-
ant working on the complete solutions graphs. is is how we are
able to record the total number of possible layouts.
e gure exhibits a number of interesting characteristics. First
of all, the well known inherent instability of the paragraph breaking
problem is clearly visible in the chaotic shapes of the curves. Next,
and also unsurprisingly, the two curves are very close to each
other for smaller paragraph widths. is illustrates the fact that
narrow paragraphs are notoriously dicult (sometimes impossible)
to justify, and that similarities may be unavoidable. On the other
hand, the gure also indicates that in the vast majority of the cases,
there is a lot of similarity-free layouts to choose from. All in all, this
experiment proves that the similarity problem is indeed a frequent
one, but also that geing rid of similarities is an achievable goal.
e next logical question is thus the following: given that in most
cases, there exist many similarity-free layouts, will T
E
X choose one
of those, or will it favor a layout with similarities instead? Further
analysis of the results gives us the following answers. In 4% of the
cases, it is impossible to get rid of similarities (meaning that all
possible solutions contain some). Otherwise, when there is a choice,
T
E
X favours a layout with similarities in 21% of the cases. A similar
calculation for the second experiment (see Section 5.2) puts this
gure at 26%.
21% (26% respectively) is far from being negligible. In very con-
crete terms, it means that a professional typographer aiming at
high quality typeseing would have to manually intervene on two
paragraphs out of ten, in order to decide whether they can be im-
proved or not. As a maer of fact, this very paper contains at least
a dozen similarity problems, including three very bad ones… In our
view, this justies the claim that the similarity problem is not only
frequent, but also worth addressing.
5.2 Ecacy
For the second experiment, we took the text of Herman Melville’s
Moby Dick novel, freely available from Project Gutenberg’s web-
site
1
. We “cleaned up” the text by removing artefacts such as table
of contents, chapter names, and in general, all pieces of text that
would lead to paragraphs of less than two lines. e resulting corpus
contains 1524 paragraphs that we typeset at 284pt (approx. 10cm).
Combined with the rst experiment’s 427 runs, this amounts
to a total of 1951 individual cases, which we ran in three dierent
experimental conditions (for a total of 5853 individual passes) as
1
https://www.gutenberg.org/
DocEng ’24, August 20–23, 2024, San Jose, CA, USA Didier Verna
follows. e rst batch was typeset with T
E
X’s default seings, thus,
not handling similarities at all. e second batch was typeset by
maximizing the cost of similarity (similar demerits set to
10 000
).
Finally, the third batch was typeset by not only penalizing similari-
ties, but also disregarding adjacency problems (adjacent demerits
set to 0; see Section 3).
When we set the similar demerits to the maximum value, 48 (rst
experiment) to 50% (second one) of the problematic paragraphs are
“corrected”, in the sense that no similarities remain. In the cases
where a paragraph contains multiple similarities (up to four in
the Moby Dick experiment), the algorithm can sometimes only
reduce their number. In such a situation, the number of “improved”
paragraphs increases to 50 and 63% respectively. If we not only
penalize similarity, but also disregard adjacency problems, 53 to
66% of the problematic paragraphs are corrected, and a total of 57 to
73% are globally improved. ese gures clearly demonstrate that
similarity can be treated in an automated fashion up to a notable
proportion.
Note that completely discarding adjacent demerits is nonsensical
from an aesthetic point of view. More generally, just because the
algorithm nds a similarity-free layout does not mean that it will
necessarily look beer to a professional typographer. Further exper-
imentation is planned to address that (see Section 7). e purpose
of these two experiments is rather to evaluate the leeway we have
in similarity handling by studying extreme conditions, and again,
the gures clearly demonstrate that the problem can be addressed
in the vast majority of the cases.
6 CONCLUSION
We have proposed an extension to the
KP
algorithm that is able to
address similarity problems in paragraph justication. is exten-
sion is implemented in our open source platform
2
for typeseing
experimentation and demonstration [
18
,
19
] and could be incorpo-
rated into any T
E
X based or inspired system (alternative
*TeX
en-
gines, Boxes and Glue
3
, Typeset
4
, InDesign [
9
], etc.). is extension
is both simple and lightweight, so it is expected to have a negligible
impact on performance, should it be used in production. In fact, a
recent conversation with two people involved in LuaT
E
X conrms
that paragraph breaking is not a performance boleneck today. It is
also worth noting that this extension is backward-compatible with
T
E
X, in the sense that seing the similar demerits to 0 eectively
deactivates it.
7 PERSPECTIVES
Experimentation has demonstrated that treating similarity auto-
matically is both a worthy and achievable goal; Figure 1 illustrates
that. On the other hand, geing rid of similarities implies a nec-
essary trade o with other aesthetic criteria [
7
,
14
], for a result
the quality of which is ultimately in the eye of the typographer. In
the near future, we intend to work hand in hand with professional
typographers in order to nd a suitable default value for our similar
demerits, and also to gure out the acceptable trade os with the
other adjustable penalties and demerits in T
E
X.
2
https://github.com/didierverna/etap
3
https://boxesandglue.dev/
4
https://github.com/bramstein/typeset/
Another area of further experimentation is to not limit ourselves
to a constant (albeit adjustable) amount for similar demerits. We
could for example weight homeoarchies and homeoteleutons dier-
ently, penalize similarities proportionally to their length, or even
increase the cost of consecutive similarities in a non-linear fashion,
so as to penalize ladders more heavily. As a maer of fact, this idea
is also applicable to TeX’s original demerits (adjacent and double-
hyphen ones in particular), and we are already investigating in that
direction.
ACKNOWLEDGMENTS
e author would like to thank omas Savary, Hans Hagen, and
Mikael Sundqvist for some fruitful exchanges.
REFERENCES
[1]
Michael P. Barne. Computer Typeseing: Experiments and Prospects. M.I.T. Press,
Cambridge, Massachuses, USA, 1965.
[2]
Richard Bellman. e theory of dynamic programming. Bulletin of the American
Mathematical Society, 60(6):503–516, 1954.
[3]
Edsger W. Dijkstra. A note on two problems in connexion with graphs. Nu-
merische Mathematik, 1(1):269–271, December 1959. ISSN 0029-599X. doi:
10.1007/BF01386390. URL https://doi.org/10.1007/BF01386390.
[4]
Paul E.Justus. ere is more to typeseing than seing type. IEEE Transactions
on Professional Communication, PC-15(1):13–16, 1972. doi: 10.1109/TPC.1972.
6591969.
[5]
Peter Hart, Nils Nilsson, and Bertram Raphae. A formal basis for the heuristic
determination of minimum cost paths. IEEE Transactions on Systems Science and
Cybernetics, 4(2):100–107, 1968.
[6]
Tamir Hassan and Andrew Hunter. Knuth-Plass revisited: Flexible line-breaking
for automatic document layout. In Proceedings of the 2015 ACM Symposium
on Document Engineering, DocEng’15, page 17–20, Lausanne, Switzerland, 2015.
Association for Computing Machinery. ISBN 9781450333078. doi: 10.1145/
2682571.2797091.
[7]
Alex Holkner. Global Multiple Objective Line Breaking. PhD thesis, School of
Computer Science and Information Technology, RMIT University, Melbourne,
Australia, October 2006.
[8]
Nathan Hurst, Wilmot Li, and Kim Marrio. Review of automatic document
formaing. In Proceedings of the 2009 ACM Symposium on Document Engineering,
DocEng’09, page 99–108, Munich, Germany, 2009. Association for Computing
Machinery. ISBN 9781605585758. doi: 10.1145/1600193.1600217.
[9]
Eric A. Kenninga. Optimal line break determination. US Patent 6,510,441, January
2003.
[10]
Donald E. Knuth. e T
E
Xbook, volume A of Computers and Typeseing. Addison-
Wesley, MA, USA, 1986. ISBN 0201134470.
[11]
Donald E. Knuth. T
E
X: the Program, volume B of Computers and Typeseing.
Addison-Wesley, MA, USA, 1986. ISBN 0201134373.
[12]
Donald E. Knuth and Michael F. Plass. Breaking paragraphs into lines. Soware:
Practice and Experience, 11(11):1119–1184, 1981. doi: 10.1002/spe.4380111102.
[13]
Frank Mielbach. E-T
E
X: Guidelines for future T
E
X extensions revisited. TUG-
boat, 34(1), 2013.
[14]
Peter Moulder and Kim Marrio. Learning how to trade o aesthetic criteria in
layout. In Proceedings of the 2012 ACM Symposium on Document Engineering, Do-
cEng’12, page 33–36, Paris, France, 2012. Association for Computing Machinery.
ISBN 9781450311168. doi: 10.1145/2361354.2361361.
[15]
Stephen R. Reimer. Manuscript studies: Medieval and erly modern. https://sites.
ualberta.ca/~sreimer/ms-course/course/scbl-err.htm, 1998.
[16]
R. P. Rich and A. G. Stone. Method for hyphenating at the end of a printed
line. Communications of the ACM, 8(7):444–445, July 1965. ISSN 00010782. doi:
10.1145/364995.365002.
[17]
H. T. ành. Micro-typographic extensions to the T
E
X typeseing system. TUG-
boat, 21(4), 2000.
[18]
Didier Verna. ETAP: Experimental typeseing algorithms platform. In 15th
European Lisp Symposium, pages 48–52, Porto, Portugal, March 2022. ISBN
9782955747469. doi: 10.5281/zenodo.6334248.
[19]
Didier Verna. Interactive and real-time typeseing for demonstration and ex-
perimentation: ETAP. In Barbara Beeton and Karl Berry, editors, TUGboat,
volume 44, pages 242–248. T
E
X Users Group, T
E
X Users Group, September 2023.
doi: 10.47397/tb/44-2/tb137verna-realtime.
[20]
Martin Litcheld West. Textual Criticism and Editorial Technique. B. G.Teubner,
Stugart, 1973.