A Corpus Processing and Analysis Pipeline for ickref

Antoine Hacquard

EPITA

Research and Development Laboratory

Le Kremlin-Bicêtre, France

antoine.hacquard@lrde.epita.fr

Didier Verna

EPITA

Research and Development Laboratory

Le Kremlin-Bicêtre, France

didier@lrde.epita.fr

ABSTRACT

Quicklisp is a library manager working with your existing Common

Lisp implementation to download and install around 2000 libraries,

from a central archive. Quickref, an application itself written in

Common Lisp, generates, automatically and by introspection, a

technical documentation for every library in Quicklisp, and pro-

duces a website for this documentation.

In this paper, we present a corpus processing and analysis pipeline

for Quickref. This pipeline consists of a set of natural language pro-

cessing blocks allowing us to analyze Quicklisp libraries, based

on natural language contents sources such as README les, doc-

strings, or symbol names. The ultimate purpose of this pipeline is

the generation of a keyword index for Quickref, although other ap-

plications such as word clouds or topic analysis are also envisioned.

CCS CONCEPTS

• Information systems

→

Information extraction; Retrieval

eectiveness; Presentation of retrieval results; • Software and

its engineering → Software libraries and repositories.

KEYWORDS

Natural Language Processing, Indexing, Documentation

ACM Reference Format:

Antoine Hacquard and Didier Verna. 2021. A Corpus Processing and Analy-

sis Pipeline for Quickref. In Proceedings of the 14th European Lisp Symposium

(ELS’21). ACM, New York, NY, USA, 9 pages. https://doi.org/10.5281/zenodo.

4714443

1 INTRODUCTION

Common Lisp [

] is a dialect of the Lisp family of programming

languages. It was standardized in 1994 by the American National

Standards Institute. It is an industrial-strength, multi-paradigm lan-

guage. Languages in the Lisp family are among the very few to

be homoiconic [

], a property through which both introspec-

tion and intercession are achieved in a relatively homogeneous

and simple way. The dynamic and highly introspective nature of

Lisp makes it straightforward to extract information about the pro-

gram structure and components, for example, for documentation

purposes. Additionally, Common Lisp lets the programmer attach

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

ELS’21, May 03–04 2021, Online, Everywhere

ACM ISBN 978-2-9557474-5-2.

https://doi.org/10.5281/zenodo.4714443

so-called “docstrings” (documentation strings) to almost all soft-

ware components: variables, functions, classes, etc. When available,

docstrings are a valuable source of information that can also be

extracted very easily from an existing program.

1.1 The Paradox of Choice

In a somewhat paradoxical way, the technical strengths of the lan-

guage bring drawbacks to its community of programmers [

Lisp usually makes it so easy to “hack” things away that every

Lisper ends up developing his or her own solution, inevitably lead-

ing to a paradox of choice. The result is a plethora of solutions for

every single problem that every single programmer faces. Most

of the time, these solutions work, but they are either half-baked

or targeted to the author’s specic needs and thus not general

enough. Furthermore, it is dicult to assert their quality, and they

are usually not (well) documented.

In this context, an important tool, community-wise, is Quicklisp.

Quicklisp is both a central repository for Common Lisp libraries

(there are currently around 2000 of them) and a programmatic

interface for it. With Quicklisp, downloading, installing, compiling

and loading a specic package on your machine (dependencies

included) essentially becomes a one-liner. What Quicklisp doesn’t

solve, however, is the documentation problem.

1.2 Quickref

Quickref [

] is a global documentation project for the Com-

mon Lisp ecosystem. It generates reference manuals for libraries

available in Quicklisp automatically. Quickref is non-intrusive, in

the sense that software developers do not have anything to do to

get their libraries documented by the system: mere availability in

Quicklisp is the only requirement.

Quickref works by introspecting libraries, and generating cor-

responding documentation in Texinfo format. The Texinfo les

may in turn be converted into human-readable documentation, for

example in PDF or HTML. Quickref may be used to create a local

website documenting your current, partial, working environment,

but it is also used in production, to maintain a global public website

of technical reference manuals for all Quicklisp libraries. The site

is kept in sync with Quicklisp.

1.3 Library Access

In order to provide access to the two thousand or so reference

manuals available on the website, Quickref provides two indexes:

a library index and an author index. The former is most likely to

be used when looking for a library in particular, while the latter is

probably only useful for people wanting the check out the generated

documentation for their own work.

ELS’21, May 03–04 2021, Online, Everywhere Antoine Hacquard and Didier Verna

Suppose however that someone is looking for some functionality,

without any prior idea or knowledge about which library may be

appropriate. Quickref, as it is right now, is impractical for such a

mining task, hence the idea of enriching it with a keyword index, a

word cloud, etc. In order to generate such things automatically, it is

necessary to process and analyze each library’s corpus, that is, the

bits of textual information providing some description of function-

ality (README les, docstrings, sometimes even symbol names,

etc.). Fortunately for us, Declt, the reference manual generator on

which Quickref is based, makes it very easy to access the corpuses

in question. The purpose of this paper is to describe the natural

language processing pipeline that we are currently building into

Quickref to analyze the extracted corpuses, and ultimately provide

library access by functionality.

Given the universal availability of very ecient internet search

engines these days, one may wonder whether an indexing project

specic to Quickref is really needed or pertinent. The following

remarks answer that question.

First of all, a general search engine doesn’t know about such

or such library’s availability in Quicklisp. On the other hand, a

local index will necessarily point to readily-available libraries only.

Next, and as opposed to search engines considering plenty of, and

indiscriminate information sources, our indexing process is based

on each library’s documentation only. Therefore, it will have a

natural tendency to favor well documented ones, which can be an

important factor, when choosing which tool to use in your own

project.

Finally, and beyond providing new kinds of indexes, other appli-

cations of this project could be envisioned later on, such as topic

analysis, distribution, and visualization (a topography of the centers

of interest in the Lisp community, of sorts).

1.4 Pipeline Overview

Figure 1 depicts the pipeline used to process and analyze the cor-

puses extracted from each library by Declt.

(1)

Each corpus is rst tokenized, that is, split into chunks which

usually (but not necessarily) correspond to words. The tokens

are then tagged, meaning that they are associated with their

syntactical class (noun, verb, etc.). After this stage, we are

able to lter specic token classes (e.g. retain only nouns,

verbs, etc.).

(2)

Next, the retained tokens are stemmed, meaning that their

lexical root is extracted, and used to attempt matching with

a canonical form found in a dictionary. This process is called

lemmatization. After this stage, only the canonicalized known

lemmas (i.e., found in said dictionary), are retained.

(3)

A TF-IDF (Term Frequency / Inverse Document Frequency)

value is computed for every such lemma. This value is a

statistical indication of how relevant each lemma is to the

corresponding library. Only the most pertinent ones are

kept around (the exact number of such retained lemmas may

vary).

(4)

Finally, the (possibly intersecting) sets of most pertinent

keywords describing each library are aggregated in order

to produce the desired output (keyword index, word cloud,

etc.).

It is worth mentioning right away that in this pipeline, two

out of four blocks (the rst two) are pre-processing steps, devoted

to sanitizing the corpuses, while only stages three and four actu-

ally perform the job of information processing. The importance of

pre-processing in this pipeline is due to TF-IDF working on syn-

tactic tokens only, without any semantic information. For example,

without pre-processing, tokens such as “test”, “tests”, and “testing”

would be treated independently, as if they meant dierent things.

At the time of this writing, the rst three blocks in this pipeline

are fully operational. Keyword aggregation, on the other hand,

is a dicult problem, and the aggregator block is still subject to

experimentation. Also, note that we intend, at a later time, to release

the code of each block as independent, open-source libraries.

The remainder of this paper is organized as follows. Sections 2

to 5 provide a more in-depth description and discussion of the to-

kenizer / PoS-Tagger, stemmer / lemmatizer, and TF-IDF blocks

respectively. Section 6 describes the challenges posed by the key-

word index generation problem, the experiments already conducted,

and some possible ideas for further experimentation.

2 POS-TAGGING

PoS-Tagging (for “Part-of-Speech” tagging) is a technique allowing

to determine the syntactic class of words, that is, whether they are

common nouns, verbs, articles, etc. The syntactic classes of words

may be important information to perform semantic analysis of a

corpus for dierent reasons. For example, some categories of words,

like determinants, convey very little or no useful meaning at all, so

we want to lter them out early, rather than carrying them around

until the TF-IDF block makes the same decision (although for a

dierent reason: they appear frequently, but everywhere). Also, in

the aim of generating a keyword index, it may be interesting to

experiment with dierent sets of retained information, such as only

nouns, nouns and verbs, etc.

2.1 Implementation

There are many ways to implement a PoS-Tagger, notably with

HMMs (Hidden Markov Models), unsupervised learning, or ma-

chine learning [

]. In the Common Lisp ecosystem, we are aware

of one PoS-Tagger library, namely “Tagger” [

], written by Xerox

in 1990, which uses HMMs.

HMMs are statistical Markov Models used to learn an unknown

Markov Process with hidden states, by observing another process,

known this time, and depending on it. HMMs are widely used in

PoS-Tagging to disambiguate syntactic classication. The biggest

problem of PoS-Tagging is that a word can have several syntactic

classes associated with it, depending on the context. For example,

the word “can” may be either a verb, or a noun (as in “soda can”).

Using HMMs, a PoS-Tagger rst learns the probability of a certain

sequence of syntactic classes occurring. Then, it disambiguates

unknown words by using the syntactic class sequence with the

highest probability.

Suppose for example that after an article such as “the”, the class

probabilities for the next word are 40% noun, 30% adjective, and

20%

number

. When seeing “The can”, a PoS-Tagger will thus cor-

rectly classify “can” as a noun.

A Corpus Processing and Analysis Pipeline for ickref ELS’21, May 03–04 2021, Online, Everywhere

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic lter

known lemmas

pertinence lter

Figure 1: Keyword index generation pipeline

Because HMM-based PoS-Tagging needs a word’s surrounding

context in order to decide on its syntactical class, it must appear very

early in the pipeline, namely, before such contextual information is

removed. On the other hand, the input of the tagger needs to have

been tokenized already. Therefore, the tokenization and tagging

steps are tightly coupled, which is why they appear as a single

block in our pipeline.

The aforementioned Tagger library happens to oer a powerful

and highly customizable tokenizer, and a PoS-Tagger linked with it.

The tokenizer uses an automaton to parse sentences, and can be

customized with rational expressions. Our two biggest customiza-

tions on the tokenizer were to accept dashes in tokens rather than

considering them as separators (otherwise, words like “command-

line” would have been split), and to add a rational expression to

recognize URLs as unique tokens (many URLs exist in our corpuses,

even in plain text documentation).

The Tagger library poses a problem however: it accepts ASCII

characters only. In the majority of the cases, this is not so much

of a problem because the natural language in use is almost exclu-

sively English, and a purely ASCII text encoded in Unicode remains

readable as-is. On the other hand, some libraries do have special

characters in their README les, breaking the tagger. A good ex-

ample of this, is the

April

[

] library, which compiles a subset of

the APL language. Many special characters in there are APL tokens,

which are not ASCII.

In order to solve this problem, we pre-process all of our docu-

ments with “Free Recode”, an open source tool to transliterate les

between many encodings. Non-ASCII characters are replaced with

interrogation marks. This side-eect actually has little or no impact

on our pipeline at all, because our current corpus contains only Eng-

lish documentation. Most of it is already plain ASCII, and the few

non-ASCII characters we found were either fancy “prettication”

of README les (e.g. smileys), or in code snippets.

2.2 Tests and Results

In order to get an early feedback on the behavior of this pre-

processing stage, we ran our complete pipeline in dierent PoS-

Tagger modes. Sample results are presented in Table 1. That table

displays the 10 keywords appearing the most frequently after the

TF-IDF block (all libraries included). The numbers in parentheses

are the number of libraries associated with each keyword. In all

cases, the Tagger library’s tokenizer is used. When tagging is turned

o, there is no ltering on the syntactic classes of the tokens. Oth-

erwise, the table presents results when only common nouns, or a

combination of common nouns and verbs are retained.

All tokens Only nouns Nouns and Verbs

lisp (110) library (175) test (110)

test (63) le (150) le (108)

message (51) function (138) license (107)

common-lisp (51) license (133) library (107)

name (49) value (117) function (92)

le (48) document (117) name (81)

value (47) package (114) value (81)

stream (46) name (114) package (78)

function (46) test (102) stream (69)

server (45) project (101) load (69)

Table 1: Top 10 keywords w/ dierent syntactic lters

We observe that even without tagging, we don’t see “noisy”

words such as articles appearing in the top 10 list. That is because

at the end of the pipeline, the TF-IDF pass will detect that such

words, being frequent basically everywhere, are in fact not specic

to any library in particular. On the other hand, the PoS-Tagger

would help ltering those words earlier in the pipeline. It is also

apparent that PoS-Tagger helps ltering out uninteresting tokens

such as “lisp” or “common-lisp”. Indeed, these end up being ltered

out as either proper nouns (as in “Lisp is a . . . ”), or adjectives (as in

“a Lisp library”).

Whether to keep verbs around, or only common nouns, remains

an open question. Verbs may contain useful information for de-

scribing what a library does. For example, it is likely that a library

for unit testing will make frequent use of the word “test” both as

a noun, and as a verb. If we keep both around, the nal weight

of “test” as a unique lemma will increase (which is a good thing

in that particular case, and is in fact visible in Table 1). This will

also happen every time a verb and a noun are slightly dierent, but

are lemmatized identically. On the other hand, many verbs are also

uninteresting (“be”, “get”, “come”, etc.), and it is impossible to know

in advance whether their distribution across all libraries would be

such that the TF-IDF block would lter them out. Finally, there are

also problematic cases in which a verb and a noun convey dierent

meanings, which would hinder the accuracy of our results. One

possible solution to this problem would be to tag nouns and verbs

in order to keep them as separate entities, but as mentioned before,

there are also cases where keeping them separate is undesirable.

3 STEMMING

Stemming is the process of reducing a word to its root, or canonical

form in the linguistic sense, notably by removing prexes or suxes.

ELS’21, May 03–04 2021, Online, Everywhere Antoine Hacquard and Didier Verna

No stemmer Porter Snowball Lancaster

node node node nam

server elem elem parsable

test src src node

stream parse server src

template stream parse byte

event server stream stream

trivial trivial see trivia

x byte trivial el

connection test test x

image x byte test

Table 2: Stemmer-dependent results for the nal index

Although stemming does not constitute a block in our pipeline per

se, it still is an important part of the process, for reasons that will

become apparent in the next section.

Because a stemmer removes everything but the linguistic root

of a word, the resulting “stem” may not be a complete word at all.

This is a potential problem for us, because in the end, we want an

index composed of actually existing words, so the stems themselves

can’t always be used directly.

3.1 Implementation

Many stemming algorithms exist, and they are usually quick and

straightforward to implement. The two most popular approaches

are based, either on rule systems, or on training of stochastic al-

gorithms [

]. The rule-based approach oers a better trade-o

between simplicity of implementation and quality of the produced

stems, so this is the approach we favor.

Figure 2 illustrates a typical use-case of a rule-based stemmer.

There are usually two categories of rules: transformation rules and

deletions rules. A transformation rule transforms a prex (respec-

tively, a sux) into a simpler version. A deletion rule deletes the

prex (respectively, the sux).

Three notable sux stemmers exist in the literature: Porter [

Snowball [

] (a.k.a. Porter 2), and Lancaster [

]. These stemmers

are well suited to process English, as most of the word variations

occur at their end in this language. We implemented the three of

them in Common Lisp, and we used NLTK [

] as a reference point

for debugging our implementations. NLTK is the most well known,

and de facto standard Python library for NLP (Natural Language

Processing), and incorporates a large number of stemmers. Note

that we are aware of only one pre-existing Common Lisp imple-

mentation of a stemmer [

], a Porter stemmer, more specically. We

still decided to write our own because it is rather straightforward,

and also because the NLTK implementation, which we want to fol-

low, sometimes departs from the original specication in ways that

would have been dicult to implement in the existing Common

Lisp implementation, which is not very exible.

3.2 Tests and Results

In order to get an early feedback on the behavior of stemming, we

ran our complete pipeline which each of them, and also without

stemming at all, that is, using the output of the PoS-Tagger directly.

Sample results are presented in Table 2. We notice a global improve-

ment of the nal index when stemming is used. Indeed, interesting

words (such as “parse”) are brought up, while less interesting ones

(such as “x”) are brought down. We also notice that the results

with the Lancaster stemmer are not so good: many nal words are

in fact not actual words. This is due to the fact that Lancaster is

a “strong” stemmer: it has a tendency to over-stem words, which

leads to the same root for words and typos. The Snowball stemmer

is considered to give the best results, as it is the only one which

manages to bring down “x” to not be in the rst ten words.

4 LEMMATIZATION

Besides stemming, the other classical approach to word normal-

ization in the literature is lemmatization, which consists in using

the dictionary form of a word as its canonical representation (in-

stead of a stem). The main advantage of this approach is that in

the aim of building a word index, the output of a lemmatizer can

be used directly, as opposed to that of a stemmer which requires

reconstructing a word afterwards.

4.1 Implementation

Lemmatization can be implemented in many dierent ways. Ap-

proaches range from rule-based systems (similar to stemmers, but

with more complicated rules), dictionary look-up, machine-learning,

etc. As our bibliographical research didn’t reveal anything satis-

factory in terms of Lisp implementation of a lemmatizer (either

not in Quicklisp or part of a larger library), we decided to imple-

ment our own. The approach we chose is that of dictionary look-up,

as described in [

]; a solution both elegant and easy to imple-

ment. In short, a word is compared with all words in a dictionary

of “lemmas”, and the closest one (according to a so-called “edit

distance”) is chosen as its canonical form. A pre-processing step

consisting of stemming the word before measuring its edit distance

is discussed in the paper, and shown to give better results (hence

the importance of stemming anyway). The Common Lisp library

mk-string-metrics

oers a set of built-in edit distances. We use

this library to implement our lemmatizer.

We conducted a set of experiments in order to decide on the best

combination of stemming algorithms (among the 3 described in

the previous section), edit distances (we choose to only test the 5

available in

mk-string-metrics

but there are plenty of others in

the literature [

], [

]), and dictionaries. The following

sections report on those experiments.

4.2 Stemmer / Edit Distance Selection

In order to decide on which stemmer algorithm and which edit

distance to use, we tested the possible combinations and counted the

number of correct lemmas generated by the lemmatizer. The ground

truth (i.e. the correct lemmas for each word) was simply found on

the internet, where a lot of resources related to lemmatization exist

for verifying the correctness of an implementation[1].

Table 3 shows the obtained results. These results conrm one of

the paper’s claims, which is that the use of a stemmer has a huge

impact on the quality of the results. The other noticeable thing

is that the Lancaster stemmer performs quite poorly. This, again,

can be explained by the fact that Lancaster tends to produce very

A Corpus Processing and Analysis Pipeline for ickref ELS’21, May 03–04 2021, Online, Everywhere

(1) Transformation rule

computational -> computate

(2) Deletion rule

computate -> comput

Word

“computational”

Stem

“comput”

Figure 2: Ruled-based stemming

Distance

Stemmer Jaccard Jaro-Winkler Damerau-Levenshtein Levenshtein Overlap

None 428 659 655 656 25

Porter 840 934 970 970 132

Snowball 842 934 970 970 132

Lancaster 483 612 590 591 32

Table 3: Number of correct lemmas on a list of 1226 words

16.000 lemmas 30.000 lemmas 33.249 lemmas

clause (54) clause (52) library (167)

client (40) server (38) le (145)

server (39) client (35) function (134)

project (37) hotel (34) license (124)

value (37) project (34) document (113)

message (35) value (33) value (110)

node (34) node (31) name (105)

user (33) message (29) package (103)

test (30) begin (29) test (100)

begin (30) aside (27) project (96)

Table 4: Dictionary-dependent results for the nal index

short stems, which skews the edit distance computation. Finally,

we can see that the best results are obtained with Porter or Snow-

ball stemmers, and with Levenshtein or Damereau-Levenshtein

edit distances. As Snowball is an improvement over Porter, and

Damereau-Levenshtein over Levenshtein, it is only logical that

these four have approximately the same results. At that point, we

decided to retain the Snowball stemmer, as it also performed more

eciently, time-wise, and the Levenshtein distance, because it is

slightly faster than the Damereau-Levenshtein one.

4.3 Dictionary Selection

For dictionary selection, we started by experimenting with two

dictionaries of 16.000 and 30.000 lemmas respectively, found on

the internet (unfortunately, we lost track of the source of these

dictionaries in the process, but we will publish them later, along

with the code).

Sample results are presented in Table 4 (the third dictionary /

column will be described in a few paragraphs). The indexes gen-

erated with those dictionaries have a big aw: they contain words

that are not in the base corpus. The most obvious example of this is

the word “hotel” occurring at the fth position in the second index.

These words appear somewhat “magically” for a conjunction of two

events: they exist in the dictionary but not in the corpus, and we’re

trying to lemmatize a word that is (obviously) in the corpus, but

not in the dictionary. Because every word in the corpus needs to be

matched to a word in the dictionary, such words will be lemmatized

weirdly.

More specically, lemmas are chosen by optimizing the edit

distance (minimizing or maximizing it, depending on the actual

distance in use), which is a continuous measure. This means that

while a lemma will always be found, the edit distance may still be

bad. In other words, there are times when even the best solution is

a bad one.

4.3.1 White-Listing. A natural solution to this problem is to use the

dictionary as some sort of “white-list”, by imposing a threshold on

the computed edit distance. Whenever a normalized edit distance is

found to be lower than the selected threshold, the condence in the

lemmatization process is considered too low, and the word discarded

from the subsequent TF-IDF statistic. Using such a threshold is a

convenient way to make a distinction between words which do have

a lemma in the dictionary, and words which don’t (hence, words

which we don’t want to keep around). After some experimentation

(mostly, looking at the results), we decided that a threshold of 0.8

is appropriate for making a decision.

4.3.2 Custom Dictionary. An even better solution to this problem

would be to make sure that the dictionary we use does not contain

words absent from our corpuses in the rst place. This leads to the

idea of generating a custom dictionary, from the lemmatization of

the words present in our corpuses directly. Of course, creating such

a dictionary leads to a bootstrapping problem, as lemmatizing our

corpus would require using the dictionary we are trying to create.

Thus, we need an external lemmatizer.

Here, again, we used the one from NLTK (in fact, we also tried

the lemmatizer from the Stanford NLP library [

], written in Scala

ELS’21, May 03–04 2021, Online, Everywhere Antoine Hacquard and Didier Verna

No lemmatization Lemmatization

library (96) library (175)

le (82) le (150)

function (79) function (138)

data (74) license (133)

value (71) value (117)

license (70) document (117)

documentation (67) package (114)

test (65) name (114)

name (64) test (102)

body (57) project (101)

Table 5: Indexes obtained with and without lemmatization

out of curiosity). The generated dictionary contains approximately

33.500 lemmas. Note that in theory, we should rebuild it every time

Quicklisp is updated. Whether this is a critical issue remains to

be seen however. Indeed, Quicklisp is already quite large, so the

probability that an update induces a very important change in the

corpus is likely to be low. On the other hand, an outdated custom

dictionary may start to miss words, or contain irrelevant ones

again, so it is still important to continue using the aforementioned

threshold-based white-listing step.

Finally, note that with this custom dictionary, we are certain to

only get lemmas existing in our corpuses, but we are not completely

sure that the “technical jargon”, frequent in our community’s spe-

cialized version of English is fully recognized by NLTK. It is dicult

to evaluate the risk of an unknown technical word being weirdly

lemmatized by NLTK, but we’re hopeful that if it happens at all,

it remains marginal. NLTK uses the Wordnet database, which is

very large, and also encodes relations between words (such as sin-

gular/plural, synonyms, etc.). A possible path to get more insight

into this problem could be to evaluate the behavior of NLTK on the

Common Lisp Hyperspec’s glossary (which is likely to be a quite

complete reference for technical jargon), and maybe adjust the ref-

erence dictionary accordingly. Another one would be to properly

recognize code pieces from markup information (see Section 8).

Finally, it would be highly benecial to keep even non-Lisp jargon

around. Pseudo-words such as “cmdline”, acronyms such as “GUI”,

etc., behave just like regular words in our communities, and should

probably be treated as such. How to collect them into our custom

dictionary is yet another problem.

The third column in Table 4 shows the top 10 keywords obtained

with this custom dictionary, and conrms that the results are better.

For example, irrelevant words such as “begin” or “aside” are gone,

even though our dictionary contains more lemmas in total than the

two other ones.

4.4 Final tests and results

In order to get an early feedback on the behavior of this pre-

processing stage, we ran our complete pipeline with and without

lemmatization. Recall that without lemmatization, it is the output

of the PoS-Tagger which is processed by the TF-IDF block directly.

Sample results are presented in Figure 5. The question of whether

lemmatization is useful, and under which precise conditions re-

mains open. In general, lemmatization is expected to be useful

because it allows to treat variations on a single keyword together.

On the other hand, a lemmatized keyword may not be the most

informative, and we believe that this is exactly what happens with

“documentation” and “document” in Figure 5. Assuming that docu-

mentation libraries (such as Quickref and Declt) are those which

bring the keyword “documentation” up, it is unfortunate that in

the lemmatized case, this keyword is transformed into “document”

which, in fact, is less informative. This problem suggests that using

an ad-hoc, carefully tuned dictionary may turn out to be important.

5 TF-IDF

Even though most of the delicate work actually happens during the

pre-processing phase, the heart of our pipeline consists in extracting

meaningful words from our corpus. By “meaningful”, we mean

words which convey the most relevant and decisive information.

For this task, we use the TF-IDF statistic [17].

TF-IDF is a measure that aims at reecting the importance of

a word in a document. It uses two parameters to operate: the fre-

quency of the word in the document and the number of documents

containing this word. The main idea behind this approach is that a

word both frequent in a document and frequent in all documents

is not very specic to the document in question, and thus, is not a

good descriptor for this document. On the other hand, a word that

is very frequent in one document, but which appears nowhere else,

brings a great amount of information on the document in question,

and can thus be used as a keyword representing it.

In the Quickref context, a “document” corresponds to the corpus

of text extracted by Declt from one specic library. TF-IDF is run

on each library, for which the best

𝑥

words are retained,

𝑥

being an

adjustable parameter.

5.1 Tests and Results

An important question, before running a TF-IDF on each library’s

corpus, is to decide on what we actually use as a corpus for each

library. As mentioned before, README les and docstrings are a

natural choice, but we can also think of using symbol names (of

functions, variables, etc.) as the code is also usually explicit about

what it does. We could also use ASDF’s system descriptions, when

provided, but we haven’t tried that yet. More specically, we ran

our pipeline on the following corpus variations.

(1) README les only.

(2)

README les, plus docstrings for all exported functionality

(public API).

(3)

The above, plus the symbol names for all exported function-

ality. The rationale is that carefully chosen API names may

be indicative of the library’s purpose.

(4)

The above, plus docstrings for the library’s internals (so,

essentially all docstrings available).

(5)

The above, plus the library’s internals symbol names (so,

essentially all symbols).

Sample results are presented in Table 6. As more or less expected,

it is probably not a good idea to add the documentation of a library’s

internals in the corpuses, as the text found there most probably

deals more with the implementation of the library’s functionality,

than the functionality itself. This is visible, for example with the

appearance of keywords such as “string”, “vector”, or “class” in the

A Corpus Processing and Analysis Pipeline for ickref ELS’21, May 03–04 2021, Online, Everywhere

README + Docstrings + Symbols

Public API

library (175) le (148) stream (123)

le (150) string (142) le (117)

function (138) value (139) value (105)

license (133) stream (132) string (105)

value (117) object (125) name (105)

document (117) name (118) user (105)

package (114) license (109) object (95)

name (114) function (108) test (90)

test (102) type (106) type (90)

project (101) test (104) error (88)

+ Internals

library (175) string (145) string (140)

le (150) stream (138) le (136)

function (138) le (137) stream (133)

license (133) value (118) object (110)

value (117) object (114) value (109)

document (117) name (102) vector (101)

package (114) class (100) class (96)

name (114) test (97) test (94)

test (102) license (96) function (92)

project (101) function (95) message (91)

Table 6: Results for dierent corpus variations

top 10, which are likely to be related to typing information known

statically, and advertised as such.

Even when restricting ourselves to the public API’s corpus, in-

cluding docstrings and / or symbol names doesn’t seem to add much

to the pertinence of the results. Even public docstrings are in fact

likely to contain static typing information (such as “string”), func-

tion parameters descriptions (such as “object”), etc. In fact, we have

ultimately no control whatsoever on the type, quality, or quantity

of documentation (if any) provided by the developers, which makes

keyword extraction a very hard problem.

6 AGGREGATION

An even harder problem is how to aggregate an appropriate selec-

tion of keywords coming from dierent libraries (probably with

some overlap), into a suciently descriptive and pertinent index.

The diculty here comes from the fact that we would like 100%

library coverage (we want every Quicklisp library to be pointed to

by at least one keyword) but we also want a reasonably sized nal

index. How to achieve this goal is still mostly unanswered, but we

have already conducted some experiments, reported below, and we

also have some ideas that yet remain to be tested.

6.1 Histogram-Based Selection

The rst approach we have experimented with is based on the cross-

library keyword appearance histogram. For each of the retained

𝑥

keywords from every library, we count the number of libraries

it appears in, and we sort them by decreasing frequency. We then

select the minimum number of keywords required to reach a 100%

coverage.

This process is very simple to implement, and as a side-eect,

can also be the base for generating a word cloud. Indeed, if a key-

word is representative of many libraries, it probably means that the

corresponding topic is subject to a lot of activity. On the other hand,

this approach also poses accuracy problems, and makes it hard to

adjust the relevant parameters properly (this is where a choice on

the value of

𝑥

becomes crucial). More specically, because we want

every library to be indexed, a trade o is to be made between the

number of keywords retained per library (hence, accuracy), and the

size of the nal index.

If, for example, we keep only one keyword per library, this key-

word will indeed be very descriptive of that particular library, and

so is less likely to apply to many of them. Consequently, it is very

probable that the nal index will be very large (at worst, one dier-

ent keyword for every single library, that is, approximately 2.000).

If, on the other hand, we keep a large number of keywords for

every library, there is more likelihood that the retained keywords

will overlap from one library to another, letting us reach a 100%

coverage faster. However, we also risk retaining keywords that are

not so relevant.

Figure 3 contains plots of the library coverage (in percentage) as

a function of the nal index size, for dierent values of

𝑥

, that is,

when retaining dierent numbers of keywords per library. These

plots conrm what intuition tells. When

𝑥 =

50 for example, a 100%

library coverage is reached with a nal index of 200 keywords, but

those keywords are likely to not be so specic. When

𝑥 =

5, on the

other hand, the nal index will require more than 3.000 words, all

probably quite relevant.

As mentioned before, because of the inherent structure of the

histogram we use, the top 1 keyword will have many libraries

associated with it, the next one slightly fewer, and so on. This is

important, and problematic, for two reasons.

(1)

When a user searches a library for a specic use, a keyword

leading to a hundred dierent choices is likely to be of little

help.

(2)

The number of libraries associated with a keyword is not the

same for all keywords, which makes the nal index some-

what heterogeneous.

This is why we also plan to investigate other approaches.

6.2 Other Potential Solutions

A rst alternative approach could be to sort the output of TF-IDF

not by decreasing frequency, but by a pertinence factor of some

sort (doing in some sense a meta-TF-IDF on top of the original

one), and keep enough of the top ones to reach a 100% coverage.

The pertinence factor in question could be the inverse of average

ranking of a keyword in each library’s top list, a normalized sum

of all TF-IDF values, or any other measure yet to be thought of.

Yet another possibility would be to take a completely opposite

approach, and start from the fact that in order to be usable, a key-

word shouldn’t point to more than, say,

𝑛 =

10 libraries. We could

then arrange to select all such keywords until we reach a 100%

coverage (probably adjusting

𝑛

to get a reasonably sized index in

the process).

ELS’21, May 03–04 2021, Online, Everywhere Antoine Hacquard and Didier Verna

Figure 3: Library coverage vs. nal index size, for dierent values of 𝑥

7 CONCLUSION

In this paper, we presented a natural language processing pipeline

for Quickref, allowing us to analyze corpus extracted from Quicklisp

libraries docstrings, README les, or symbol names. This pipeline

is relatively lightweight, as it amounts to no more than 2000 lines

of code.

As part of this process, we have used an existing PoS-Tagging

library, and we have developed our own native Common Lisp imple-

mentations of stemming and lemmatization algorithms. As of this

writing, the code is not in production yet, but we plan to cleanup,

package, and release our stemmers and lemmatizer as standalone

libraries in short term.

The complete pipeline, including the histogram-based aggrega-

tion approach, is currently integrated in an experimental version

of Quickref, but other solutions remain to be tested before putting

the whole thing in production.

8 PERSPECTIVES

Apart from the aggregation problem, some other plans for future

work are worth mentioning.

Our pipeline is currently unaware of any markup used in README

les notably (Markdown, HTML, etc.). A number of specic tweaks

are in place in order to remove markup tags from the corpus (for

example, by recognizing and ltering URLs out during the tokeniza-

tion phase). Also, in the case of frequently used markup formats

(such as Markdown), the syntactic “noise” produced by the tags

is likely to be ltered out as non-pertinent by the TF-IDF block,

precisely because of its frequency in many libraries. Yet, it would

be better to be aware of the markup formats in use, and use that

information during the tagging process. The rst advantage that

comes to mind is to be able to properly dierentiate natural lan-

guage parts from code samples, in order to select what we want

to keep around for TF-IDF (see Section 5.1). Correctly identifying

markup tags could also be useful to spot inline code excerpts (or

just words) embedded in natural language paragraphs, and give

them special treatment (for instance, considering them as “technical

jargon”; see Section 4.3.2). More generally, it could be interesting to

think about the kind of information that other tags, such as bold or

italics, provide. For instance, bold or italics may be an incentive to

give more weight to the targeted textual part. Even more generally,

potentially useful information can sometimes be extracted from

pure, tagless, text. For example, it is customary to render Lisp ref-

erences (function parameters, variables, etc.) in uppercase in plain

docstrings.

Previously, we mentioned that we use a sux stemmer because

that is where most of the variations occur in English. We could not

nd any prex stemmer in the literature, and we currently don’t

know if that would be worth looking for, even for English, and

perhaps in combination with the current sux one.

As far as dictionaries are concerned, we mentioned that the best

results were obtained by creating our own custom dictionary with

the help of an external lemmatizer. Another possibility would be to

start from an existing dictionary, but keep track of missing words,

and create only a custom addition to the original dictionary with

those words. Even if we gain a little in terms of dictionary boot-

strapping time, it is not very likely that this solution would get

us anything more in terms of pertinence, notably because exist-

ing dictionaries are still likely to contain a lot of words that are

uninteresting for us, or that actually never occur in our corpus.

Finally, one nal question that could arise eventually is that of

the actual language in use. Currently, we assume English (which

is unlikely to pose any problem with Quicklisp), but if we even

want to handle other languages, the problem will become more

complicated. In particular, our current PoS-Tagger will not be usable

anymore.

REFERENCES

[1]

Lemma lists. https://lexically.net/wordsmith/support/lemma_lists.html. Accessed:

2021-03-06.

[2]

L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence

algorithms. In Proceedings of the Seventh International Symposium on String

Processing and Information Retrieval. SPIRE 2000, pages 39–48, September 2000.

A Corpus Processing and Analysis Pipeline for ickref ELS’21, May 03–04 2021, Online, Everywhere

doi: 10.1109/SPIRE.2000.878178.

[3]

Steven Bird, Edward Loper, and Ewan Klein. Natural language processing with

python. https://www.nltk.org/book/, 2009.

[4]

William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison

of string distance metrics for name-matching tasks. In Proceedings of the 2003

International Conference on Information Integration on the Web, IIWEB’03, pages

73–78. AAAI Press, 2003. doi: 10.5555/3104278.3104293.

[5]

Doug Cutting and Jan Pedersen. The Xerox part-of-speech tagger version 1.2.

https://github.com/g000001/tagger, 1993.

[6]

Steven M. Haich. The Porter stemming algorithm, a Common Lisp implementa-

tion. https://github.com/varjagg/porter-stemmer, 2002.

[7]

Anjali Jivani. A comparative study of stemming algorithms. International Journal

on Computer Technology and Applications, 2:1930–1938, November 2011.

[8] Alan C. Kay. The Reactive Engine. PhD thesis, University of Hamburg, 1969.

[9]

Deepika Kumawat and Vinesh Jain. Pos tagging approaches: A comparison.

International Journal of Computer Applications, 118(6):32–38, May 2015. ISSN

09758887. doi: 10.5120/20752-3148.

[10]

Dimitrios P. Lyras, Kyriakos N. Sgarbas, and Nikolaos D. Fakotakis. Using the

levenshtein edit distance for automatic lemmatization: A case study for modern

greek and english. In 19th IEEE International Conference on Tools with Articial

Intelligence(ICTAI 2007), volume 2, page 428–435, Oct 2007. doi: 10.1109/ICTAI.

2007.41.

[11]

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard,

and David McClosky. The stanford corenlp natural language processing toolkit.

In Proceedings of 52nd Annual Meeting of the Association for Computational Lin-

guistics: System Demonstrations, page 55–60. Association for Computational Lin-

guistics, 2014. doi: 10.3115/v1/P14-5010.

[12]

Andrew McCallum, Kedar Bellare, and Fernando Pereira. A conditional random

eld for discriminatively-trained nite-state string edit distance. In Proceedings

of the Twenty-First Conference on Uncertainty in Articial Intelligence. UAI2005,

Jul 2012.

[13]

M. Douglas McIlroy. Macro instruction extensions of compiler languages.

Communications of the ACM, 3:214–220, April 1960. ISSN 0001-0782. doi:

10.1145/367177.367223.

[14]

Chris D. Paice. Another stemmer. ACM SIGIR Forum, 24(3):56–61, Nov 1990. ISSN

0163-5840. doi: 10.1145/101306.101310.

[15]

M. F. Porter. An algorithm for sux stripping. Program: Electronic Library and

Information Systems, 14(3):130–137, 1980. doi: 10.1108/eb046814.

[16] M. F. Porter and Richard Boulton. Snowball stemmer. 2001.

[17]

Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic

text retrieval. Information Processing and Management, 24(5):513–523, Jan 1988.

ISSN 03064573. doi: 10.1016/0306-4573(88)90021-0.

[18]

Andrew Sengul. April: Array programming re-imagined in lisp. https://github.

com/phantomics/april, 2019.

[19]

Syeda ShabnamHasan, Fareal Ahmed, and Rosina Surovi Khan. Approximate

string matching algorithms: A brief survey and comparison. International Journal

of Computer Applications, 120(8):26–31, Jun 2015. ISSN 09758887. doi: 10.5120/

21247-4048.

[20]

Mark Tarver. The bipolar Lisp programmer. http://marktarver.com/bipolar.html,

2007.

[21]

Ansi. American National Standard: Programming Language – Common Lisp.

ANSI X3.226:1994 (R1999), 1994.

[22]

Didier Verna. Quickref: Common Lisp reference documentation as a stress test

for Texinfo. In Barbara Beeton and Karl Berry, editors, TUGboat, volume 40,

pages 119–125. T

X Users Group, September 2019.

[23]

Didier Verna. Parallelizing Quickref. In 12th European Lisp Symposium, pages 89–

96, Genova, Italy, April 2019. ISBN 9782955747438. doi: 10.5281/zenodo.2632534.

[24]

Rudolf Winestock. The Lisp curse. http://winestockwebdesign.com/Essays/Lisp_

Curse.html, April 2011.