ELS’21, May 03–04 2021, Online, Everywhere Antoine Hacquard and Didier Verna
Suppose however that someone is looking for some functionality,
without any prior idea or knowledge about which library may be
appropriate. Quickref, as it is right now, is impractical for such a
mining task, hence the idea of enriching it with a keyword index, a
word cloud, etc. In order to generate such things automatically, it is
necessary to process and analyze each library’s corpus, that is, the
bits of textual information providing some description of function-
ality (README les, docstrings, sometimes even symbol names,
etc.). Fortunately for us, Declt, the reference manual generator on
which Quickref is based, makes it very easy to access the corpuses
in question. The purpose of this paper is to describe the natural
language processing pipeline that we are currently building into
Quickref to analyze the extracted corpuses, and ultimately provide
library access by functionality.
Given the universal availability of very ecient internet search
engines these days, one may wonder whether an indexing project
specic to Quickref is really needed or pertinent. The following
remarks answer that question.
First of all, a general search engine doesn’t know about such
or such library’s availability in Quicklisp. On the other hand, a
local index will necessarily point to readily-available libraries only.
Next, and as opposed to search engines considering plenty of, and
indiscriminate information sources, our indexing process is based
on each library’s documentation only. Therefore, it will have a
natural tendency to favor well documented ones, which can be an
important factor, when choosing which tool to use in your own
project.
Finally, and beyond providing new kinds of indexes, other appli-
cations of this project could be envisioned later on, such as topic
analysis, distribution, and visualization (a topography of the centers
of interest in the Lisp community, of sorts).
1.4 Pipeline Overview
Figure 1 depicts the pipeline used to process and analyze the cor-
puses extracted from each library by Declt.
(1)
Each corpus is rst tokenized, that is, split into chunks which
usually (but not necessarily) correspond to words. The tokens
are then tagged, meaning that they are associated with their
syntactical class (noun, verb, etc.). After this stage, we are
able to lter specic token classes (e.g. retain only nouns,
verbs, etc.).
(2)
Next, the retained tokens are stemmed, meaning that their
lexical root is extracted, and used to attempt matching with
a canonical form found in a dictionary. This process is called
lemmatization. After this stage, only the canonicalized known
lemmas (i.e., found in said dictionary), are retained.
(3)
A TF-IDF (Term Frequency / Inverse Document Frequency)
value is computed for every such lemma. This value is a
statistical indication of how relevant each lemma is to the
corresponding library. Only the most pertinent ones are
kept around (the exact number of such retained lemmas may
vary).
(4)
Finally, the (possibly intersecting) sets of most pertinent
keywords describing each library are aggregated in order
to produce the desired output (keyword index, word cloud,
etc.).
It is worth mentioning right away that in this pipeline, two
out of four blocks (the rst two) are pre-processing steps, devoted
to sanitizing the corpuses, while only stages three and four actu-
ally perform the job of information processing. The importance of
pre-processing in this pipeline is due to TF-IDF working on syn-
tactic tokens only, without any semantic information. For example,
without pre-processing, tokens such as “test”, “tests”, and “testing”
would be treated independently, as if they meant dierent things.
At the time of this writing, the rst three blocks in this pipeline
are fully operational. Keyword aggregation, on the other hand,
is a dicult problem, and the aggregator block is still subject to
experimentation. Also, note that we intend, at a later time, to release
the code of each block as independent, open-source libraries.
The remainder of this paper is organized as follows. Sections 2
to 5 provide a more in-depth description and discussion of the to-
kenizer / PoS-Tagger, stemmer / lemmatizer, and TF-IDF blocks
respectively. Section 6 describes the challenges posed by the key-
word index generation problem, the experiments already conducted,
and some possible ideas for further experimentation.
2 POS-TAGGING
PoS-Tagging (for “Part-of-Speech” tagging) is a technique allowing
to determine the syntactic class of words, that is, whether they are
common nouns, verbs, articles, etc. The syntactic classes of words
may be important information to perform semantic analysis of a
corpus for dierent reasons. For example, some categories of words,
like determinants, convey very little or no useful meaning at all, so
we want to lter them out early, rather than carrying them around
until the TF-IDF block makes the same decision (although for a
dierent reason: they appear frequently, but everywhere). Also, in
the aim of generating a keyword index, it may be interesting to
experiment with dierent sets of retained information, such as only
nouns, nouns and verbs, etc.
2.1 Implementation
There are many ways to implement a PoS-Tagger, notably with
HMMs (Hidden Markov Models), unsupervised learning, or ma-
chine learning [
9
]. In the Common Lisp ecosystem, we are aware
of one PoS-Tagger library, namely “Tagger” [
5
], written by Xerox
in 1990, which uses HMMs.
HMMs are statistical Markov Models used to learn an unknown
Markov Process with hidden states, by observing another process,
known this time, and depending on it. HMMs are widely used in
PoS-Tagging to disambiguate syntactic classication. The biggest
problem of PoS-Tagging is that a word can have several syntactic
classes associated with it, depending on the context. For example,
the word “can” may be either a verb, or a noun (as in “soda can”).
Using HMMs, a PoS-Tagger rst learns the probability of a certain
sequence of syntactic classes occurring. Then, it disambiguates
unknown words by using the syntactic class sequence with the
highest probability.
Suppose for example that after an article such as “the”, the class
probabilities for the next word are 40% noun, 30% adjective, and
20%
number
. When seeing “The can”, a PoS-Tagger will thus cor-
rectly classify “can” as a noun.