A Corpus Processing and Analysis Pipeline for
Quickref
Antoine Hacquard & Didier Verna
LRDE
EPITA Research and Development Laboratory
14th European Lisp Symposium, May 3–4 2021
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 1 / 16
Quicklisp & Quickref
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 2 / 16
Motivation
The project:
A new keyword index for Quickref
Why not just use a modern search engine?
Favor Quicklisp availability
Natural emphasis on libraries with some documentation
Other potential applications (word cloud, statistical / topic
analysis, etc.)
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 3 / 16
Pipeline Overview
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 4 / 16
Tokenizer & POS-tagger
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
Tokenization :
"this can can walk" = THIS | CAN | CAN | WALK
POS-tagging :
THIS (det.) | CAN (common noun) | CAN (verb) | WALK (verb)
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 5 / 16
Stemmer & Lemmatizer
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
Stemming :
argue | argued | argues | arguing = argu
Lemmatization :
argue | argued | argues | arguing = argue
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 6 / 16
TF-IDF
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
TF(’the’) = 0.7; IDF(’the’) = 1.9;
TD-IDF(’the’) =
TF
IDF
= 0.37
TF(’temperature’) = 1.6; IDF(’temperature’) = 0.3;
TD-IDF(’temperature’) =
TF
IDF
= 5.33
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 7 / 16
Aggregator
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 8 / 16
Out of Dictionary Words
arguing
argu
argue
cat
forest
dictionary
argue
stemming
0.6
0.9
0.5
argmax
edit distance
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 9 / 16
Out of Dictionary Words
docstring
docstring
argue
cat
forest
dictionary
forest
stemming
0.6
0.4
0.7
argmax
edit distance
Words absent from the dictionary will match awkwardly!
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 10 / 16
Custom Dictionary Generation
Grab the whole corpus
Lemmatize with an external lemmatizer (NLTK in our case)
Use this as new dictionary
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 11 / 16
Pros & Cons
Pros:
Custom dictionary with words from our corpus only
Cons:
Words are potentially badly lemmatized
Potential solution: test and incorporate CLHS glossary
Requires an external lemmatizer
But just once for every other pipeline run
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 12 / 16
Experimentation with Aggregators
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
temperature
weight
unit
option
weight
test
test
temperature
weight
lib1
lib2
lib3
weight
temperature
test
option
unit
TF-IDF output
Histogram
Keyword index
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 13 / 16
Other Potential Aggregators: Top-Down
Rank output of TF-IDF with a pertinence score (e.g. mean of
TF-IDF values), and keep just enough keywords to reach full
library coverage.
temperature (0.8)
weight (0.72)
unit (0.59)
option (0.76)
weight (0.57)
test (0.53)
test (0.98)
temperature (0.8)
weight (0.53)
lib1
lib2
lib3
weight (0.61)
temperature (0.8)
test (0.755)
option (0.76)
unit (0.59)
TF-IDF output
Pertinence score
temperature
option
test
weight
unit
Keyword index
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 14 / 16
Other Potential Aggregators: Bottom-Up
Start from keywords with the fewest associated libraries, and
take until full library coverage is achieved.
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 15 / 16
Conclusion
lib1 corpus
lib2 corpus
lib3 corpus
Tokenizer
POS-Tagger
Stemmer
Lemmatizer
TF-IDF
Aggregator
keyword index
. . .
word cloud
syntactic filter
known lemmas
pertinence filter
A 4-stages modular NLP pipeline for Quickref
First 3 blocks completed, to be released as standalone
open-source libraries
Aggregation block still work in progress
Suggestions / ideas welcome!
Thank you!
A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 16 / 16