A Corpus Processing and Analysis Pipeline for

Quickref

Antoine Hacquard & Didier Verna

LRDE

EPITA Research and Development Laboratory

14th European Lisp Symposium, May 3–4 2021

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 1 / 16

Quicklisp & Quickref

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 2 / 16

Motivation

The project:

A new keyword index for Quickref

Why not just use a modern search engine?

Favor Quicklisp availability

Natural emphasis on libraries with some documentation

Other potential applications (word cloud, statistical / topic

analysis, etc.)

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 3 / 16

Pipeline Overview

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 4 / 16

Tokenizer & POS-tagger

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

Tokenization :

"this can can walk" =⇒ THIS | CAN | CAN | WALK

POS-tagging :

THIS (det.) | CAN (common noun) | CAN (verb) | WALK (verb)

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 5 / 16

Stemmer & Lemmatizer

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

Stemming :

argue | argued | argues | arguing =⇒ argu

Lemmatization :

argue | argued | argues | arguing =⇒ argue

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 6 / 16

TF-IDF

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

TF(’the’) = 0.7; IDF(’the’) = 1.9;

TD-IDF(’the’) =

IDF

= 0.37

TF(’temperature’) = 1.6; IDF(’temperature’) = 0.3;

TD-IDF(’temperature’) =

IDF

= 5.33

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 7 / 16

Aggregator

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 8 / 16

Out of Dictionary Words

arguing

argu

argue

cat

forest

dictionary

argue

stemming

0.6

0.9

0.5

argmax

edit distance

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 9 / 16

Out of Dictionary Words

docstring

argue

cat

forest

dictionary

forest

stemming

0.6

0.4

0.7

argmax

edit distance

Words absent from the dictionary will match awkwardly!

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 10 / 16

Custom Dictionary Generation

Grab the whole corpus

Lemmatize with an external lemmatizer (NLTK in our case)

Use this as new dictionary

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 11 / 16

Pros & Cons

Pros:

Custom dictionary with words from our corpus only

Cons:

Words are potentially badly lemmatized

Potential solution: test and incorporate CLHS glossary

Requires an external lemmatizer

But just once for every other pipeline run

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 12 / 16

Experimentation with Aggregators

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

temperature

weight

unit

option

weight

test

temperature

weight

lib1

lib2

lib3

weight

temperature

test

option

unit

TF-IDF output

Histogram

Keyword index

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 13 / 16

Other Potential Aggregators: Top-Down

Rank output of TF-IDF with a pertinence score (e.g. mean of

TF-IDF values), and keep just enough keywords to reach full

library coverage.

temperature (0.8)

weight (0.72)

unit (0.59)

option (0.76)

weight (0.57)

test (0.53)

test (0.98)

temperature (0.8)

weight (0.53)

lib1

lib2

lib3

weight (0.61)

temperature (0.8)

test (0.755)

option (0.76)

unit (0.59)

TF-IDF output

Pertinence score

temperature

option

test

weight

unit

Keyword index

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 14 / 16

Other Potential Aggregators: Bottom-Up

Start from keywords with the fewest associated libraries, and

take until full library coverage is achieved.

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 15 / 16

Conclusion

lib1 corpus

lib2 corpus

lib3 corpus

Tokenizer

POS-Tagger

Stemmer

Lemmatizer

TF-IDF

Aggregator

keyword index

. . .

word cloud

syntactic ﬁlter

known lemmas

pertinence ﬁlter

A 4-stages modular NLP pipeline for Quickref

First 3 blocks completed, to be released as standalone

open-source libraries

Aggregation block still work in progress

Suggestions / ideas welcome!

Thank you!

A. Hacquard & D. Verna (LRDE) The Quickref NLP Pipeline ELS 2021 16 / 16