MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK
Oct 27, 2020
Click to Play Episode

NLTK: swiss army knife. Gensim: LDA topic modeling, n-grams. spaCy: linguistics. transformers: high-level business NLP tasks.

Show Notes

NLTK - swiss army knife / catch-all for anything and everything NLP.

Gensim - another odds-and-ends package, which I use specifically for LDA Topic Modeling and Bigrams/Trigrams

spaCy - deep-learning-based linguistics tool. I use LemmInflect for inflecting part-of-speech tags, and more robust lemmatization than in-built spaCy lemmas. Also consider (forgot to mention in episode) Stanford CoreNLP, offered as a spaCy package spacy-stanza, which I've found more accurate for most tasks, but much slower. Depends on your needs.

huggingface/transformers - high-level NLP tasks, see their Pipelines for what 10+ tasks you can perform as one-liners basically; and the sky's the limit if you're willing to get more code-y. Their model repository is just huge.

UKPLab/sentence-transformers - embed documents into vector-space so you can math. Clustering, semantic search, etc. See their example applications.

To cover later: Approximate Nearest Neighbor (ANN). Eg, cosine similarity, but for huge corpuses like Wikipedia. Annoy, FAISS, hnswlib, etc. See UKPLab's examples of these.

See Analytics Steps Top 10 Libraries