NLP packages: transformers, spaCy, Gensim, NLTK | Machine Learning Guide Podcast

MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK

Oct 27, 2020

Click to Play Episode

NLTK: swiss army knife. Gensim: LDA topic modeling, n-grams. spaCy: linguistics. transformers: high-level business NLP tasks.

Try a walking desk to stay healthy while you study or work!

Show Notes

NLTK - swiss army knife / catch-all for anything and everything NLP.

Gensim - another odds-and-ends package, which I use specifically for LDA Topic Modeling and Bigrams/Trigrams

spaCy - deep-learning-based linguistics tool. I use LemmInflect for inflecting part-of-speech tags, and more robust lemmatization than in-built spaCy lemmas. Also consider (forgot to mention in episode) Stanford CoreNLP, offered as a spaCy package spacy-stanza, which I've found more accurate for most tasks, but much slower. Depends on your needs.

huggingface/transformers - high-level NLP tasks, see their Pipelines for what 10+ tasks you can perform as one-liners basically; and the sky's the limit if you're willing to get more code-y. Their model repository is just huge.

UKPLab/sentence-transformers - embed documents into vector-space so you can math. Clustering, semantic search, etc. See their example applications.

To cover later: Approximate Nearest Neighbor (ANN). Eg, cosine similarity, but for huge corpuses like Wikipedia. Annoy, FAISS, hnswlib, etc. See UKPLab's examples of these.

See Analytics Steps Top 10 Libraries

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.