The landscape of Python natural language processing tools has evolved from broad libraries like NLTK toward more specialized packages such as Gensim for topic modeling, SpaCy for linguistic analysis, and Hugging Face Transformers for advanced tasks, with Sentence Transformers extending transformer models to enable efficient semantic search and clustering. Each library occupies a distinct place in the NLP workflow, from fundamental text preprocessing to semantic document comparison and large-scale language understanding.
You are listening to machine learning applied. In this episode, we're gonna talk about the lay of the land in Python, natural language processing tools. Hugging face Transformers, sentence, transformers, Gensim spacey, NLTK. So here's a little bit of history on these packages, as I understand the history, so take it with a grain of salt.
This has sort of just been my own experience in evolving through using these packages over the years. It seems it all started with NLTK. NLTK Natural Language Toolkit seems to have been one of the most, if not the most, first. Popular, NLP Library, NLTK, basically lets you do anything and everything in NLP.
They just kept adding and adding and adding all the features you could possibly imagine. For any NLP application, you could possibly imagine all the simple stuff like tokenization, stemming, and ization. To more complex things like document classification and syntax tree parsing. So NLTK was sort of the bread and butter of any NLP practitioner for the longest time.
Then what started happening was various packages started to be created, which were. Better at some specific application, some specific NLP application that NLTK could handle, but maybe some new advancement in the NLP field. Maybe some new white paper with its accompanying code base lent better to using the the new specialized application of that technology for that whatever application.
So, for example, there's a module called Gensim, G-E-N-S-I-M. It might be pronounced gensim, I'm not sure. But one thing that Gensim provides is topic modeling. And if you Google Python topic modeling tutorial, you'll see gensim over and over and over. Now, I assume that NLTK. Provides some topic modeling capabilities, but for whatever reason, Gensim really sort of owned that feature, that specific application, topic modeling, and made it its own, and it's probably the best library out there to go to for that specific feature.
Topic modeling. Now Gensim provides a couple of tools for topic modeling, but the most popular of which is called LDA. Latent Dear SLE allocation and conceptually, what topic modeling does is it basically tags your documents. Let's say you have a news website and you have articles about politics. You have articles about sports, you have articles about health.
Well, how might you tag these articles or how would you cluster them so that you have a news section and a sports section, or how would you find document similarities between these documents? That's something where LDA really shown for the longest time. Nowadays, we've got Bert, which we'll get into later.
I. And what LDA does, what topic modeling does is it looks at the keywords in your documents, finds common keywords as they recur in different themes, as a distribution of those keywords, as a distribution of those keywords. We call that a topic distribution. So you don't just want to. Collect together everything that says politics or everything that says Trump.
Because what if you have a sports article that references Trump, or what if you have a health article that references Democrats for whatever reason? I'm being stupid with my examples. So rather than just using keywords to collect documents or, or perform document similarity searches. What we call semantic search, we'll talk about that later.
Semantic search is finding the similarity from one document to another in concept, in concept between the two documents. Semantic search, rather than performing semantic search, just using keyword. Overlap between documents. We use a more advanced algorithm provided by Gensim called LDA, that actually looks at the distribution of keywords as they occur in themes across the different documents.
And we call those topics. So this concept is called topic modeling. So the process goes like this from, you know, from beginning to end. You take your documents. We call this a corpus, a bunch of text. You pull out the keywords. Now, at this time in our timeline, you're gonna be using NLTK to pull out the keywords.
NLTK will tokenize your documents. It will then remove stop words. It will then. Latize your words. It's dramatization is generally better than stemming. Remember stemming just chops off the end of a word. So running becomes run, but sometimes that doesn't work out too well. Where Lat tries to find the root word in the dictionary based on the inflection of the current word, NLTK, tokenizes, remove stop words, dramatizes your tokens, and then now you have your corpus converted into keywords, keywords.
Then we take the PS kit, learn TF IDF vectorize tool TF IDF. What that does is it counts the number of keywords in a document that are special. Basically it's called term frequency inverse document frequency. This is. By contrast to a different kind of vectorize called the count vectorize, which is very easy to understand.
All it does is count the words in your document. Is this one present one? Is this one present? No. Zero is this one present. Actually two times two. That's what a count vectorize does, is it just counts the number of keywords in your document and turns it into a matrix of numbers. The TF IDF Vectorize is.
Basically the same thing, except that it really focuses on special words, words that occur a little bit more. Rarely you don't. For example, you don't want stop words to be considered in this process. TF IDF basically makes them more special. Words stand out in account vector, so you've tokenized and ized your documents using NLTK.
You've vectorized those keywords into a number matrix using psychic learn. T-F-I-D-F Vectorize. Now you plug the whole thing into gen's, LDA model, and Gensim will create for you a model that recognizes topic distributions across your various documents. So that's what the LDA model provided by GENSIM performs is doc topic distributions or topic modeling.
And then now you can compare these documents to each other. Remember semantic search the comparison of one document to another by way. Of the Jensen Shannon similarity metric between two topic distributions. I'll talk about the different similarity metrics like Jensen Shannon, Jaccard, co-sign, Euclidean, et cetera, in a machine learning guide episode in the future.
That was a bit of a whirlwind, but that just kind of showed a little bit of historicity in the evolution of these packages, NLTK. What it ended up becoming is the catchall for everything you could do in NLP and then various other packages sort of started branching off with their owning their own specific specialties.
Another thing Gensim provides for you better, in my opinion, than LTK provides is by grams and trigrams for documents. So gensim allows you to take your pre tokenized documents from NLTK and then combine words that are commonly combined together like machine learning. Becomes machine underscore learning and artificial intelligence becomes artificial underscore intelligence so that when you are doing topic modeling, generating topic distributions across your documents, combinations of words are considered rather than individual words, and that could be very important.
For example, you wouldn't want the word artificial or intelligence to stand out on their own. And then maybe the word intelligence, for example, gets mapped against articles on education or universities and stuff when there really wasn't that similarity to begin with. So Bigrams and Trigrams and all those, they call 'em ngrams, allow you to find words that commonly combine together in order to make your downstream NLP applications more accurate.
And in my opinion, the best implementation of this is in gensim. Not in NLTK. So we have Gensim and we have NLTK bopping along, and then some other projects pop up like text Blob and Pattern and some other ones that I really haven't used that much. But the next one to come out was really powerful. It's called Spacey.
Spacey. SPA, capital C, lowercase y, and Spacey has a very specific use case. It's actually sort of marketed as a Swiss Army knife slash catchall, like N Lt K, but I don't think that. It's actually like that. I think it has a very specific use case and that use case is Linguistics. Linguistics. So yes, it has a lot of applications.
These include syntax tree parsing, named Entity recognition, ization, and all these things. But it does not provide, for example, LDA or T-F-I-D-F, vectorization or bigrams. Trigrams and all those things, you're still gonna want to use Gensim. If you want Bigrams and Trigrams, you're still gonna want to use Gensim if you want LDA.
It's actually unclear to me now with Spacey, as you'll see in a bit. Whether we have any need for NLTK anymore. Um, spacey really covers a lot of the stuff that NLTK covers, but in my opinion, more powerfully, when I say linguistics, I'm talking about really phrase structure, sentence structure. So the core of spacey works like this.
You pass in a sentence or a paragraph and it will split the paragraph into sentences, and then you can iterate word by word over the words in your sentence. When you are at one particular word, it provides you all the information you could possibly imagine. For that word. It will tell you whether it's a stop word, okay?
Throw it away. There's stop word removal right there. You don't have to use some external package. It will tell you the named entity. It'll tell you the part of speech tag. It will tell you it's lema, which is super powerful. You don't have to latize your document separately. You could just ask for the lema from the word as you're iterating in spacey.
And the way it understands named entity recognition and part of speech tagging and all these things is because it actually generates a parse tree of the sentence using a neural network, so it actually understands the sentence structure very powerfully by way of a neural network. I. In order to generate a parse tree that you can then either iterate from left to right in your sentence, or you can walk the tree.
So when you're at a word, you can look at its parent, you can look at its left and right children to decide what are modifiers. So for example, if you're looking at a noun, what are, is it adjective, modifiers? What sort of. Preposition phrases are related to it, what's its parent? And then you can navigate all the way to the top of the document in order to find the root verb, um, which is usually what's called a transitive verb.
That if you were to be generating, for example, a intent recognizer system, like a chat bot that wanted to determine what's the intent of the user asking some question, you can pull out the root verb and the subject or the object in order to determine which action this user wants to perform. So it's incredibly, incredibly powerful, and I used spacey in noie.
It's used for tokenizing and removing stop words and limitizing your entries so that it can provide keywords for your themes in the themes feature. But I also have big future plans in utilizing more of species capabilities. And here's an A, a powerful example. When you write journal entries in Noie, you write them in the first person.
I mean, that's how everybody writes their journal today. I went to the store today, I blah, blah, blah. I, I, me, me, me. Well, when you're generating summaries of your entries, or if a therapist were to ask a question about you, they, first off, they'd have to know to ask it in the first person form, how do I feel this week?
They should be able to ask, how does Tyler feel this week? I want to be able to replace all personal pronouns. I, me, my with third person pronouns that are replaced with your first name that you specify in your profile. And Hy and Spacey allows you to do this. Spacey allows you to iterate the words in your sentence.
Examine the, the word we're looking at right now. Is it a personal pronoun? Yes. Okay. Just swap it out with Tyler. But we have to be smart about that because now the grammar structure will be incorrect. If I say I want french fries, becomes Tyler, want french fries? Well, the grammar's incorrect. The, the verb, the transitive verb there is now messed up.
So what do you do? You replace the first person pronoun with a third person token, and then you walk up the tree. Left or right and you inflect the verbs. To match the replacement token, using a plugin in spacey called Lemon Inflect, which we are using in Noie Lemon Inflect, allows you to inflect the token to your desired part of speech tag.
Super, super powerful. So it's a little bit tough to know for new users getting into NLP when and why they would want to use spacey, but when that time comes, it becomes crystal clear. The first obvious use case is tokenizing, removing stop words, ization. Then you'll throw that into a TF IDF vectorize, and then you throw that into gen some LDA topic modeler, very obvious use case for spacey and its izer and stop word removal capabilities is substantially more powerful and accurate, in my opinion, than NLTK, but.
The other capabilities of spacey, such as sentence pars, tree generation, and walking the pars tree in order to manipulate the document or determine intent in the case of a chat bot who is going to perform an action on behalf of the user. All these things you'll know when you need it, and Spacey is a fantastic tool for that job.
So that's NLTK, Gensim and Spacey. One other thing Spacey does provide is semantic search, co-sign, similarity between documents. Okay? If you have a news article and you want to compare it to another news article, you could use Spacey. Spacey uses word by word. Token vectors. So it is the word, two VEC model, and it vectorize the tokens in your sentence.
So then you have a, a, a bunch of vectors representing the words in your sentence. And as I understand the way it performs, document similarity is some sort of averaging mechanism across the token vectors in your sentence to the same of another sentence, which to me is not great. That's not fantastic. So I personally would never use.
Spacey for semantic search capabilities or clustering of your documents or anything like that. So, like I said, spacey really shines in linguistic features, uh, manipulating the linguistic aspects of your sentences or documents. Gensim shines in specifically in my opinion, bigrams Trigrams and topic modeling.
It has many more capabilities, but in my own opinion, I have not used Gensim for almost anything else, but those specific features and NLTK is sort of the catchall. Python package that does everything in NLP Swiss Army Knife, um, but is sort of slowly becoming superseded by more and more capable packages, using more and more modern technology for their specific applications.
And finally, the crown jewel of natural language processing, hugging face transformers, hugging face transformers. Transformers is a breakthrough technology in NLP that lets you do. Any of the very high level, big powerful NLP tasks, like question answering and summarization and sentiment analysis, document classification, all these things.
It's less on the linguistics side, you wouldn't really use hugging face transformers for. Parsing your parse tree and picking out part of speech tags and tokens, although you can, transformers is capable of that functionality and there are hugging face transformers models specifically with those use cases.
But for me, I would still rather use Spacey for that use case. No hugging face. Transformers is for these really high level applications in NLP. And so transformers is, is the technology, is the concept. It was a white paper, and then Bert was sort of the first implementation of the Transformers concept.
And then after Bert, everybody started iterating on the Bert model. They came up with Albert and Roberta and distiller and all sorts of other variations. And recently Google even put out a new white paper called Performer Performers, an Alternative to Transformers. And it's Google that created bur in the first place.
If they're gonna be replacing it with performer, you know, is it time to adapt? Well, the glory of hugging face is that they just pull in. Everything as it comes off hot off the press. Brand new models. Either the community implements what was specked out in the white paper or the new model. Authors directly implement their models into hugging face transformers repository.
Basically, the hugging face Transformers repository is a basket of all the hot new NLP high level models. That you can just use off the shelf, and that's what's so special about hugging face transformers. Unlike a lot of these other packages, and definitely unlike the repositories put out by white paper authors that are gonna require quite a bit of finagling to make things work in your own project.
Hugging face transformers really shines on being a turnkey solution. You just pip install the package. You copy a handful of lines of code, and then you have a summarizer or a question answerer. And it's just incredibly powerful, so powerful for very little code. I'll post a link to the hugging face Transformers, uh, website and repository.
In the show notes, I'll also post a link to their models directory. They have just a huge directory of community models and basically anything you could possibly dream up. Is there anything? Sentiment analysis, machine translation in all the languages. Et cetera. So hugging face transformers is the bread and butter of most modern NLP companies and shops, startups, certainly The driving force behind, no, the, you know, the question answering feature, the summarization feature, the sentiment analysis feature, and so on, on top of hugging face, transformers a, a separate package, which I'm actually surprised they haven't just merged this package in 'cause it's so tightly coupled.
Is called UKP lab. Sentence transformers and sentence transformers uses hugging face transformers under the hood. In order to embed your documents, embed your documents into vector space. So for example, if you want to perform semantic search, cosign similarity between documents, or if you want to cluster your documents, both of which I'm doing in no thi you would use.
Sentence Transformers, which uses what a model of your choosing from hugging face transformers. For example, maybe the Facebook BART model specialized in summarization capabilities or the longform QA model specialized in question answering capabilities. I personally am using. Sentence Transformers default.
Uh, Roberta Spin, which is actually fine tuned on co-sign similarity between documents. I'm using that model and it will embed your entries into vector space, a 768 dimension vector, and then you can compare that vector using the co-sign similarity to books. Books and resources, Wikipedia articles, therapists, other community members, you name it.
You can also cluster those embeddings using K means or agglomerative clustering. And I use that for no these's themes feature. So the process goes that I take all of your entries, I use UKP lab to embed them into vector space. I cluster them using agglomerative clustering. I'll talk about clustering in the next episode.
And now you have, let's. Say seven clusters across all your entries. These are semantic clusters. These are topics, topics across your entries. Because, because that's the magic of Bert is it's, it's sort of quote unquote understanding your entry. It's not looking at keywords. And then within those clusters, I find the cluster Centro.
That will be the center of each topic, each concept, and then I find the, the closest few entries to that oid and I summarize them. In other words, I cluster your entries into themes or topics, and then I find the entries within each theme. Most representative of that theme. And then I summarize that, so I summarize the essence of that theme, which is super powerful.
And this is all by way of the sentence Transformers package, allowing you to embed your documents using transformers models. So, yeah, I've touched on semantic search, which all that means is, uh, finding the similarity between documents. It's, it's searching across documents, their similarity to each other.
Semantically. Semantically means like in understanding, in concept, as opposed to syntactically, which presumably might mean that you're taking out the keywords and comparing keywords across documents, which is very unpowerful. Semantic search is a powerful tool, especially powerful when you use the sentence Transformers package.
To embed your documents then, which allows co-sign similarity comparison between documents. So that's semantic search, which is similarity between documents and the most popular similarity metric to use in this type of semantic search. Namely document embedding, is co-sign, co-sign similarity. But there are packages out there that I'll discuss in a future episode which allow for very large document searching.
So for example, if I were to. Implement the Wikipedia feature of no. If I wanted to match your entries to Wikipedia articles where there's so many Wikipedia articles, it would be too computationally expensive to run a cosign similarity calculation all at once. Instead, there are packages out there like Facebook fis, F-A-I-S-S, that index your database, the Wikipedia corpus, for example, and then allow you to do very, very fast top K lookups of your document embedding using sentence transformers.
It uses this cosign similarity metric under the hood, but it has some, um, efficiency mechanisms that make the whole process very fast. That's it for this episode. On NLP tools, we covered NLTK Gen, some spacey hugging face transformers and UKP lab sentence transformers and all of them have their use, um, in various aspects.
Uh, NLTK is the catchall kind of odds and ends Swiss army knife. Gensim for topic modeling and diagrams. Spacey for linguistic analysis and manipulation, and transformers for the big stuff, the big high level business value NLP applications and sentence transformers on top of that for semantic search and clustering by way of embedding your documents.
In the next episode, I'm going to talk a little bit about clustering algorithms, agglomerative clustering versus K means clustering specifically in the psychic Learn package and how that's applied in No.