Primary clustering tools for practical applications include K-means using scikit-learn or Faiss, agglomerative clustering leveraging cosine similarity with scikit-learn, and density-based methods like DBSCAN or HDBSCAN. For determining the optimal number of clusters, silhouette score is generally preferred over inertia-based visual heuristics, and it natively supports pre-computed distance matrices.
You are listening to machine learning applied. In this episode, we're gonna talk about practical clustering tools per the usual difference between machine learning guide and machine learning applied. I won't be talking about any theory about clustering or how these clustering algorithms work. I'm just gonna be talking about some of the psychic learn packages and some of the tips and tricks that I've found useful in actually applying clustering techniques in No.
And hopefully in a future machine learning guide episode, I'll talk about the theory behind some of these tools and clustering in general, et cetera. Now, the first tool I'm going to talk about is PS Kit Learn. K means everyone knows K means. K means is just the most popular clustering algorithm. Ever.
Just ever. It's, everyone uses it 99% of the time. If they're clustering, they're using K-means, and in fact, I just recommend trying. K-means first if you need to cluster your vectors, try K-means if, if it works, great. If not, if. Then you'll move on to basically the other tools we'll talk about in this episode.
How does K means work kind of generally in practice? We're not gonna talk about theory. I don't wanna talk about how these OIDs move around and then the points shift and all these things I wanna talk about. In practice, you will be using the psych kit. Learn clustering dot K means package. And you'll specify upfront a number of clusters.
So let's say you have a thousand vectors, that's your matrix, a thousand by 10. 10 being the number of dimensions per vector in your matrix. Well, you would need to know in advance how many clusters you're going to be clustering your matrix into. And then what will happen is you'll say, K means parentheses, N clusters equals number of clusters.
And in the case of a thousand vectors, maybe let's say 10, uh, 10 clusters, that sounds pretty good. And parentheses, there are other hyper parameters in the K means package that I won't be discussing. And then you'll say, dot fit or.fit. Predict your ma, you know, parentheses your matrix, and it will return to you the labels, the cluster assignment, which cluster each row got assigned to.
So if you have a thousand rows. You'll get a list of a thousand numbers between one and 10. Now, I mentioned in a recent machine learning guide episode that Euclidean distance does not perform well in high dimensions, and K means is effectively using the Euclidean distance metric under the hood. In order to move its points around in order to find those OIDs in no.
The journal entry embeddings using the sentence transformers embedding tool are 768 dimension vectors. That is too high. 4K means to work effectively. It is too high as far as I understand it from, from poking around online. When we say too high of dimensions for K means to cluster effectively for Euclidean distances to work well.
We're talking like a hundred, a hundred dimensions is like pushing things, and we really want something like 10 to 50 dimensions. That's really our sweet spot. So 768 dimension vectors is just right out. It's. Way too high for K means to work fantastically. Now I have used K means in No in Practice, and it's worked okay.
It's not terrible. And in fact, if you go to, no, the ai.com now sign up. Uh. Create a few journal entries and then go to the insights page where you can generate themes of your entries, of your journal entries. Themes in No. The are clusters. They're clusters. Uh, they are clusters of your journal entries. Um, in other words, if I talk about relationships a lot and I talk about politics a lot, what may start out as a thousand journal entries becomes.
Two clusters in the themes section of Novi, it will show you, uh, the keywords, the common keywords that appear in cluster one. In this case it would be politics, trump elections, blah, blah, blah. As well as a summary of the core of that theme, using the hugging face, transformers, summarization, pipeline, and then theme two.
AKA cluster two 'cause we are using clustering. That's all themes are, are just the clustering feature in practice. Theme two might have the words relationship and marriage and all these things, and then a summarization of those entries. You can on the website, click the advanced gear icon and select K means to be your clustering algorithm.
Behind the scene, it uses a agglomerative clustering, which I'll talk about in a bit, and you can see the result. I have found that K means. For this purpose, for natural language processing embedded documents, now it's 768 Dimension vector, very high dimension vector clustering, those does not work fantastically with K-means it works okay.
It's not terrible, but I have personally found that, uh, a agglomerate of clustering works better. And we'll talk about that in a bit. But before we talk about that, you know, Kans is the most popular clustering algorithm on the planet. So let's dwell here a little bit more. There are other implementations of K means you don't have to use the psych kit LEARN package.
A very popular other implementation of K-means is the Fi F-A-I-S-S. It's a Facebook machine learning package and it has its own implementation of K means, which is actually. Faster it's, it's more performant and it actually is more accurate evidently after you train it on very large data. So the psychic Learn K means package works effectively for low to medium data.
I'm talking about numbers of rows, not dimensions, and the fi k means implementation by Facebook works. Better for large data, number of rows, both computationally performance wise and accuracy performance wise, and what this package FI does. Uh, I will talk about this concept in another episode because it is highly related to nfi.
I'm not actually using FI or any of these related packages, but I will eventually, and I think it will be of interest to you listeners, feis, they call this approximate, nearest neighbor search algorithms or packages. There's a handful out there. One is fi, one is annoy. A-N-N-O-Y. Yeah, exactly. Annoy. And another one is HNSW.
Lib. And you'll find if you want to, if you wanna dig into these now, 'cause I won't be talking too much about them currently. If you wanna find out a little bit more about these. Actually go to the UK P Lab sentence, transformers package, which is the the tool I use in no for embedding your documents. And in the examples directory, there's code that uses these packages and discusses what they do and why you would use them.
Um, so if you want to get a quick start and dive into these things, I recommend going to the UKP lab sentence transformers, examples, directory doing a little bit poking around. But in essence, what these packages do, these approximate nearest neighbors algorithms is they are, they're used for semantic search, for looking things up.
Just like you would use cosign similarity, and I am using cosign similarity in ti, but it's for very, very, very large data sets, extremely large data sets, millions of rows. What the, what these packages do, and let's just take FIS for an example. FIS creates an index, basically your database of document embeddings.
An index of these documents. So you instantiate an index in Python code index equals fi index parentheses, and then you feed in all of your document embeddings. These are all 768 dimension vectors into the index, and the index creates a database that's basically. Effectively like a Postgres database or a MySQL database.
It's a database of vectors that can be swapped in and out of disc and memory very computationally effectively, and which indexes the, the location in vector space of these documents to each other, where you get to specify what is that distance metric by which these are compared to each other for lookup on the index.
So if you're using document embeddings. You care about cosign similarity, as we discussed in the distance metrics episode document embeddings care about cosign similarity to each other. So you specify in the FI index, here are my documents, a million of them, and the similarity metric I want is cosign similarity.
Now you have an. Index that is used for extremely efficient lookup and search. And the next time you have a document, so for example, if I have a million books being my FI index and now I have a new journal entry that I put on no fee, I could say, find me the top 10 books matching this journal entry and you just throw it into the FI index and it returns to you 10 results, which using the similarity metric that you specified in this case being co-sign, it's able to just.
Project your journal entry into vector space and these, uh, points, these books are, uh, represented computationally efficiently can be just pulled out really fast and easy. And the way this relates to K means is that if you don't specify any distance metric, it's gonna be using the Euclidean distance metric.
Everything always uses Euclidean by default. Euclidean is just like the, the default distance metric of everything and the default cluster. The K means algorithm, but FI's implementation of the K means algorithm. So the FI package primary utility is this approximate nearest neighbor search, which allows you to do semantic search between vectors.
You can vary computational efficiently, look up similar vectors to one that you pass it for very large data sets. But one of the side benefits of FIS is that it uses its own implementation of the Cains algorithm under the hood in order to perform this process. It also exposes that Kans implementation to you as a developer, which you can then use for your own just normal clustering application for very large data sets.
So that's K means the psychic Learn package for small to medium dataset, the FI package for large data sets, and K means sort of uses Euclidean distances as part of its algorithm under the hood. But like I said, Euclidean distance doesn't bode well in high dimensions. And indeed, with document embeddings as we're using in nfi, we're dealing with high dimensions.
So what do you do? You have a couple of options. One thing you could do is dimensionality reduce your large vectors. That will be a very near upcoming episode in machine learning. Applied is dimensionality reduction. There's all sorts of dimensionality reduction algorithms out there. One is PCA principle component analysis.
Another is SVD. Another is Umap and tSNE. Now tSNE is a little bit more. Commonly used for visualization purposes, but umap and ts e both are what's called manifold embeds, and, uh, they can be used for dimensionality, reducing your large vectors to very small dimension vectors. Uh, t in the case of tny, two to three components down from 768.
And in the case of Umap, you get to specify however many components. We won't talk about dimensionality reduction here. That'll be an upcoming episode, and I'm not using dimensionality reduction in no the, instead, I'm using a different clustering algorithm, which supports different similarity metrics for determining the clusters of your rows, and namely, as I mentioned in the similarity metrics episode in NLP cosine is almost always preferred over Euclidean distance anyway.
You know, it's not just that high dimensions don't perform as well. That's not the only reason we're not using K means here. It's that in the case of NLP Euclidean distance, which is sort of the driving force behind K means, doesn't really make sense as a distance metric. In semantic searching, co-sign is preferred.
Cosign is preferred. So K means is pretty much kicked to the curb in this particular case of NLP, high Dimensions check, Euclidean check, move on. And the place that I ended up landing on is agglomerative clustering. A agglomerative clustering, I'm not gonna spell it, it's a long word. A agglomerative, you can look it up in the show notes.
A agglomerative clustering is a type of hierarchical clustering. It's a different, it's a totally different style of clustering algorithm than the way Kans works. These hierarchical clustering algorithms. There's all sorts of different hierarchical clustering algorithms, and there are a lot of other clustering algorithms that aren't.
Hierarchical necessarily as well. Uh, a few out there that I've seen commonly used are, uh, spectral clustering, mean shift, affinity, propagation, and some other clustering algorithms that I won't discuss in this episode. I won't discuss the three that I just mentioned, but. Hierarchal clustering works totally different than the way Kans works and hierarchal clustering algorithms are more friendly to using different similarity metrics under the hood for clustering your data.
In our case, like I said, we want co-sign. K-means doesn't. Use cosign. In fact, it doesn't even support cosign. And so in Novi we are using a type of hierarchical clustering algorithm called agglomerative clustering. I won't talk about how agglomerative clustering works mostly 'cause I actually don't understand it that well.
It just so happens to really work well for me in practice. And I've tried a handful of other al clustering algorithms. Agglomerative worked the best for me, so it's the one I'm gonna go with. I will say technically that the way it allows you to use the cosign metric is that it allows you to pass in a metric by name.
So if you're to say, psych kit, learn clustering, agglomerative, parentheses, number of clusters, you always gotta say number of clusters in these psych kit. Learn, uh, clustering packages. If we have a thousand entries, maybe 10 clusters, sounds good, comma, and then metric equals, and then you can specify a word, one, which I believe supported is the word cosign.
I may or may not be wrong about that. I'm not actually using that. I'm using a different one. The metric string that I'm using is called Pre-computed. Pre-computed, and what that does is that it expects. You not only to pass in the data, the rows that you are going to be clustering, but also a another matrix.
And that matrix is the distance of every row to every other row. Now, because we're comparing every row to every other row. What we end up having is what we call a square matrix. A square matrix is the same number of rows by columns, all of A by all of B. So there's a lot of repeated data, obviously, but that's what these hierarchical algorithms which allow you to pass in a pre computed distance matrix, expect is a matrix, not a list or a vector.
M by M, matrix M being the number of rows in your dataset. So you pre-compute the cosign similarity of every entry, embedding by every other entry embedding. What you get back is a square matrix of all the cosign similarities, and you pass that in to the agglomerative clustering constructor function, and then you call fit transform on your entry embeddings.
And what you get back is again, a list of labels, the list length being the length. Of your list of entries and the label values being any number in the range of the number of clusters that you specified being tenant. In this example, how do I compute the square matrix of the cosign distances from every entry, embedding to every other entry embedding?
You'll just have to go see that in the code. I am using PyTorch to normalize all the vectors first and then compute the DOT products, right? I mentioned that in the similarity metrics episode. The cosign similarity function can be approximated as the normalized dot products of two vectors, and the reason I do it that way as PyTorch norm.
Then DOT product is that it's actually really fast. You can do this on the GPU. Normalization and DOT product are both two extremely fast, very first class implementation, obvious linear algebra on a GPU kind of stuff. Whereas if you were to use one of the psychic learn cosign distance functions like the P diss or C diss methods, uh, it would be performed on the CPU and it would be slower.
Okay, so that's the K means algorithm, which uses Euclidean distance under the hood compared to the agglomerative clustering algorithm, which is a type of hierarchical clustering algorithms, which allows you to specify whatever, uh, distance metric you want. And in our case, we're using the co-sign similarity metric.
We don't just tell it to use the co-sign similarity metric for technical reasons using the string. Metric equals co-sign. Instead, we pre-compute the co-sign similarity into a, into a square matrix of A by B cosign distances. And we do that by calculating the norm of every vector and their dot products against each other.
Normed dot product is effectively co-sign. Next we're gonna talk about DB scan and HDB scan, but before we do, I want to talk about finding the number of clusters to use in your clustering algorithm because like I said, both K means and agglomerate of clustering expect you to pass in an ncore clusters parameter.
Being the number of clusters you wanna cluster your large data set down into. But how do you know in advance? You might know, you might actually know that there are three topics on this website. Three news topics, politics, sports, and beauty. Right? Those are the three topics, the only three topics we talk about on this website.
So no matter how many news articles we publish, if we're gonna be using a clustering algorithm, we're gonna cluster down to three and clusters equal three. But it's very rarely the case that you know in advance. The number of clusters you need for this clustering application. So there are automatic ways for finding the number of clusters.
The first way is through the, what we call the elbow approach. Now, I can't show you this. We're in podcast format. It's a very visual thing, but what happens is there's a point at which you start getting diminishing returns in, in what? We call inertia. We'll talk about that in a bit and you can graph that on a, a graph and you'll see that point.
It's very crystal clear. What happens is you try every number of clusters up until you start getting diminishing returns. So we try one cl a thousand rows. Let's try clustering this into one. Okay, we get back what's called inertia, and we plot that point on a graph, and then we say, how about two clusters?
We cluster those thousand rows into two clusters, and then we get back the inertia and we put that on a graph and we keep going, three clusters, four clusters, and so on. And what the graph ends up looking like is it's like this really steep fall and then like a elbow, uh, a point where there's just like an angle, it really stops falling and then it kind of levels out and goes to the right.
Much more level still decreasing, but mu much less rapidly. We call that the elbow, and that's the point at which you decide that's your optimal number of clusters. Now, you could do that by actually creating a graph, and most of these tutorials teach you that approach. Literally iterate through one, two.
40 or half the number of rows in your dataset and generate clusters for that number of clusters. Get the inertia back. Point it on a graph. Make the graph, look at the graph, eyeball the elbow, what seems to be a good elbow, and then manually you, the programmer, choose that to be your end clusters. I don't like that approach.
I don't like, I don't like the human sort of being in the mix here. In that way, I, I believe in visualization of your data sets and distributions and making judgment calls on how to normalize your data and stuff like this. But I don't believe that the human should be in the mix in this particular way. I think this should be automated, and there is a package out there called need, K-N-E-E-D, like knee.
Instead of elbow because, uh, if the, if, if the kink in a graph is pointing, concave up, we call that an elbow. And if it's concave down, we call that a knee. So they decided to go with knee instead of elbow. Maybe elbow was already used as a package knee with A-D-K-N-E-E-D, and what the need. Package does, is it allows you to just automatically find the, the elbow or the knee.
You a, a handful of hyper parameters and which direction are we supposed to be pointing, concave up, convex down, blah, blah, blah. You, you specify those things and then you give it the graph that would've otherwise been generated in visual output. You just give it those points and it finds the knee for you automatically.
Very handy utility. So need. Knee locator, parentheses, your points and some hyper parameters, and it will find that elbow for you. But those points on the graph. Uh, uh, inertia. Inertia. Let's discuss this a little bit. The inertia of the K means model is the. Intra cluster distance between clusters. It's, it's the distance between clusters, between clouds of dots.
Okay. So what you do is you instantiate the K means model with, with a number of clusters that you specify to, for example, and then you fit, transform it on your rows, and it remembers it's, it's a variable. Now, on the K means model, it's a public variable you can access called inertia. It is the distance between the.
Two clusters, the two clouds of dots, the the mean square distance between these two dots, what would those distances be? Euclidean distance. Okay. Little trivia from our distance metrics episode. So it's the mean distance between clusters. You add a third cluster, now it's distances between all three clusters from each other.
So that's inertia and it's stored on the instantiated K means model, which is a variable in your Python code. You can just access it as model inertia and it's just a number and you just throw it into a list and then you can pop that into your, into your knee locator code, which will then find the elbow for you.
But there's an alternative method of finding the optimal number of clusters. That's not inertia. It's called the silhouette score. Silhouette score. It takes into consideration both inter and intra cluster distances, so it takes into consideration the distance of points in a cluster to their cluster oid.
And the distance from cluster to cluster, so it's a little bit more informed. It's a little, it has a little bit more information packed into what it provides you back. And in addition, it does something extra, which is that the maximum silhouette score. Is the winner, is the optimal number of clusters to choose, which is better than using this knee locator approach because finding the elbow of a graph can be a little bit fuzzy, and there are hyper parameters that you have to pass into this knee locator algorithm so that you find the right elbow given the types of grafts that you're getting back.
In particular, there's this one hyper parameter called. S, which is the sensitivity. How sensitive are you to these elbows? If you have a too smooth of a graph, we need to up the sensitivity so that it's a little bit more hardcore. And if, if it's a very jagged graph, et cetera. So finding the elbow of a graph is usually not all very cut and dry.
It can be subjective and it can be fuzzy, and it might elude automated techniques like the knee locator, whereas the silhouette score has two benefits overusing inertia. For finding the optimal number of clusters for K means or agglomerative clustering, the first benefit is that the information that is packed into this process, namely inter and intra cluster distances, is more informative when it comes to deciding the optimal number of clusters.
You have more information to help you make a decision on the optimal number of clusters, and the second benefit is. That you can simply select the silhouette score that is maximum as that represents your optimal number of clusters, and there's no complicated fuzzy knee locator logic you have to do to find the elbow.
So nine times outta 10. You see people prefer to use the silhouette score and there's a psychic learn silhouette score function that just makes this whole process super easy for you. And there's one third huge benefit is that it allows you to pass in a pre-computed square matrix of distances. So you can use the silhouette score with cosign distances for deciding the number of optimal clusters in agglomerative clustering.
So let's take a little step back. K means model uses Euclidean distance under the hood that has problems and high dimension. It has problems because you can't use cosign. So in NLP applications, K means isn't the best algorithm to use here in most other applications. K means is the default, and it is a great clustering algorithm to use.
But specifically in this case, document embeddings and cosign similarities, high dimension, and we prefer cosign over. You click in. So can't really use K means, and it is the K means model that generates this inertia score when you run the model on some number of clusters. So we don't even have the inertia score, we can't even use the inertia score.
It's not available to us. We're using the Agglomerative clustering algorithm, which does not come with an inertia score. So what do we do? We run the Agglomerative cluster for every number of clusters from some minimum to a maximum, and then we actually compute the silhouette score for the, the same square matrix of co-sign distances between documents that we used as the data to train the Agglomerative clustering algorithm on.
We use that as the input for the pre-computed distances for the silhouette score, so, so the silhouette score method takes the labels that are output from the cluster and the distances themselves being, in this case, the pre-computed matrix, cosine similarities. So the silhouette score is fantastic. It's phenomenal.
It is preferred in most cases from what I understand, for finding the optimal number of clusters, either for K means or for a agglomerative clustering, hierarchical clustering, whatever have you. And so for most applications, what I recommend is you go K means, and you go silhouette score and take the max silhouette score.
That's your number of clusters. That's how you find the number, the optimal number of clusters. K means from psych kit. Learn if your data is small to medium. K means from fi if your data is large. And you'll use the silhouette score to find the optimal number of clusters for your K means. That's most applications, but we're not talking about most applications here.
We're talking about natural language processing. So in our case, we are using the Agglomerative clustering algorithm with which works well with pre-computed square distance matrices, whatever that may be. In our case, it's co-sign. And then you can use the silhouette score on the agglomerative clustering output in combination with the pre-compute cosign matrix.
So K means an agglomerative. Of course, there's so many algorithms out there, and there are different popular ones used by different circles, but in my opinion, I like those two clustering algorithms. The best. Finally, now we'll talk about DB scan and HDB scan. Uh, DB scan is another type of hierarchical clustering algorithm similar to agglomerative clustering being a type of hierarchical clustering algorithm.
So DB scan is another hierarchical cluster. And in fact it seems to be, um, and this is just my observation, a lot more popular on the internet, just used a lot more commonly in various applications. Even it seems to be the case that a lot of the community is moving away from K-means and towards, uh, DB scan, it seems that db, whatever DB scan has as part of its special sauce under the hood, offers a lot of versatility.
That lends itself to multiple applications and sort of auto learns some aspects as part of the process. One aspect being the number of clusters, so big benefit. If you use DB scan, you don't need to specify the number of clusters. It learns it, it finds the optimal number of clusters for you on its own.
That is enormous. That's an a huge reason to choose DB scan over either a agglomerative clustering or K means. Now DB scan is, uh, an algorithm, it's a concept, and there's a psyche learn implementation called psyche learn clustering, DB scan, probably, well, there's a, a more popular implementation out there called.
HDB scan and it's just, it's just a non psychic learn implementation. And it's just, it's just more popular. It's more powerful. It has a lot of bells and whistles. It does a lot more for you outta the box. A lot of great documentation, a huge community behind it. Um, so if you're gonna use DB scan, I recommend using the HDB scan package instead.
That is not part of the Psychic Learn package. And yeah, so like I said, uh, one of the big benefits of. DB scanner, HT B scanner is that automatically learns the number of clusters optimal for your clustering situation. And another big benefit is that it's, it's sort of, there's a lot of automaticness to it.
It, it sort of does some stuff under the hood that I don't understand. I haven't spent much time with it that it lends itself generally. To various specific use cases where you would specifically be using a specialized algorithm. So me personally, in ti, I am specifically using a agglomerative clustering because it specifically lends well to the pre-computed cosign similarity matrix being passed in and and working off that.
And I find it to work very well. Whereas the K means algorithm would be used over here for some other stuff, stuff that's Euclid in and not language ish. Well, HGB scan, from what I understand, one of its great strengths is that it's a little bit more general and multiply applicable. You can use it across the board under various circumstances and it, it can handle multiple of these various circumstances very well.
I would recommend exploring HDB scan, and if you're doing clustering in your application and seeing how well it works for you, it's, it's very popular and people swear by it. People love HDB scan. Me personally, I tried it in Novi for the themes feature and for clustering your. Journal entries before running those through the Semantic Similarities search to books.
It didn't work at all for me. I, I fiddled with it for hours and hours. It never found any clusters. It was always negative one, which means it didn't, didn't find any. I. The, the only cluster it found was negative one, which means it was an outlier. I don't know if it's just the, the data I'm using, you know, the, the document embeddings, these 768 dimension vectors.
I don't know if I need to normalize those in advance or if that's too high of dimensions. Vectors. As far as I understand, another claim of HDB scan is it's, it doesn't matter the dimension of vectors you're using. I could be wrong about that one, but it's supposed to be one of these like catchall clustering algorithms that can basically handle anything, but it was not able to handle my data.
I was not able to handle these document embeddings whatsoever. Neither DB scan from psych kit learn nor HDB scan the standalone package with all the fiddling of hyper parameters and, uh, dimensionality reducing my vectors and all these things. I couldn't get it working whatsoever, so I just moved on. I had a agglomerative clustering was working just fine for me, so I stuck to my guns.
So I'm not using it, but I do recommend you try it because people swear by it. K means is kind of the, the default HGB scan is the catchall. And a agglomerative clustering is nice because it allows you to do co-sign and any other pre-computed metric you might be using that is not Euclidean based. That's it for today.
Next time we'll talk about Docker. Actually, a little change in pace from the types of stuff we've been talking about lately. See you then.