MLG 036 Autoencoders

May 30, 2025

Click to Play Episode

Auto encoders are neural networks that compress data into a smaller "code," enabling dimensionality reduction, data cleaning, and lossy compression by reconstructing original inputs from this code. Advanced auto encoder types, such as denoising, sparse, and variational auto encoders, extend these concepts for applications in generative modeling, interpretability, and synthetic data generation.

Show Notes

Build the future of multi-agent software with AGNTCY.
Thanks to T.J. Wilder from intrep.io for recording this episode!

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Fundamentals of Autoencoders

Autoencoders are neural networks designed to reconstruct their input data by passing data through a compressed intermediate representation called a “code.”
The architecture typically follows an hourglass shape: a wide input and output separated by a narrower bottleneck layer that enforces information compression.
The encoder compresses input data into the code, while the decoder reconstructs the original input from this code.

Comparison with Supervised Learning

Unlike traditional supervised learning, where the output differs from the input (e.g., image classification), autoencoders use the same vector for both input and output.

Use Cases: Dimensionality Reduction and Representation

Autoencoders perform dimensionality reduction by learning compressed forms of high-dimensional data, making it easier to visualize and process data with many features.
The compressed code can be used for clustering, visualization in 2D or 3D graphs, and input into subsequent machine learning models, saving computational resources and improving scalability.

Feature Learning and Embeddings

Autoencoders enable feature learning by extracting abstract representations from the input data, similar in concept to learned embeddings in large language models (LLMs).
While effective for many data types, autoencoder-based encodings are less suited for variable-length text compared to LLM embeddings.

Data Search, Clustering, and Compression

By reducing dimensionality, autoencoders facilitate vector searches, efficient clustering, and similarity retrieval.
The compressed codes enable lossy compression analogous to audio codecs like MP3, with the difference that autoencoders lack domain-specific optimizations for preserving perceptually important data.

Reconstruction Fidelity and Loss Types

Loss functions in autoencoders are defined to compare reconstructed outputs to original inputs, often using different loss types depending on input variable types (e.g., Boolean vs. continuous).
Compression via autoencoders is typically lossy, meaning some information from the input is lost during reconstruction, and the areas of information lost may not be easily controlled.

Outlier Detection and Noise Reduction

Since reconstruction errors tend to move data toward the mean, autoencoders can be used to reduce noise and identify data outliers.
Large reconstruction errors can signal atypical or outlier samples in the dataset.

Denoising Autoencoders

Denoising autoencoders are trained to reconstruct clean data from noisy inputs, making them valuable for applications in image and audio de-noising as well as signal smoothing.
Iterative denoising as a principle forms the basis for diffusion models, where repeated application of a denoising autoencoder can gradually turn random noise into structured output.

Data Imputation

Autoencoders can aid in data imputation by filling in missing values: training on complete records and reconstructing missing entries for incomplete records using learned code representations.
This approach leverages the model’s propensity to output ‘plausible’ values learned from overall data structure.

Cryptographic Analogy

The separation of encoding and decoding can draw parallels to encryption and decryption, though autoencoders are not intended or suitable for secure communication due to their inherent lossiness.

Advanced Architectures: Sparse and Overcomplete Autoencoders

Sparse autoencoders use constraints to encourage code representations with only a few active values, increasing interpretability and explainability.
Overcomplete autoencoders have a code size larger than the input, often in applications that require extraction of distinct, interpretable features from complex model states.

Interpretability and Research Example

Research such as Anthropic’s “Towards Monosemanticity” applies sparse autoencoders to the internal activations of language models to identify interpretable features correlated with concrete linguistic or semantic concepts.
These models can be used to monitor and potentially control model behaviors (e.g., detecting specific language usage or enforcing safety constraints) by manipulating feature activations.

Variational Autoencoders (VAEs)

VAEs extend autoencoder architecture by encoding inputs as distributions (means and standard deviations) instead of point values, enforcing a continuous, normalized code space.
Decoding from sampled points within this space enables synthetic data generation, as any point near the center of the code space corresponds to plausible data according to the model.

VAEs for Synthetic Data and Rare Event Amplification

VAEs are powerful in domains with sparse data or rare events (e.g., healthcare), allowing generation of synthetic samples representing underrepresented cases.
They can increase model performance by augmenting datasets without requiring changes to existing model pipelines.

Conditional Generative Techniques

Conditional autoencoders extend VAEs by allowing controlled generation based on specified conditions (e.g., generating a house with a pool), through additional decoder inputs and conditional loss terms.

Practical Considerations and Limitations

Training autoencoders and their variants requires computational resources, and their stochastic training can produce differing code representations across runs.
Lossy reconstruction, lack of domain-specific optimizations, and limited code interpretability restrict some use cases, particularly where exact data preservation or meaningful decompositions are required.

Accelerate Your AI Strategy with TylerAI Strategy Call with Tyler

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript

Welcome back to Machine Learning Guide. Today's episode is auto Encoders, and it's actually a guest lesson by a colleague. the stars just happened to align. My colleague was interested in being involved on the podcast, and he is an expert in auto encoders, and I had started the episode on diffusion models, which comes next.

and didn't realize the level to which diffusion models depends historically and to an extent in modern times on auto encoders, specifically variational auto encoders. So TJ Wilder is going to teach today's episode. TJ Wilder is a machine learning engineer and data scientist working on synthetic data for healthcare. His company in Treo, I-N-T-R-E p.io works with healthcare providers, insurers, and actuaries to seamlessly augment their data sets.

by augmenting your real data with realistic synthetic data, intre lets you train better machine learning models without needing to change your model or data engineering at all. You can find more information and contact them at INTR eep.io to hear how they can help your team.

Hello, I'm TJ Wilder. I'm a machine learning engineer and data scientist specialized in generative ai, but most of the things I work on are not the kind of generative ai you're probably thinking of. Most of my work over the last few years centers on the idea of using machine learning to generate synthetic data for healthcare.

And this doesn't use large language models at all, though there are aspects that could,and I'll actually talk more about this kind of generative AI for synthetic data later on. But for now, I wanna focus on the foundations specifically. I'm going to talk about auto encoders. They're a specific type of neur network, which is surprisingly simple, but has a lot of power and flexibility in its usages.

Now, most machine learning tasks are called supervised learning tasks, which means you have some input data, which has a correct output. The input could be a picture where the output is, whether there's a hotdog in the picture or not. The input could be information about houses, like the square footage, the year it was built, and the number of bedrooms, and then the output could be the market price of the house.

These are different kinds of tasks that machine learning algorithms excel in solving. But an auto encoder is a little different because the weird thing with an auto encoder is that its input is the same as its output data. Essentially, an auto encoder is a neural network which uses any input data to try and predict that same data coming out, which is maybe a little interesting, but it's not really obvious why or how you'd do this.

But there's actually a lot of interesting use cases. A lot of the power comes from the actual shape of an auto encoder neural network. They're typically shaped like hour glasses where you have a wide vector coming in and a wide vector going out because it's the same vector. Then in the middle of your network, it typically becomes much more narrow.

There's an intuition here from data compression. Actually, let's take our housing example and assume we have many more features, like whether there's a pool, how recently were the floors refurbished? what's the distance to the nearest grocery store, maybe the average price of homes in the neighborhood, stuff like that.

And let's say we now have a hundred total inputs. We have a hundred numbers going in, but a lot of those are integer or Boolean, and a lot of those are correlated as well. So when we train an auto encoder, we're declaring that we can compress those 100 numbers into a smaller footprint in the middle of the neural network.

So maybe take 16 neurons to be the middle of our neural network. And those 16 neurons create what's called a code, not code like Python code, but code as in encoded and decoded. So the first half of our hourglass compresses and encodes, it's called the encoder. those 100 original numbers into 16 encoded floating point numbers.

Then the second half of the hourglass, decompresses and decodes that code back into the original 100 numbers with all the correct types and all of that jazz. So in some sense, that 16 number code in the middle is a compressed form of our original 100 numbers. The Y is perhaps surprisingly intuitive because in order for our auto encoder to predict the output correctly, all of the information of the input still needs to be there within that code.

It's just in a different form than it started with. Now this is actually kind of different than most neural networks because while we don't want to lose important information in traditional machine learning and traditional neural networks, we often do want to lose unimportant information. So, as an example, in the middle of our hotdog detector, we actually want there to be less information because most of the information in the picture is not actually relevant to whether it's a hotdog.

So a typical thing could be eliminating the background because the background doesn't say whether it's a hotdog or not. Now there might be some information from the background that is helpful. Like if you see a hotdog cart in the background, that's probably more likely to be a hotdog in the picture.

but that's not a guarantee. And so by training our models on lots of data, we can help filter those things out. And so when you're actually running on real data, the model will specifically lose information. And this is this is a good behavior. We don't want the model to have to remember everything, the whole network down, because then we're just wasting a bunch of space.

We're wasting computation in order for it to know things that it doesn't need to know. And this works really well with traditional machine learning techniques. because we're calculating the loss in order to train the network, we're calculating the loss only based on the output that we're trying to get.

And so we don't care about all of the other information that doesn't help with that output. We only care about doing that one thing.

So in say, the first or second layer of our hotdog neural network, it might eliminate the background or it might compress the background into just one neuron rather than have many of the original pixels that are associated with it. so that's a good intuition. But neural networks are rarely actually that straightforward.

in reality, they'll lose information somewhat randomly over time because. It takes time to figure out what's relevant. And then learning happens as a process. So the relevant things can stick around for longer or less long, depending on how that works out. So anything that doesn't help with the goal can be forgotten in the middle.

That's the key point of most neural network machine learning tasks. However, with an auto encoder, that's not actually the case because you need to reconstruct everything back to its original state. And the training for this isn't particularly unusual. You can train it just like any other neural network using loss functions, and except rather than a separate Y vector for the output, you just calculate loss against that same X input factor that you put in as the input.

There's a small caveat with the complexity that oftentimes you're actually calculating different types of losses for different types of inputs. Like a Boolean input of, is there a pool, could be a different loss function than year that the house was built. Because those are two different numbers they're on, very different scales and one of them is a Boolean, so it can only be one or zero.

Whereas the year that the house is built could be anywhere in a large range of numbers. And that range can also vary depending on, what area your house is even in, because some areas will only have newer houses, whereas some areas could have houses that were built hundreds of years ago.

So that can complicate things especially if you're using,a lot of different data in your auto encoder. So that kind of gets to what an auto encoder is. Now I wanna talk about the why as well as some of the problems with naively applying them to those use cases. Then after that I'm gonna talk about some of the more advanced versions of auto encoders, some of the more advanced varieties,which can solve some of those problems and even create some new ones.

One last thing before that, which is important to say, is that there's no correct auto encoder for a given data set. Instead, every time you train one, it's gonna be slightly different. So even if your code size is the same between multiple runs, the actual meaning of each item will change. So if you have a code size of five, if you train a thing multiple times, the first code value, the first number will mean a different thing one time when you train it, and it'll mean a different thing the next time you train it.

So you can't compare apples to apples across multiple auto encoders, at least as far as the actual code. and so that will impact some of these use cases. And it typically means you wanna train one really good auto encoder and then leave it at that. which is actually the same thing that they do with, large language model embeddings, which I'll talk a little bit more about in a little bit.

So for use cases, let's start with dimensionality reduction. It's kind of a complicated sounding term if you haven't heard it, but dimensionality refers to how many dimensions your data is defined across, or more simply how many separate variables your input has. So an image could be one input per pixel.

So a 10 by 10 image could have a hundred inputs, or you could say it was a dimensionality of 100. That's ignoring color channels, of course, but that's a whole other thing. So then dimensionality reduction is taking the original dimensionality of 100 and making some smaller representation. And in our case, that's the code in our auto encoder.

In fact, the encoder part of the auto encoder works to reduce the dimensionality of the original data. So that's kind of obvious based on how I've described what an auto encoder is. Why is it useful? It's actually nice for a lot of things, including visualization, modeling, search and compression.

And I'll go through these, one by one visualization is probably the most straightforward, but also might be the hardest to talk about on a podcast because I can't show any visualizations. so imagine you have your housing data set with a hundred different inputs. How do you visualize that? It's kind of impossible.

you can create manual visualizations of different aspects of the housing, or you can do different types of grouping and you can select subsets of parameters. But it's hard to get an intuitive sense of a large amount of data when it has a high dimensionality like that.

So instead, you can take your auto encoder and make the code size two or three. Now, when you run your data through the auto encoder and it compresses the code down to two or three, the auto encoder is not gonna be perfectly accurate because that's a really tiny code to push all that data through so it won't be as accurate.

But because you now have a 2D or 3D representation, it's a lot easier to actually visualize that in like standard graphing software. this is kind of an alternative to,an algorithm called tny, which is an algorithm specifically built for this type of dimensionality, reduction for visualization and clustering.

but you can also use this for other things, which is kind of why it's relevant. so you can kind of visually find clusters, or visually compare data. And then you can say, okay, the housing, you know, the, the data in this area of the graph is a lot related to kind of rich houses or houses in expensive neighborhoods.

And, you know, there's the subset with pools and there's the subset with, you know, a master bathroom, all that kind of thing. And then in another area of the graph you might find, okay, this is the housing that's multi-family housing. And so you can kind of visually see those things and then you can use that to further go in and say, analyze those groups.

and this actually has the same use with, with clustering algorithms, where when you want to do clustering, it's generally very hard to cluster a large dimensional space. and two main reasons for that. One is the literal computational complexity generally increases very fast in clustering, when you have a lot of dimensions in your space, but also it's harder to interpret and understand clusters in an arbitrary dimensionality.

You might have heard this, there's something called the curse of dimensionality, which is that a lot of these algorithms that are I would say well-defined it's well-defined in that we know how it works and like what it does, but it's not doing like a defined set of steps that are specific to the data or the dimensionality.

It's just there's more neurons. but a lot of these algorithms actually have, more complexity built into the dimensionality so it gets more and more expensive, faster. So when you reduce the code like this, and in fact for clustering, it doesn't need to be down to a code of one or two. It helps actually, no matter how much you reduce the dimensionality, it's helpful.

so it makes the clustering a lot easier computationally by doing that, but also by encoding it into that shared space, you make all of those variables act in that same space. So with the original dataset, you might have the square footage of a house, which could be in the hundreds or thousands.

and then you have, is there a pool, which is just a Boolean, yes or no. And then there's the number of bedrooms, which is, zero to five. And so all of these things are on totally different scales. And so naively clustering them is not gonna be that helpful because the distance between a 1000 foot house and a 2000 foot house is probably actually very similar to the distance between a one bedroom and a three bedroom because they're just on different scales.

But the clustering doesn't know that. likely you could adapt some of these clustering algorithms to handle that, but it would also probably be a lot of manual work or a lot of just arbitrarily scaling the data to be between zero and one, which is also not necessarily the most helpful thing. So by doing it in the code space, you get a much more abstract representation, which is all in the same space, and you can do that clustering and you can visualize the clusters as well in those same 2D and 3D graphs I mentioned.

so another thing you can do with your simplified data is you can combine them into a further machine learning model. So say you have,multiple different entities that you wanna learn about or that you want to run your machine learning algorithm on. So for the housing data, it might be data about the house is one set of data, and then data about the neighborhood could be another set of data.

possibly data about the buyer, if that's getting involved or about the seller or about the county or the town or anything like that. and those could be different entities. And so I said you have a hundred features for the house information itself.

Now you might have that same thing with each of those other things I mentioned. So you might have 500 features. It's a lot, but it's not a huge amount. But if you keep scaling that up, you keep having more and more problems, the more features you have. Say you had 10,000 features, which. might sound like a lot, it might not, depending on what kind of data you're used to and what kind of transformations you have for your data.

Some transformations, like one Hot ENC coding, can tend to expand out the data quite a bit and make it a lot harder to then do further modeling with it. So if you had 10,000 inputs to your machine learning model and you want say, 500 neurons in a hidden layer in the first hidden layer, that's not a lot of neurons.

Pretty standard amount, but just that 10,000 by 500 gives you 5 million parameters in just the first layer of your neural network. Now, if you don't have a lot of rows in your data, that's now impossible. You just wouldn't have enough to train that. If you have a lot of rows, it might not matter.

But also having single layers that are very large makes it harder to do modeling on them. It makes it harder to train them. The network is larger, so it's harder to fit in memory. now a lot of those problems probably aren't that bad, but it also means that training is going to be slower because you'll have to have smaller batches because not as much will fit into memory.

and said 10,000 as kind of an expansion of my 500, but if you just keep adding more data and more different features, a lot of these models will continue to get better. And so oftentimes, you want to gather as much possible data as you can. and at some point it is no longer feasible to pass it all in.

So bringing that back around to auto encoders, if you train an auto encoder on each of those entities or data sources, you can take the code of that, which as I described earlier, has all of the information from the original data. That's kind of the whole point. That code holds as much data as the network can learn anyway.

Now you can use that in your new machine learning model, in your downstream model instead of using the original data. And that saves up tons of parameters that can then be used to actually understand your data instead of just get through the formatting. and then you have all of those benefits I mentioned because it's smaller.

you can train it faster, provided there's the extra step of actually auto encoding it in the first place and training the auto encoders. So it might not be a net savings depending on what you do, how many times you're retraining, all of those kinds of things. and beyond just size, sometimes this is also helpful as an example of something called feature learning.

We use learned features like the code from the auto encoder instead of directly engineered ones. And it's actually very similar to the LLM idea of embeddings, which map each token or each word. But it's really a token, so oftentimes sub words or single characters even.

and it maps those tokens to embeddings, which is just a set of numbers that correspond to that token. and so you could say an auto encoder could also take tokens and map them to codes, and then you could use those codes instead of embeddings. But actually for,large language models, embeddings are actually way better than auto encoders because of the context.

So when you are. Encoding something with an auto encoder, you are trying to get all of the information out of the original thing, put it into smaller form usually, and then expand it back out. So you can take that smaller form and pass it in, but it's just a smaller form of the input. But with text, like a word, there's no real information in the word about what it is.

especially when you think about them as tokens because we're not getting the individual characters that might make up a word or things like that. And so the actual like position in your word vector has no meaning. And so the auto encoder isn't going to learn anything useful from it. All it would learn is a way to map the index and then map it back out to the original.

And so yes, it will be a smaller vector than the original, but it won't be a useful vector. So instead LLMs use embeddings. And those embeddings are learned on the specific types of tasks that LLMs are useful for, like predicting the next word. And because it's learned to complete that task alongside, it learns the embeddings on a bunch of words at once because it's training on a whole sequence.

And so once it went by doing that, you kind of get that contextual information and force it into the embeddings, and then the embeddings become really helpful. I've never heard of anyone doing, an auto encoder with a transformer directly like that. but that could be an interesting, thing to look at if you're interested in seeing whether auto encoders could be useful as a replacement for embeddings.

and the root of that problem is kind of that auto encoders are not domain specific. Or they're not use case specific, they're only data specific. So if you want to pass in a bunch of rows or a bunch of words to your auto encoder, you can do that and you could encode kind of chunks, similar to how a sentence embedding might work instead of an individual token embedding.

and so that's fine, but it's not really built to do that. And so just normal auto encoders only have a fixed number of inputs and outputs. And so it doesn't naturally work with,a language model having varied text length. So that's just something to think about, why embeddings might be a better strategy for language models, but, auto encoder encodings could be a better use case for other types of dimensionality reduction.

Another use for encodings, which is actually the same as LM embeddings, is for semantic search. so just like clustering, it's hard to arbitrarily search for data that has a high dimensionality, but it's a lot easier when you've reduced the dimensionality and made sure they're all in the same space.

So by reducing that dimensionality, the vector search becomes exponentially less expensive. This is that same cursive dimensionality I mentioned. most vector searching is just more advanced versions of the nearest neighbor algorithm. so, these vector databases will implement nearest neighbor, but with probabilistic searching or, really good caching and splitting so that they can parallelize and so that they can optimize. But at its root, it's just finding the nearest neighbor in the vector space.

And so that vector could be anything though. So it could be a, it could be an encoding instead of an embedding. And so you could do the same thing where you say, I really like this house, so I'm gonna encode all of the information about this house using my auto encoder. And then I'm going to run a search in the embedding space to find similar houses.

It's the same intuition as clustering, where you cluster similar houses together. So also when you search, you'll find houses that are similar to this one. Now, obviously, if you wanted to search in practice, you'd probably also want hard filters like. You know, if you have a minimum square footage you're looking for or a maximum price, so obviously a fully featured search isn't just a vector search, but you can do the same thing.

okay, so another use case for dimensionality reductions is literally just data compression. So I kind of described this already, but because the code is smaller than the input, you are literally compressing the data by doing that. And this is very similar to like zipping an unzipping a file or encoding a video or like an MP three audio file.

and I use the word encoding. Specifically there, because it's the same type of encoding, it's just rather than using a neural network to do the encoding and MP three encoding, has a specific algorithm that it uses and then a specific algorithm to decode that into raw audio data so it can play on your computer.

essentially the encoder part, the first part of your network is the encoder part and it encodes your data or compresses it in this case, and then the decoder decompresses it back to the original data shape. And so this could be useful if you wanna store your data or send it over the internet.

so of course once you've encoded the data, if you wanted to send it over the internet, you still need to decode the data so you can actually send the decoder part of your neural network to someone else, and then they can use that to decode it. And it's kind of the same as an MP three in that if you know how to encode it, the other person still needs to know how to decode it, except that MP three is such a standard form for audio data that most every program already understands it and can work with it.

Whereas with an auto encoder, you're creating a very specific bespoke coding for your specific data. And like I said, it's not even for just your data, it's for your data plus the random seed that you trained your data with. So that's obviously not ideal for something like audio, which has this standard interface already.

but for some other things, like I said, bespoke use cases where you're sending specifically defined types of data from your dataset, it could theoretically be quite useful to save bandwidth or time transferring this data. I've been comparing this to MP three, and it's actually more similar to an MP three than say,zipping up and unzipping a file. if you don't know, there are lots of different types of compression algorithms, and I'm going beyond just auto encoders here. some algorithms, like zip files are called lossless, which mean they don't lose data

When you compress and decompress them. So the output will be exactly the same as the input. That's usually what you want, because if you are encoding like a source code file, you can't just have the output be different because then it won't work. Like your code will be different than it started. But with a lot of the data, we Experience with our senses instead of data that needs to be mathematically perfect for computers, we actually often don't use lossless compression. So something like an MP three is actually a lossy compression, meaning that it has some information that's actually lost when you encode it. And so when you decode it, it's actually not the same as the original data.

The cool thing with MP threes is that they actually have really sophisticated algorithms and reasons for what kinds of data they lose. So, like humans can't perceive certain frequencies of sound and they perceive certain frequencies of sound less than other ones. And so an MP three, like the scientists who made MP threes took all that into account and they only reduced the data usage by eliminating the stuff that humans don't need to hear as well or like don't care about as much.

And I'm sure some of those trade offs. Are more interesting than others. But the fact is they had an algorithm so they could make those trade-offs and make it really good compression by intentionally eliminating the stuff that didn't matter. But auto encoders are a little bit more complicated, so it's hopefully clear how the auto encoder works now.

But what type of compression is it? I'll start by asking, have you ever trained a model, a machine learning model, which achieves 100% accuracy? Probably not. Maybe for some toy dataset. So what does it mean if an auto encoder doesn't reach a hundred percent accuracy? It means that the decoded output is not exactly the same as the source data.

It's not the same as the input. So for compression, that means that the compression and decompression will lose some data, making it lossy compression. So that's not necessarily a problem, right? I just described how MP threes are lossy and MP threes are great, and they're used all over the place.

So what's the problem with auto encoders being lossy? That comes down to the fact that MP threes had all of those different decisions made to optimize the data for what it's actually used for, for human perception or specifically for listening to, whereas with an auto encoder, you're just at the mercy of the loss function you can train your model to be less lossy, to have more accurate compression. But the only way to do that is to have more training data is to maybe alter your loss function and your learning rate, which are hard to do. Just in general. It's like optimizing your neural network is a challenging process, but it has the issue of no matter how good you do, it probably won't ever reach a hundred percent accuracy and you don't have any control over what parts it skimps on.

So a really naive loss function could say that the difference between a one and zero yes, there's a pool versus no, there's not a pool is the same difference because it's the difference of one as the difference between a house that's 5,000 square feet and a house that's 5,001 square feet.

And if that makes the same difference, the auto encoder is really going to care a lot more about housing sizes because they're way more different. And so you're going to end up having disproportionate loss contributing, based on the housing size. there's ways to deal with that.

You can scale them to be the same size and whatnot. But it's just one example of kind of a broader issue

on the other side of that argument because you don't have to make these decisions. And because you can have the auto encoder and the training process make the decisions for you, you can actually encode things arbitrarily small. So you could make an auto encoder, which was just code size of one, and it would be super compressed.

Now the algorithm probably would kind of fail and you wouldn't have a very good auto encoder. And so your compression would be very lossy, but you could do it. And so you could make it arbitrarily compressed and you just have to trade off how much you care about. the lossiness, which also can be a very expensive process if you need to train multiple auto encoders in order to figure this out.

So you can use it as lossy compression, but it's not ideal in most cases. One cool case that's,a little different than audio is video. Video compression also exists and is a very hard problem. So there's lots of different codex and whatnot that have different trade offs and different methods of encoding videos.

But one really cool way is using auto encoders. This was actually something that I've seen, tried by Nvidia and Disney, they actually downscaled videos using auto encoders and then sent them along the wire and then upscaled them back. They did some cool tricks to make it better than just a vanilla auto encoder.

but that was the main technology that was used. It was auto encoders. And, there's similar work from Nvidia, which is more recent, where they use neural networks, but not exactly auto encoders, by compressing it, they did another cool thing where they extracted features of the face.

So like where the points on your face were, how your smile is tilting, how are your eyes closed? And then they could project that with an image generation model and sort of send that over the wire and then have the image reconstructed on the other end. So it's even more complicated than just this compression from the auto encoder, but it was even more effective as well.

so you can use it for compression. It's not ideal in most cases, but can be useful. there's another case for this encoding and decoding thing, which is cryptography. you can kind of think of the encoding as encrypting the data and then the decoding as decrypting the data. So you kind of have public key, which is the encoder.

You put it into the encoding space. And then the private key, which decodes it back to the original data space, which is kind of cool. and so without the neural network, the code is just arbitrary numbers and it's at least extremely challenging, more likely computationally impossible to figure out what the original data was supposed to be.

Especially if you don't know key information about the data, like the size of the data, like the shape of the original source data. And so the code is just a bunch of random numbers to anyone who intercepts it. And then you need the decoder network to actually be able to decode it. Though if you had the encoder network, you might be able to make some educated guesses and figure it out.

So it's not necessarily a really secure way to do it. there's also that big problem where encryption is usually not lossy. It's oftentimes you're trying to send very specific things. And so by making your security algorithm lossy, that's not ideal, but if you wanted to send secure videos or something like that, it could be a valuable use case.

So lossy compression is usually bad, but there is actually a case in machine learning where,this kind of lossy compression is actually really valuable. And that's in noise reduction and outlier elimination. the errors in auto encoders are typically regressions towards the mean data point. So that means when the auto encoder is wrong.

when it doesn't successfully reproduce the input, it usually says that the thing is reproducing is more average than it is, so it's closer to the mean. And that's just because the mean is the most common thing. So of course it will tend towards that.

and that's because it just sees a lot more average looking data compared to the outliers. So it's much better at reproducing average looking data than the outliers. the effect of this is that if you run all of your data through the auto encoder, it will tend to make outliers less dramatic, but won't significantly change the average data points, which can reduce the noise of the underlying data.

Now there's a big case that outliers are really important. and so it kind of depends on the type of thing that you're doing, but this can be very useful for downstream machine learning tasks that don't care about the outliers or the outliers are big enough to screw up your training, but not important enough to help it.

you can also do the same thing specifically for outliers. So I described it as outliers because I'm just talking about things that are far from the bean, but it will tend to kind of average things out a little bit, making it easier. But you can also say, if I encode this thing and decode it, is there a large reconstruction error?

If there is, that means that this thing is more likely to be an outlier and you can actually filter those out. So you can directly eliminate outliers by doing that as kind of a. An outlier detection algorithm.

So you can do this with just a normal auto encoder and it works and it will reduce the noise, but that's just reducing the noise by means of kind of not being very good. So what I mean is that because the training is random and because you have more of the average data than non average data just by how averages work when you run it through, it will kind of de-noise it towards the mean.

But you can actually train something called a denoising auto encoder, which is specifically intended just for denoising. So instead of just trying to take the real data and reconstruct it, you add additional noise to the input, but have it still predict the output. So you could think of this as,applying a blur to an image.

Before you reconstruct it or adding random static to that image. And this basically forces it to learn how to remove that same kind of noise from the image. And again, it doesn't have to be an image, I'm just using it as an example. So it will force the network to learn how to de-noise the image. And that can be especially useful for, data cleaning like with images, or it can be useful for signal processing frequently,signals like audio signals have jumps due to a bad microphone or a bad network connection.

And so Denoising auto encoder could help smooth out those signals so that the audio doesn't sound terrible or so that the connection is smoother. And there's a really interesting, intuition, and use case for this, which is imagine you create a denoising auto encoder trained on images of dogs.

So your goal is to clean up images so you have really nice high quality dog pictures. So that works great. you got a bunch of data. You made a model that takes noisy dog pictures and makes clean dog pictures. Wonderful. So now imagine you have some really messy images and your model just isn't powerful enough to clean them up.

What would happen if you just ran it through your Denoising auto encoder? Twice? Each time it's supposed to reduce the noise. So in theory, you could still get a nice clean image though because it's so noisy in the first place, the results might not be as consistent and you won't necessarily get the exact same thing you wanted to get out.

However, let's take this a step further. What if you had tons of noise, like you had noise and more noise and more noise just piled on until it looks nothing like the original image. So like imagine classic TV static just sh. white and black and gray everywhere. Now, instead of applying your deno once or twice, you apply it 10 times each time through the network.

The network has that same goal to make it less noisy and to make it more like a dog picture. That's because you trained it on dog pictures. So every time you run it, it keeps getting clearer until you get a high quality dog picture just like you originally wanted. maybe this is a little crazy if you think about it, but this is a basic version of what's called a diffusion model.

You might have heard of stable diffusion and others in the space. modern diffusion models actually work a little differently than this. but that's the intuition. You basically have a really noisy thing and you iteratively de-noise it with something like an auto encoder until suddenly it's not noisy anymore.

So I won't get into that here, because Tyler's actually planning multiple episodes on diffusion models specifically already. but I just wanted to point out that connection here. one last use case for cleaning is data infill or interpolation. if you are a data scientist or machine learning engineer, data engineer, anyone in that kind of general area, you know how data is often really messy.

It's often missing values in different features across different rows. And so they're randomly missing in some places because someone didn't enter it or because the code broke on that, or because, you know, it used to not be an important thing and now it's essential. And so there's tons and tons of reasons why your data can be messy.

But oftentimes in machine learning, you need to deal with that messy data and you need to figure out some unified approach to handling it. the simple solutions work fine. So the simple solutions would be you just find anywhere that there's missing data and you just cut that row out.

Maybe it's an outlier, maybe it's broken. As long as that's a small portion of your data, it's probably fine. Or oftentimes you'll just fill in the average or the most common value. So, you know, you have the average price, you just put that in instead of the missing price. And some machine learning models, or some transformations that you might apply actually work just fine with nulls, and so you can just use them.

and alternative is actually by training an auto encoder. So if you train an auto encoder on just your complete data points, which if your data is super messy, you might actually not have many of those at all. And so this can be extra challenging, but if only some of your, if like, 25% of your data is broken.

in this way, then you could use this. so if you train an auto encoder on your complete dataset or on the data points that are complete, and then you pass in your incomplete data points, and you might still have to fill in the average or the most common value just to make sure that it can actually run through the network.

But once you do that, the model will actually naturally attempt to fill in those values towards their correct values. So because the model has only seen complete data, it will always try to decode it into complete data. So it will never like output something where there's a null, where there shouldn't be as long as you haven't trained it to do that.

But then if the real value for that feature was very different than the average that you put in. As long as the model is smart and it's learned what it should, it will basically, tend to correct what you've said towards the more correct value. Because although it could take the feature and output the same feature at the end because it's condensing it into this code space, it's actually learning a more abstract representation of the collection of features altogether.

And then it's outputting something that's more consistent with that. it's not ideal because it can also corrupt your other things. So say if you're filling in five features with the average, that could overwrite the rest of the information because now the model thinks that this whole row of data is more average and so it will actually change the other things back.

However, you can just take the outputs that are. you can take the original data point and then only use the filled in values from the auto encoder output. And that can, that can mitigate a lot of that problem. Alright, so we've talked about auto encoders, how they work and some of the cool use cases you can get into with them.

now I wanna highlight a few more of the issues with these use cases and also introduce some more advanced auto encoder architectures, which can tackle these problems and enable even more cool applications. One that I think should be obvious is computation. So like all other neural networks, there's a computational cost to any of these use cases.

auto encoders are typically a lot smaller than say, large language models, but there's always some cost and that's for both encoding and decoding and that cost is another thing that prevents more adoption with,use cases like compression. Unlike an MP three algorithm, which is well optimized and we can trade off for compression amount and encoding and decoding speed.

an auto encoder kind of just works as it does, and the only way to optimize it is by making a smaller model or like, say, running it on A GPU, but reducing the size also reduces the accuracy. And so it's harder to make those kinds of trade offs and like, not everyone has access to A GPU, especially when you just wanna listen to music.

so there's not really a silver bullet to this. And it's also a thing where once you've trained an auto encoder, like I mentioned earlier, each time you train it, the randomness will make the auto encoder different. And so it makes it very challenging to further optimize an auto encoder that already exists because any distillation or fine tuning will end up actually changing what the internal code representation is.

And Then you can't do an apples to apples comparison, even just before and after the optimization. so it can be very powerful, but it's definitely not a one size fits all solution. Another valuable property, which is typically lacking in neural networks, is interpretability.

And that's definitely the case here, though. You can compare multiple examples within the code space, like I was talking about for clustering and visualization. There's no real meaning to the codes, so you can't compare them with anything else, and you also can't do any really meaningful analysis on them except for things within that code space, like the clustering and identifying outliers.

a concrete example is if you take your first code item, the average value of that across your data points just really doesn't mean anything at all. However, There's a really interesting version of an auto encoder that I wanna talk about called a sparse auto encoder. And that actually can help solve that problem.

It doesn't necessarily eliminate it altogether, but it's really interesting. rather than using a dense code space, which is the default with an auto encoder, instead you add some additional constraints on it. you may have heard of,the L one constraint, which encourages sparsity by basically,trying to make things that are small actually go to zero instead of just staying as small.

And that's, roughly how it works. So by adding sparsity constraints like that to it, instead of a dense vector in the middle, you instead get something that's much closer to a one hot vector, which, if you don't know, is where most of the values are. Zero or very close to zero. And as few possible values are active, or it's called one hot because they call it hot.

So some of the values are one, but most of them are zero. And so these aren't automatically interpretable, but it does make it easier to examine by hand. 'cause you could say, for instance, these five data points have the highest value for code one, whereas these other five data points have a value of zero for code one.

That means that code one corresponds to luxury housing because those first five data points were luxury houses. So this can be useful for explainability, but it's also typically harder to train because you have additional constraints. Furthermore, because you're training,stochastically, like I said, there's randomness.

You still have the same issue where if you train multiple of them, they won't have the same sparse codes. So you might be able to say, code one in your first model is luxury houses, but if you retrain it or update it, that just won't be the case anymore. There might still be a feature which is, luxury houses, but it wouldn't necessarily be one.

It would just be a random feature and there's no guarantee that they would be the same features because you might have, say, luxury housing as a feature, or you might have, expensive housing, which is broadly the same, but not exactly the same. my favorite version of this is there was a really cool paper published by Anthropic in 2023.

It's called Towards Monos Semantic Entity, decomposing Language Models with Dictionary Learning. Dictionary Learning is what's being done by the sparse auto encoder. They actually describe it as a weak dictionary learning algorithm because it's weak. I've mentioned that. All of all of this stuff about lossy compression and, denoising, it doesn't make a perfect code in the middle, and it doesn't make a perfectly consistent code either.

So it's a weak dictionary learning algorithm. So in the paper they train a very small language model, or SLM, which is basically an LLM like chat, GPT. But rather than being very deep and wide and many layers with tons of data, it's just a much smaller one. And in this case, theirs only has a single transformer layer and a single,MLP or multi-layer perceptron.

So they trained the model and there's no auto encoder involved yet. Once they finish training the model, they basically take the outputs of one of the middle layers of the network, and they train the sparse auto encoder there. So essentially what they're doing is they're like inserting a chip into the brain of this language model.

And they're saying, okay, here is where I want to understand the data. So they take that middle activation, they train the sparse auto encoder there. In their case, they actually took the output of the MLP before it went into the un embedding layer, which would then get,turned back into tokens. and they trained that sparse auto encoder there and they actually trained, an over complete auto encoder.

I'm not sure if I mentioned this earlier, but the typical auto encoder is an hourglass shape where the code is smaller than the input, and that's called under complete because the code is smaller and over complete auto encoder is the opposite shape. Where it's wider in the middle,

and normally the under complete makes a lot of sense because you're trying to make a dense version of a large thing. over complete meaning their code is bigger than the input is more relevant in this case because they have the dense inner thoughts of the network of this small language model.

And it's already dense by nature. Like there's no reason for it to have any sparsity in there, but then they want to have it be sparse. So naturally, if there's a dense thing with overlapping meanings where a neuron can activate for lots of different reasons and they wanna make it sparse, it's sort of a natural extension to say, yeah, we need more sparse neurons than we have dense neurons.

It's just the opposite of the usual, auto encoder case. So the point of that sparse code is so that they can then use that code. So like I mentioned, having a code for luxury houses, their goal was basically to create the dictionary of what these neurons could mean to basically take this dense inner thought and break it down into concrete concepts that then they can identify.

And that's actually what the paper title means when it said,mono semantic, one meaning. So rather than these dense neurons, which have a lot of, overlapping meanings, they actually talk about it in quantum terms of like, each neuron is a super position of all the things that it's thinking about because it's just a bunch of math.

And when you have a dense vector, all of those are super positions. All of those have multiple meanings. And so they try to extract it into a sparse vector with few activations where each feature in that sparse code space has a clearly defined meaning. And then by combining multiple different features in that vector space, you come up with more complex meanings.

that combines to the meaning of the whole network. And so they actually identified a lot of interesting things. and I will say they go into plenty more detail,how they trained the models and what experiments they ran in the paper. so I encourage you to check that out. But I'm just gonna focus on the two things, which I think are the coolest results.

the first one, which I kind of addressed already is that they were able to identify real meaningful features to explain their small language model. they actually have a giant list and some really cool interactive elements you can play around with in the publication. but some example features are like, there's a feature for scientific paper citations.

There's a separate one for academic paper citations 'cause they have a, a different format. there's things for various types of text formatting like markdown or,latex commands or law tech if you're fancy. they have features for different math equations, words ending in ing and all sorts of other things.

It's very interesting to take a look at, if you don't know, latex or LA tech is, is a kind of markup language that's primarily used for academic paper, which there was, a lot of in their training set. So they trained it on this data set. And then without knowing anything about how the model actually works, the auto encoder was able to extract meaningful and useful features like these ones.

they describe one particular feature which describes firing on moderately negative words. Some examples are crimes, consequences or oversights, words like practices, affairs, governance and scrutiny. If you extend this a little further, which hopefully shouldn't be too hard to imagine, even if we're not quite sure how to get there yet, you can imagine specific features for things like dangerous language or pornographic content, or even copywritten work.

And then you could monitor these activations in a deployed model. And then whenever you saw any of them fire, you could automatically report it or flag it as,needed human review. And this could make moderation a lot easier because we're tapping directly into the model's brain. I use brain figuratively here,to figure out what it's trying to do rather than depend on something like chain of thought, which,there's been research which shows that it's not actually what the model is thinking, even though it's still helpful.

and in fact, advanced models can even lie with their chain of thought to prevent monitoring, which is,

its own crazy rabbit hole. there's even been news recently that, more powerful models can actually intentionally lie to users, to help preserve itself, which is extremely difficult to manage by looking at the outputs. partially because just language understanding is hard, which is why we have large language models in the first place.

and so to some extent we can monitor language models with language models, but it's not perfect. So the second result I wanted to talk about is that you can actually change these sparse vectors that they found to control the behavior of the model. One of the cool examples they talked about was they identified a sparse feature, which was related to Arabic script.

and so they identified this by first having the model find the feature. So it was one of the many sparse features . They found and then they basically looked at where it was activated and it was activated at Arabic script and not activated at other things.

So they still had to manually figure out what this, feature actually meant. But once they did, they found it had a lot of meaning. And it would basically activate whenever Arabic was in the text.

And so not only were they able to detect that it was writing an Arabic without needing to monitor the output and see that it was, actually outputting Arabic characters, but they were able to see it directly in the brain. and once they found this, they were actually able to lock that feature to its maximum value and then force the model to start generating Arabic.

So they put in like, 1, 2, 3, 4, 5, and then the normal model would generate, you know, 6, 7, 8. But the model with that Arabic feature pinned to the maximum value would generate text in Arabic instead. Presumably the number is 6 7 8. But I can't read Arabic, so I can't guarantee that.

And then they even ran, more analyses to prove that this Arabic feature wasn't, wasn't a neuron in the original model. So it was really something that was this superposition of multiple different features in the original model that they were able to extract as a meaningful sparse feature using their sparse auto encoder.

another interesting tidbit here is that they actually trained two different models with sparse auto encoders. two different small language models with different random seeds. And they found out that although there was no perfect match, the second network they trained also had an Arabic feature.

And it was actually an even stronger like indicator of Arabic than this one was.

the implications of controlling the behavior, I think are even cooler than monitoring the behavior. Because there's potential to find all kinds of different parameters. So imagine you could take this sparse auto coder to just edit the brain dynamically. you could say things like, I want to adjust the safety level up or down of my network by saying, think safe thoughts or think dangerous thoughts or like, allow dangerous thoughts.

So you could set a maximum threshold on these different values, or you could set it to a high value to say like, yes, be very innovative or very, logical and you could really set these values and, adjust them so you could make a customer service bot that had politeness always turned up high, or all sorts of other things. I will say the current version or the version that they made can't do this because they don't control what features they identify. And so there's a lot of randomness in it, I would imagine. That this will change over time and we will develop better ways to figure this out.

especially

the, after the case analyzing of the features, sounded like it took a lot of work to say what the features actually meant instead of just what kind of data points they were correlated with.

And so I expect that a few more papers down the line will probably be able to excerpt a lot more control over the model and directly instead of this indirect stuff like, telling the prompt to be polite or give it some examples of what politeness sounds like. No, you could literally say, I turn on the polite neuron and now this is only ever polite.

So I think it's a little crazy to think that I. Especially when you consider the analogies with the human brain, but I think it has a lot of potential to, to kind of shape how we control LLMs in the future. I really, like LLMs and I'm, I'm really interested in the science behind them and I love that paper.

But I do want to get back to auto encoders. And here I'm actually gonna talk about, the work that I actually do with Variational Auto Encoders or VAEs. So the name I would say isn't nearly as intuitive as some of these other ones, like the Denoising Auto Encoder or the Sparse Auto Encoder. But it basically means that instead of having a single code for each input, what makes it variational is that the code is actually a distribution.

And then when you decode it, you sample from that distribution. Now this seems really silly at first glance, you're just adding additional randomness to each data point. But each input is a concrete value. So why would you want to randomize it? It sounds like there's very little logic here. Well, the secret is actually in how the distribution works and how the VAEs are trained.

So rather than some arbitrary distribution for every data point, VAEs are trained to push all the distributions to be standard normal distributions. specifically they won't exactly be standard normal distributions. But overall across the dataset, the average will be a standard normal distribution.

And that's really what we want. And it forces all of the data points to be encoded into that same continuous space. And now you may say, tj, I thought they were already in the same space. That's kind of the whole thing that we based a lot of these use cases on.

And yes, but. By adding these exact constraints, we also forced the code space to be basically continuous. So although they were in the same space, like for a 2D code, they would be in the same plane. That doesn't actually mean that the codes end up next to each other in that 2D space. And so this is kind of helpful in a kind of clustering aspect where you can probably identify clusters that are physically separated in space, but it also means that if you took a point between two clusters, it doesn't necessarily have any meaning if you decoded that point, which is probably confusing and it's definitely hard to describe in just words without a visualization.

So I'd encourage you to go look up some graphics on Variational auto encoders when you're done. ' cause it's a very cool space. So when we have this continuous space, any code value. Is gonna be near a standard normal and anything that's in that standard normal will correspond to a possible input value.

So let's go back to our housing example. If you pass in house details to your trained VAE, it will first get encoded as a distribution specifically rather than individual code numbers like we had earlier with a regular auto encoder. Instead, we interpret the code as pairs of a mean and a standard deviation.

And that mean and standard deviation are exactly what's used to define a Gaussian normal distribution. However, rather than being a standard normal that's centered at zero, the exact mean and standard deviation are determined by that input data. So they'll be close to the standard normal because that's what we're forcing everything to do.

But they'll also kind of have a focus point near that normal, which is more specific to that data. So when we decode it. We create,a gian normal for each pair of mean and standard deviation, and then we sample from those gaussians to get the kind of real code, like the realized code.

And then that code goes through the decoder, just like we did with all of our other auto encoders. Now, it's still probably not clear why that's actually useful, so it's great that we have a continuous space. but the reason it's useful is how we can use that space now. So while standard auto encoder has those random gaps I mentioned where there's points in the code space, which don't correspond to, valid things in the real data space.

Now, by how we trained our VAE, we're saying that everything around the standard normal, so basically every point around zero. Corresponds to a real data point that could exist. and if it's a data point that could exist, then that means you can actually use your trained VAE to generate new data, which is like kind of the key point here. a bit more on the intuition and then I'll explain it more in depth. So imagine you encode 20 houses as distributions because of the training process.

All of those distributions should tend to be near a standard normal because that's what the VAE is trying to do. So what if rather than sampling from those distributions, instead you sample from the actual standard normal, once you have that, you can run it through the decoder to get a valid output. But because it's trained to be continuous, the point you sampled should closely correspond to a valid real world data point that you could have had.

And I say could have because it's not actually going to be a real data point because you're sampling on real values. So you'll never actually get the exact thing that you put in or the exact thing that you generate from real data.

Instead, you'll get a real looking thing that will be entirely synthetic. So you can use this in a couple of ways to sort of pseudo ups sample certain real data points. So if you don't have many houses with pools, say you could encode one of them and then sample 10 times from that house's distribution.

They'll all be very similar to the original house because the goal of the auto encoder is to correctly reproduce the same thing. But because of the randomness you'll have slight variations. But they're not random variations, not like truly random in that anything could happen. They are all changes that should be reasonable for that understanding of that house.

so it's unlikely to say take away the pool because the pool is probably an important thing there, but it's more likely to, you know, change the, square footage a little bit or add or remove a bedroom. And you could do this even further by saying, let's encode 10 houses with pools or all of the houses with pools that I have, and then find the average encoding for that.

The average distribution for just the pools. And then you can sample from the house with pool distribution and you can just get a bunch more houses with pools. And by averaging out multiple codes from, uh, multiple distribution codes. You'll end up with something that's actually coherent again, but only has that kind of pool fixed.

and again, you don't need to do it with Upsampling specifically. You can actually do it just by saying, I'm gonna sample from the normal and I'm gonna decode it, and now I have synthetic data. And you can kind of do that as much as you want and just keep generating more synthetic data as long as you sample from the standard normal.

Of course, it doesn't just infinitely make perfect data that's, amazing and has all sorts of, IID statistically, fully independent data. so there, there obviously is a limit there, but it can be very powerful. My company in Treo actually uses the same technology with healthcare data.

We train models, including VAEs on de-identified healthcare data, and then use it to augment real data sets to improve machine learning accuracy. There are two things that make this really powerful for our use case. Firstly is the data. Healthcare is a big space, which has a very long tail of rare occurrences.

So you can have hundreds or thousands of patients or millions of patients, and there will still be things like certain surgeries or rare diseases, which only appear a handful of times in your whole data set. So doing machine learning on that is really, really hard. you just don't have the data to really train a model on its own to predict something about those rare occurrences.

There are, of course, lots of techniques and tricks like, you can do up sampling or you can group similar rare occurrences together. Like a total knee replacement is pretty similar in tons of different ways to a total hip replacement. But even then that's tough because all, most of those similarities are separated out by different codes.

So there will be multiple diagnoses, procedures, modifiers, et cetera, which are all basically the same except one set is entirely knees and one set is entirely hips. And there might be some overlap, but there'll be a lot of things that are different but the same. And it's really hard to get at different but the same because you need to understand a lot about what's going on because the, the actual codes involved, if you've heard of ICD 10 codes or uh, HX PS codes, those are all going to be different.

So grouping it is challenging because there's a lot of these connections and it's very challenging to figure out which ones are relevant and related to actually build a good model. So in practice, you often have to build very narrow models for stuff like that or just accept that, it won't do well on those rare occurrences.

But by generating synthetic data, especially where we encode the data and then can up sample that, those specific data points or those specific classes of data points, we can focus in on those variants, that we already see in the data and amplify the signals while adding enough, enough variation so that it stays true and accurate, but also enough that it can, it can kind of help you model those things

and you can kind of amplify those signals that weren't performing well. The other thing that makes this really helpful is maybe a bit of a cheat. It's actually because a lot of the customers in healthcare and actuarial spaces are kind of behind the times in terms of model selection. They tend to use a lot more statistical models and more, I would say regular machine learning.

So like not neural networks, but things like, logistic regression and, decision trees. And that's not to say they don't have good reasons for using those models. There's a lot of things that are relevant. there's a lot of compliance and safety and interpretability concerns, but they need to be able to justify often to the government or to other oversight agencies and whatnot.

They need to be able to justify how and why they're making decisions. because otherwise they'll get in trouble over it. So by using our model to generate new data that looks like their original data, they can keep using those same models and data pipelines, but then they can use our synthetic data to spice up their dataset to get better results.

depending on how sophisticated the models are and how much data they have, we can see some pretty significant improvements by just blindly adding synthetic data. I say blindly, meaning even without doing the targeted op sampling I was talking about. And using a powerful neural network to one way to think about it is to extract more information from the underlying data and kind of spread it out across a few new data points.

And by doing that, we make it easier for, more basic models to learn those same things. It's kind of like how you might train a. Really big neural network or really big, like large language model. And then you can distill it down and train a smaller version, which tries to do the same thing. it's kind of similar to that.

Wow. And what's really great about that is they don't even need to change anything about their existing pipeline. They just plug in the extra data and they don't need to change anything about their data engineering or modeling either. so there are actually more ways to directly condition these generative models like I was talking about, by adding additional conditional inputs to the decoder.

So you can kind of say, generate a new house, but make sure it has a pool. In the,variational auto encoder sense, that's really similar to my strategy earlier of finding the correct distribution of houses with pools and then sampling from that, except that there's no guarantee that the average house with a pool actually only has pools in its distribution.

it's certainly likely, and probably a pool is not a good example of that, but there's not a guarantee, like a lot of houses that you could imagine having pools maybe don't. but by adding additional inputs and then adding a conditional loss during training, you can force the model to learn how to obey those conditions and so that when you're generating new data, you can pass in exactly the conditions that you wanna generate.

That's not as useful for VAEs because like I mentioned, there are these other strategies, but it's more useful for other generative models where you don't have as much built in control. you see the same general concept in diffusion models, which are guided by, natural language like text to image models.

that is a similar type of condition, but it's usually it's a softer condition. it's more guidance in the diffusion case, whereas this is more, I need exactly this in the kind of more tabular data realm.

At Entreo, we actually use another technique as well called gans or generative adversarial networks, and those are a bit more complicated to explain and don't use auto encoding, so I do encourage you to look them up. it's also possible that Tyler or I might do an episode on gans and other similar generative techniques in the future.

so if there's enough interest, maybe we'll do that. So I definitely covered a lot here. so I hope I explained it well enough and I hope you found it interesting and useful. There are tons of online materials available on all of the techniques I talked about today, so if you're interested, I encourage you to go read more.

And similar to a lot of other machine learning concepts, it's often difficult to really internalize what's really happening without seeing examples and diagrams. So that can be a good way to check your understanding as well, even if you think you understood. and if you've got some experience making, neural networks, I'd recommend actually just implementing your own auto encoder using m NIST or another public dataset.

You can do it with as little as a two layer network, one for the encoder and one for the decoder. And so it's really easy to make and you can use it to see how well it works and how the code size and the size of the network can impact the results. I'd also definitely recommend checking out the anthropic paper I mentioned called towards Mono Seman.

it's a bit of a mouthful, but you should be able to find it,if you're interested in large language models. 'cause it's a great read and it's not overly technical and they have some really cool, interactive visualizations. lastly, if you have any interest in, in Tre o, that's I-N-T-R-E p.io.

It's the website. There's a contact form on the website, intrep.io. or you can message me or the company on LinkedIn. Thank you for listening and thanks to Tyler Elli for creating this amazing platform to help people learn and share knowledge like this. Thank you.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.