Skip to content

Commit

Permalink
chatgpt-improved formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
dmarx authored Feb 11, 2024
1 parent 44ec60e commit aa61aa6
Showing 1 changed file with 33 additions and 109 deletions.
142 changes: 33 additions & 109 deletions ufldl_history.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,112 +4,36 @@ labels: education, history, vae, feature_learning, publication

https://discord.com/channels/729741769192767510/730095596861521970/1206121483877752862

let's talk about feature learning
so back in the before times, there was this thing we used to do
we'd sit around the fire, and think about our problems and how to solve them

it was called "feature engineering" and it was a pain in the ass

when i worked as a data scientist, like 70% of the effort of any problem was just making the data workable
"workable" means something very different today
in the now-after times
the founders, in their great wisdom and laziness, dreamed of "end-to-end" solutions to their problems
the computer vision people had been especially bogged down in their feature engineering, and were the first for whom their frustration overcame their laziness
and thus was born a machine that could learn its own features
let's consider a simple classification problem
you can model classification with logistic regression. nice and simple, old-school statistics
softmax is just multivariate logistic regression
literally
where do we often see softmax these days? last stop on the processing pipeline for a deep neural network
so pop off that last piece and what are we left with?
on the one hand, we have one of the simplest possible classification models
and on the other hand, we ahve everything else
so it's completely valid to interpret that "everything else" as a machine that constructs complex features on-which a simple classifier can operate
this is no big deal now. may even seem obvious
it was not. not for a very long time
one of the common themes in feature engineering back in the early ML days was dimensionality reduction

you see my lovelies, we didn't know about gradient double descent at the time, and misguidedly believed that a consequence of the bias variance decomposition was that it was bad to have overparameterized models
so a common component of feature engineering pipelines used to be dimensionality reduction because it was believed that was a way to avoid overfitting
crystallize the signal out of the data, throw away the noise
as the information density of the representation increases, so then should the generalizability of the model
pack the input down into just its bare essentials. find the latent
this line of thought naturally led people to an hourglass architectured MLP. treat the bottleneck region as the condensed features, and the part after it as a second reconstruction component
turns out, this procedure is essentially a kind of non-linear PCA

the NLP people heard tell of the success the computer vision people had achieved by leaning into their laziness
they began to adopt tips and tricks that had been demonstrated by their CV brethren
yet something was missing
pictures had a natural numeric representation already, but words did not
NLP researchers dealt with this nuisance by counting things
and treating words as counts
in the land where people are counting lots of things, the statistician is king. and so it was with NLP
"counts can be modeled as poisson distributed random variables!" proclaimed the computational linguists. we can use our statistical models to understand language!"
and this worked for a time
it sure beat the hell out of constructing parse trees and part-of-speech tagging and all that shit
but they also hadn't completely escaped that either
fortunately, some mythically lazy NLP researchers had observed the success the computer vision people had achieved
and they wanted in
the computer vision folks had the VAEs, but they didn't understand the value of that yet
they were blinded by their feature engineering machines and ignored the magic of their VAEs, relegating to them to tasks like clustering images

now, our mythically lazy NLP researchers, they were tired of their shitty features and wanted a simple solution to create good features
features for our NLP researchers, as mentioned, were counts. of specific words (or sequences of strings)
which meant if you wanted to use these techniques, you had to pick your words ahead of time and ignore all the other words
wordlists were all the rage
stop word lists
wordnet lemma graph
stemming rules
it was rough
and thus was born: word2vec
word2vec was still a wordlist
but it was an extremely useful wordlist
and it mapped words not to counts but to dense vectors
the lazy researchers realized they could combine the VAE trick with masking to construct a reconstruction loss based on word context, and therefore representations based on word context
the word2vec authors shared their pretrained model with the world
pre-trained embeddings became the hot new thing
___2vec was all you needed to get published

"we're doing deep learning!" proclaimed the NLP researchers proudly slapping themselves on the back
but they were not
they were only doing shallow learning
word2vec, it turned out, was just doing an implicit matrix factorization
https://papers.nips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf
but nobody cared because they finally got to be lazy like their CV friends
in fact, they got to be even lazier
they soon realized they didn't even need the fancy deep learning architectures to achieve most of their goals, as long as they started from a lookup table of pre-trained embeddings
"word2vec + logistic regression? good enough!" said basically everyone
they had discovered transfer learning, and it was good
the computer vision people had enjoyed their time with the VAE, but had decided they outgrew it
a simple reconstruction loss was not enough, they needed more losses
they moved on from encoder-decoder to generator-discriminator
instead of the latent representation being the bottleneck, it was the input for both models
the computer vision people had discovered the z space, and felt quite fancy there

but their models were getting chonkier and chonkier
am i saying the stylegan architecture was a conspiracy by nvidia to get people to buy more compute? no of course not, but if it was it worked
and then from the z space, came the w space
(by the way, z = normal random variable, w = weights)

the activation space baby
i'm baked, w doesn't equal weights

the GAN folks had been treating the input vector as their main "latent", but there was another dense feature representaiton they'd ignored

they'd fogotten that they could interpret deep networks as feature engineering machines, and discovered they'd been siting on a pile of useful features inside the network
z space, w space, w+ space

mechanistic interpretability was becoming all the rage
the VAE however, was stuck in dimensionality reduction land
beta-VAE, sparse vae, .... that latent had to be as DENSE AS POSSIBLE

one day, some computer vision researchers in germany had an insight
let's use the VAE to learn features for a GAN

the VAE had announced its triumphant return
it would be used to learn a feature dictionary for the GAN
and thus was born the VQGAN
yadda yadda diffusion, yadda yadda stable diffusion

happily ever after
thank you for attending my TED talk
Let's talk about feature learning. Back in the before times, there was this thing we used to do: we'd sit around the fire, and think about our problems and how to solve them. It was called "feature engineering," and it was a pain in the ass.

When I worked as a data scientist, like 70% of the effort of any problem was just making the data workable. "Workable" means something very different today, in the now-after times. The founders, in their great wisdom and laziness, dreamed of "end-to-end" solutions to their problems. The computer vision people had been especially bogged down in their feature engineering and were the first for whom their frustration overcame their laziness. And thus was born a machine that could learn its own features.

Let's consider a simple classification problem. You can model classification with logistic regression. Nice and simple, old-school statistics. Softmax is just multivariate logistic regression. Literally. Where do we often see softmax these days? Last stop on the processing pipeline for a deep neural network. So, pop off that last piece and what are we left with? On the one hand, we have one of the simplest possible classification models. And on the other hand, we have everything else. So it's completely valid to interpret that "everything else" as a machine that constructs complex features on which a simple classifier can operate. This is no big deal now. May even seem obvious. It was not. Not for a very long time.

One of the common themes in feature engineering back in the early ML days was dimensionality reduction. You see, my lovelies, we didn't know about gradient double descent at the time, and misguidedly believed that a consequence of the bias-variance decomposition was that it was bad to have overparameterized models. So, a common component of feature engineering pipelines used to be dimensionality reduction because it was believed that was a way to avoid overfitting. Crystallize the signal out of the data, throw away the noise. As the information density of the representation increases, so then should the generalizability of the model. Pack the input down into just its bare essentials. Find the latent. This line of thought naturally led people to an hourglass-architected MLP. Treat the bottleneck region as the condensed features, and the part after it as a second reconstruction component. Turns out, this procedure is essentially a kind of non-linear PCA.

The NLP people heard tell of the success the computer vision people had achieved by leaning into their laziness. They began to adopt tips and tricks that had been demonstrated by their CV brethren. Yet something was missing. Pictures had a natural numeric representation already, but words did not. NLP researchers dealt with this nuisance by counting things and treating words as counts. In the land where people are counting lots of things, the statistician is king. And so it was with NLP. "Counts can be modeled as Poisson distributed random variables!" proclaimed the computational linguists. "We can use our statistical models to understand language!" And this worked for a time. It sure beat the hell out of constructing parse trees and part-of-speech tagging and all that shit. But they also hadn't completely escaped that either.

Fortunately, some mythically lazy NLP researchers had observed the success the computer vision people had achieved, and they wanted in. The computer vision folks had the VAEs, but they didn't understand the value of that yet. They were blinded by their feature engineering machines and ignored the magic of their VAEs, relegating them to tasks like clustering images.

Now, our mythically lazy NLP researchers, they were tired of their shitty features and wanted a simple solution to create good features. Features for our NLP researchers, as mentioned, were counts. Of specific words (or sequences of strings), which meant if you wanted to use these techniques, you had to pick your words ahead of time and ignore all the other words. Wordlists were all the rage: stop word lists, WordNet lemma graph, stemming rules. It was rough. And thus was born: Word2Vec.

Word2Vec was still a wordlist, but it was an extremely useful wordlist. And it mapped words not to counts but to dense vectors. The lazy researchers realized they could combine the VAE trick with masking to construct a reconstruction loss based on word context, and therefore representations based on word context. The Word2Vec authors shared their pretrained model with the world. Pre-trained embeddings became the hot new thing. ___2Vec was all you needed to get published.

"We're doing deep learning!" proclaimed the NLP researchers proudly slapping themselves on the back. But they were not. They were only doing shallow learning. Word2Vec, it turned out, was just doing an implicit matrix factorization. But nobody cared because they finally got to be lazy like their CV friends. In fact, they got to be even lazier. They soon realized they didn't even

need the fancy deep learning architectures to achieve most of their goals, as long as they started from a lookup table of pre-trained embeddings. "Word2Vec + logistic regression? Good enough!" said basically everyone. They had discovered transfer learning, and it was good.

The computer vision people had enjoyed their time with the VAE, but had decided they outgrew it. A simple reconstruction loss was not enough; they needed more losses. They moved on from encoder-decoder to generator-discriminator. Instead of the latent representation being the bottleneck, it was the input for both models. The computer vision people had discovered the z-space, and felt quite fancy there.

But their models were getting chonkier and chonkier. Am I saying the StyleGAN architecture was a conspiracy by NVIDIA to get people to buy more compute? No, of course not, but if it was, it worked. And then from the z-space, came the w-space. (By the way, z = normal random variable, w = weights.)

The activation space, baby. I'm baked, w doesn't equal weights.

The GAN folks had been treating the input vector as their main "latent," but there was another dense feature representation they'd ignored. They'd forgotten that they could interpret deep networks as feature engineering machines, and discovered they'd been sitting on a pile of useful features inside the network. Z-space, w-space, w+ space.

Mechanistic interpretability was becoming all the rage. The VAE, however, was stuck in dimensionality reduction land. Beta-VAE, sparse VAE,... that latent had to be as DENSE AS POSSIBLE.

One day, some computer vision researchers in Germany had an insight. Let's use the VAE to learn features for a GAN. The VAE had announced its triumphant return. It would be used to learn a feature dictionary for the GAN. And thus was born the VQGAN. Yadda yadda diffusion, yadda yadda Stable Diffusion.

Happily ever after. Thank you for attending my TED talk.

0 comments on commit aa61aa6

Please sign in to comment.