section headers

dmarx · Feb 11, 2024 · fee23d5 · fee23d5
1 parent 5b79e91
commit fee23d5
Showing 1 changed file with 13 additions and 3 deletions.
diff --git a/ufldl_history.md b/ufldl_history.md
@@ -4,14 +4,22 @@ labels: education, history, vae, feature_learning, publication
 
 https://discord.com/channels/729741769192767510/730095596861521970/1206121483877752862
 
+### Feature Engineering
+
 Let's talk about feature learning. Back in the before times, there was this thing we used to do: we'd sit around the fire, and think about our problems and how to solve them. It was called "feature engineering," and it was a pain in the ass.
 
 When I worked as a data scientist, like 70% of the effort of any problem was just making the data workable. "Workable" means something very different today, in the now-after times. The founders, in their great wisdom and laziness, dreamed of "end-to-end" solutions to their problems. The computer vision people had been especially bogged down in their feature engineering and were the first for whom their frustration overcame their laziness. And thus was born a machine that could learn its own features.
 
+### Automatic Feature Learning
+
 Let's consider a simple classification problem. You can model classification with logistic regression. Nice and simple, old-school statistics. Softmax is just multivariate logistic regression. Literally. Where do we often see softmax these days? Last stop on the processing pipeline for a deep neural network. So, pop off that last piece and what are we left with? On the one hand, we have one of the simplest possible classification models. And on the other hand, we have everything else. So it's completely valid to interpret that "everything else" as a machine that constructs complex features on which a simple classifier can operate. This is no big deal now. May even seem obvious. It was not. Not for a very long time.
 
+### Dimensionality Reduction in Early ML
+
 One of the common themes in feature engineering back in the early ML days was dimensionality reduction. You see, my lovelies, we didn't know about gradient double descent at the time, and misguidedly believed that a consequence of the bias-variance decomposition was that it was bad to have overparameterized models. So, a common component of feature engineering pipelines used to be dimensionality reduction because it was believed that was a way to avoid overfitting. Crystallize the signal out of the data, throw away the noise. As the information density of the representation increases, so then should the generalizability of the model. Pack the input down into just its bare essentials. Find the latent. This line of thought naturally led people to an hourglass-architected MLP. Treat the bottleneck region as the condensed features, and the part after it as a second reconstruction component. Turns out, this procedure is essentially a kind of non-linear PCA.
 
+## Advent of Word Embeddings
+
 The NLP people heard tell of the success the computer vision people had achieved by leaning into their laziness. They began to adopt tips and tricks that had been demonstrated by their CV brethren. Yet something was missing. Pictures had a natural numeric representation already, but words did not. NLP researchers dealt with this nuisance by counting things and treating words as counts. In the land where people are counting lots of things, the statistician is king. And so it was with NLP. "Counts can be modeled as Poisson distributed random variables!" proclaimed the computational linguists. "We can use our statistical models to understand language!" And this worked for a time. It sure beat the hell out of constructing parse trees and part-of-speech tagging and all that shit. But they also hadn't completely escaped that either.
 
 Fortunately, some mythically lazy NLP researchers had observed the success the computer vision people had achieved, and they wanted in. The computer vision folks had the VAEs, but they didn't understand the value of that yet. They were blinded by their feature engineering machines and ignored the magic of their VAEs, relegating them to tasks like clustering images.
@@ -22,16 +30,18 @@ Word2Vec was still a wordlist, but it was an extremely useful wordlist. And it m
 
 "We're doing deep learning!" proclaimed the NLP researchers proudly slapping themselves on the back. But they were not. They were only doing shallow learning. Word2Vec, it turned out, was just doing an implicit matrix factorization. But nobody cared because they finally got to be lazy like their CV friends. In fact, they got to be even lazier. They soon realized they didn't even need the fancy deep learning architectures to achieve most of their goals, as long as they started from a lookup table of pre-trained embeddings. "Word2Vec + logistic regression? Good enough!" said basically everyone. They had discovered transfer learning, and it was good.
 
-The computer vision people had enjoyed their time with the VAE, but had decided they outgrew it. A simple reconstruction loss was not enough; they needed more losses. They moved on from encoder-decoder to generator-discriminator. Instead of the latent representation being the bottleneck, it was the input for both models. The computer vision people had discovered the z-space, and felt quite fancy there.
+### VAEs to GANs
 
-But their models were getting chonkier and chonkier. Am I saying the StyleGAN architecture was a conspiracy by NVIDIA to get people to buy more compute? No, of course not, but if it was, it worked. And then from the z-space, came the w-space. (By the way, z = normal random variable, w = weights.)
+The computer vision people had enjoyed their time with the VAE, but had decided they outgrew it. A simple reconstruction loss was not enough; they needed more losses. They moved on from encoder-decoder to generator-discriminator. Instead of the latent representation being the bottleneck, it was the input for both models. The computer vision people had discovered the z-space, and felt quite fancy there.
 
-The activation space, baby. I'm baked, w doesn't equal weights.
+But their models were getting chonkier and chonkier. Am I saying the StyleGAN architecture was a conspiracy by NVIDIA to get people to buy more compute? No, of course not, but if it was, it worked. And then from the z-space, came the w-space. The activation space, baby.
 
 The GAN folks had been treating the input vector as their main "latent," but there was another dense feature representation they'd ignored. They'd forgotten that they could interpret deep networks as feature engineering machines, and discovered they'd been sitting on a pile of useful features inside the network. Z-space, w-space, w+ space.
 
 Mechanistic interpretability was becoming all the rage. The VAE, however, was stuck in dimensionality reduction land. Beta-VAE, sparse VAE,... that latent had to be as DENSE AS POSSIBLE.
 
+### The Return of the ~~King~~ VAE
+
 One day, some computer vision researchers in Germany had an insight. Let's use the VAE to learn features for a GAN. The VAE had announced its triumphant return. It would be used to learn a feature dictionary for the GAN. And thus was born the VQGAN. Yadda yadda diffusion, yadda yadda Stable Diffusion.
 
 Happily ever after. Thank you for attending my TED talk.