This repo includes code to reproduce all results in the above Neurips paper, authored by Jiaji Huang, Qiang Qiu and Kenneth Church.
We used python 3.8.5, but other versions close to that should also work. Install all required packages by
pip install --upgrade pip
pip install -r requirements.txt
We used cuda 10.2.89, but any version that meets pytorch's requirement should also work.
We highlight some major results, so that readers do not have to read the paper to grasp the main ideas. Concisely, the paper tries to answer the question:
"Can we use a checkpoint zoo to build something that better adapts to unseen tasks?"
To answer the question, first we need to understand the geometry of a space of tasks.
In the paper, we model the tasks as following a Gaussian process. Its covariance is computed by applying kernel alignment to extracted features. The features are obtained by inputting probe data into checkpoints, each trained for a task. For example, using 34 checkpoints from Huggingface models, we can estimate the 34x34 covariance (of their corresponding tasks).
To reproduce the above figure, refer to LMs/README.md.
We hypothesize that representative tasks are more generalizable to new tasks. This, of course, needs a rigorious mathematical proof. But empirically we find it is true, as indicated by the experiments on NLP and vision tasks.
So, how to identify representative tasks? They are supposed to convey the most information about the rest of the task space. We formulate the problem into a Max-Mutual-Information (MMI) objective. The solver takes the covariance as input, and greedily picks representative tasks.
Using the 34x34 covariance matrix, we can identify that the 5 most representative tasks are those corresponding to roberta-base, distilbert-base-uncased, t5-base, bert-base-cased and bart-large. Combining these checkpoints yields superior results on 8 new linguistic tasks, e.g., below is an example of chunking task.
To reproduce full results, check LMs/README.md for details.
The observation holds for vision tasks too. Below is an experiment set up on cifar100. MMI shows steady gain over random selection, and outperforms another baseline.
To reproduce all results, check vision/README.md for details.
Note: This project requires running many small jobs. So it will be very useful if you have a cluster powered by slurm, which can launch jobs in parallel. In the job-launching scripts, you can see multiple commands like
sbatch -p $partition --gres=gpu:1 --wrap "python run.py" -o $job_log_path
If you do not have such a cluster, just use
python run.py > $job_log_path
instead.