This README provides guidelines on how to structure and prepare aligned multimodal training datasets.
We recommend organizing training data using a modified version of the WebDataset format. With this format, the dataset is split into tarfiles, with each tarfile containing 1'000 to 10'000 samples from one modality (e.g., RGB, caption, depth, etc.). This data can be stored either on a local disk or in common cloud object stores like S3. Storing the data as tarfiles reduces the number of object read requests when streaming directly from cloud buckets. The data is organized as such:
root/modality_a/shard-00000.tar
root/modality_a/shard-00001.tar
root/modality_a/shard-00002.tar
root/modality_b/shard-00000.tar
root/modality_b/shard-00001.tar
root/modality_b/shard-00002.tar
Here, modality_a
and modality_b
are placeholders for the names of the modalities (e.g., rgb
, caption
, or more specific modality names for dataset versioning).
Each tarfile expands into individual files with arbitrary names but the same extension, such as:
xxx.ext
xxy.ext
xxz.ext
The file extension varies depending on the modality (e.g., .jpg
for RGB images, .txt
or .json
for captions, .npy
for pre-computed tokens, etc.).
To load aligned samples from all modalities, make sure that filenames are identical for all modalities, except for the modality name and file extensions. Shards should also be ordered numerically (as shown above) to support brace-expand notation. New modalities can be easily added by creating a new folder with tarfiles in the same directory. Existing modalities can also be modified by updating their specific tarfiles or creating new ones.
For smaller datasets that can be stored locally, we also support a simpler hierarchical structure. This is convenient for datasets like validation sets or transfer datasets. The data structure is as follows:
root/modality_a/folder_x/xxx.ext
root/modality_a/folder_y/xxy.ext
root/modality_a/folder_z/xxz.ext
root/modality_b/folder_x/xxx.ext
root/modality_b/folder_y/xxy.ext
root/modality_b/folder_z/xxz.ext
The folder and file names can be arbitrary as long as they are aligned across modalities.
We use the following datasets to train and/or evaluate 4M models. For pre-training:
For transfers and evaluations:
Please refer to their respective pages for instructions on how to download them and license information.
Starting from text-image pairs, we use pseudo labeling to create an aligned multimodal dataset across all training modalities. For this purpose, we use the off-the-shelf networks listed in the table below. Please refer to their respective pages for inference instructions and license information.
Modality | Model | Homepage |
---|---|---|
Depth | Omnidata DPT-B-Hybrid (v2) | link |
Surface normals | Omnidata DPT-B-Hybrid (v2) | link |
Semantic segmentation | Mask2Former Swin-B | link |
Bounding boxes | ViTDet ViT-H with Cascade Mask-RCNN | link |
CLIP features | CLIP ViT-B/16 | link |
DINOv2 features | DINOv2 ViT-B/14 | link |
ImageBind features | ImageBind ViT-H/14 | link |
SAM instances | SAM ViT-H | link |
3D human poses & shape | HMR2.0 | link |
Color palette | PyPalette | link |
During training, all modalities are maps to sets or sequences of discrete tokens using modality-specific tokenizers. Please refer to README_TOKENIZATION.md for more information. To avoid dataloading and tokenization from becoming a training bottleneck, we instead pre-compute the tokens of all image-like modalities once before training (i.e. pre-tokenization), and then directly load the tokens.
To pre-tokenize any modality, run the provided save_vq_tokens.py
file with the appropriate arguments.
ℹ️ For non-square images or if --n_crops
is > 1, pre-tokenization requires cropping the original image. Therefore, to ensure that the tokens from all modalities are aligned, we automatically create a crop_settings
directory with the crop information for all samples the first time that a dataset is tokenized. This information is then used when tokenizing the same dataset with a different modality.