Environment | Data | Training | Repository | Issues | Cite
Paper: Efficient Transformers with Dynamic Token Pooling
conda create -n dynamic-pooling python=3.8
pip install -r requirements.txt
- Download & preprocess
- text8
bash scripts/get_text8.sh
- wiki40b
bash scripts/get_wiki40b.sh $lang
- where $lang is for example
vi
- check Link for how the abbreviation of other languages
- Script first downloads wiki40b under
./data/wiki40b/$lang/
, and then applies our cleaners on top of it based on text8 cleaning rules. Final training data sits under./data/wiki40b/$lang/text8
. We found that for some systems there might occur some errors when downloading wiki40b usingdatasets
. In this case after you manage to get the data just apply our cleaners on it.
- text8
- Train Unigram
python tokenizer_data/train_tokenizer.py $vocab_size $dataset
$vocab_size
is the integer target vocab size of Unigram$dataset
istext8
for text8,wiki40b/$lang/text8
for wiki40b
-
Training by default starts with a simple test that checks the autoregressive property of a model. We support grad accummulation, distributed training, half precision training.
-
To run training use:
C=configs/whitespaces.yaml GPUS= bash scripts/run_exp.sh
- C -> defines the path to the config
- GPUS -> defines the number of GPUs for distributed run, when not given then the training runs on a single GPU/CPU
Repository is a fork from: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/Transformer-XL
We decided to fork from the Nvidia implementation of Transformer XL, because Transformer XL is strong and established baseline in Language Modelling, and Nvidia code is well-optimised for the current hardware.
- ./configs/
- we've prepared configs for all models presented in our work, i.e., Vanilla, Fixed, Entropy, Unigram, Whitespaces, Gumbel
- ./tokenizer_data/
- Pretrained tokenizers using HuggingFace/Sentencepiece library for all datasets we've tested in the paper. You can train them yourself by running:
python ./tokenizer_data/train_tokenizer.py $ARGS
- Args are defined in the
./tokenizer_data/train_tokenizer.py
- Pretrained tokenizers using HuggingFace/Sentencepiece library for all datasets we've tested in the paper. You can train them yourself by running:
- ./cleaners/
- Implementation of preprocessing rules applied to raw
wiki40b
dataesets andcc-100
dataset
- Implementation of preprocessing rules applied to raw
- Boundary Predictor:
- {Vanilla, Fixed, Whitespaces}
- These approaches do not need a boundary predictor. Boundaries are extracted from the data itself in the
boundary_creator.py
, then used in the DataLoader.
- These approaches do not need a boundary predictor. Boundaries are extracted from the data itself in the
- {Unigram}
- Segmentation based on Unigram needs a Boundary Predictor, because Unigram itself is not autoregressive. We teach the Boundary Predictor module defined in
hourglass.py
to predict the Unigram segmentation. Boundary Predictor is autoregressive, which makes the whole model autoregressive as well. Unigram segmentation is extracted inboundary_creator.py
.
- Segmentation based on Unigram needs a Boundary Predictor, because Unigram itself is not autoregressive. We teach the Boundary Predictor module defined in
- {Entropy, Gumbel}
- These approaches are end-to-end and use the main model to train Boundary Predictor. Entire logic is implemented in the
hourglass.py
.
- These approaches are end-to-end and use the main model to train Boundary Predictor. Entire logic is implemented in the
- {Vanilla, Fixed, Whitespaces}
In case of any questions or problems with the codebase feel free to raise a Github Issue or contact me directly at: piotr.nawrot@ed.ac.uk
@misc{nawrot2022dynamic,
title={Efficient Transformers with Dynamic Token Pooling},
author={Piotr Nawrot and Jan Chorowski and Adrian Łańcucki and Edoardo M. Ponti},
year={2022},
eprint={2211.09761},
archivePrefix={arXiv},
primaryClass={cs.CL}
}