Major refactor (inc adding Pydantic) #16

danbraunai · 2024-11-23T16:54:33Z

Description

Adds a models directory with model implementations and default configs
Use pydantic for the base config (rather than argparse)
Allow users to pass cli flags to update the config. These override the config file (and/or the pydantic defaults).

New usage (which is also written in the README and the train_llama.py script:

python train_llama.py [PATH/TO/CONFIG.yaml] [--key1 value1 --key2 value2 ...]

where

PATH/TO/CONFIG.yaml contains the training config. If no path is provided, a default config will be used.
--key1 value1 --key2 value2 ... override values in the config. Note that if you wish to update a
nested value, you must use dotted notation (e.g. --train_dataset_config.name my_dataset).

To run on multiple GPUs, use

torchrun --standalone --nproc_per_node=N train_llama.py ...

where N is the number of GPUs to use.

Motivation and Context

It's currently quite hard to understand and work with the training scripts. This PR aims to set us towards a nicer library to use/contribute to.

How Has This Been Tested?

unittest for one of the added functions utils.convert_dotted_args_to_nested_dict.
No tests written for general config behaviour (it does look good after quick testing myself). These could be added.
Single and multi-gpu training works according to the instructions in the README and the train_llama.py script. WANDB train loss goes down as desired.

Does this PR introduce a breaking change?

Yes. Complete override of config. Some config argument names have changed.

lennart-finke

Very cool, and ready for merge, with just one minor usability comment.

lennart-finke · 2024-11-25T17:53:22Z

README.md

Just one comment: Since compilation is on by default (which is probably good), we might want to include --compile 0 in the example prompt for local execution, since compilation on e.g. mac gives very ugly errors that might be hard for newcomers to trace back to this flag.

Good pickup, added.

danbraunai-apollo added 7 commits November 23, 2024 10:25

Ignore ruff forward annotation warning

c94bae5

Use pydantic and split up train script

873890f

Add .env to .gitignore

94fb71f

Add default d12 config

9a5684d

Fix validation

31916ab

Add dotenv for wandb

f38bae4

Allow for cli override args and fix ddp

cf28818

danbraunai requested a review from lennart-finke November 24, 2024 18:07

danbraunai marked this pull request as ready for review November 24, 2024 18:07

danbraunai changed the title ~~WIP: Refactor training script~~ Refactor training script Nov 24, 2024

danbraunai changed the title ~~Refactor training script~~ Major refactor of training script Nov 24, 2024

danbraunai changed the title ~~Major refactor of training script~~ Major refactor (inc adding Pydantic) Nov 24, 2024

lennart-finke approved these changes Nov 25, 2024

View reviewed changes

Add warning about cpu runs

a0809ff

danbraunai merged commit a088f43 into main Nov 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactor (inc adding Pydantic) #16

Major refactor (inc adding Pydantic) #16

danbraunai commented Nov 23, 2024 •

edited

Loading

lennart-finke left a comment

lennart-finke Nov 25, 2024

danbraunai Nov 25, 2024

Major refactor (inc adding Pydantic) #16

Major refactor (inc adding Pydantic) #16

Conversation

danbraunai commented Nov 23, 2024 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

lennart-finke left a comment

Choose a reason for hiding this comment

lennart-finke Nov 25, 2024

Choose a reason for hiding this comment

danbraunai Nov 25, 2024

Choose a reason for hiding this comment

danbraunai commented Nov 23, 2024 •

edited

Loading