Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactor (inc adding Pydantic) #16

Merged
merged 8 commits into from
Nov 25, 2024
Merged

Conversation

danbraunai
Copy link
Owner

@danbraunai danbraunai commented Nov 23, 2024

Description

  • Adds a models directory with model implementations and default configs
  • Use pydantic for the base config (rather than argparse)
  • Allow users to pass cli flags to update the config. These override the config file (and/or the pydantic defaults).

New usage (which is also written in the README and the train_llama.py script:

python train_llama.py [PATH/TO/CONFIG.yaml] [--key1 value1 --key2 value2 ...]

where

  • PATH/TO/CONFIG.yaml contains the training config. If no path is provided, a default config will be used.
  • --key1 value1 --key2 value2 ... override values in the config. Note that if you wish to update a
    nested value, you must use dotted notation (e.g. --train_dataset_config.name my_dataset).

To run on multiple GPUs, use

torchrun --standalone --nproc_per_node=N train_llama.py ...

where N is the number of GPUs to use.

Motivation and Context

It's currently quite hard to understand and work with the training scripts. This PR aims to set us towards a nicer library to use/contribute to.

How Has This Been Tested?

  • unittest for one of the added functions utils.convert_dotted_args_to_nested_dict.
  • No tests written for general config behaviour (it does look good after quick testing myself). These could be added.
  • Single and multi-gpu training works according to the instructions in the README and the train_llama.py script. WANDB train loss goes down as desired.

Does this PR introduce a breaking change?

Yes. Complete override of config. Some config argument names have changed.

@danbraunai danbraunai marked this pull request as ready for review November 24, 2024 18:07
@danbraunai danbraunai changed the title WIP: Refactor training script Refactor training script Nov 24, 2024
@danbraunai danbraunai changed the title Refactor training script Major refactor of training script Nov 24, 2024
@danbraunai danbraunai changed the title Major refactor of training script Major refactor (inc adding Pydantic) Nov 24, 2024
Copy link
Collaborator

@lennart-finke lennart-finke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, and ready for merge, with just one minor usability comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment: Since compilation is on by default (which is probably good), we might want to include --compile 0 in the example prompt for local execution, since compilation on e.g. mac gives very ugly errors that might be hard for newcomers to trace back to this flag.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good pickup, added.

@danbraunai danbraunai merged commit a088f43 into main Nov 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants