-
Notifications
You must be signed in to change notification settings - Fork 97
Model ladder, revamped #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| """ | ||
| A run configurator that uses WSD-S learning rate scheduling and Chinchilla scaling laws. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the underlying chinchilla heuristic of 20 tok/param has some caveats, like it only applies to models trained with AdamW and is dataset dependent (was determined on the Pile iirc). Might be worth adding a disclaimer here so people dont assume that 20 tok/param is actually optimal for their dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and made it configurable: 5aa3d4e
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class RunConfigurator(Config, metaclass=ABCMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this loosely maps to TrainModule and the above maps to Model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the right comparison because the ModelConfigurator is actually responsible for building the TrainModule given the optimizer, scheduler, and other hyperparameters configured by the RunConfigurator. I'm not entirely happy with it, but I landed there because the training optimizations/parallelism configuration that go into the TrainModule are tightly coupled to the model architecture, which is defined by the ModelConfigurator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of RunConfigurator as what decides hyperparameters.
| @dataclass(kw_only=True) | ||
| class ModelLadder(Config): | ||
| """ | ||
| Represents a complete model ladder of runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this roughly maps to ExperimentConfig
src/olmo_core/model_ladder2/base.py
Outdated
| raise NotImplementedError | ||
|
|
||
| @abstractmethod | ||
| def configure_device_microbatch_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elsewhere we use the term "rank_microbatch_size", can we keep that consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: bf5bc2d
| steps = d.value // global_batch_size | ||
| return steps | ||
| else: | ||
| raise ValueError(f"Unsupported checkpoint interval duration unit: {d.unit}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might make a good method on the duration class
|
|
||
| def _get_in_loop_eval_tasks(self) -> list[str]: | ||
| # For training runs where we don't expect the model to acquire MC (e.g., 1B-5xC, short 7B training runs). | ||
| tasks_small_compute = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are the tasks groups from task_groups.py but without the basic_skills and mt_mbpp tasks?
Could we just use the predefined tasks groups instead?
|
|
||
|
|
||
| class TransformerSize(StrEnum): | ||
| size_190M = "190M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting to myself that it looks like the smallest size takes about 30m to hit 1xC
| size_13B = "13B" | ||
|
|
||
| @property | ||
| def num_params(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confusing that this returns the rough number of params, but there is a similar method get_num_params in base.py that returns the exact number of non-embedding params.
| name=DataParallelType.fsdp, | ||
| param_dtype=DType.bfloat16, | ||
| reduce_dtype=DType.float32, | ||
| wrapping_strategy=TransformerDataParallelWrappingStrategy.blocks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think full wrapping is slightly better than blocks in general. This is because under blocks wrapping the LM head isnt released for communication until after the backward pass is complete.
|
|
||
| def configure_target_batch_size(self, num_params: int) -> int: | ||
| # Calculate global batch size according to https://api.semanticscholar.org/CorpusID:270764838 | ||
| # which assumes a sequence length of 2048. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, any idea if / how this would be modified for 8k seq len?
| This usually isn't invoked directly, but rather via `launch_run`. | ||
| """ | ||
| ).replace("\n", " "), | ||
| "launch_run": "Launch a Beaker job to run the ladder for a given model size.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a utility for launching a set of runs? For example all of the rungs of size less than or equal to the specified size? I see myself kicking off the bottom 3/4 sizes all at once pretty frequently when evaluating ideas.
| @@ -0,0 +1,387 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the expected pattern similar to our other scripts/train scripts where users will clone this file and make the changes relevant to their experiment?
If so, could we pull some of the utility functions in this file into a shared file, similar to internal/experiment.py?
tyler-romero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, some already discussed offline, LGTM!
Adds a new model ladder API following this design doc and a script to run the baseline Olmo 3 ladder, which also serves an example to configure new ladder experiments.
For now the new ladder API is located at
src/olmo_core/model_ladder2/to make the diffs easier to review, but before merging we'll replace the old code atsrc/olmo_core/model_ladder.pywith this new module.You can play around with the baseline ladder script without actually launching anything via its
dry_runcommand. For example: