WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com

Model ladder, revamped #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

epwalsh wants to merge 11 commits into main from epwalsh/model-ladder-deux

Member

epwalsh commented Dec 10, 2025 •

edited

Loading

Adds a new model ladder API following this design doc and a script to run the baseline Olmo 3 ladder, which also serves an example to configure new ladder experiments.

For now the new ladder API is located at src/olmo_core/model_ladder2/ to make the diffs easier to review, but before merging we'll replace the old code at src/olmo_core/model_ladder.py with this new module.

You can play around with the baseline ladder script without actually launching anything via its dry_run command. For example:

python src/scripts/train/ladder/olmo3_ladder.py dry_run --size=760M --show-plot

epwalsh added 8 commits

December 8, 2025 14:58


          model ladder v2

517b26a


          Merge branch 'main' into epwalsh/model-ladder-deux

e3bfa98


          mypy is stoopid

eb1e9ac


          clean up

ddcfa8b


          remove old test

67ea955


          Add test

780b090


          updates

202913f


          update dry run message

3196dc7

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/wsds_chinchilla_run_configurator.py

+                  """
+                  A run configurator that uses WSD-S learning rate scheduling and Chinchilla scaling laws.
+                  """

Contributor

tyler-romero Dec 10, 2025

nit: the underlying chinchilla heuristic of 20 tok/param has some caveats, like it only applies to models trained with AdamW and is dataset dependent (was determined on the Pile iirc). Might be worth adding a disclaimer here so people dont assume that 20 tok/param is actually optimal for their dataset.

Member Author

epwalsh Dec 10, 2025 •

edited

Loading

Done, and made it configurable: 5aa3d4e


          Make chinchilla tokens-per-param configurable

5aa3d4e

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/base.py



		@dataclass(kw_only=True)
		class RunConfigurator(Config, metaclass=ABCMeta):

Contributor

tyler-romero Dec 11, 2025

Seems like this loosely maps to TrainModule and the above maps to Model?

Member Author

epwalsh Dec 11, 2025

I don't think that's the right comparison because the ModelConfigurator is actually responsible for building the TrainModule given the optimizer, scheduler, and other hyperparameters configured by the RunConfigurator. I'm not entirely happy with it, but I landed there because the training optimizations/parallelism configuration that go into the TrainModule are tightly coupled to the model architecture, which is defined by the ModelConfigurator.

Member Author

epwalsh Dec 11, 2025

I think of RunConfigurator as what decides hyperparameters.

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/base.py

+              @dataclass(kw_only=True)
+              class ModelLadder(Config):
+                  """
+                  Represents a complete model ladder of runs.

Contributor

tyler-romero Dec 11, 2025

And this roughly maps to ExperimentConfig


          Merge branch 'main' into epwalsh/model-ladder-deux

930730d

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/base.py Outdated

+                      raise NotImplementedError
+                  @abstractmethod
+                  def configure_device_microbatch_size(

Contributor

tyler-romero Dec 12, 2025

elsewhere we use the term "rank_microbatch_size", can we keep that consistent?

Member Author

epwalsh Dec 12, 2025

Done: bf5bc2d


          rename device_microbatch_size -> rank_microbatch_size

bf5bc2d

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/base.py

+                          steps = d.value // global_batch_size
+                          return steps
+                      else:
+                          raise ValueError(f"Unsupported checkpoint interval duration unit: {d.unit}.")

Contributor

tyler-romero Dec 12, 2025

This might make a good method on the duration class

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/base.py

+                  def _get_in_loop_eval_tasks(self) -> list[str]:
+                      # For training runs where we don't expect the model to acquire MC (e.g., 1B-5xC, short 7B training runs).
+                      tasks_small_compute = [

Contributor

tyler-romero Dec 12, 2025

Looks like these are the tasks groups from task_groups.py but without the basic_skills and mt_mbpp tasks?

Could we just use the predefined tasks groups instead?

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/transformer_model_configurator.py



		class TransformerSize(StrEnum):
		size_190M = "190M"

Contributor

tyler-romero Dec 12, 2025

Just noting to myself that it looks like the smallest size takes about 30m to hit 1xC

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/transformer_model_configurator.py

+                  size_13B = "13B"
+                  @property
+                  def num_params(self) -> int:

Contributor

tyler-romero Dec 12, 2025

Slightly confusing that this returns the rough number of params, but there is a similar method get_num_params in base.py that returns the exact number of non-embedding params.

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/transformer_model_configurator.py

+                          name=DataParallelType.fsdp,
+                          param_dtype=DType.bfloat16,
+                          reduce_dtype=DType.float32,
+                          wrapping_strategy=TransformerDataParallelWrappingStrategy.blocks,

Contributor

tyler-romero Dec 12, 2025

nit: I think full wrapping is slightly better than blocks in general. This is because under blocks wrapping the LM head isnt released for communication until after the backward pass is complete.

tyler-romero reviewed

View reviewed changes

src/olmo_core/model_ladder2/wsds_chinchilla_run_configurator.py

+                  def configure_target_batch_size(self, num_params: int) -> int:
+                      # Calculate global batch size according to https://api.semanticscholar.org/CorpusID:270764838
+                      # which assumes a sequence length of 2048.

Contributor

tyler-romero Dec 12, 2025

hmmm, any idea if / how this would be modified for 8k seq len?

tyler-romero reviewed

View reviewed changes

src/scripts/train/ladder/olmo3_ladder.py

+                          This usually isn't invoked directly, but rather via `launch_run`.
+                          """
+                      ).replace("\n", " "),
+                      "launch_run": "Launch a Beaker job to run the ladder for a given model size.",

Contributor

tyler-romero Dec 12, 2025

Could we have a utility for launching a set of runs? For example all of the rungs of size less than or equal to the specified size? I see myself kicking off the bottom 3/4 sizes all at once pretty frequently when evaluating ideas.

tyler-romero reviewed

View reviewed changes

src/scripts/train/ladder/olmo3_ladder.py

		@@ -0,0 +1,387 @@
		import argparse

Contributor

tyler-romero Dec 13, 2025

So is the expected pattern similar to our other scripts/train scripts where users will clone this file and make the changes relevant to their experiment?

If so, could we pull some of the utility functions in this file into a shared file, similar to internal/experiment.py?

tyler-romero approved these changes

View reviewed changes

Contributor

tyler-romero left a comment

Some small comments, some already discussed offline, LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet