WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@epwalsh
Copy link
Member

@epwalsh epwalsh commented Dec 10, 2025

Adds a new model ladder API following this design doc and a script to run the baseline Olmo 3 ladder, which also serves an example to configure new ladder experiments.

For now the new ladder API is located at src/olmo_core/model_ladder2/ to make the diffs easier to review, but before merging we'll replace the old code at src/olmo_core/model_ladder.py with this new module.

You can play around with the baseline ladder script without actually launching anything via its dry_run command. For example:

python src/scripts/train/ladder/olmo3_ladder.py dry_run --size=760M --show-plot

"""
A run configurator that uses WSD-S learning rate scheduling and Chinchilla scaling laws.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the underlying chinchilla heuristic of 20 tok/param has some caveats, like it only applies to models trained with AdamW and is dataset dependent (was determined on the Pile iirc). Might be worth adding a disclaimer here so people dont assume that 20 tok/param is actually optimal for their dataset.

Copy link
Member Author

@epwalsh epwalsh Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and made it configurable: 5aa3d4e



@dataclass(kw_only=True)
class RunConfigurator(Config, metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this loosely maps to TrainModule and the above maps to Model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the right comparison because the ModelConfigurator is actually responsible for building the TrainModule given the optimizer, scheduler, and other hyperparameters configured by the RunConfigurator. I'm not entirely happy with it, but I landed there because the training optimizations/parallelism configuration that go into the TrainModule are tightly coupled to the model architecture, which is defined by the ModelConfigurator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of RunConfigurator as what decides hyperparameters.

@dataclass(kw_only=True)
class ModelLadder(Config):
"""
Represents a complete model ladder of runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this roughly maps to ExperimentConfig

raise NotImplementedError

@abstractmethod
def configure_device_microbatch_size(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elsewhere we use the term "rank_microbatch_size", can we keep that consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: bf5bc2d

steps = d.value // global_batch_size
return steps
else:
raise ValueError(f"Unsupported checkpoint interval duration unit: {d.unit}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make a good method on the duration class


def _get_in_loop_eval_tasks(self) -> list[str]:
# For training runs where we don't expect the model to acquire MC (e.g., 1B-5xC, short 7B training runs).
tasks_small_compute = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are the tasks groups from task_groups.py but without the basic_skills and mt_mbpp tasks?

Could we just use the predefined tasks groups instead?



class TransformerSize(StrEnum):
size_190M = "190M"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting to myself that it looks like the smallest size takes about 30m to hit 1xC

size_13B = "13B"

@property
def num_params(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confusing that this returns the rough number of params, but there is a similar method get_num_params in base.py that returns the exact number of non-embedding params.

name=DataParallelType.fsdp,
param_dtype=DType.bfloat16,
reduce_dtype=DType.float32,
wrapping_strategy=TransformerDataParallelWrappingStrategy.blocks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think full wrapping is slightly better than blocks in general. This is because under blocks wrapping the LM head isnt released for communication until after the backward pass is complete.


def configure_target_batch_size(self, num_params: int) -> int:
# Calculate global batch size according to https://api.semanticscholar.org/CorpusID:270764838
# which assumes a sequence length of 2048.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, any idea if / how this would be modified for 8k seq len?

This usually isn't invoked directly, but rather via `launch_run`.
"""
).replace("\n", " "),
"launch_run": "Launch a Beaker job to run the ladder for a given model size.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a utility for launching a set of runs? For example all of the rungs of size less than or equal to the specified size? I see myself kicking off the bottom 3/4 sizes all at once pretty frequently when evaluating ideas.

@@ -0,0 +1,387 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the expected pattern similar to our other scripts/train scripts where users will clone this file and make the changes relevant to their experiment?

If so, could we pull some of the utility functions in this file into a shared file, similar to internal/experiment.py?

Copy link
Contributor

@tyler-romero tyler-romero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, some already discussed offline, LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants