【训练营】学习率调度器实现 by littleotherut · Pull Request #113 · InfiniTensor/InfiniTrain

littleotherut · 2026-03-11T10:13:21Z

No description provided.

…r accessors

…ingRate

…LR helper

…base class - Change Step() to virtual with default implementation - Add pure virtual ComputeLR() for subclasses to implement. - Adapt test helpers (IdentityScheduler, LinearDecayScheduler) to implement ComputeLR() instead of Step(). - All existing tests pass without behavioral changes. BREAKING CHANGE: Subclasses must implement ComputeLR() instead of Step().

…t and update all tests to use Create<T>() factory method.

…entialLR - enhance LRScheduler with chained and closed form learning rate methods - adapt methods(Step, InitialStep, GetClosedFormLR, GetChainedFormLR) to match PyTorch‘s design - add tests for consistency - refactor LinearLR: add end_factor, and rename this class - add SequentialLR InitialStep and UndoChildInitialSteps BREAKING CHANGE: Subclasses must implement GetClosedFormLR instead of ComputeLR(). Should use LinearLR instead of LinearwarmupLR.

- Add LRSchedulerConfig struct with parameters for all basic schedulers(constant, linear, step) - Add CreateLRScheduler() factory function - Support automatic warmup wrapping via SequentialLR when warmup_steps > 0 - Adapt test files

…tial, Chained, and Lambda)

…ommon total_iters

- Add gflags: --lr_scheduler, --warmup_steps, --step_size, --gamma, --start_factor, --end_factor, --lr_total_iters, --total_steps - Replace nullptr scheduler with factory-created scheduler - Move scheduler.Step() after optimizer.Step() in both DP and PP paths - Replace hardcoded FLAGS_learning_rate in log with scheduler->GetLR()

Chamberlain0w0 · 2026-03-16T05:14:58Z

example/gpt2/main.cc

            size_t used_mb = 0, reserved_mb = 0;
            std::tie(used_mb, reserved_mb) = impl->GetMemPoolPeakMB(device);
-
+            const float current_lr = scheduler ? scheduler->GetLR() : static_cast<float>(FLAGS_learning_rate);


scheduler 在前面已经 Step 过了，所以这里 GetLR() 语义上是”下一步要用到的 lr“；而我们这里想打印的是每一步实际用到的 lr，所以这里的逻辑需要修改下。llama3 部分的 main.cc 里同理。

Chamberlain0w0 · 2026-03-16T05:15:18Z

example/llama3/main.cc

            size_t used_mb = 0, reserved_mb = 0;
            std::tie(used_mb, reserved_mb) = impl->GetMemPoolPeakMB(device);
-
+            const float current_lr = scheduler ? scheduler->GetLR() : static_cast<float>(FLAGS_learning_rate);


Chamberlain0w0 · 2026-03-16T05:37:36Z

infini_train/include/optimizer.h

    std::vector<std::shared_ptr<Tensor>> params_;
+    float learning_rate_ = 0.0f;
+    float initial_learning_rate_ = 0.0f;
+    bool initial_lr_set_ = false;


这部分比较冗余。optimizer 里面可以只存有代表当前学习率的 learning_rate_，不需要额外存 initial lr 的状态；语义上初始学习率可以仅存在 lr scheduler 里（你是实际上已经这样做了，存在 lr scheduler 的 base_lr）。

此处为对齐PyTorch初始化时的设置（源码链接），

PyTorch在对调度器进行初始化时，会访问其关联优化器的参数列表，并进行setdefault，设置initial_rate_，对于首次被关联的优化器，将现在的学习率设置为initial_lr，对于非首次关联的调度器，返回现有值。

目前仅能想到作用为，可保证如果有多个调度器关联同一优化器声明（ChainedScheduler或SequentialLR等），他们的base_lr_均为第一个调度器进行初始化时优化器的学习率。暂不清楚其他应用场景，但出于与PyTorch保持一致，增设了相关参数，如果只涉及ChainedScheduler或SequentialLR的话，确实有其他替代方案，是否需要更改？

Chamberlain0w0 · 2026-03-16T06:10:16Z

infini_train/include/lr_scheduler.h

+
+    std::shared_ptr<Optimizer> optimizer_;
+    int64_t last_step_;
+    float current_lr_;


current_lr_ 似乎也有点冗余，语义上 current_lr_ 和 optimizer_->GetLearningRate() 的值在任何时候应等价，现在在你的设计里看到这二者存在各自分开存且混用的状态（读完发现目前的 current_lr_ 像是 optimizer_->GetLearningRate() 的一个副本）；目前的数值正确性上你处理的没问题，但是这种设计交给后人来扩展的时候很可能带来歧义。

建议针对“当前学习率”只保留唯一真状态来源，要么就全程由 optimizer_->GetLearningRate() 跟踪，lr scheduler 里面就不存 current lr 了；要么就由 lr scheduler 跟踪，每次计算完再 set 回 optimizer。个人认为前者较合适。

已修改，由于需要调度器具备恢复训练的能力，而如SequentialLR或ChainedScheduler等不支持closed-form计算，无法根据base_lr和last_epoch快速得到学习率，因此保留接口仅用于学习率恢复，并调整命名为recover_lr避免混淆。

Chamberlain0w0 · 2026-03-16T06:14:27Z

infini_train/src/lr_scheduler.cc

+
+void LRScheduler::ApplyLR(float lr) {
+    current_lr_ = lr;
+    optimizer_->SetLearningRate(current_lr_);


承接上面所说的，在你的设计中一方面看到有 optimizer_->SetLearningRate(current_lr_); 这种调用，另一方面又有 current_lr_ = optimizer_->GetLearningRate();，二者可能会存在谁因谁果的混淆，所以建议保持设计上语义的一致性。

Chamberlain0w0 · 2026-03-16T06:14:40Z

infini_train/src/lr_scheduler.cc

+        scheduler->Step();
+    }
+
+    current_lr_ = optimizer_->GetLearningRate();


承接上面所说的，在你的设计中一方面看到有 optimizer_->SetLearningRate(current_lr_); 这种调用，另一方面又有 current_lr_ = optimizer_->GetLearningRate();，二者可能会存在谁因谁果的混淆，所以建议保持设计上语义的一致性。

Chamberlain0w0 · 2026-03-16T06:38:30Z

infini_train/src/lr_scheduler.cc

+    } else if (last_step_ < total_iters_) {
+        return lr;
+    } else if (last_step_ == total_iters_) {
+        return lr / factor_;


个别超参的值由于是由 cli 用户传入，所以需要加一下非法检查。以此处为例，factor 应该是 (0, 1) 范围内的，不然可能会存在除零的非法值。torch 实现中也在构造函数中做了检查，参考：https://github.com/pytorch/pytorch/blob/08840d08a02eead8edf22406a53e5691c9a89c9a/torch/optim/lr_scheduler.py#L813

另外，以我看到的，还有 StepLR 没检查 step_size > 0，LinearLR 没检查两个 factor 以及 total_iters 等。建议通篇 check 一下。

Chamberlain0w0 · 2026-03-16T07:23:42Z

infini_train/include/lr_scheduler.h

+    void LoadState(const StateDict &state) override;
+
+protected:
+    float GetClosedFormLR() const override { return current_lr_; }


这块语义上不太对，我仔细看了下 torch 里面的实现，GetClosedFormLR 对标 torch 里提供的 get_closed_form_lr 的接口的话，实际是想实现一个“给定 base_lr、last_step 以及其他超参，然后可以通过公式算出当前 lr 的 function”。这个虽然数值上确实等于你现在提供的 current_lr，但是逻辑上的代码不应该直接返回缓存的 current_lr_ 就完事，而是应该给一个计算公式。

另外，torch 里提供的 _get_closed_form_lr 的接口，最终实际上是用于 step(int epoch) 这个 function 的，如果对应的 LRScheduler 派生类实现了这个 _get_closed_form_lr，就代表其支持 closed form 的跳步语义，然后 step(epoch) 会直接由提供的 function 计算出 current lr。而 torch 里面的 SequentialLR 派生类没有实现这个 function。

考虑到你这边的 GetClosedFormLR 定义为虚函数，要求所有派生类必须实现，我建议是在这里加上一个 // FIXME 的注释说明一下这一点，目前暂时先以返回一个 current lr 来 hack 实现，而不是提供了 closed-form 计算方法。

Chamberlain0w0 · 2026-03-16T07:35:56Z

infini_train/include/lr_scheduler.h

+};
+
+} // namespace lr_schedulers
+} // namespace infini_train


format 规范上，end of file 需要有一个 newline，后续也有几个文件存在这个问题

- it now only be used for learning rate recovery when using loadstate

…iles

Copilot

Pull request overview

This PR introduces a learning-rate scheduler system to infini_train, integrates it with optimizers (including distributed optimizer), and adds standalone C++ test executables plus example CLI wiring to exercise the new schedulers.

Changes:

Add LRScheduler base + concrete schedulers (ConstantLR/StepLR/LinearLR/LambdaLR/SequentialLR/ChainedScheduler) and a CreateLRScheduler factory.
Extend Optimizer with runtime-settable learning rate and initial learning rate tracking; propagate LR to DistributedOptimizer.
Add scheduler coverage tests and wire scheduler flags into example/gpt2 and example/llama3; register new test executables in CMake.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`infini_train/include/lr_scheduler.h`	Declares scheduler APIs, configs, and concrete scheduler types.
`infini_train/src/lr_scheduler.cc`	Implements scheduler logic, factory creation, state save/load, sequential/chained behavior.
`infini_train/include/optimizer.h`	Adds LR getters/setters + initial LR tracking to support schedulers.
`infini_train/src/optimizer.cc`	Implements optimizer LR plumbing and updates SGD/Adam to use base LR storage.
`infini_train/include/nn/parallel/ddp/distributed_optimizer.h`	Overrides LR get/set for distributed optimizer so schedulers affect the real base optimizer.
`infini_train/src/nn/parallel/ddp/distributed_optimizer.cc`	Implements LR propagation to/from the wrapped base optimizer.
`example/gpt2/main.cc`	Adds scheduler CLI flags and steps the scheduler during training.
`example/llama3/main.cc`	Adds scheduler CLI flags and steps the scheduler during training.
`test/lr_scheduler/test_helpers.h`	Shared minimal test helpers/macros for scheduler tests.
`test/lr_scheduler/test_*.cc`	Adds functional + state + validation tests for schedulers.
`CMakeLists.txt`	Adds new scheduler test executables to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T20:38:29Z

test/lr_scheduler/test_sequential_lr.cc

+    LRSchedulerConfig linear_config = {
+        .type = "linear",
+        .linear_start_factor = 1e-8f,
+        .linear_end_factor = 1.0f,
+        .linear_total_iters = 3,
+    };
+    auto linear = CreateLRScheduler(opt, linear_config);
+    LRSchedulerConfig constant_config = {
+        .type = "constant",
+        .constant_factor = 1.0f,
+        .constant_total_iters = 100,
+    };
+    auto constant = CreateLRScheduler(opt, constant_config);


Several schedulers are created but never used (linear, constant), which is dead code and can trigger compiler warnings or confuse readers about what's being tested. Remove these unused variables (or explicitly cast to void if construction side-effects are what you want to test).

Copilot · 2026-03-20T20:38:29Z

test/lr_scheduler/test_chained_lr.cc

+    auto linear = CreateLRScheduler(opt, {
+                                             .type = "linear",
+                                             .linear_start_factor = 1e-8f,
+                                             .linear_end_factor = 1.0f,
+                                             .linear_total_iters = 3,
+                                         });
+    auto step_lr = CreateLRScheduler(opt, {
+                                              .type = "step",
+                                              .step_size = 3,
+                                              .step_gamma = 0.1f,
+                                          });
+    auto Lambda = CreateLRScheduler(opt, {
+                                             .type = "lambda",
+                                             .lambda_fn = [](int64_t step) { return 1.0f - 0.1f * step; },
+                                         });


linear and Lambda are created but never used in this test case, which is dead code and can lead to misleading tests / unused-variable warnings. Remove them or (void)-mark them if the intent is to validate construction only.

Suggested change

auto linear = CreateLRScheduler(opt, {

.type = "linear",

.linear_start_factor = 1e-8f,

.linear_end_factor = 1.0f,

.linear_total_iters = 3,

});

auto step_lr = CreateLRScheduler(opt, {

.type = "step",

.step_size = 3,

.step_gamma = 0.1f,

});

auto Lambda = CreateLRScheduler(opt, {

.type = "lambda",

.lambda_fn = [](int64_t step) { return 1.0f - 0.1f * step; },

});

auto step_lr = CreateLRScheduler(opt, {

.type = "step",

.step_size = 3,

.step_gamma = 0.1f,

});

Copilot · 2026-03-20T20:38:30Z

infini_train/src/lr_scheduler.cc

+void SequentialLR::Step() {
+    ++last_step_;
+    size_t idx = std::upper_bound(milestones_.begin(), milestones_.end(), last_step_) - milestones_.begin();
+


SequentialLR::Step() uses std::upper_bound, but this translation unit doesn't include <algorithm>. Relying on indirect includes is brittle and can fail to compile depending on standard library headers; add #include <algorithm> in this file (or in the header that declares this usage).

Copilot · 2026-03-20T20:38:30Z

infini_train/src/lr_scheduler.cc

+float ConstantLR::GetChainedFormLR() const {
+    const float lr = optimizer_->GetLearningRate();
+    if (last_step_ == 0) {
+        return lr * factor_;
+    } else if (last_step_ < total_iters_) {
+        return lr;
+    } else if (last_step_ == total_iters_) {
+        return lr / factor_;
+    }
+    return lr;


ConstantLR::GetChainedFormLR() divides by factor_ when last_step_ == total_iters_. Since the constructor allows factor_ == 0, this can cause a division-by-zero at runtime. Either disallow factor_ == 0 (CHECK_GT) or handle the factor_ == 0 transition without dividing.

Copilot · 2026-03-20T20:38:30Z

infini_train/src/lr_scheduler.cc

+            for (const auto &sub_config : config.sequential_configs) {
+                auto sub_sched = CreateLRScheduler(opt, sub_config);
+                if (sub_sched) {
+                    schedulers.push_back(sub_sched);
+                }
+            }


CreateLRScheduler() filters out sub-schedulers when a sub-config has type == "none", but it keeps sequential_milestones unchanged. This can make milestones.size() != schedulers.size() - 1 and hard-fail in SequentialLR's constructor. Consider rejecting none inside sequential_configs, or keep placeholders / adjust milestones so the sizes stay consistent.

Suggested change

for (const auto &sub_config : config.sequential_configs) {

auto sub_sched = CreateLRScheduler(opt, sub_config);

if (sub_sched) {

schedulers.push_back(sub_sched);

}

}

for (const auto &sub_config : config.sequential_configs) {

if (sub_config.type == "none") {

LOG(FATAL) << "LR scheduler type \"none\" is not allowed inside sequential_configs.";

}

auto sub_sched = CreateLRScheduler(opt, sub_config);

if (sub_sched) {

schedulers.push_back(sub_sched);

}

}

if (!schedulers.empty() && milestones.size() != schedulers.size() - 1) {

LOG(FATAL) << "SequentialLR requires milestones.size() == schedulers.size() - 1, but got "

<< milestones.size() << " milestones and " << schedulers.size() << " schedulers.";

}

Copilot · 2026-03-20T20:38:31Z

infini_train/src/lr_scheduler.cc

+    auto warmup_scheduler = LRScheduler::Create<lr_schedulers::LinearLR>(optimizer,
+                                                                         /*start_factor=*/config.warmup_start_factor,
+                                                                         /*end_factor=*/config.warmup_end_factor,
+                                                                         /*total_iters=*/config.warmup_steps);
+
+    auto main_scheduler = create_main(optimizer);
+
+    return LRScheduler::Create<lr_schedulers::SequentialLR>(
+        optimizer, std::vector<std::shared_ptr<LRScheduler>>{warmup_scheduler, main_scheduler},
+        std::vector<int64_t>{config.warmup_steps});


When warmup_steps > 0, main_scheduler can be nullptr (e.g., if config.type is unsupported/"none" or sub-configs filter out to empty), but the code unconditionally passes it into SequentialLR, which will CHECK-fail on a null child. Add an explicit check here (either return only the warmup scheduler, or fail fast with a clear error) before constructing the SequentialLR.

Copilot · 2026-03-20T20:38:31Z

infini_train/src/lr_scheduler.cc

+
+StepLR::StepLR(std::shared_ptr<Optimizer> optimizer, int64_t step_size, float gamma, int64_t last_step)
+    : LRScheduler(std::move(optimizer), last_step), step_size_(step_size), gamma_(gamma) {
+    CHECK_GT(step_size_, 0) << "StepLR: step_size must be > 0.";


StepLR validates step_size_ > 0 but doesn't validate gamma_. Non-positive gamma_ can produce negative/NaN learning rates via pow() and via the multiplicative chained form. Consider CHECK_GT(gamma_, 0.0f) (and possibly a sensible upper bound if desired) to match typical scheduler expectations.

Suggested change

CHECK_GT(step_size_, 0) << "StepLR: step_size must be > 0.";

CHECK_GT(step_size_, 0) << "StepLR: step_size must be > 0.";

CHECK_GT(gamma_, 0.0f) << "StepLR: gamma must be > 0.";

Copilot · 2026-03-20T20:38:31Z

infini_train/src/lr_scheduler.cc

+    }
+}
+
+void ChainedScheduler::InitialStep() {}


ChainedScheduler::InitialStep() is empty, so a newly created ChainedScheduler keeps last_step_ == -1 while other schedulers start at 0 after Create(). This makes LastStep() inconsistent (and the saved last_step in State() will lag children by 1). Consider setting the initial step to 0 (without advancing children) to keep LastStep() semantics consistent across schedulers.

Suggested change

void ChainedScheduler::InitialStep() {}

void ChainedScheduler::InitialStep() {

// Ensure consistent LastStep semantics with other schedulers:

// a newly created ChainedScheduler should start at step 0.

last_step_ = 0;

}

kinorw and others added 19 commits March 3, 2026 14:42

refactor(optimizer): hoist learning_rate_ to Optimizer base and add l…

7a16589

…r accessors

refactor(distributed_optimizer): passthrough SetLearningRate/GetLearn…

0514862

…ingRate

feat(lr_scheduler): add LRScheduler abstract base class with StateDict

81295e8

refactor(examples): add scheduler placeholder and use runtime lr in logs

8e7cda0

feat: add ConstantLR, StepLR and LinearWarmupLR

1e65881

refactor(lr_scheduler): replace ComputeLR with virtual Step and Apply…

d924d3d

…LR helper

feat(lr_schedulers): add LambdaLR strategy

baca2ef

refactor(optimizer): add initial_learning_rate and it's accessors

7df75d7

feat(lr_schedulers): add SequentialLR composite strategy

d0ac538

feat(lr_scheduler): add factory method Create<T>() with two-phase ini…

5b4ef6d

…t and update all tests to use Create<T>() factory method.

feat(lr_schedulers): add ChainedScheduler composite strategy

6823244

feat(lr_scheduler): add scheduler factory for CLI integration (Sequen…

7a29a61

…tial, Chained, and Lambda)

feat(lr_scheduler): add warmup start_factor and end_factor , remove c…

b64566e

…ommon total_iters

Merge branch 'InfiniTensor:master' into lr_scheduler

f7b3fcb

style: apply clang-format to all legacy files

1f95e29

kilinchange requested a review from Chamberlain0w0 March 12, 2026 08:10

kilinchange assigned Chamberlain0w0 Mar 17, 2026

Chamberlain0w0 requested changes Mar 17, 2026

View reviewed changes

kinorw added 4 commits March 21, 2026 02:03

fix: get lr of current epoch before the scheduler steps.

afd98ff

refactor: rename current_lr_ to recover_lr_ and update related methods

151dda0

- it now only be used for learning rate recovery when using loadstate

feat: add validation tests for learning rate schedulers

2980f93

style: format code and comments for consistency across lr_scheduler f…

dc748bd

…iles

Copilot AI review requested due to automatic review settings March 20, 2026 20:34

Copilot started reviewing on behalf of littleotherut March 20, 2026 20:34 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

littleotherut requested a review from Chamberlain0w0 March 20, 2026 20:44

	CHECK_GT(step_size_, 0) << "StepLR: step_size must be > 0.";
	CHECK_GT(step_size_, 0) << "StepLR: step_size must be > 0.";
	CHECK_GT(gamma_, 0.0f) << "StepLR: gamma must be > 0.";

-void ChainedScheduler::InitialStep() {}
+void ChainedScheduler::InitialStep() {
+    // Ensure consistent LastStep semantics with other schedulers:
+    // a newly created ChainedScheduler should start at step 0.
+    last_step_ = 0;
+}

Conversation

littleotherut commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants