promptdojo_

Optimizers and learning-rate schedules

Two pieces sit around the gradient in real training code: the optimizer and the scheduler. Both are simpler than their names.

The optimizer = the update rule

The optimizer is just the rule that turns a gradient into a parameter change. You already wrote the simplest one by hand:

  • SGD (plain gradient descent): w = w - lr * grad. Honest and basic.
  • Momentum: remembers the last few steps' direction and keeps rolling, so it powers through flat spots instead of crawling.
  • Adam: the popular default — it adapts the step size per parameter using a running memory of recent gradients, so you fuss less over one global learning rate.

In PyTorch this is one line — optimizer = torch.optim.Adam(params, lr=...) — and the loop calls optimizer.step(). You rarely implement these; you pick one (Adam is a fine default) and move on.

The scheduler = changing the lr over time

A scheduler changes the learning rate as training progresses. The standard shape is decay: start with a larger lr to cover ground fast, then shrink it so you settle gently into the minimum instead of bouncing around it.

Run the editor: a step-decay schedule halves the lr each epoch (base * 0.5**epoch): 0.1, 0.05, 0.025, 0.0125. Big steps early, fine steps late.

Why a builder cares

You won't derive Adam, but you'll choose an optimizer and a schedule (or read the ones an AI chose) and reason about the tradeoff: too-flat a schedule wastes time crawling; decaying too fast stalls before you reach the bottom. The mental model — fast early, careful late — is the part that transfers.