Optimizers and learning-rate schedulers — step 1 of 7
Optimizers and learning-rate schedules
Two pieces sit around the gradient in real training code: the optimizer and the scheduler. Both are simpler than their names.
The optimizer = the update rule
The optimizer is just the rule that turns a gradient into a parameter change. You already wrote the simplest one by hand:
- SGD (plain gradient descent):
w = w - lr * grad. Honest and basic. - Momentum: remembers the last few steps' direction and keeps rolling, so it powers through flat spots instead of crawling.
- Adam: the popular default — it adapts the step size per parameter using a running memory of recent gradients, so you fuss less over one global learning rate.
In PyTorch this is one line — optimizer = torch.optim.Adam(params, lr=...)
— and the loop calls optimizer.step(). You rarely implement these; you
pick one (Adam is a fine default) and move on.
The scheduler = changing the lr over time
A scheduler changes the learning rate as training progresses. The standard shape is decay: start with a larger lr to cover ground fast, then shrink it so you settle gently into the minimum instead of bouncing around it.
Run the editor: a step-decay schedule halves the lr each epoch
(base * 0.5**epoch): 0.1, 0.05, 0.025, 0.0125. Big steps early, fine
steps late.
Why a builder cares
You won't derive Adam, but you'll choose an optimizer and a schedule (or read the ones an AI chose) and reason about the tradeoff: too-flat a schedule wastes time crawling; decaying too fast stalls before you reach the bottom. The mental model — fast early, careful late — is the part that transfers.