promptdojo_

Experiment tracker (lite)

When you tune a model you run it many times with different settings. An experiment tracker records a receipt for each run — the config you used and the result you got — so you can compare them honestly instead of relying on memory ("I think the lr=0.01 run was better?").

A receipt is just a record:

{"config": {"lr": 0.01, "epochs": 5}, "accuracy": 0.91}

Run the editor: with three runs logged, picking the best is a one-liner (max by accuracy) — and it's reproducible, because the winning config is written down, not remembered.

What a good receipt captures

  • Config — the knobs: learning rate, epochs, which features, the data snapshot. (This ties back to reproducibility: a receipt without the config can't be repeated.)
  • Metric(s) — accuracy, precision/recall, loss — whatever you're optimizing.
  • Result/artifact — where the trained model or its id lives.

Real teams use MLflow or Weights & Biases for this; under the hood it's the same shape — a list of {config, metric} receipts you sort and compare. We use a plain list of dicts.

Why a builder cares

Without receipts, model tuning is folklore: nobody can say why the shipped model was chosen or reproduce it. With receipts, "we picked lr=0.01 because it scored 0.91 vs 0.88/0.82" is a fact you can defend and rerun. You'll log receipts and pick the best next.