How evals went from research curiosity to the only thing that ships — a five-year history (step 8/8) · eval-driven ai development

promptdojo_›phase 04 · shipping discipline›ch 21 · eval-driven ai development

lesson 3 of 3 · how evals went from research curiosity to the only thing that ships — a five-year historystep 8 / 8

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Synthesize the chapter's history into one function: triage_teams(teams) that takes a list of team profiles and returns a dict with two fields:

verdicts: a dict mapping each team's name to its verdict string (same four tiers as step 07)
most_at_risk: the name of the team with the LOWEST readiness score (the team most likely to break on the next model launch)

Score each team using the same rules as step 07:

has_test_set True: +25
has_judge_prompt True: +15
ci_runs_evals True: +25
tracks_eval_history True: +15
eval_count_per_feature >= 20: +20

Verdict tiers: >=80 "eval-mature", >=50 "eval-aware", >=20 "eval-curious", else "vibes era".

On ties for lowest score, return the FIRST tied team (Python's min with key= preserves stable order).

Five teams run. Expected output:

verdicts: {'CursorClone': 'eval-mature', 'SeriesBAI': 'eval-mature', 'SeedAI': 'eval-curious', 'ConsultingShop': 'eval-curious', 'PromptShopCo': 'vibes era'}
most at risk: PromptShopCo

⌘↵ runs the editor.read, then continue.

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Synthesize the chapter's history into one function: triage_teams(teams) that takes a list of team profiles and returns a dict with two fields:

verdicts: a dict mapping each team's name to its verdict string (same four tiers as step 07)
most_at_risk: the name of the team with the LOWEST readiness score (the team most likely to break on the next model launch)

Score each team using the same rules as step 07:

has_test_set True: +25
has_judge_prompt True: +15
ci_runs_evals True: +25
tracks_eval_history True: +15
eval_count_per_feature >= 20: +20

Verdict tiers: >=80 "eval-mature", >=50 "eval-aware", >=20 "eval-curious", else "vibes era".

On ties for lowest score, return the FIRST tied team (Python's min with key= preserves stable order).

Five teams run. Expected output:

verdicts: {'CursorClone': 'eval-mature', 'SeriesBAI': 'eval-mature', 'SeedAI': 'eval-curious', 'ConsultingShop': 'eval-curious', 'PromptShopCo': 'vibes era'}
most at risk: PromptShopCo

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

save my spot follow @TFisPython for the app launch

open this same url on a laptop to keep going today.

How evals went from research curiosity to the only thing that ships — a five-year history — step 8 of 8

this step needs the editor