promptdojo_

Write run_eval_suite(cases) that:

  • Takes a list of dicts, each shaped {"question": str, "expected": str, "max_turns": int}.
  • For each case:
    • Calls run_agent(question) (provided) which returns {"answer": str, "trace": list}.
    • Asserts THREE things, AND combines into a single boolean: a) expected substring is in answer, b) len(trace) is <= max_turns, c) trace[-1]["stop_reason"] == "end_turn".
  • Counts passes (all three asserts true) and fails.
  • Returns {"total": <count>, "passed": <count>, "failed": <count>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

Three cases provided. Expected output:

total=3 passed=2 failed=1 pass_rate=0.67

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

open this same url on a laptop to keep going today.