promptdojo_

Precision, recall, and the threshold dial

The confusion matrix gives four numbers. Two ratios turn them into the questions a business actually asks.

  • Precision = TP / (TP + FP). Of everything we flagged as fraud, what fraction was really fraud? Low precision = lots of false alarms annoying real customers.
  • Recall = TP / (TP + FN). Of all the fraud that existed, what fraction did we catch? Low recall = fraud slipping through.

Run the editor: this model has precision 0.80 (4 of 5 flags are real) but recall 0.67 (it catches only 2 of every 3 fraud cases).

The dial: the threshold

Most models don't output "fraud"/"legit" directly — they output a score from 0 to 1. You pick a threshold: score ≥ threshold → flag it. Moving that dial trades the two metrics against each other:

  • Lower the threshold → flag more requests → catch more fraud (recall up) but more false alarms (precision down).
  • Raise the threshold → flag only the obvious cases → fewer false alarms (precision up) but more fraud slips by (recall down).

The builder move: pick the metric the risk demands

There is no "best" threshold in the abstract — it depends on what a mistake costs. Missing fraud might cost $5,000; a false alarm costs a two-minute human review. When a miss is far more expensive than a false alarm, you tune for recall and accept lower precision. When false alarms burn customer trust, you protect precision. The number you optimize is a business decision, not a math fact.