promptdojo_

Two evaluation samples have the same average score, but one has a much wider range. What should you inspect next?