Slice-based error analysis: where the average lies
A single accuracy number is an average, and averages hide the groups a model is failing. Run the editor: this ticket-router is "67% accurate overall." That sounds mediocre but usable — until you split it.
Slice it by who it serves
A slice is a subgroup of your data: a language, a region, a customer tier, a device, a ticket type. Compute accuracy per slice and the hidden failure jumps out:
- English tickets: 4 of 4 correct → 100%
- Spanish tickets: 0 of 2 correct → 0%
The model isn't "67% good." It's perfect for English speakers and completely broken for Spanish speakers. Shipping the average means shipping a tool that silently fails an entire group of customers.
Why a builder always slices
Aggregate metrics are how bad models pass review. The real questions are "who does it fail, and how badly?" Slicing turns "good enough on average" into "unusable for Spanish tickets — fix retrieval for that language before launch." You pick slices from how the business is actually segmented: the groups where a silent failure would cost you trust, money, or a lawsuit.
The move: group the cases by a slice key, compute accuracy within each group, and read the worst one.