DJ Percy AI quality
Adoption, eval quality, reliability, and operational performance for the public demo.
Patterns generated
Completed auto_layered runs in the rolling window.
Successful edits
Runs where Percy applied at least one meaningful pattern change.
Eval pass rate
Structure eval pass rate (deterministic checks).
Median generation time
End-to-end run duration (not TTFT).
Quality
Weekly eval pass % vs successful edits % (grouped bars).
Passed vs failed structure checks (counts per week).
Reliability & task success
Meaningful change vs no change vs failed.
Runs with meaningful change
Tool success (sum of steps)
No-change rate (runs with eval)
Among completed runs that recorded a reliability eval snapshot.
Failure rate
0 failed of 26 runs.
Performance
Median and p95 duration (ms) for completed runs.
Median duration
P95 duration
Total tokens (prompt + completion)
Estimated cost (sum)
What we learned
- Reliability matters more than raw completion count: we separate “assistant ran” from “pattern meaningfully changed.”
- Eval design must pair deterministic structure checks with apply/delta signals so product quality is measurable.
- Speed (median/p95) is part of the experience for creative tools — we track it alongside eval pass rates.
Auto_layered runs only. As of 5/11/2026, 5:54:00 PM.