AI quality

DJ Percy AI quality

Adoption, eval quality, reliability, and operational performance for the public demo.

Patterns generated

26

Completed auto_layered runs in the rolling window.

Successful edits

88.5%

Runs where Percy applied at least one meaningful pattern change.

Eval pass rate

73.1%

Structure eval pass rate (deterministic checks).

Median generation time

6.6 s

End-to-end run duration (not TTFT).

Quality

Eval & successful edits by week

Weekly eval pass % vs successful edits % (grouped bars).

Structure eval outcomes by week

Passed vs failed structure checks (counts per week).

Reliability & task success

Run outcomes by week

Meaningful change vs no change vs failed.

Runs with meaningful change

23

Tool success (sum of steps)

25

No-change rate (runs with eval)

11.5%

Among completed runs that recorded a reliability eval snapshot.

Failure rate

0.0%

0 failed of 26 runs.

Performance

Generation time by week

Median and p95 duration (ms) for completed runs.

Median duration

6.6 s

P95 duration

12.4 s

Total tokens (prompt + completion)

446,178

Estimated cost (sum)

$0.5068

What we learned

  • Reliability matters more than raw completion count: we separate “assistant ran” from “pattern meaningfully changed.”
  • Eval design must pair deterministic structure checks with apply/delta signals so product quality is measurable.
  • Speed (median/p95) is part of the experience for creative tools — we track it alongside eval pass rates.

Auto_layered runs only. As of 5/11/2026, 5:54:00 PM.