Run Evals
This workflow is the day-to-day path for quality checks.Run the suite
Summarize the run
Re-score without re-running agent
CI usage pattern
Practical gating policy
- Block merge if smoke tag fails
- Block release if full suite fails
- Keep per-tag pass-rate trends to detect slow quality drift