live · streaming tracesPhoenix MCP · OpenInferenceGemini ADK · Cloud Run
Run target: anonymized YC-shaped JD · "B2B vertical-AI for ops teams" → spec-fit converged in 7 iterations
CONVERGEDjudge_pass=5/6 · Δ<1.5pt · t=4h 12m
Spec-fit (current)
96.2%+25.1 vs t0
LLM-as-judge · 6-flow weighted
Iterations
73 regen-only
stop rule: Δ<2pt for 2 cycles
Token spend
$4.18+$0.62 vs budget
2.41M in · 0.87M out · Gemini 2.x
Time to deploy
11m 04svs YC: 6mo
first Cloud Run URL live · loop bg
Spec-fit over time · per iteration
spec-fit %judge confidenceregen events
Token spend per iteration
inputoutput
Regenerated-flow heatmap · iteration × flow
012+
Flow-by-flow funnel · final iteration (i7)
judge: gemini-2.x · weight = traffic share
Flow
Spec-fit
Regens
Judge
/landing
98.4%
0
PASS
/pricing
96.1%
2
PASS
/onboarding
94.8%
2
PASS
/api-doc
91.2%
2
SOFT
/dashboard
97.0%
0
PASS
/auth
99.1%
1
PASS
Opus 4.7 trace narrative · 220K-token span tree summarized:
Run started at spec-fit 41% — Gemini extractor missed the pricing-tier comparison that the JD's "vertical-AI for ops" framing implied. Phoenix MCP flagged a 12-point judge-confidence drop on /pricing at i2; the planner regenerated only that flow (cost: $0.58, not a full rebuild). The same pattern repeated on /onboarding (i4, judge cited missing role-selector) and /api-doc (i6, missing auth-token snippet). At i7 the stop rule triggered (Δ<2pt across two cycles). Net: 7 iterations, $4.18, 11m to first deploy, 4h12m to convergence. YC comparable: 0 deploys in 6 months — see header banner.