WhyC / runs / r-20260506-122526 / phoenix-convergence
live · streaming traces Phoenix MCP · OpenInference Gemini ADK · Cloud Run

Spec-fit (current)

96.2%+25.1 vs t0
LLM-as-judge · 6-flow weighted

Iterations

73 regen-only
stop rule: Δ<2pt for 2 cycles

Token spend

$4.18+$0.62 vs budget
2.41M in · 0.87M out · Gemini 2.x

Time to deploy

11m 04svs YC: 6mo
first Cloud Run URL live · loop bg

Spec-fit over time · per iteration

spec-fit % judge confidence regen events
100 85 70 55 40 i0 i1 i2 i3 i4 i5 i6 i7 regen /pricing regen /onboarding regen /api-doc 41% 89% 96.2%

Token spend per iteration

input output
$1.0 $0.7 $0.4 $0.1 i0i1 i2i3 i4i5 i6i7 marginal cost trending → 0 as scope narrows to under-spec flows

Regenerated-flow heatmap · iteration × flow

0 1 2+
/landing /pricing /onboard /api-doc /dash /auth i1i2 i3i4 i5i6 i7

Flow-by-flow funnel · final iteration (i7)

judge: gemini-2.x · weight = traffic share
FlowSpec-fitRegensJudge
/landing98.4%
0PASS
/pricing96.1%
2PASS
/onboarding94.8%
2PASS
/api-doc91.2%
2SOFT
/dashboard97.0%
0PASS
/auth99.1%
1PASS
Opus 4.7 trace narrative · 220K-token span tree summarized: Run started at spec-fit 41% — Gemini extractor missed the pricing-tier comparison that the JD's "vertical-AI for ops" framing implied. Phoenix MCP flagged a 12-point judge-confidence drop on /pricing at i2; the planner regenerated only that flow (cost: $0.58, not a full rebuild). The same pattern repeated on /onboarding (i4, judge cited missing role-selector) and /api-doc (i6, missing auth-token snippet). At i7 the stop rule triggered (Δ<2pt across two cycles). Net: 7 iterations, $4.18, 11m to first deploy, 4h12m to convergence. YC comparable: 0 deploys in 6 months — see header banner.