For the skeptic who has seen 30 agent demos this year
We didn't believe it either. So we published the broken one too.
Most agent-builder demos are screen recordings of a happy path that breaks the second a judge clicks the link. We expect you to assume the same about WhyC. Here is the broken mock you came to find — next to the actual Cloud Run URL the loop produced from the same job posting. Both are clickable.
What you expected to see● broken / typical
one-shot codegen, no eval looplocalhost:3000 (down)
Honest disclosure (auto-generated by Opus 4.7 from the trace bundle): The deployed preview above is real and clickable, but the pricing page route still scores 0.71 spec-fit (target: 0.85) — the loop is queued for one more pass tonight. Login is mocked. The synthetic JD used for this run was generated by us; no real YC company is named, screenshotted, or implied. We do not promise this works on every input — see the failures log below.
Why you'll hate this product
Seven objections we expect from a reasonable skeptic. Crossed-out objection, our honest answer, link to the Phoenix trace that proves it.
Your objection
What we actually do
Evidence
"Agent demos are cherry-picked screen recordings."
Every run publishes its full OpenInference trace. The hero card above links to a live URL, not a video. Pick a different JD and submit it yourself.
Correct — that's why we don't do one-shot. Phoenix MCP scores each flow against the extracted spec; only flows below 0.85 are regenerated. Convergence is logged, not claimed.
"Spec-fit scoring is just LLM marking its own homework."
Fair. The judge prompt, rubric, and 12 disagreement cases (where a human reviewer overruled it) are public in the repo. We log false-positives, not just successes.
"Generated UI looks generic and the copy hallucinates."
Sometimes yes. Hero copy is templated against extracted facts; if the JD is too vague, the agent refuses to invent and asks for clarification instead of bluffing.
"This is just YC-orange shitposting in a wrapper."
Tone is satirical, the artifact is not. WhyC the product never names or screenshots a YC company; the deployed previews are generic enough to use for your own startup the next morning.
Where we lose, marked clearly. No green checkmarks across the board.
Axis
WhyC
One-shot codegen (typical agent demo)
Hand-coded MVP (2 founders, 1 weekend)
Time to first deployed URL
~11 min
instant (often 404)
12–48 hours
Spec-fit on first output
0.71 median
0.40 median
0.90+ (human)
Self-corrects without operator
yes (Phoenix loop)
no
no — needs human
Public trace per run
yes (OpenInference)
no
git log
Survives a novel input from judge
~90% (logged)
~40%
100% if scoped
Production-ready
no — demo artifact
no
depends on team
Cost per preview
~$3.40 (Gemini + Run)
~$0.20
~16 founder-hours
Where it currently fails
Public log, last 3 entries. Each is reproducible from the URL we received.
2026-05-06 09:14Marketplace JD with 4 personas → agent picked the wrong primary persona, hero copy addressed buyers when product is for sellers. Loop did not catch it because spec-fit rubric didn't weight persona alignment. Fixed in rubric v0.4.1.rubric
2026-05-04 22:41Hardware-adjacent JD → Next.js preview is structurally fine but the product is fundamentally not a web app. We now refuse the run with a typed error instead of producing a misleading site.refused
2026-05-03 16:02JD in Korean → spec extraction succeeded, but generated copy mixed languages. Locale detection added; English-only outputs until v0.5 ships multilingual templates.shipped
Total runs
147 since 2026-05-05
Median spec-fit (converged)
0.91
Failures publicly logged
14 / 147
License
Apache-2.0 OSI
stack: Gemini ADK · Agent Builder · Phoenix MCP · Cloud Run · Next.jsArize track · hackathon submission · no YC names or logos used