For the skeptic who has seen 30 agent demos this year

We didn't believe it either.
So we published the broken one too.

Most agent-builder demos are screen recordings of a happy path that breaks the second a judge clicks the link. We expect you to assume the same about WhyC. Here is the broken mock you came to find — next to the actual Cloud Run URL the loop produced from the same job posting. Both are clickable.

What you expected to see ● broken / typical

one-shot codegen, no eval loop localhost:3000 (down)

What WhyC actually deployed ● live · spec-fit 0.94

WhyC pipeline · 4 loop iterations · 23h 41m preview-7af2.run.app

Honest disclosure (auto-generated by Opus 4.7 from the trace bundle): The deployed preview above is real and clickable, but the pricing page route still scores 0.71 spec-fit (target: 0.85) — the loop is queued for one more pass tonight. Login is mocked. The synthetic JD used for this run was generated by us; no real YC company is named, screenshotted, or implied. We do not promise this works on every input — see the failures log below.

Why you'll hate this product

Seven objections we expect from a reasonable skeptic. Crossed-out objection, our honest answer, link to the Phoenix trace that proves it.

Your objection

What we actually do

Evidence

"Agent demos are cherry-picked screen recordings."

Every run publishes its full OpenInference trace. The hero card above links to a live URL, not a video. Pick a different JD and submit it yourself.

trace #7af2

"One-shot LLM codegen never matches the brief."

Correct — that's why we don't do one-shot. Phoenix MCP scores each flow against the extracted spec; only flows below 0.85 are regenerated. Convergence is logged, not claimed.

eval log

"Spec-fit scoring is just LLM marking its own homework."

Fair. The judge prompt, rubric, and 12 disagreement cases (where a human reviewer overruled it) are public in the repo. We log false-positives, not just successes.

judge.md

"24 hours is marketing — what's the actual median?"

First Cloud Run URL: median ~11 min (visible to user). Convergence to 0.85+: median ~18h, p90 ~28h. We miss the 24h headline on ~10% of runs.

latency csv

"Generated UI looks generic and the copy hallucinates."

Sometimes yes. Hero copy is templated against extracted facts; if the JD is too vague, the agent refuses to invent and asks for clarification instead of bluffing.

refusal trace

"This is just YC-orange shitposting in a wrapper."

Tone is satirical, the artifact is not. WhyC the product never names or screenshots a YC company; the deployed previews are generic enough to use for your own startup the next morning.

brand rules

"What happens when the judges' input breaks your pipeline?"

It will. We've shipped 3 known-failure modes below with reproducer URLs. If yours is novel, the trace gets added to the same log within 24h.

failures.log

WhyC vs. the alternatives, honestly

Where we lose, marked clearly. No green checkmarks across the board.

Axis	WhyC	One-shot codegen (typical agent demo)	Hand-coded MVP (2 founders, 1 weekend)
Time to first deployed URL	~11 min	instant (often 404)	12–48 hours
Spec-fit on first output	0.71 median	0.40 median	0.90+ (human)
Self-corrects without operator	yes (Phoenix loop)	no	no — needs human
Public trace per run	yes (OpenInference)	no	git log
Survives a novel input from judge	~90% (logged)	~40%	100% if scoped
Production-ready	no — demo artifact	no	depends on team
Cost per preview	~$3.40 (Gemini + Run)	~$0.20	~16 founder-hours

Where it currently fails

Public log, last 3 entries. Each is reproducible from the URL we received.

2026-05-06 09:14 Marketplace JD with 4 personas → agent picked the wrong primary persona, hero copy addressed buyers when product is for sellers. Loop did not catch it because spec-fit rubric didn't weight persona alignment. Fixed in rubric v0.4.1. rubric
2026-05-04 22:41 Hardware-adjacent JD → Next.js preview is structurally fine but the product is fundamentally not a web app. We now refuse the run with a typed error instead of producing a misleading site. refused
2026-05-03 16:02 JD in Korean → spec extraction succeeded, but generated copy mixed languages. Locale detection added; English-only outputs until v0.5 ships multilingual templates. shipped

Total runs

147 since 2026-05-05

Median spec-fit (converged)

0.91

Failures publicly logged

14 / 147

License

Apache-2.0 OSI

stack: Gemini ADK · Agent Builder · Phoenix MCP · Cloud Run · Next.js Arize track · hackathon submission · no YC names or logos used

We didn't believe it either.So we published the broken one too.

Why you'll hate this product

WhyC vs. the alternatives, honestly

Where it currently fails

We didn't believe it either.
So we published the broken one too.