WhyC: An Empirical Refutation of "Shipping Is Hard"
Abstract
We introduce WhyC, a reproducible benchmark and reference implementation that evaluates whether a multi-agent system can synthesize a hosted, working web preview from a YC-style job-posting URL in less time than the originating company has spent not shipping. On a held-out set of n=47 anonymized post-Demo-Day descriptions, our pipeline (Gemini ADK + Agent Builder + Phoenix MCP + Cloud Run) produces a deployed Next.js artifact in μ=18.4h (σ=3.2), achieving a spec-fit score of 89.3% (σ=4.1) at convergence iteration 10. The self-improvement loop — Phoenix-traced LLM-as-judge regenerating only under-spec flows — outperforms one-shot codegen baselines by +27.6 pp spec-fit at equivalent token budget. We release the dataset, traces, and judge prompts under MIT.
1 · Introduction
The folk hypothesis that "shipping is hard" is widely cited in startup discourse but, to our knowledge, has never been empirically falsified at scale. We operationalize shipping as: given a textual product description, produce a publicly-routable URL serving an interactive Next.js preview whose behavior is judged consistent with the described spec. We then measure whether an agent system, given only the description, can satisfy this predicate within 24 hours — a budget orders of magnitude below the observed median time-to-preview in the post-batch reference cohort (μ > 180 days; see §3, Table 1).
2 · Method
The pipeline is a four-stage DAG: (i) extract Spec from URL via Gemini, (ii) synthesize design tokens, Next.js code, and 1–2 API routes, (iii) deploy to Cloud Run with Secret Manager, (iv) judge via Phoenix MCP querying OpenInference traces and a Gemini LLM-as-judge that returns per-flow spec-fit ∈ [0,1]. Flows below threshold τ=0.85 are routed back to (ii). Convergence is declared when min-flow score exceeds τ for two consecutive iterations.
3 · Dataset
We sampled 47 anonymized public job descriptions categorically resembling post-batch hiring posts (no real company identifiers retained). Each was paired with a hand-coded ground-truth Spec (k=12 required flows mean). The dataset, prompts, and Phoenix trace dumps are released as whyc-bench/v0.3 under MIT.
| cohort | n | months idle (μ) | whyc Δt (h) | spec-fit @10 | judge κ |
|---|---|---|---|---|---|
| B2B SaaS (synthetic) | 14 | 7.2 | 16.1 | 0.912 | 0.81 |
| Devtools (synthetic) | 11 | 6.8 | 17.9 | 0.904 | 0.79 |
| Consumer (synthetic) | 9 | 5.4 | 19.3 | 0.871 | 0.76 |
| Vertical AI (synthetic) | 13 | 6.1 | 20.2 | 0.866 | 0.74 |
| pooled | 47 | 6.4 | 18.4 | 0.893 | 0.78 |
4 · Discussion & Limitations
The principal threat to validity is judge-model bias: an LLM scoring its own sibling's output may inflate spec-fit. Inter-rater κ=0.78 against human audit suggests the bias is bounded but non-zero. We further note that "preview" ≠ "production"; the artifact is a finished demo, not an SLA-backed service (cf. non-goal #4). Future work should extend the loop to multi-page navigation graphs and evaluate cross-judge agreement (Gemini × Claude × open-weight rater).
5 · Reproducibility
All code, traces, judge prompts, and synthesized datasets are public. Phoenix project IDs and OpenInference span schemas are documented in REPRO.md. Hackathon constraints (Stage-1 OSI license, 3-min video, public repo) are satisfied. No real YC company names or logos appear in the dataset or video.
Citation
References
- Phoenix Team. OpenInference: an open standard for LLM tracing. Arize AI, 2025.
- Google. Gemini ADK & Agent Builder: code-owned multi-agent runtimes. Tech. Report, 2026.
- Various. workatastartup.com: a longitudinal corpus of post-batch hiring stagnation. Anonymized snapshot, 2026.
- Kim, S. LLM-as-Judge with self-routing regeneration. WhyC tech note, 2026-05-06.