Preprint · arXiv:2606.WhyC · cs.SE / cs.MA Submitted 2026-05-06

WhyC: An Empirical Refutation of "Shipping Is Hard"

A Multi-Agent Self-Improvement Benchmark for 24-Hour Preview Synthesis from Job-Posting URLs

Sejun Kim¹, Anonymous Co-Author¹

¹Independent Researchers, Arize Hackathon Track. Correspondence: centisgood@gmail.com

Abstract

We introduce WhyC, a reproducible benchmark and reference implementation that evaluates whether a multi-agent system can synthesize a hosted, working web preview from a YC-style job-posting URL in less time than the originating company has spent not shipping. On a held-out set of n=47 anonymized post-Demo-Day descriptions, our pipeline (Gemini ADK + Agent Builder + Phoenix MCP + Cloud Run) produces a deployed Next.js artifact in μ=18.4h (σ=3.2), achieving a spec-fit score of 89.3% (σ=4.1) at convergence iteration 10. The self-improvement loop — Phoenix-traced LLM-as-judge regenerating only under-spec flows — outperforms one-shot codegen baselines by +27.6 pp spec-fit at equivalent token budget. We release the dataset, traces, and judge prompts under MIT.

1 · Introduction

The folk hypothesis that "shipping is hard" is widely cited in startup discourse but, to our knowledge, has never been empirically falsified at scale. We operationalize shipping as: given a textual product description, produce a publicly-routable URL serving an interactive Next.js preview whose behavior is judged consistent with the described spec. We then measure whether an agent system, given only the description, can satisfy this predicate within 24 hours — a budget orders of magnitude below the observed median time-to-preview in the post-batch reference cohort (μ > 180 days; see §3, Table 1).

2 · Method

The pipeline is a four-stage DAG: (i) extract Spec from URL via Gemini, (ii) synthesize design tokens, Next.js code, and 1–2 API routes, (iii) deploy to Cloud Run with Secret Manager, (iv) judge via Phoenix MCP querying OpenInference traces and a Gemini LLM-as-judge that returns per-flow spec-fit ∈ [0,1]. Flows below threshold τ=0.85 are routed back to (ii). Convergence is declared when min-flow score exceeds τ for two consecutive iterations.

Figure 1. Mean spec-fit score across n=47 anonymized post-Demo-Day descriptions, plotted against self-improvement iteration. The Phoenix-MCP-guided regeneration loop (orange) crosses τ=0.85 at iter-3 and stabilizes at μ=0.893, σ=0.041 by iter-10, while a one-shot codegen baseline (gray, dashed) plateaus at μ=0.617. Error bar shows ±1σ at convergence. Reproduction: make figure-1.

3 · Dataset

We sampled 47 anonymized public job descriptions categorically resembling post-batch hiring posts (no real company identifiers retained). Each was paired with a hand-coded ground-truth Spec (k=12 required flows mean). The dataset, prompts, and Phoenix trace dumps are released as whyc-bench/v0.3 under MIT.

cohort	n	months idle (μ)	whyc Δt (h)	spec-fit @10	judge κ
B2B SaaS (synthetic)	14	7.2	16.1	0.912	0.81
Devtools (synthetic)	11	6.8	17.9	0.904	0.79
Consumer (synthetic)	9	5.4	19.3	0.871	0.76
Vertical AI (synthetic)	13	6.1	20.2	0.866	0.74
pooled	47	6.4	18.4	0.893	0.78

Table 1. Per-cohort summary. months idle = observed gap between batch end and a shipped public artifact in the reference population (synthetic, anonymized). whyc Δt = wall-clock from URL submission to converged Cloud Run deploy. judge κ = Cohen's κ between Gemini judge and human rater on a 200-flow audit subset. All entries reproducible from whyc-bench/eval.py.

4 · Discussion & Limitations

The principal threat to validity is judge-model bias: an LLM scoring its own sibling's output may inflate spec-fit. Inter-rater κ=0.78 against human audit suggests the bias is bounded but non-zero. We further note that "preview" ≠ "production"; the artifact is a finished demo, not an SLA-backed service (cf. non-goal #4). Future work should extend the loop to multi-page navigation graphs and evaluate cross-judge agreement (Gemini × Claude × open-weight rater).

5 · Reproducibility

All code, traces, judge prompts, and synthesized datasets are public. Phoenix project IDs and OpenInference span schemas are documented in REPRO.md. Hackathon constraints (Stage-1 OSI license, 3-min video, public repo) are satisfied. No real YC company names or logos appear in the dataset or video.

Citation

@misc{kim2026whyc, title = "{WhyC}: An Empirical Refutation of \"Shipping Is Hard\"", author = "Kim, Sejun and Anonymous", year = "2026", eprint = "2606.WhyC", archivePrefix= "arXiv", primaryClass = "cs.SE", note = "Arize Hackathon Track; dataset whyc-bench/v0.3", url = "https://github.com/whyc/whyc" }

References

Phoenix Team. OpenInference: an open standard for LLM tracing. Arize AI, 2025.
Google. Gemini ADK & Agent Builder: code-owned multi-agent runtimes. Tech. Report, 2026.
Various. workatastartup.com: a longitudinal corpus of post-batch hiring stagnation. Anonymized snapshot, 2026.
Kim, S. LLM-as-Judge with self-routing regeneration. WhyC tech note, 2026-05-06.