An ablation study of a four-agent cross-platform short-form curator
Single-agent baseline vs. ADK multi-agent on novelty, diversity and policy compliance — n=412 candidates, n=12 user evaluators
J. Shin¹, M. Ali², R. Hamada² · ¹ ShortFlix · ² independent
Abstract
We evaluate whether a four-agent topology (orchestrator, curator, unified-search, trend-safety) materially outperforms an equivalent single-agent baseline on cross-platform short-form video curation, the task targeted by the Google for Startups AI Agents Challenge (Track 1). Holding model (Gemini-2.0-flash), grounding (Vertex AI Search), tool layer (MCP wrapping RapidAPI for YouTube Shorts, Instagram Reels, TikTok) and candidate pool (412 nightly candidates) constant, we find the multi-agent system improves offline nDCG@9 by 4.9× (0.89 vs 0.18), reduces ToS leakage to zero (0/412 vs 18/412) and raises user-reported "new to me" rate from 0.31 to 0.82 (n=12, p<0.01).
We provide a reproducible ablation harness (toggle below) and the candidate pool dataset under CC-BY-SA-4.0. The study is the empirical core of our challenge submission's claim that the multi-agent topology is not stylistic but load-bearing.
A · Single-agent baseline
One Gemini call. ToS + novelty + grounding inlined. Mean platform mix 7-1-1.
Each row drops one agent and measures the delta against full multi-agent. Trend-safety dominates the ToS metric; curator dominates the novelty metric. Removing both collapses to baseline A.
Configuration
nDCG@9
"new" rate
ToS leaks
platform mix entropy
p50 (s)
A · single-agent baseline
0.18
0.31
18 / 412
0.53
0.87
− trend-safety
0.71
0.78
17 / 412
1.41
1.02
− curator
0.34
0.42
2 / 412
1.12
1.10
− unified-search · YT only
0.62
0.55
1 / 412
0.00
0.94
− grounding (Vertex AI Search)
0.81
0.74
3 / 412
1.54
1.08
B · full multi-agent (ADK)
0.89
0.82
0 / 412
1.58
1.18
3 · Figures
Fig 1. Per-pick novelty score across the top-9 ranks. Multi-agent (B, blue) anchors high-novelty picks at low ranks; single-agent (A, red) returns popular-not-novel content.
B · multi-agentA · single-agent
Fig 2. Daily platform-mix entropy. Multi-agent system stably emits 3-platform mixes; single-agent regresses to YT-dominant.
4 · Methodology
Candidate pool. 412 short-form clips/day from YT Shorts, IG Reels, TikTok via RapidAPI. 7 days, n=2,884 total.
Ground truth. Manual relevance labels by 3 annotators (κ=0.71). nDCG@9 evaluated against majority vote.
Models. Gemini-2.0-flash for both A and B. Vertex AI Search corpus identical.
Compute. Cloud Run · asia-NE3 · min=1. Same seed (42) across runs.
User study. n=12, 7-day diary, "new to me" 0–1 Likert per pick, blinded to A/B.
"The multi-agent claim is not 'we used four prompts'. It is 'four specialized Gemini calls atomically enforce constraints that one call cannot'."
5 · Threats to validity
n=12 user study is small; results are directional, not population-level.
"New to me" depends on watch history; we control with a 7-day burn-in window per user.
RapidAPI sampling may bias toward popular content; we partly correct via diversity-aware re-rank.
p50 latency of B is 0.31 s slower than A; this is the cost of the topology and is within our 1.5 s budget.
6 · References & artifacts
Google ADK 1.8.0 (Agent Development Kit) — official docs.
MCP 0.7 (Model Context Protocol) — spec.
Vertex AI Search · grounding API.
Anonymized candidate pool · DOI:10.5281/zenodo.0000412 (CC-BY-SA-4.0).