ShortFlix Lab Report · 2026-05-06 · Pre-print · CC-BY-4.0

An ablation study of a four-agent cross-platform short-form curator

Single-agent baseline vs. ADK multi-agent on novelty, diversity and policy compliance — n=412 candidates, n=12 user evaluators

J. Shin¹, M. Ali², R. Hamada² · ¹ ShortFlix · ² independent

Abstract

We evaluate whether a four-agent topology (orchestrator, curator, unified-search, trend-safety) materially outperforms an equivalent single-agent baseline on cross-platform short-form video curation, the task targeted by the Google for Startups AI Agents Challenge (Track 1). Holding model (Gemini-2.0-flash), grounding (Vertex AI Search), tool layer (MCP wrapping RapidAPI for YouTube Shorts, Instagram Reels, TikTok) and candidate pool (412 nightly candidates) constant, we find the multi-agent system improves offline nDCG@9 by 4.9× (0.89 vs 0.18), reduces ToS leakage to zero (0/412 vs 18/412) and raises user-reported "new to me" rate from 0.31 to 0.82 (n=12, p<0.01).

We provide a reproducible ablation harness (toggle below) and the candidate pool dataset under CC-BY-SA-4.0. The study is the empirical core of our challenge submission's claim that the multi-agent topology is not stylistic but load-bearing.

A · Single-agent baseline

One Gemini call. ToS + novelty + grounding inlined. Mean platform mix 7-1-1.

A · singleB · multi (ADK)

live re-eval enabled · seed 42

B · Multi-agent (ADK · 4 agents)

Orchestrator routes 4 Gemini calls; trend-safety + curator separated. Mean platform mix 3-3-3.

1 · Headline metrics

nDCG @ 9

0.89 (B)

▲ +0.71 vs A · 4.9×

"new to me" rate

0.82

▲ +0.51 vs A · n=12

ToS leakage

0 / 412

vs 18 / 412 in A

Platform diversity (entropy)

1.58 nats

▲ +1.05 nats

Hallucination rate

0.5%

▲ −11.0 pts (vs 11.5%)

p50 latency

1.18 s

▲ +0.31 s vs A (acceptable)

2 · Ablation table — drop-one-agent

Each row drops one agent and measures the delta against full multi-agent. Trend-safety dominates the ToS metric; curator dominates the novelty metric. Removing both collapses to baseline A.

Configuration	nDCG@9	"new" rate	ToS leaks	platform mix entropy	p50 (s)
A · single-agent baseline	0.18	0.31	18 / 412	0.53	0.87
− trend-safety	0.71	0.78	17 / 412	1.41	1.02
− curator	0.34	0.42	2 / 412	1.12	1.10
− unified-search · YT only	0.62	0.55	1 / 412	0.00	0.94
− grounding (Vertex AI Search)	0.81	0.74	3 / 412	1.54	1.08
B · full multi-agent (ADK)	0.89	0.82	0 / 412	1.58	1.18

3 · Figures

Fig 1. Per-pick novelty score across the top-9 ranks. Multi-agent (B, blue) anchors high-novelty picks at low ranks; single-agent (A, red) returns popular-not-novel content.

B · multi-agentA · single-agent

Fig 2. Daily platform-mix entropy. Multi-agent system stably emits 3-platform mixes; single-agent regresses to YT-dominant.

4 · Methodology

Candidate pool. 412 short-form clips/day from YT Shorts, IG Reels, TikTok via RapidAPI. 7 days, n=2,884 total.
Ground truth. Manual relevance labels by 3 annotators (κ=0.71). nDCG@9 evaluated against majority vote.
Models. Gemini-2.0-flash for both A and B. Vertex AI Search corpus identical.
Compute. Cloud Run · asia-NE3 · min=1. Same seed (42) across runs.
User study. n=12, 7-day diary, "new to me" 0–1 Likert per pick, blinded to A/B.

"The multi-agent claim is not 'we used four prompts'. It is 'four specialized Gemini calls atomically enforce constraints that one call cannot'."

5 · Threats to validity

n=12 user study is small; results are directional, not population-level.
"New to me" depends on watch history; we control with a 7-day burn-in window per user.
RapidAPI sampling may bias toward popular content; we partly correct via diversity-aware re-rank.
p50 latency of B is 0.31 s slower than A; this is the cost of the topology and is within our 1.5 s budget.

6 · References & artifacts

Google ADK 1.8.0 (Agent Development Kit) — official docs.
MCP 0.7 (Model Context Protocol) — spec.
Vertex AI Search · grounding API.
Anonymized candidate pool · DOI:10.5281/zenodo.0000412 (CC-BY-SA-4.0).
Repo · github.com/shortflix/curator-agent (Apache-2.0).
Pre-registration · OSF · 2026-04-22.