Glasshat — The audit layer for AI evaluation (interactive demo)

Six Thinking Hats

RubricSynthesizer · Pro · thinking_high

▍3D Evaluation Constellation · UMAP(768d Gemini emb → 3d)

503 anchors · color = outcome tier · size = evidence depth

Qdrant Live: recommend() 0 · positive 0 · negative 0

Winner Gravity: 72% similar to winner cluster — but pulled toward non-winner pattern by anchor "Aegis" (a 2025 finalist with similar evidence depth that did not win). Threshold for likely winner: 80%.

⚠ Anti-Pattern Radar 37 of 503 past Gemini 3 submissions matched this profile. Winners: 0.
Common failure: vague user definition, weak repo evidence, no working demo.

Winners (13) Honorable (50) Non-winners (440) This submission

x: rubric-final · y: tech-depth · z: evidence-depth

Phoenix Monitor (recorded trace)

glasshat-demo

⚠ Phoenix Online Eval fired

span:hat_yellow_score_a1

eval.calibration.label:over_confident

eval.calibration.score:0.31

evidence_depth_bucket:shallow

predicted_score:9.0

Why -1.4? · Score Receipt

From rubric A1 (Problem Clarity, weight 8/100):

new = clip(9.0 - 0.8×1.75, p25=6.8, p75=8.1) = 7.6

3 anchors retrieved via qdrant.recommend():

· Globot (winner) → A1=8.2

· Aegis (finalist) → A1=7.5

· Netra (winner) → A1=7.2

|Δ| 1.4 ≤ 2.0 cap · n=3 ≥ 3 ✓ · LoopAgent iter 1/2

Score Receipts · Dual-Rubric Variance · EU AI Act Art. 12 ready

Globot · Multi-Agent · 2M-token compliance analysis

Qdrant rubric

Functionality · Originality · UX

–

Functionality

–

Originality

–

User Experience

–

Rapid Agent rubric

Tech 40 · Inn 30 · Imp 20 · Pres 10 · Tech tie-break

–

Tech (★)

–

Innovation

–

Impact

–

Presentation

–

Rubric Sanity Layer (always-on): ① Reproducibility ✓ ② Inter-hat consistency ✓ ③ Calibration vs 13 known winners ✓ ④ Evidence depth threshold ✓

The audit layer for AI evaluation.

Stop trusting your LLM judges — verify them.

Same engine. Different rubric. Different rubric-faithful score.

Six Thinking Hats

▍3D Evaluation Constellation · UMAP(768d Gemini emb → 3d)

Phoenix Monitor (recorded trace)

Score Receipts · Dual-Rubric Variance · EU AI Act Art. 12 ready