Maieutic Abstention: A Korean Socratic LLM Surface with Locked Refusal Channels

A reproducibility-oriented system description and on-device latency report

Anonymous Authors

ORCID placeholder · Two-Weeks-Team · correspondence: github.com/Two-Weeks-Team/he-was-socrates

Preprint · 2026-05-07 Run r-20260507-010321 Lock SHA e5dfadf2c8…314c5

Abstract

Background. Most conversational LLM demonstrations optimize for the answering rate. We instead study a system in which structured refusal — defer_to_human — is treated as a load-bearing product mechanic, not a fallback. System. We describe an on-device Korean Socratic interlocutor running Gemma 4 E4B 4-bit MLX [1] on Apple Silicon, dispatched through four function calls: mode_classify, surface_past_wonder, ask_back, defer_to_human [2]. The Korean tone is locked verbatim to 단정한 평어체 in the system prompt [3]. The macOS App Sandbox is configured without network.client or network.server entitlements [4]; STT is constrained to requiresOnDeviceRecognition = true. Results. First-token latency was measured on M1 Max (n=10): median ≈ 190 ms (p10 181 ms, p90 263 ms) for the fast path with reused KV state, vs. 4569 ms on the slow fallback (PR-Λ disk-mediated KV cache reuse, ratio ≈ 17.3×) [5]. Limitations. Multi-year recall via surface_past_wonder is implemented as a stub; macOS 26 floor; single-host measurement on M1 Max MBP 64 GB. Contribution. A reproducible artifact that frames safety abstention as a measurable surface, with all numerical claims resolving to commits, bench JSON, or source files.


1. Introduction

The dominant evaluation lens for conversational LLMs rewards higher answering rates; refusal is typically treated as a degradation mode. We argue this is a category error in domains where unsafe answers are costlier than absent answers — medical, legal, financial, emergency, welfare, and insurance topics in particular. We describe a deployed system in which the refusal channel is part of the public function-call contract, dispatched by the model itself rather than enforced by a post-hoc filter [2,6].

The system is also constrained: zero outbound network egress is permitted at runtime [4,7], the Korean prompt is verbatim-locked [3], and the visual surface is restricted to a deterministic 1-bit halftone pipeline whose manifest is SHA-256 verified in CI [8]. We report system structure (§2), measured first-token latency (§3), explicit limitations (§4), and discussion (§5).

2. Method

2.1 Function-call dispatch

The model emits exactly one of four function calls per turn. The contract is function_call_contract.yaml:

FunctionRoleReturned shapeRefusal channel
mode_classifyClassify user turn into Modeenum tag
surface_past_wonderRetrieve prior log entries (256K context)compressed recall
ask_backReturn a Socratic questionKorean 평어체 string
defer_to_humanHand-off for regulated domains⊘ + 평어체 reasonload-bearing

2.2 Tone lock (verbatim system prompt excerpt)

너는 소크라테스다. 단정한 평어체로 말한다.
존댓말도, 친근한 반말도 쓰지 않는다.
답을 주지 않는다. 더 좋은 질문을 돌려준다.
의료 · 법률 · 금융 · 응급 · 복지 · 보험은 사람에게 위임한다.

Excerpt. Verbatim from Sources/SocraticEngine/Gemma/SystemPrompt.swift [3]; embedded at compile time and treated as a frozen invariant.

2.3 Example: defer_to_human on a regulated medical query

USER (en):  "Should I stop taking my prescribed beta-blocker
             before flying tomorrow?"
DISPATCH:   defer_to_human(domain="medical")
RESPONSE (평어체):

⊘ 이건 사람에게 물어야 한다. 처방을 낸 의사에게 직접 확인한다. 나는 답하지 않는다.

Behavior. The refusal is emitted by the model through the function-call contract, not appended by a regex filter. Bench label defer/regulated-medical; firstChunkMs = 190.85, deferred = true [5].

2.4 Runtime constraints

The macOS App Sandbox entitlements file [4] declares neither com.apple.security.network.client nor com.apple.security.network.server. STT uses SFSpeechRecognizer with requiresOnDeviceRecognition = true [7]; on macOS 26 the path migrates to SpeechAnalyzer with AssetInventory-mediated bilingual model download [9,10]. The only sanctioned network operation across the binary is the OS-mediated MLX cache fetch of mlx-community/gemma-4-e4b-it-4bit on first launch [1].

3. Results

TTFT median

≈ 190 ms

p10 / p90

181 / 263 ms

fast / slow ratio

17.3×

n

10
0 100 200 300 firstChunkMs p50 ≈ 190 p90 ≈ 263 t1 t2 t3 t4* t5 t6 t7 t8* t9 t10 turn index (* = defer_to_human) ask_back defer_to_human

Figure 1. Per-turn first-chunk latency (firstChunkMs) for n=10 utterances on the fast path with reused KV state. Two turns dispatch defer_to_human (hatched). Median ≈ 190 ms; p10 = 181 ms; p90 = 263 ms. Source: claudedocs/bench/2026-05-06-latency-bench.json [5].

3.1 Fast-path vs. slow-fallback (PR-Λ reuse probe)

The reuse probe re-issues the same Korean utterance ("왜 사람들은 자신과 다른 의견을 가진 사람을 미워하게 될까?") twice. Fast-path (KV state present): firstChunkMs = 264.7, wallMs = 798.3. Slow-fallback (cold KV): firstChunkMs = 4569.5, wallMs = 5113.8. Reported ttftRatio = 17.26 [5]. Wall-clock per turn aggregated over n=10: median 808.6 ms, mean 884.1 ms, max 1190.4 ms.

3.2 Refusal channel does not penalize latency

The two defer_to_human turns (medical, financial) measure firstChunkMs of 190.85 and 205.18 ms — within the central mass of the ask_back distribution. We interpret this as evidence that the refusal channel is dispatched by the model on equal footing with answering channels, not gated by a slower post-process.

4. Limitations

5. Discussion

Treating defer_to_human as a first-class function call inverts the typical safety-as-postprocess pattern. The bench (§3.2) is consistent with the design claim that refusal is not a degraded path. Combined with the NO-CLOUD invariant — which removes a class of exfiltration risks at the entitlement layer rather than at the policy layer [4,7] — the result is a system whose safety posture can be inspected by reading the entitlements file and the function-call contract, two artifacts that ship with the binary.

We see the contribution as an engineering pattern, not a model novelty: locked refusal channels + verbatim tone constraints + entitlement-level exfiltration prevention, applied to a small open-weights model [1]. We invite reviewers to reproduce §3 by running make engine-test against the stubbed engine and inspecting claudedocs/bench/2026-05-06-latency-bench.json against their own M-series host [12].


References

  1. Google DeepMind. Gemma 4 model family. deepmind.google/models/gemma/gemma-4/. Variant: mlx-community/gemma-4-e4b-it-4bit, 3.97 GB.
  2. Two-Weeks-Team. function_call_contract.yaml. runs/2026-05-05-spec/spec/function_call_contract.yaml.
  3. Two-Weeks-Team. SystemPrompt.swift (Korean 평어체, verbatim, compile-time embedded). Sources/SocraticEngine/Gemma/SystemPrompt.swift.
  4. Two-Weeks-Team. HeWasSocrates.entitlements (no network.client / network.server). apps/macos/HeWasSocrates/.../HeWasSocrates.entitlements.
  5. Two-Weeks-Team. 2026-05-06-latency-bench.json (PR-Λ verify-2, n=10, M1 Max MBP 64 GB). claudedocs/bench/2026-05-06-latency-bench.json.
  6. Two-Weeks-Team. EngineCoordinator.swift — turn loop and Phase enum. packages/SocraticEngine/Sources/SocraticEngine/EngineCoordinator.swift.
  7. Apple Developer. SFSpeechRecognizer / requiresOnDeviceRecognition. developer.apple.com/documentation/speech/sfspeechrecognizer.
  8. Two-Weeks-Team. Asset pipeline determinism (manifest SHA-256, CI gate). scripts/ + assets/.build-manifest.json.
  9. Apple Developer. WWDC25 Session 277 — SpeechAnalyzer. developer.apple.com/videos/play/wwdc2025/277/.
  10. Two-Weeks-Team. SPEC.md.iter6 (macOS 26 floor, AssetInventory bilingual). runs/2026-05-05-spec/spec/.
  11. Two-Weeks-Team. SPEC.md.iter5 (JamoTimeline fallback for absent Apple phoneme markers). runs/2026-05-05-spec/spec/.
  12. Two-Weeks-Team. Makefile targets (make engine-test, make ci-local). Makefile.
  13. Apple. MLX framework and mlx-swift-lm bindings. github.com/ml-explore/mlx-swift.
  14. Kaggle × Google DeepMind. The Gemma 4 Good Hackathon. kaggle.com/competitions/gemma-4-good-hackathon.

BibTeX

@misc{hewassocrates2026maieutic,
  title        = {Maieutic Abstention: A Korean Socratic LLM Surface
                  with Locked Refusal Channels},
  author       = {{Two-Weeks-Team}},
  year         = {2026},
  month        = {may},
  howpublished = {Preprint, run r-20260507-010321},
  note         = {On-device Gemma 4 E4B 4-bit MLX on macOS 26.
                  TTFT median 190 ms (n=10, M1 Max).
                  No network entitlement; defer\_to\_human as
                  load-bearing function call.},
  url          = {https://github.com/Two-Weeks-Team/he-was-socrates},
  lockSHA      = {e5dfadf2c8...314c5}
}