- Published on
LLM Translation & Schema-Validation Benchmark: Claude Sonnet 4.6 vs Gemini 2.5 Pro vs Qwen3 Max Thinking
- Authors

- Name
- Tina Park
- @tinaparklive

Repo: devtinapark/korean-culinary-translation-benchmark
Report: results/benchmark_report.html
In my previous article, I tackled the challenge of speech-to-text for Spoken Kitchen — an app built to help immigrant families capture and preserve heirloom recipes. I established a localized baseline showing how OpenAI's gpt-4o-transcribe and Deepgram's nova-3 handle noisy, code-switched kitchen audio.
But capturing the transcript is only half the battle. Once you have a messy, spoken mix of Korean and English (e.g., 할머니 explaining how to make 미역국 or adjusting the 간 맞추기), you have to turn that raw text into a clean, structured, translated recipe object.
Our app requires a strict JSON schema so the frontend can display ingredients, steps, and cooking times reliably. This means our translation engine must do three things simultaneously:
- Translate accurately between English and Korean while respecting localized cooking terminology.
- Preserve loanwords and specialized "Konglish" structures naturally without flattening them.
- Enforce a strict JSON schema perfectly, mapping loose conversational prose to precise data fields.
Generic LLM translation leaderboards don't test for schema integrity under multilingual stress. So, I built a translation evaluation harness to put three heavy-hitters to the test: Anthropic's Claude Sonnet 4.6, Google's Gemini 2.5 Pro, and Alibaba's Qwen3 Max Thinking.
Why Korean Culinary Translation is Surprisingly Fragile
When extracting data from bilingual culinary speech, standard LLM translation layers fail in fascinating ways.
The Loanword Dilemma (Konglish). Real-world home cooks use mixed linguistic registers. For instance, an English speaker talking about Korean food might use 세서미 오일 (Sesame oil) or 참기름 interchangeably. In contrast, a native Korean speaker might say 프라이팬 (frying pan) or 레시피 (recipe). Standard models often try to hyper-correct these back into formal, pure English or pure Korean words, breaking the voice profile of the speaker or failing internal string matching.
The Schema vs. Nuance Tradeoff. Forcing an LLM to output rigid JSON often degrades its linguistic performance. The cognitive load of ensuring that every comma, array element, and boolean field perfectly matches a TypeScript definition can cause models to hallucinate translations or drop subtle phrases entirely.
Cultural Subtlety. Cooking is dense with implicit actions. Phrases like "간을 맞추다" (adjusting taste/seasoning) or specifying a pinch of something aren't just literal word conversions; they carry contextual weight that must be translated gracefully into an internationalized recipe layout.
To evaluate this accurately, I built a composite metric score using an absolute formula:
The Models Under Evaluation
We selected the absolute latest premium tiers of three major LLM ecosystems, focusing on their capacity for complex structural mapping alongside multilingual fluency under real logical constraints.
| Model | Provider | Strengths Being Evaluated |
|---|---|---|
| Claude Sonnet 4.6 | Anthropic | Renowned for exceptional instruction-following, natural prose rendering, and advanced complex JSON structuring. |
| Gemini 2.5 Pro | Equipped with massive context capabilities and traditionally high cross-lingual adaptability, specifically tuned for developer workflows. | |
| Qwen3 Max Thinking | Alibaba | A top-tier flagship featuring an advanced integrated reasoning loop, showing competitive multi-turn and native cross-lingual performance. |
Headline Results
Following a strict evaluation over 8 rigorous culinary scenarios (representing combinations of English, Korean, clean audio inputs, and noisy, broken text transcripts), an absolute hierarchy emerged.
#1 — Anthropic Claude Sonnet 4.6
- Composite Absolute Score:
0.9514 - Schema Validity:
0.9945 - Loanword Preservation:
0.8854 - Cultural Score (1–5):
4.88
#2 — Google Gemini 2.5 Pro
- Composite Absolute Score:
0.8918 - Schema Validity:
0.9341 - Loanword Preservation:
0.8733 - Cultural Score (1–5):
4.25
#3 — Alibaba Qwen3 Max Thinking
- Composite Absolute Score:
0.8743 - Schema Validity:
0.9340 - Loanword Preservation:
0.8233 - Cultural Score (1–5):
4.25
The Big Takeaway: All three systems demonstrate advanced operational thresholds. The total delta from first to third place is only
0.0771. This points to true market differentiation rather than a catastrophic model failure, with Claude leading comfortably across both structural reliability and linguistic adaptation.
Cross-Model Deep Dive
When breaking the metrics down across English and Korean scenarios, we can observe precisely where models excel—and where they begin to struggle.
| Metric Sub-Category | Claude Sonnet 4.6 | Gemini 2.5 Pro | Qwen3 Max Thinking |
|---|---|---|---|
| Schema Accuracy — EN Scenarios | 0.9952 | 0.9058 | 0.8797 |
| Schema Accuracy — KO Scenarios | 0.9938 | 0.9624 | 0.9884 |
| Loanword Extraction — EN Scenarios | 0.7709 | 0.7465 | 0.6466 |
| Loanword Extraction — KO Scenarios | 1.0000 | 1.0000 | 1.0000 |
| Noise Delta (Clean Noisy Schema) | -0.0111 | +0.0271 | +0.0324 |
Per-Scenario Breakdown & Core Patterns
1. The Konglish Translation Bias
A fascinating trend appeared within the English scenarios (en-a and en-b). When transcripts contained English text embedded with words meant to match culinary forms, Claude Sonnet 4.6 naturally opted for the hybrid Konglish register—translating "sesame oil" cleanly into phonetically accessible 세서미 오일. While traditionalists might argue for 참기름, 세서미 오일 preserved the structural flavor of the speaker's original intent without flattening the vocabulary.
Gemini 2.5 Pro and Qwen3 Max Thinking, by contrast, frequently forced total normalization back into native English words (sesame oil) within the schema fields, leading to lower loanword tracking scores.
2. The Noise Inversion Paradox
Look closely at the Noise Delta field. This tracks how much schema validity degraded when processing messy text generated from noisy kitchen audio compared to clean audio inputs.
Surprisingly, Gemini 2.5 Pro (+0.0271) and Qwen3 Max Thinking (+0.0324) showed higher schema reliability on noisy transcripts than on clean ones. This indicates that minor textual inconsistencies or repetitions in noisy ASR inputs may actually act as extra contextual anchors, prompting the models to reason more aggressively before assembling the JSON payload.
Behind the Scenes: The Metric Inversion Patch
Building engineering benchmarks is an iterative process. In our initial test execution (Run 1), we ran into a massive logic error where the evaluation harness inadvertently inverted the ranking of missing schema attributes—rewarding models that dropped nested arrays instead of penalizing them.
For Run 2, we patched this routing bug and activated an impartial LLM-as-a-Judge agent powered by anthropic/claude-sonnet-4.6 to systematically grade the cultural nuances of the translated recipes on a strict 1-5 scale.
### SYSTEM BUG LOG (PATCHED - RUN 2)
[ERROR] Evaluator loop inside `ModelRanker.py` was multiplying schema field omissions
by a positive index weight rather than subtracting it.
[FIX] Replaced evaluation mapping with absolute structural distance calculations.
[JUDGE] Active. Cultural subtleties are scored out of an absolute 5-point scale.
Composite calculation formula locked down to avoid variable scale drifting.