LLM Translation & Schema Benchmark: Claude vs. Gemini vs. Qwen

Korean Culinary LLM Translation Benchmark

Repo: devtinapark/korean-culinary-translation-benchmark
Report: results/benchmark_report.html

In my previous article, I tackled the challenge of speech-to-text for Spoken Kitchen — an app built to help immigrant families capture and preserve heirloom recipes. I established a localized baseline showing how OpenAI's gpt-4o-transcribe and Deepgram's nova-3 handle noisy, code-switched kitchen audio.

But capturing the transcript is only half the battle. Once you have a messy, spoken mix of Korean and English (e.g., 할머니 explaining how to make 미역국 or adjusting the 간 맞추기), you have to turn that raw text into a clean, structured, translated recipe object.

Our app requires a strict JSON schema so the frontend can display ingredients, steps, and cooking times reliably. This means our translation engine must do three things simultaneously:

Translate accurately between English and Korean while respecting localized cooking terminology.
Preserve loanwords and specialized "Konglish" structures naturally without flattening them.
Enforce a strict JSON schema perfectly, mapping loose conversational prose to precise data fields.

Generic LLM translation leaderboards don't test for schema integrity under multilingual stress. So, I built a translation evaluation harness to put three heavy-hitters to the test: Anthropic's Claude Sonnet 4.6, Google's Gemini 2.5 Pro, and Alibaba's Qwen3 Max Thinking.

Why Korean Culinary Translation is Surprisingly Fragile

When extracting data from bilingual culinary speech, standard LLM translation layers fail in fascinating ways.

The Loanword Dilemma (Konglish). Real-world home cooks use mixed linguistic registers. For instance, an English speaker talking about Korean food might use 세서미 오일 (Sesame oil) or 참기름 interchangeably. In contrast, a native Korean speaker might say 프라이팬 (frying pan) or 레시피 (recipe). Standard models often try to hyper-correct these back into formal, pure English or pure Korean words, breaking the voice profile of the speaker or failing internal string matching.

The Schema vs. Nuance Tradeoff. Forcing an LLM to output rigid JSON often degrades its linguistic performance. The cognitive load of ensuring that every comma, array element, and boolean field perfectly matches a TypeScript definition can cause models to hallucinate translations or drop subtle phrases entirely.

Cultural Subtlety. Cooking is dense with implicit actions. Phrases like "간을 맞추다" (adjusting taste/seasoning) or specifying a pinch of something aren't just literal word conversions; they carry contextual weight that must be translated gracefully into an internationalized recipe layout.

To evaluate this accurately, I built a composite metric score using an absolute formula: $0.40 \times \text{Schema Validity} + 0.35 \times \text{Loanword Preservation} + 0.25 \times \left(\frac{\text{Cultural Subtlety Judge Score}}{5}\right)$

The Models Under Evaluation

We selected the premium tiers of three major LLM ecosystems, focusing on their capacity for complex structural mapping alongside multilingual fluency under real logical constraints.

Model	Provider	Strengths Being Evaluated
Claude Sonnet 4.6	Anthropic	Renowned for exceptional instruction-following, natural prose rendering, and advanced complex JSON structuring.
Gemini 2.5 Pro	Google	Equipped with massive context capabilities and traditionally high cross-lingual adaptability, specifically tuned for developer workflows.
Qwen3 Max Thinking	Alibaba	A top-tier flagship featuring an advanced integrated reasoning loop, showing competitive multi-turn and native cross-lingual performance.

Headline Results

Following a strict evaluation over 8 rigorous culinary scenarios (representing combinations of English, Korean, clean audio inputs, and noisy, broken text transcripts), an absolute hierarchy emerged.

#1 — Anthropic Claude Sonnet 4.6

Composite Absolute Score: 0.9514
Schema Validity: 0.9945
Loanword Preservation: 0.8854
Cultural Score (1–5): 4.88

#2 — Google Gemini 2.5 Pro

Composite Absolute Score: 0.8918
Schema Validity: 0.9341
Loanword Preservation: 0.8733
Cultural Score (1–5): 4.25

#3 — Alibaba Qwen3 Max Thinking

Composite Absolute Score: 0.8743
Schema Validity: 0.9340
Loanword Preservation: 0.8233
Cultural Score (1–5): 4.25

The Big Takeaway: All three systems demonstrate advanced operational thresholds. The total delta from first to third place is only 0.0771. This points to true market differentiation rather than a catastrophic model failure, with Claude leading comfortably across both structural reliability and linguistic adaptation.

Cross-Model Deep Dive

When breaking the metrics down across English and Korean scenarios, we can observe precisely where models excel—and where they begin to struggle.

Metric Sub-Category	Claude Sonnet 4.6	Gemini 2.5 Pro	Qwen3 Max Thinking
Schema Accuracy — EN Scenarios	0.9952	0.9058	0.8797
Schema Accuracy — KO Scenarios	0.9938	0.9624	0.9884
Loanword Extraction — EN Scenarios	0.7709	0.7465	0.6466
Loanword Extraction — KO Scenarios	1.0000	1.0000	1.0000
Noise Delta (Clean $\rightarrow$ Noisy Schema)	-0.0111	+0.0271	+0.0324

Per-Scenario Breakdown & Core Patterns

1. The Konglish Translation Bias

A fascinating trend appeared within the English scenarios (en-a and en-b). When transcripts contained English text embedded with words meant to match culinary forms, Claude Sonnet 4.6 naturally opted for the hybrid Konglish register—translating "sesame oil" cleanly into phonetically accessible 세서미 오일. While traditionalists might argue for 참기름, 세서미 오일 preserved the structural flavor of the speaker's original intent without flattening the vocabulary.

Gemini 2.5 Pro and Qwen3 Max Thinking, by contrast, frequently forced total normalization back into native English words (sesame oil) within the schema fields, leading to lower loanword tracking scores.

2. The Noise Inversion Paradox

Look closely at the Noise Delta field. This tracks how much schema validity degraded when processing messy text generated from noisy kitchen audio compared to clean audio inputs.

Surprisingly, Gemini 2.5 Pro (+0.0271) and Qwen3 Max Thinking (+0.0324) showed higher schema reliability on noisy transcripts than on clean ones, with Qwen taking the lead. This indicates that minor textual inconsistencies or repetitions in noisy ASR inputs may actually act as extra contextual anchors, prompting the models to reason more aggressively before assembling the JSON payload.

Behind the Scenes: The Metric Inversion Patch

Building engineering benchmarks is an iterative process. In our initial test execution (Run 1), we ran into a massive logic error where the evaluation harness inadvertently inverted the ranking of missing schema attributes—rewarding models that dropped nested arrays instead of penalizing them.

For Run 2, we patched this routing bug and activated an impartial LLM-as-a-Judge agent powered by anthropic/claude-sonnet-4.6 to systematically grade the cultural nuances of the translated recipes on a strict 1-5 scale.

### SYSTEM BUG LOG (PATCHED - RUN 2)

[ERROR] Evaluator loop inside `ModelRanker.py` was multiplying schema field omissions
by a positive index weight rather than subtracting it.
[FIX] Replaced evaluation mapping with absolute structural distance calculations.
[JUDGE] Active. Cultural subtleties are scored out of an absolute 5-point scale.
Composite calculation formula locked down to avoid variable scale drifting.