Korean ASR Benchmark: GPT-4o Transcribe vs Deepgram Nova-3

Korean ASR Benchmark

Repo: devtinapark/korean-asr-benchmark Report: results/benchmark_report.html

I'm building Spoken Kitchen — a bilingual app that helps immigrant families capture, translate, and preserve heirloom recipes passed down through voice. Think: 할머니 dictating her kimchi jjigae recipe in Korean while her granddaughter follows along in English, in a real kitchen with sizzling pans and running water in the background.

The core of the app is speech-to-text, and I needed to pick the right ASR model. English demos all look magical, but Korean audio falls apart the moment you add real background noise and code-switching. Generic leaderboards weren't going to answer my question, so I built a small benchmark to answer it myself:

For real Korean/English kitchen speech — with sizzling pans, code-switched words like "미역국" and "간 맞추기," and sentences that jump between languages mid-thought — should I build on OpenAI gpt-4o-transcribe or Deepgram Nova-3?

Why Korean ASR Is Annoyingly Hard

A few things make Korean ASR very different from English.

Agglutinative morphology. Grammatical information is glued onto word stems as suffixes. Tiny mistakes in these endings can be huge at the character level but may not show up cleanly as "whole word" errors.

Inconsistent spacing (띄어쓰기). Human transcribers and ASR models disagree constantly on where to put spaces. Two equally "correct" Korean transcriptions can have wildly different word boundaries.

Loanwords and code-switching. Real kitchen speech doesn't stay in one language. The Korean clips use English-origin terms like "프라이팬" (frying pan) and "도시락" naturally within Korean sentences. The English clips go the other direction — Korean terms like 미역국, 미역, 간 맞추기, and 의미 appear mid-sentence. Some models handle cross-language terms as expected, others mangle them or hallucinate entirely different words.

Because of this, I used Character Error Rate (CER) as the primary metric rather than WER. I strip spaces before computing CER for Korean, following the KsponSpeech evaluation convention. I also explicitly track code-switch accuracy — how well each model handles terms from the other language embedded in a sentence.

Why These Two Models

Both OpenAI's gpt-4o-transcribe and Deepgram's nova-3 support automatic language detection out of the box — no pre-pass, no per-request language hints. For audio that shifts between English and Korean mid-sentence, that constraint was non-negotiable, and it immediately made these two the natural pair to compare.

Model	Provider	Why
gpt-4o-transcribe	OpenAI API	Latest transcription model; auto-detects language; strong on Korean + English code-switching
Nova-3	Deepgram API	Fast, `detect_language=true`; competitive accuracy; low cost per minute

The Dataset: Real Multilingual Kitchen Audio

This is not a curated research corpus. It's 8 clips total — 4 base utterances (EN A/B, KO A/B), each recorded twice: once clean (quiet room) and once noisy (real kitchen sounds: sizzling pans, running water, background hum). Both the English and Korean clips are code-switched — the English clips contain Korean words and phrases (미역국, 간 맞추기), and the Korean clips contain natural Korean cooking vocabulary that overlaps with borrowed terms. Lengths range from 5–20 seconds. The full metadata lives in kitchen_samples/metadata.json.

[
  {
    "id": "en-b-noise",
    "audio_file": "English-B-with-noise.MP3",
    "transcript": "Today I make 미역국 — miyeokguk. Seaweed soup. Korean people eat this on birthdays, after having a baby. It's not just food — it means something. You need 미역, the dried seaweed. Soak it in cold water maybe twenty minutes...",
    "language": "en",
    "noise": true
  },
  {
    "id": "ko-a-clean",
    "audio_file": "Korean-A.MP3",
    "transcript": "오늘은 계란말이 만드는 법을 알려드릴게요. 저희 엄마 한테 배운 거예요. 진짜 간단한데 맛있어요...",
    "language": "ko",
    "noise": false
  }
]

The goal isn't coverage — it's a minimal, developer-friendly harness you can swap your own audio into. The effective sample size is 4 unique utterances × 2 noise conditions.

Methodology

The benchmark calls both APIs directly, normalizes text (especially Korean spacing), computes metrics per clip, and aggregates into an HTML + Markdown report.

CER — Character Error Rate. Primary metric for Korean. Spaces are stripped before comparison (띄어쓰기 normalization), following KsponSpeech practice.

WER — Word Error Rate. Secondary metric. Useful for English, but less reliable for Korean due to ambiguous word boundaries. In fact, on ko-a, GPT-4o's clean WER (0.2079) is higher than its noisy WER (0.1980) — a great example of why WER can mislead on Korean even when the underlying transcription quality is stable.

Code-Switch Accuracy. Checks how well each model handles terms from the other language — Korean words like 미역국 and 간 맞추기 inside English sentences, and any English-origin terms inside Korean sentences. Critical for real kitchen speech where both languages appear in a single utterance.

Composite Score. Weighted: 55% CER + 30% WER + 15% code-switch error rate (normalized). Reported as a relative 0–1 rank between models. Speed is excluded — latency is reported separately.

Latency. API response time per clip (request → response), excluding backoff or manual delays.

Cost. Audio duration in minutes × price per minute. Deepgram is priced at the multilingual pre-recorded rate ($0.0052/min), since the benchmark uses detect_language=true on mixed Korean/English audio. OpenAI's rate ($0.006/min) is flat regardless of language. Both verified as of June 2026.

Note: aggregate CER and WER in the headline results are character-weighted and word-weighted micro-averages across all clips, so they won't match a simple mean of the per-clip column.

Headline Results

#1 — openai-gpt4o-transcribe

CER: 0.0528
WER: 0.1135
Code-switch accuracy: 0.9742
Avg latency: 4.01s
Cost: $0.04728 for 7.88 min ($0.0060/min)

#2 — deepgram-nova-3

CER: 0.0773
WER: 0.1784
Code-switch accuracy: 0.9710
Avg latency: 1.97s
Cost: $0.04098 for 7.88 min ($0.0052/min)

Accuracy: GPT-4o-transcribe is clearly ahead on CER, especially on Korean clips. Cost: Nova-3 is about 13% cheaper per minute at the multilingual rate. Latency: Nova-3 is about 2× faster on average. Both models are near-perfect on code-switch recognition at this tier.

Per-Sample Breakdown

Noisy clips are marked. Lat = API latency in seconds.

Sample	Model	CER	WER	Lat
en-a-clean	gpt-4o-transcribe	0.0401	0.0473	4.12s
	deepgram-nova-3	0.0518	0.0676	1.50s
en-a-noise 🔊	gpt-4o-transcribe	0.0518	0.0878	3.33s
	deepgram-nova-3	0.0968	0.1622	4.45s
en-b-clean	gpt-4o-transcribe	0.0472	0.0818	4.04s
	deepgram-nova-3	0.0730	0.1321	2.91s
en-b-noise 🔊	gpt-4o-transcribe	0.0572	0.1006	4.24s
	deepgram-nova-3	0.0730	0.1195	0.99s
ko-a-clean	gpt-4o-transcribe	0.0548	0.2079	4.04s
	deepgram-nova-3	0.0685	0.2376	1.99s
ko-a-noise 🔊	gpt-4o-transcribe	0.0651	0.1980	4.03s
	deepgram-nova-3	0.0753	0.3168	0.92s
ko-b-clean	gpt-4o-transcribe	0.0337	0.0690	4.10s
	deepgram-nova-3	0.0599	0.1638	2.23s
ko-b-noise 🔊	gpt-4o-transcribe	0.0899	0.1810	4.20s
	deepgram-nova-3	0.1423	0.3276	0.76s

A few patterns worth noting. On clean English, both models are excellent and the gap is likely irrelevant in production. On noisy English, Nova-3 sometimes spikes — note en-a-noise where its CER nearly doubles. On Korean, GPT-4o consistently has lower CER and WER, and the gap widens on noisy clips. Also worth flagging: Nova-3's en-b-clean and en-b-noise CER are identical (0.0730), and its en-a-noise latency (4.45s) is an outlier where it's actually slower than GPT-4o. With n=1 per cell, individual latency figures are noisy.

Noise Robustness

Average CER on clean vs noisy clips:

Model	Clean avg CER	Noisy avg CER	Degradation Δ
openai-gpt4o-transcribe	0.0440	0.0660	+0.0221
deepgram-nova-3	0.0633	0.0969	+0.0336

Both models degrade gracefully under kitchen noise, but GPT-4o has a lower baseline CER and a smaller degradation delta. Nova-3 is still very usable, but its errors increase more with noise, especially on Korean segments. Noise hurts more in Korean than in English across both models — consonant-rich endings plus background noise is still hard.

Cost and Latency

Pricing verified June 2026. Deepgram rate is the multilingual pre-recorded tier.

Model	$/min	Audio	Est. cost	Avg latency	Total latency
openai-gpt4o-transcribe	$0.0060	7.88 min	$0.04728	4.01s	32.10s
deepgram-nova-3	$0.0052	7.88 min	$0.04098	1.97s	15.75s

Nova-3 is ~13% cheaper on this run and ~2× faster in raw API latency. In a streaming or push-to-talk UX, the latency gap is user-visible. In a batch transcription pipeline, it might not matter. Deepgram also bills per second rather than rounding up, which compounds the savings on short utterances at high volume.

When to Pick Which

Choose GPT-4o-transcribe if accuracy is your primary KPI, especially on Korean. It has lower CER across both clean and noisy clips and smaller degradation under noise. Good fit for automatic subtitle generation, recipe search and semantic indexing, or any downstream LLM workflow where ASR errors propagate and get amplified.

Choose Deepgram Nova-3 if you care about cost and latency. It's 13% cheaper per minute at the multilingual rate and roughly 2× faster. Good fit for live assistants where users speak short commands, in-app cooking guidance that reacts to spoken steps, or high-volume processing where pennies per minute add up. You're trading some accuracy for speed and cost — on clean English the gap may be negligible, but on noisy Korean you'll feel it.

Why CER Beats WER for Korean

Korean ASR can't be evaluated like English. WER assumes space-delimited words are meaningful units, but in Korean, word boundaries are fuzzy and often arbitrary.

Example:

Gold: 물을 끓이고 있어요.
Model A: 물을끓이고 있어요.
Model B: 물을 끓 이고있어요.

By WER, Model B might look worse even if its underlying character sequence is closer to the reference. That's why this benchmark strips spaces before computing CER for Korean and treats characters as the fundamental unit. This follows the KsponSpeech Korean ASR benchmark convention and matches how most Korean ASR papers report metrics.

Run It on Your Own Audio

Everything in this article is reproducible.

git clone https://github.com/devtinapark/korean-asr-benchmark
cd korean-asr-benchmark

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export OPENAI_API_KEY=your_openai_key
export DEEPGRAM_API_KEY=your_deepgram_key

Add your own clips to kitchen_samples/metadata.json:

[
  {
    "id": "001",
    "audio_file": "001_boil_water.MP3",
    "transcript": "물이 끓고 있어요. 불 좀 줄여줘.",
    "language": "ko",
    "noise": false
  },
  {
    "id": "001-noise",
    "audio_file": "001_boil_water_kitchen.MP3",
    "transcript": "물이 끓고 있어요. 불 좀 줄여줘.",
    "language": "ko",
    "noise": true
  }
]

Then run:

python -m src.main                                # both models
python -m src.main --model openai-gpt4o-transcribe  # single provider
python -m src.main --model deepgram-nova-3

Output:

results/
├── benchmark_report.html
├── benchmark_report.md
├── results.csv
├── results.json
└── predictions/
    ├── openai-gpt4o-transcribe_predictions.csv
    └── deepgram-nova-3_predictions.csv

Swap in your own domain — customer support calls, in-car commands, educational content — and you have a small but serious evaluation harness.

What Surprised Me

Code-switch handling is basically a solved problem at this tier. Both models are near-perfect on code-switched terms. This matters less as a differentiator than I expected.

Noise hurts more in Korean than in English. Even with CER as the metric, noisy Korean clips show larger gaps between the two models than English ones. Consonant-rich endings plus noise is still hard.

Latency feels very different, even on 8 clips. Waiting ~4 seconds vs ~2 seconds per request is noticeable on a CLI. In a production UX you'd definitely feel it in live interactions.

What I'd Test Next

This benchmark is intentionally small. If you want to take it further:

More clips — at least 50–100 Korean segments, more diverse speakers and mic setups.
Other languages — Japanese and Chinese for CJK behavior, Spanish or Hindi as baselines.
Streaming APIs — measure partial-hypothesis latency and transcript stability (how often the output rewrites itself mid-utterance).
Forced alignment — for subtitle use cases, it's not just what is recognized but when.

Takeaways

For multilingual Korean/English kitchen audio, GPT-4o-transcribe is the accuracy-first choice — best CER/WER, especially on noisy Korean, with stronger robustness to background cooking noise. Deepgram Nova-3 is the cost/latency choice — ~13% cheaper per minute, ~2× faster, and still very strong on clean English.

You don't have to guess. The entire pipeline is open at devtinapark/korean-asr-benchmark. Drop in your own audio, re-run the benchmark, and make vendor decisions with metrics, not vibes.