How well do multi-modal LLMs hear Mandarin tones? Do they hear tones? Let's find out!

by Yunus Abdülhayoğlu <hi@lingolingo.app>

Screenshot of a Mandarin tone practice app: character 他 with detected pinyin jiǎng and correct pinyin tā, plus tap-to-speak and play controls.
Mandarin tone feedback in the wild — detected vs correct pronunciation.

After reading this article https://simedw.com/2026/01/31/ear-pronunication-via-ctc/ and exploring the failure cases (see above), I was curious to see if multi-modal LLMs could hear Mandarin tones. Especially since the article references the "bitter lesson":

And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.

Wouldn't it be funny if an article referencing the bitter lesson could suffer from the same bitter lesson? Okay, that's not really funny. But it's certainly interesting. By the way, I'm not trying to take away from the accomplishments of the original article. Even if SOTA LLMs can achieve similar results, there's still value in a small model that can run locally with low latency and for free.

Results

The results indicate that there is some emergent ability to hear Mandarin tones in off-the-shelf multi-modal LLMs, most notably Gemini 3.0 Pro.

Documenting My Approach

For this article, I want to try something new. I'm including my full chat transcript from Cursor to show how I'm interacting with the AI agent. I'm doing this for three reasons:

  1. The initial prompt works well as an intro for this article. The reader and the LLM are actually in the same position of needing context and an overview, and the prompt delivers that in a information-dense way.
  2. For transparency, so the reader can follow the raw process.
  3. To contribute to the discussion about documenting the use of AI when coding.
I think it's valueable to see how people are using AI, and not just the output. We're still in the infancy of documenting AI-assisted coding. Eventually the prompts might be the only thing worth reading? (Please discuss).

Notes:

Loading…

Paper referenced: https://arxiv.org/abs/2104.05657

1. The Four Tones

In Mandarin, each syllable carries one of four tones, with an additional unstressed one, which we ignore here:

The figure below shows the idealized pitch contours that we use for the synthetic tones.

F0 (Hz) vs time (ms) for tones 1–4: flat, rising, dipping then rising, falling.
Pitch contours for the four Mandarin tones (synthetic). T1: high level; T2: rising; T3: dipping then rising; T4: falling.

2. Synthetic Tones

The synthetic audio that we created.

Synthetic tones (T1–T4)

3. Native Speaker Pronouncing “bai” (T1–T4)

Four real Mandarin syllables from https://github.com/hugolpz/audio-cmn

bai1, bai2, bai3, bai4

3. Testing Synthetic tones

I tested the endpoints with the raw pitch audio, but the models were not able to distinguish any tones. This indicates that the detection of tones is tightly coupled with the phoetic form of the syllables and can't be referenced in an abstract way.

Model Stimulus T1 → pred T2 → pred T3 → pred T4 → pred Correct
GPT Audio (2025-08-28) 3 2 4 2 0/4
GPT-4o Audio 3 3 4 4 0/4
Gemini 2.0 Flash 4 4 4 4 0/4
Gemini 2.5 Pro 3 4 4 1 0/4
Gemini 3.0 Pro 1 3 4 3 1/4
Gemini 3.1 Pro 2 2 3 1/4

5. Testing with Real Syllables

Macro F1 (average of per-tone F1) on the 60-clip set. Random baseline ≈ 0.25.

Bar chart: Macro F1 by model on 60 real syllables. Gemini 3.0 Pro beats 3.1 Pro; all models beat random chance.
Macro F1 by model (60 real syllables). Gemini 3.1 Pro and 3.0 Pro lead; all models beat random chance.
Confusion matrix for Gemini 3.0 Pro: rows = true tone, columns = predicted tone.
Confusion matrix for Gemini 3.0 Pro (60 clips). Confusing T3 for T2 is the easiest mistake to make, since these two tones are the most similar. This indicates that the model learned frequencies. The model not confusing T2 for T3 could indicate a slight bias towards T2 that emerges when confidence is low.
Model Macro F1
Gemini 3.0 Pro 0.82
Gemini 3.1 Pro 0.76
Gemini 2.5 Pro 0.74
GPT Audio (2025-08-28) 0.58
GPT-4o Audio 0.53
Gemini 2.0 Flash 0.52

When interpreting: high recall on T4 or a tendency to predict “4” can reflect the Mandarin tone prior as well as acoustic discrimination.

5.2 Syllable “hao” (4 clips)

Same models on four clips for the syllable hao (hao1–hao4). Interestingly, the performance is much worse, with the model erring towards the third and fourth tone. A possible explanation here is that there are basically no words in Mandarin that use hao1, and while hao2 is more common, it is still orders of magnitudes less used than hao3 (most commonly 好, "good") and hao4 (號, "number"), which are among the most commonly used words in Mandarin. This of course would have to be taken into account when considering these systems for serious use.

Macro F1 by model on 4 hao syllables.
Macro F1 by model (syllable hao, 4 clips).
Confusion matrix for Gemini 3.0 Pro on hao (4 clips).
Confusion matrix for Gemini 3.0 Pro (hao, 4 clips).
Model T1 → pred T2 → pred T3 → pred T4 → pred Correct
Gemini 3.0 Pro 4 3 3 4 3/4
GPT-4o Audio 4 4 3 4 2/4
GPT Audio (2025-08-28) 4 1/4
Gemini 2.5 Pro 4 3 3 3 1/4
Gemini 2.0 Flash 3 3 3 1/4
Gemini 3.1 Pro 3 3 3 1/4

GPT Audio returned no tone for hao1–hao3; Gemini 2.0 Flash had no response for hao2; Gemini 3.1 Pro timed out on hao4.

6. Trying Something Different

With the results being promising, but still far from perfect, I had another idea: what if we add another modality, namely images, to provide more context to the LLM? The idea is to pass a F0 frequency graph to the model, so it can visually "see" the tone contour. For this, Praat is the gold standard, so I used my existing Praat integration from LingoLingo to generate tone pitch contours and passed them to Gemini 3.0 Pro. This method proved to be very effective and lead to a macro F1 of 92%. Unfortunately, due to the added latency of Praat and the added modality, requests took up to 20 seconds to complete, which would be too much for a real world application.

Macro F1 for local analyze-tones endpoint on 60 clips.
Macro F1 for the local endpoint (60 clips).
Confusion matrix for local endpoint: rows = true tone, columns = predicted tone.
Confusion matrix for the local analyze-tones endpoint (60 clips).

7. Repo and scripts

Everything is in the mandarin-tones repo:

For an up-to-date list of audio-capable models: python scripts/fetch_audio_models_from_litellm.py (no API keys needed).