'utf-8' codec can't decode byte 0xe2 in position 4785: invalid continuation byte #2552

LeeKIngKIng · 2025-03-19T05:57:36Z

LeeKIngKIng
Mar 19, 2025

When I ttranscribe the audio, the error happens.

UnicodeDecodeError Traceback (most recent call last)
Cell In[1], line 138
132 add_subtitles_with_watermark(compose_clip, subtitles, output_path, watermark_path, top_padding, left_padding)
134 # Clean up temporary audio file
135 # os.remove(audio_path)
--> 138 process_video_folder('德玛西亚人在塔在啦啦啦啦哈哈哈哈')

Cell In[1], line 105, in process_video_folder(words_str, video_folder, output_folder)
101 extract_audio('video_folder/4.mp4', audio_path)
103 # Transcribe audio
104 # transcriptions = transcribe_audio(audio_path)
--> 105 words = transcribe_words(audio_path)
106 print(words)
107 return

Cell In[1], line 31, in transcribe_words(audio)
28 import whisper
30 model = whisper.load_model("medium")
---> 31 result = model.transcribe(audio)
32 return json.dumps(result)

File ~/anaconda3/lib/python3.12/site-packages/whisper/transcribe.py:146, in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, clip_timestamps, hallucination_silence_threshold, **decode_options)
142 print(
143 "Detecting language using up to the first 30 seconds. Use --language to specify the language"
144 )
145 mel_segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
--> 146 _, probs = model.detect_language(mel_segment)
147 decode_options["language"] = max(probs, key=probs.get)
148 if verbose is not None:

File ~/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)

File ~/anaconda3/lib/python3.12/site-packages/whisper/decoding.py:35, in detect_language(model, mel, tokenizer)
22 """
23 Detect the spoken language in the audio, and return them as list of strings, along with the ids
24 of the most probable language tokens and the probability distribution over all language tokens.
(...)
32 list of dictionaries containing the probability distribution over all languages.
33 """
34 if tokenizer is None:
---> 35 tokenizer = get_tokenizer(
36 model.is_multilingual, num_languages=model.num_languages
37 )
38 if (
39 tokenizer.language is None
40 or tokenizer.language_token not in tokenizer.sot_sequence
41 ):
42 raise ValueError(
43 "This model doesn't have language tokens so it can't perform lang id"
44 )

File ~/anaconda3/lib/python3.12/site-packages/whisper/tokenizer.py:391, in get_tokenizer(multilingual, num_languages, language, task)
388 language = None
389 task = None
--> 391 encoding = get_encoding(name=encoding_name, num_languages=num_languages)
393 return Tokenizer(
394 encoding=encoding, num_languages=num_languages, language=language, task=task
395 )

File ~/anaconda3/lib/python3.12/site-packages/whisper/tokenizer.py:335, in get_encoding(name, num_languages)
330 @lru_cache(maxsize=None)
331 def get_encoding(name: str = "gpt2", num_languages: int = 99):
332 vocab_path = os.path.join(os.path.dirname(file), "assets", f"{name}.tiktoken")
333 ranks = {
334 base64.b64decode(token): int(rank)
--> 335 for token, rank in (line.split() for line in open(vocab_path) if line)
336 }
337 n_vocab = len(ranks)
338 special_tokens = {}

File ~/anaconda3/lib/python3.12/site-packages/whisper/tokenizer.py:335, in (.0)
330 @lru_cache(maxsize=None)
331 def get_encoding(name: str = "gpt2", num_languages: int = 99):
332 vocab_path = os.path.join(os.path.dirname(file), "assets", f"{name}.tiktoken")
333 ranks = {
334 base64.b64decode(token): int(rank)
--> 335 for token, rank in (line.split() for line in open(vocab_path) if line)
336 }
337 n_vocab = len(ranks)
338 special_tokens = {}

File :322, in decode(self, input, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4785: invalid continuation byte

Advait251206 · 2026-06-24T18:29:43Z

Advait251206
Jun 24, 2026

This error is unlikely to be caused by your audio file. The traceback shows the failure occurs before transcription starts, while Whisper is loading its tokenizer vocabulary:

whisper/tokenizer.py

open(vocab_path)

↓

UnicodeDecodeError:
'utf-8' codec can't decode byte 0xe2 ...

What's actually happening?

Whisper is trying to read:

.../site-packages/whisper/assets/gpt2.tiktoken

using UTF-8:

open(vocab_path)

and one of the bytes in that file is invalid.

The relevant code is:

for token, rank in (
    line.split()
    for line in open(vocab_path)
    if line
)

So the tokenizer vocabulary file itself is failing to decode.

Most likely causes

1. Corrupted Whisper installation

The most common cause is a corrupted:

gpt2.tiktoken

file.

Check:

import whisper
import os

print(os.path.dirname(whisper.__file__))

Then inspect:

.../whisper/assets/gpt2.tiktoken

If the file contains strange characters or looks truncated, reinstall Whisper.

2. File accidentally modified

Sometimes:

manual edits
Git merges
text editors
encoding conversion tools

can corrupt the tokenizer file.

The .tiktoken vocabulary should be plain ASCII text.

If you see:

â€™
â€œ
ï»¿

or other garbled characters, the file has likely been re-encoded incorrectly.

3. Python 3.12 compatibility edge case

You're running:

~/anaconda3/lib/python3.12/

Python 3.12.

While Whisper generally works on modern Python versions, some users have encountered tokenizer-related issues due to environment or package inconsistencies.

The error itself still points to a bad vocabulary file rather than a Python bug.

Quick verification

Run:

import whisper
import os

vocab_path = os.path.join(
    os.path.dirname(whisper.__file__),
    "assets",
    "gpt2.tiktoken"
)

with open(vocab_path, "rb") as f:
    data = f.read()

print(len(data))

Then:

data.decode("utf-8")

If this raises the same exception, the file is definitely corrupted.

Recommended fix

Reinstall Whisper and its tokenizer assets:

pip uninstall openai-whisper
pip uninstall whisper

Then:

pip install --no-cache-dir openai-whisper

Using --no-cache-dir is important because a corrupted cached wheel can reinstall the same bad file.

Verify after reinstall

Check:

import whisper

model = whisper.load_model("medium")
print("OK")

If the model loads successfully, the tokenizer assets are healthy.

Alternative possibility

If you're working inside a notebook or project directory, ensure you don't have a local file named:

whisper.py

or a folder:

whisper/

shadowing the installed package.

That can occasionally lead to loading unexpected assets.

Diagnosis

The failure occurs while Whisper reads:

assets/gpt2.tiktoken

not while processing the audio. The most likely cause is a corrupted or incorrectly encoded gpt2.tiktoken vocabulary file in your Whisper installation. Reinstalling openai-whisper (preferably with --no-cache-dir) usually resolves this issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'utf-8' codec can't decode byte 0xe2 in position 4785: invalid continuation byte #2552

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

'utf-8' codec can't decode byte 0xe2 in position 4785: invalid continuation byte #2552

Uh oh!

LeeKIngKIng Mar 19, 2025

Replies: 1 comment

Uh oh!

Advait251206 Jun 24, 2026

What's actually happening?

Most likely causes

1. Corrupted Whisper installation

2. File accidentally modified

3. Python 3.12 compatibility edge case

Quick verification

Recommended fix

Verify after reinstall

Alternative possibility

Diagnosis

LeeKIngKIng
Mar 19, 2025

Advait251206
Jun 24, 2026