-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 975: invalid start byte #113
Comments
Using notepad++ searching for [\x84\x93\x94] or [\x82\x91\x92] didn't give me any results in all my 1196 txt files. =( |
Just got this error last night when training a LoRA, hadn't happened with the LoRAs I'd trained prior. Quick way to fix it is to use the find and replace function in Notepad++ on the folder containing your training data, and replacing ' with nothing. As for changing the encoding back to UTF-8 I'm not entirely sure how to do that automatically but there's probably a way. |
I got this error when using llama 3.1 7b as a captioner using with a joy caption script. It was printing a ' (apostrophe) in a non utf-8 format. After I removed them, training went fine... |
yes, sometimes the file format gets reset depending if you opened the file with another program and closed it. |
You guys are right. Someone (user:Think) on Discord suggested, and I run these two scripts on the caption folder and it worked. I still think that this falls into a bug category and the trainer could handle this better in the future, as a suggestion. ("note these scripts would run on the current directory, so run them on a backup copy to risk messing up your dataset.")
|
#128 should fix this |
Open Window Setting, Time and Language, Language and Region, Administrative Language Settings, Change System Locale, Check Beta: Use Unicode UTF-8 for Worldwide Language Support. This works for me |
This is for bugs only
Did you already ask in the discord?
Yes
You verified that this is a bug and not a feature request or question by asking in the discord?
Yes
Describe the bug
I'm getting this error in the middle of training. Once at 399 step. The second time at 1265.
Chatgpt says it's related to a single quotation mark '.
Maybe it's a character in the config folder or the name of a file or caption? Edit: further googling it's related smart quote(’) of Windows-1252. I just don't know how to find and replace it...
My previous LoRa from a person named "Loïc" that I used the name as a trigger word had errors related to the ï character. I had to change it everywhere in the config file. But on the captions I left as it was and it worked. I also think this is a bug. The file name and prompt on the config file should allow this character to be used.
Anyway. This is the problem I'm having now (not related to Loïc) it's a different LoRA.
The text was updated successfully, but these errors were encountered: