Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex in guidance.gen fails to handle non-ASCII characters like German umlauts #1091

Open
LesterKort opened this issue Dec 30, 2024 · 0 comments

Comments

@LesterKort
Copy link

The bug
Regex rules in guidance.gen fail to handle non-ASCII characters (e.g., German umlauts such as ä, ö, ü, ß). Even when explicitly included in the regex pattern, the generated text systematically omits these characters.

To Reproduce
The following code demonstrates the issue. The regex pattern explicitly permits German umlauts and expects the generated text to adhere to it. However, the output consistently avoids such characters.

from guidance import gen
from guidance.models import Transformers
lm = Transformers("Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4", device_map="cuda", echo=False)
lm += "<|im_start|>system\nErstelle eine Liste mit 10 Regeln zum Fußball.<|im_end|>\n<|im_start|>assistant\nRegelliste:\n"
for i in range(1, 11):
    lm += f"{i}. " + gen('rule', stop='\n', regex=r'[A-ZÄÖÜ][a-zA-Z., äöüÄÖÜß]*\.\n')
    print(i, lm['rule'].strip())

Expected behavior
The generated text should match the regex pattern, including words with umlauts such as "Fußball", "München", or "Größe". Expected output example:

1. Fußball ist ein beliebter Sport.
2. Spieler dürfen keine Handspiele machen.
...

Actual behavior
The generated text omits umlauts, even though they are explicitly allowed in the regex pattern. For example:

1. Fussball ist ein beliebter Sport.
2. Spieler duerfen keine Handspiele machen.
...

System info:

  • OS: WSL2 Debian GNU/Linux 12 under Windows 11
  • Guidance Version: 0.1.16
  • Transformers Version: 4.47.1
  • Transformers Model: Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4
  • Python Version: 3.11.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant