Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix single newlines #6599

Merged
merged 8 commits into from
Dec 31, 2024
Merged

Fix single newlines #6599

merged 8 commits into from
Dec 31, 2024

Conversation

mamei16
Copy link
Contributor

@mamei16 mamei16 commented Dec 22, 2024

The aim of this PR is to allow single newlines to be used both by the LLM and the user and thus fix issue #6597.

The following is an example showing the effect of the change made:

Before:
old_newlines
After:
new_newlines

@oobabooga I had a difficult time figuring out the purpose of previous_line_empty. Is there any particular kind of output that relies on it?

@daniel-ang
Copy link

daniel-ang commented Dec 28, 2024

I tried this fix. It works on the user's replies but still exhibits the same issue in the assistant replies.

Edit: Hmm... Sometimes the user's replies do not work either. I can't find definitive pattern for this behavior yet.

@mamei16 mamei16 changed the base branch from main to dev December 28, 2024 20:19
@oobabooga
Copy link
Owner

The commit that introduced problems was 3d19746, so I have reverted things (in this PR) to be as close as possible to before that commit.

The idea of this \n\n behavior is that LLMs often generate paragraphs separated by a single \n (they do not always generate markdown), so it's necessary to turn that into \n\n to split into paragraphs, EXCEPT if the current line is in a code block, LaTeX equation, or table.

But the lists behavior was and still is wrong. Examples of lists that cause issues:

Here is the list of fruits::

- Apples
    - Red Delicious
    - Granny Smith
    - Fuji
- Oranges
    - Navel
    - Valencia
    - Blood Orange
- Bananas
    - Cavendish
    - Plantain
    - Red Banana
- Carrots
    - Baby Carrots
    - Heirloom Carrot

Here is another one:

1. Fruits
    - Apples
        - Red Delicious
        - Granny Smith
        - Fuji
    - Oranges
        - Navel
        - Valencia
        - Blood Orange
    - Bananas
        - Cavendish
        - Plantain
        - Red Banana
2. Vegetables
    - Carrots
        - Baby Carrots
        - Heirloom Carro

@oobabooga
Copy link
Owner

Now the lists above work, but

- Item 1
  - Subitem 1.1
  - Subitem 1.2
    - Sub-subitem 1.2.1
    - Sub-subitem 1.2.2
- Item 2
  - Subitem 2.1
  - Subitem 2.2

doesn't, unless I add back tab_length=2, in which case the previous lists above break again. I'm not sure if this can be solved.

@oobabooga
Copy link
Owner

This may be unfixable without a dirty hack to enforce 4 spaces indentation for every nested list generated by an LLM. See:

Python-Markdown/markdown#3 (comment)

I'll merge this PR because it fixes the issues above. If someone can see a better solution, please send a new PR!

@oobabooga oobabooga merged commit e953af8 into oobabooga:dev Dec 31, 2024
@mamei16
Copy link
Contributor Author

mamei16 commented Dec 31, 2024

Thanks for your time and effort, I was not aware that nested lists were such a pain in the ass to process.

However, I disagree with you here:

The idea of this \n\n behavior is that LLMs often generate paragraphs separated by a single \n (they do not always generate markdown), so it's necessary to turn that into \n\n to split into paragraphs...

If the LLM does not generate markdown formatted single newlines, you'd need to turn a single \n into \n (note the added double space). Otherwise, you cannot discern single newlines from double newlines, which is what this PR was originally about. Here is an example showcasing the issue:
result-sprite

@oobabooga
Copy link
Owner

Thanks for the example, that makes sense. So after this PR, the change would be this one?

        # Don't add an extra \n for code, LaTeX, or tables
        if is_code or is_latex or line.startswith('|'):
            result += '\n'
        # Also don't add an extra \n for lists
        elif stripped_line.startswith('-') or stripped_line.startswith('*') or stripped_line.startswith('+') or stripped_line.startswith('>') or re.match(r'\d+\.', stripped_line):
            result += '\n'
        else:
-            result += '\n\n'
+            result += '  \n'

It seems like that would break this case (2 long paragraphs separated by a \n), for which the paragraph separation wouldn't be clear without the \n\n:

Growing plants at home provides food and beauty for many households. The hobby has gained popularity recently as more people seek sustainable lifestyles and want to connect with nature. Gardens can range from small herb boxes to extensive vegetable patches.
While rewarding, gardening requires dedication and knowledge about soil conditions, watering schedules, and seasonal planting cycles. Plants need consistent care to thrive, but watching them grow from seeds to mature specimens brings satisfaction to millions of gardeners worldwide.

@mamei16
Copy link
Contributor Author

mamei16 commented Dec 31, 2024

So after this PR, the change would be this one?

Yes, exactly.

Regarding the case you provided, I think it's a matter of whether one believes that LLMs are capable of correctly using single and double newlines in their output. If yes, then automatically converting \n to \n\n will result in output not being shown as intended by the LLM. Since LLMs are trained on texts created by humans (mostly), I'd argue most humans would put \n\n to separate two paragraphs instead of just \n and as such, single newlines should not be changed to double newlines.
Personally, I do think current models are already able to make that distinction by themselves:
Screenshot_20241231_170038

@oobabooga
Copy link
Owner

That's fair, I wrote that \n\n logic a long time ago and you are right that most models today likely separate paragraphs by \n\n. I have reapplied your change in 64853f8. Thanks again for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants