Fix single newlines #6599

mamei16 · 2024-12-22T15:25:40Z

The aim of this PR is to allow single newlines to be used both by the LLM and the user and thus fix issue #6597.

The following is an example showing the effect of the change made:

Before:

After:

@oobabooga I had a difficult time figuring out the purpose of previous_line_empty. Is there any particular kind of output that relies on it?

Merge dev branch

…newline for normal text

daniel-ang · 2024-12-28T06:15:48Z

I tried this fix. It works on the user's replies but still exhibits the same issue in the assistant replies.

Edit: Hmm... Sometimes the user's replies do not work either. I can't find definitive pattern for this behavior yet.

oobabooga · 2024-12-31T03:23:33Z

The commit that introduced problems was 3d19746, so I have reverted things (in this PR) to be as close as possible to before that commit.

The idea of this \n\n behavior is that LLMs often generate paragraphs separated by a single \n (they do not always generate markdown), so it's necessary to turn that into \n\n to split into paragraphs, EXCEPT if the current line is in a code block, LaTeX equation, or table.

But the lists behavior was and still is wrong. Examples of lists that cause issues:

Here is the list of fruits::

- Apples
    - Red Delicious
    - Granny Smith
    - Fuji
- Oranges
    - Navel
    - Valencia
    - Blood Orange
- Bananas
    - Cavendish
    - Plantain
    - Red Banana
- Carrots
    - Baby Carrots
    - Heirloom Carrot

Here is another one:

1. Fruits
    - Apples
        - Red Delicious
        - Granny Smith
        - Fuji
    - Oranges
        - Navel
        - Valencia
        - Blood Orange
    - Bananas
        - Cavendish
        - Plantain
        - Red Banana
2. Vegetables
    - Carrots
        - Baby Carrots
        - Heirloom Carro

oobabooga · 2024-12-31T03:54:40Z

Now the lists above work, but

- Item 1
  - Subitem 1.1
  - Subitem 1.2
    - Sub-subitem 1.2.1
    - Sub-subitem 1.2.2
- Item 2
  - Subitem 2.1
  - Subitem 2.2

doesn't, unless I add back tab_length=2, in which case the previous lists above break again. I'm not sure if this can be solved.

oobabooga · 2024-12-31T04:03:30Z

This may be unfixable without a dirty hack to enforce 4 spaces indentation for every nested list generated by an LLM. See:

Python-Markdown/markdown#3 (comment)

I'll merge this PR because it fixes the issues above. If someone can see a better solution, please send a new PR!

mamei16 · 2024-12-31T15:25:48Z

Thanks for your time and effort, I was not aware that nested lists were such a pain in the ass to process.

However, I disagree with you here:

The idea of this \n\n behavior is that LLMs often generate paragraphs separated by a single \n (they do not always generate markdown), so it's necessary to turn that into \n\n to split into paragraphs...

If the LLM does not generate markdown formatted single newlines, you'd need to turn a single \n into \n (note the added double space). Otherwise, you cannot discern single newlines from double newlines, which is what this PR was originally about. Here is an example showcasing the issue:

oobabooga · 2024-12-31T15:52:50Z

Thanks for the example, that makes sense. So after this PR, the change would be this one?

        # Don't add an extra \n for code, LaTeX, or tables
        if is_code or is_latex or line.startswith('|'):
            result += '\n'
        # Also don't add an extra \n for lists
        elif stripped_line.startswith('-') or stripped_line.startswith('*') or stripped_line.startswith('+') or stripped_line.startswith('>') or re.match(r'\d+\.', stripped_line):
            result += '\n'
        else:
-            result += '\n\n'
+            result += '  \n'

It seems like that would break this case (2 long paragraphs separated by a \n), for which the paragraph separation wouldn't be clear without the \n\n:

Growing plants at home provides food and beauty for many households. The hobby has gained popularity recently as more people seek sustainable lifestyles and want to connect with nature. Gardens can range from small herb boxes to extensive vegetable patches.
While rewarding, gardening requires dedication and knowledge about soil conditions, watering schedules, and seasonal planting cycles. Plants need consistent care to thrive, but watching them grow from seeds to mature specimens brings satisfaction to millions of gardeners worldwide.

mamei16 · 2024-12-31T16:13:37Z

So after this PR, the change would be this one?

Yes, exactly.

Regarding the case you provided, I think it's a matter of whether one believes that LLMs are capable of correctly using single and double newlines in their output. If yes, then automatically converting \n to \n\n will result in output not being shown as intended by the LLM. Since LLMs are trained on texts created by humans (mostly), I'd argue most humans would put \n\n to separate two paragraphs instead of just \n and as such, single newlines should not be changed to double newlines.
Personally, I do think current models are already able to make that distinction by themselves:

oobabooga · 2024-12-31T22:51:30Z

That's fair, I wrote that \n\n logic a long time ago and you are right that most models today likely separate paragraphs by \n\n. I have reapplied your change in 64853f8. Thanks again for the fix.

oobabooga and others added 5 commits October 1, 2024 14:48

Merge pull request oobabooga#6421 from oobabooga/dev

3b06cb4

Merge dev branch

Merge pull request oobabooga#6422 from oobabooga/dev

d1af7a4

Merge dev branch

Merge pull request oobabooga#6491 from oobabooga/dev

cc8c7ed

Merge dev branch

Merge pull request oobabooga#6585 from oobabooga/dev

4d466d5

Merge dev branch

remove logic regarding 'previous_line_empty', use markdown formatted …

aac8a12

…newline for normal text

mamei16 changed the base branch from main to dev December 28, 2024 20:19

oobabooga added 2 commits December 30, 2024 19:06

Merge branch 'dev' into mamei16-fix_single_newlines

4776f9f

Revert to the previous markdown behavior (before 3d19746)

9c340b0

Revert one more change

fb3c533

oobabooga merged commit e953af8 into oobabooga:dev Dec 31, 2024

oobabooga added a commit that referenced this pull request Dec 31, 2024

Reapply a necessary change that I removed from #6599 (thanks @mamei16!)

64853f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix single newlines #6599

Fix single newlines #6599

mamei16 commented Dec 22, 2024

daniel-ang commented Dec 28, 2024 •

edited

Loading

oobabooga commented Dec 31, 2024

oobabooga commented Dec 31, 2024

oobabooga commented Dec 31, 2024

mamei16 commented Dec 31, 2024 •

edited

Loading

oobabooga commented Dec 31, 2024

mamei16 commented Dec 31, 2024 •

edited

Loading

oobabooga commented Dec 31, 2024

Fix single newlines #6599

Fix single newlines #6599

Conversation

mamei16 commented Dec 22, 2024

daniel-ang commented Dec 28, 2024 • edited Loading

oobabooga commented Dec 31, 2024

oobabooga commented Dec 31, 2024

oobabooga commented Dec 31, 2024

mamei16 commented Dec 31, 2024 • edited Loading

oobabooga commented Dec 31, 2024

mamei16 commented Dec 31, 2024 • edited Loading

oobabooga commented Dec 31, 2024

daniel-ang commented Dec 28, 2024 •

edited

Loading

mamei16 commented Dec 31, 2024 •

edited

Loading

mamei16 commented Dec 31, 2024 •

edited

Loading