Optimized Chunking Method by pacmanincarnate · Pull Request #4 · jckpn/RecursiveGPT

pacmanincarnate · 2023-05-22T20:23:51Z

Revised the script to optimize the chunking process.
Chunking now works by adding whole paragraphs up to 3500 tokens into a chunk. If a single paragraph is larger than 3500 tokens, the script will break it in half by sentence. This will help to improve the meaning and internal logic of each chunk for analysis. The output file now also separates each chunk with "+++++++++" so you can understand where each chunk began. Those symbols could easily be removed in word, if necessary, but are helpful in reviewing the output text.

I tried to integrat tiktoken to count tokens accurately, but I was getting bad results. If someone wants to take a stab at trying that again, go for it. As of now, I'm estimating tokens by characters. If we get tiktoken working, the chunk size can be much more efficiently calculated by basing chunk size on max tokens minus prompt length.

pacmanincarnate · 2023-05-23T18:24:25Z

I realized today that you do need to be able to manually set the chunk size. GPT is limited to 4K tokens, but that includes input AND output, so if you fill that efficiently, your response will be very short. I was testing using this script to rewrite a book in a different style and it kept summarizing rather than rewrite and this is why.

This is probably a point worth noting in the command line instructions as well.

Linereck · 2023-05-24T03:38:30Z

Hey @pacmanincarnate, good stuff but gpt-4 has 8k tokens limitation maybe we need to change the default chunk size based on the model we selected? What are your thoughts on that?

jckpn · 2023-05-25T11:51:38Z

GPT is limited to 4K tokens, but that includes input AND output

Wasn't aware of this, thanks @pacmanincarnate.

Will test later and merge if all goes ok. Thanks a lot!

pacmanincarnate · 2023-05-25T23:18:55Z

Great! I haven’t had a chance to cha get the code to ask for input length. Shouldn’t be hard to do, but I’ve been busy with real work. Just thinking about it, we may want to ask for response length and subtract that from the max token length, so you are dictating how long you want your response to be. That would be more directly to the issue at hand.

Asks for length of response and uses that to calculate the maximum chunk size (4098 - response length = max chunk).

pacmanincarnate · 2023-05-26T02:31:56Z

Alright. I have updated per my previous comment. The script should be good to go now, at least until ChatGPT gets more token limit options beyond 4098.

pacmanincarnate added 2 commits May 22, 2023 14:39

Revise chunking method

045cd57

Revised chunking behavior and cleaned out testing code.

38da48c

Revised to ask for length of response

ed15bb5

Asks for length of response and uses that to calculate the maximum chunk size (4098 - response length = max chunk).

jckpn self-requested a review March 22, 2024 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Chunking Method#4

Optimized Chunking Method#4
pacmanincarnate wants to merge 3 commits intojckpn:mainfrom
pacmanincarnate:main

pacmanincarnate commented May 22, 2023

Uh oh!

pacmanincarnate commented May 23, 2023

Uh oh!

Linereck commented May 24, 2023

Uh oh!

jckpn commented May 25, 2023

Uh oh!

pacmanincarnate commented May 25, 2023

Uh oh!

pacmanincarnate commented May 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pacmanincarnate commented May 22, 2023

Uh oh!

pacmanincarnate commented May 23, 2023

Uh oh!

Linereck commented May 24, 2023

Uh oh!

jckpn commented May 25, 2023

Uh oh!

pacmanincarnate commented May 25, 2023

Uh oh!

pacmanincarnate commented May 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants