Skip to content

Optimized Chunking Method#4

Open
pacmanincarnate wants to merge 3 commits intojckpn:mainfrom
pacmanincarnate:main
Open

Optimized Chunking Method#4
pacmanincarnate wants to merge 3 commits intojckpn:mainfrom
pacmanincarnate:main

Conversation

@pacmanincarnate
Copy link

Revised the script to optimize the chunking process.
Chunking now works by adding whole paragraphs up to 3500 tokens into a chunk. If a single paragraph is larger than 3500 tokens, the script will break it in half by sentence. This will help to improve the meaning and internal logic of each chunk for analysis. The output file now also separates each chunk with "+++++++++" so you can understand where each chunk began. Those symbols could easily be removed in word, if necessary, but are helpful in reviewing the output text.

I tried to integrat tiktoken to count tokens accurately, but I was getting bad results. If someone wants to take a stab at trying that again, go for it. As of now, I'm estimating tokens by characters. If we get tiktoken working, the chunk size can be much more efficiently calculated by basing chunk size on max tokens minus prompt length.

@pacmanincarnate
Copy link
Author

I realized today that you do need to be able to manually set the chunk size. GPT is limited to 4K tokens, but that includes input AND output, so if you fill that efficiently, your response will be very short. I was testing using this script to rewrite a book in a different style and it kept summarizing rather than rewrite and this is why.

This is probably a point worth noting in the command line instructions as well.

@Linereck
Copy link
Contributor

Hey @pacmanincarnate, good stuff but gpt-4 has 8k tokens limitation maybe we need to change the default chunk size based on the model we selected? What are your thoughts on that?

@jckpn
Copy link
Owner

jckpn commented May 25, 2023

GPT is limited to 4K tokens, but that includes input AND output

Wasn't aware of this, thanks @pacmanincarnate.

Will test later and merge if all goes ok. Thanks a lot!

@pacmanincarnate
Copy link
Author

Great! I haven’t had a chance to cha get the code to ask for input length. Shouldn’t be hard to do, but I’ve been busy with real work. Just thinking about it, we may want to ask for response length and subtract that from the max token length, so you are dictating how long you want your response to be. That would be more directly to the issue at hand.

Asks for length of response and uses that to calculate the maximum chunk size (4098 - response length = max chunk).
@pacmanincarnate
Copy link
Author

Alright. I have updated per my previous comment. The script should be good to go now, at least until ChatGPT gets more token limit options beyond 4098.

@jckpn jckpn self-requested a review March 22, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants