Conversation
|
I realized today that you do need to be able to manually set the chunk size. GPT is limited to 4K tokens, but that includes input AND output, so if you fill that efficiently, your response will be very short. I was testing using this script to rewrite a book in a different style and it kept summarizing rather than rewrite and this is why. This is probably a point worth noting in the command line instructions as well. |
|
Hey @pacmanincarnate, good stuff but gpt-4 has 8k tokens limitation maybe we need to change the default chunk size based on the model we selected? What are your thoughts on that? |
Wasn't aware of this, thanks @pacmanincarnate. Will test later and merge if all goes ok. Thanks a lot! |
|
Great! I haven’t had a chance to cha get the code to ask for input length. Shouldn’t be hard to do, but I’ve been busy with real work. Just thinking about it, we may want to ask for response length and subtract that from the max token length, so you are dictating how long you want your response to be. That would be more directly to the issue at hand. |
Asks for length of response and uses that to calculate the maximum chunk size (4098 - response length = max chunk).
|
Alright. I have updated per my previous comment. The script should be good to go now, at least until ChatGPT gets more token limit options beyond 4098. |
Revised the script to optimize the chunking process.
Chunking now works by adding whole paragraphs up to 3500 tokens into a chunk. If a single paragraph is larger than 3500 tokens, the script will break it in half by sentence. This will help to improve the meaning and internal logic of each chunk for analysis. The output file now also separates each chunk with "+++++++++" so you can understand where each chunk began. Those symbols could easily be removed in word, if necessary, but are helpful in reviewing the output text.
I tried to integrat tiktoken to count tokens accurately, but I was getting bad results. If someone wants to take a stab at trying that again, go for it. As of now, I'm estimating tokens by characters. If we get tiktoken working, the chunk size can be much more efficiently calculated by basing chunk size on max tokens minus prompt length.