We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm creating a class instance from the Runtime class and trying to generate text using the async_generate method
async_generate
I tested that on gemma2 and for large context length the endpoint doesn't generate except for one token
gemma2
This given that when I test it using sglang launch_server it works fine
launch_server
I tested the same implementation on llama3.1 with large context length and it worked fine using the runtime class
So I'm wondering if there's a difference between the Runtime class and the TokenizerManager class?
Runtime
TokenizerManager
This is an example output with loading the model with 8k context length
{'id': 'bc58d0f1564d4d3a87bcf444fd087690', 'object': 'chat.completion', 'created': 1730626163, 'model': 'google/gemma-2-2b-it', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ''}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 7051, 'total_tokens': 7053, 'completion_tokens': 2}}
The text was updated successfully, but these errors were encountered:
I am a bit confused by your question. you mean you served gemma using runtime class and launch_server but they generated different results?
For the diff between TokenizerManager and Runtime, runtime contains tokenizer manager. You can see the comments here for details.
Sorry, something went wrong.
No branches or pull requests
I'm creating a class instance from the Runtime class and trying to generate text using the
async_generate
methodI tested that on
gemma2
and for large context length the endpoint doesn't generate except for one tokenThis given that when I test it using sglang
launch_server
it works fineI tested the same implementation on llama3.1 with large context length and it worked fine using the runtime class
So I'm wondering if there's a difference between the
Runtime
class and theTokenizerManager
class?This is an example output with loading the model with 8k context length
The text was updated successfully, but these errors were encountered: