Token Streaming in only returns first token correctly. #1087

gbuuu · 2024-11-21T19:31:36Z

I successfully quantized the mistralai/Mistral-Nemo-Instruct-2407 model to ONNX using the following command:

python awq-quantized-model.py --model_path mistralai/Mistral-Nemo-Instruct-2407 --quant_path ./mistralai/mistral-nemo-instruct-2407-awq/ --output_path ./mistralai/mistral-nemo-instruct-2407-awq-onnx/ --execution_provider cuda

I tested the ONNX model in Python, and it works without any issues. When testing in C#, I can generate the complete output. However, when trying to use streaming to retrieve tokens one by one, only the first token is returned correctly. All subsequent tokens return 0 in the loop.

using Microsoft.ML.OnnxRuntimeGenAI;

using Model model = new Model(@"mistral-nemo-instruct-awq-onnx");
using Tokenizer tokenizer = new Tokenizer(model);

var inputs = tokenizer.Encode($"<s>[INST]Hello[/INST]</s>");
var generatorParams = new GeneratorParams(model);

generatorParams.SetSearchOption("min_length", 10);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetInputSequences(inputs);

//var outputSequences = model.Generate(generatorParams);
//var outputString = tokenizer.Decode(outputSequences[0]);

using var tokenizerStream = tokenizer.CreateStream();
using var generator = new Generator(model, generatorParams);

while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();

    Console.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
}

Any suggestions or workarounds for resolving this issue in C# would be appreciated.

Thanks.

OS: [Windows 11]
Cuda 12.4, Cudnn 9
Version: NuGet [Microsoft.ML.OnnxRuntimeGenAI.Cuda - 0.5.1] [Microsoft.ML.OnnxRuntime.Gpu - 1.20.1]

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token Streaming in only returns first token correctly. #1087

Token Streaming in only returns first token correctly. #1087

gbuuu commented Nov 21, 2024 •

edited

Loading

Token Streaming in only returns first token correctly. #1087

Token Streaming in only returns first token correctly. #1087

Comments

gbuuu commented Nov 21, 2024 • edited Loading

gbuuu commented Nov 21, 2024 •

edited

Loading