You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I successfully quantized the mistralai/Mistral-Nemo-Instruct-2407 model to ONNX using the following command:
python awq-quantized-model.py --model_path mistralai/Mistral-Nemo-Instruct-2407 --quant_path ./mistralai/mistral-nemo-instruct-2407-awq/ --output_path ./mistralai/mistral-nemo-instruct-2407-awq-onnx/ --execution_provider cuda
I tested the ONNX model in Python, and it works without any issues. When testing in C#, I can generate the complete output. However, when trying to use streaming to retrieve tokens one by one, only the first token is returned correctly. All subsequent tokens return 0 in the loop.
using Microsoft.ML.OnnxRuntimeGenAI;
using Model model = new Model(@"mistral-nemo-instruct-awq-onnx");
using Tokenizer tokenizer = new Tokenizer(model);
var inputs = tokenizer.Encode($"<s>[INST]Hello[/INST]</s>");
var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("min_length", 10);
generatorParams.SetSearchOption("max_length", 2048);
generatorParams.SetInputSequences(inputs);
//var outputSequences = model.Generate(generatorParams);
//var outputString = tokenizer.Decode(outputSequences[0]);
using var tokenizerStream = tokenizer.CreateStream();
using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
Console.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
}
Any suggestions or workarounds for resolving this issue in C# would be appreciated.
I successfully quantized the mistralai/Mistral-Nemo-Instruct-2407 model to ONNX using the following command:
python awq-quantized-model.py --model_path mistralai/Mistral-Nemo-Instruct-2407 --quant_path ./mistralai/mistral-nemo-instruct-2407-awq/ --output_path ./mistralai/mistral-nemo-instruct-2407-awq-onnx/ --execution_provider cuda
I tested the ONNX model in Python, and it works without any issues. When testing in C#, I can generate the complete output. However, when trying to use streaming to retrieve tokens one by one, only the first token is returned correctly. All subsequent tokens return 0 in the loop.
Any suggestions or workarounds for resolving this issue in C# would be appreciated.
Thanks.
The text was updated successfully, but these errors were encountered: