Additional costs on multiple inferences in a row in the same session #17590

Manutea · 2023-09-18T11:53:28Z

Manutea
Sep 18, 2023

Hello,

When making multiple inferences in a row in the same session, some inferences incur additional costs.
How can I correct this ? I'm using Cuda 11.7 and OnnxRuntime 1.15.0

I did a few tests and you can see a sort of patern

void onnx_benchmark_GPU(std::string &modelPath, std::string &inputTensorName, std::string &outputTensorName, int deviceId, int batch)
{
  std::vector<float> image(batch * 3 * 224 * 224, 150);
  std::vector<int64_t> inputDims = {batch, 3, 224, 224};
  std::vector<int64_t> outputDims = {batch, 1000};
 
  Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "InferenceGPU");
 
  Ort::SessionOptions sessionOptions;
  sessionOptions.EnableProfiling("gpu_profile_file");
  OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, deviceId);

  Ort::Session session(env, modelPath.c_str(), sessionOptions);

  Ort::MemoryInfo infoCuda("Cuda", OrtAllocatorType::OrtArenaAllocator, deviceId, OrtMemTypeDefault);  
  Ort::Allocator cudaAllocator(session, infoCuda);
  
  int num_iterations = 4096;
  for (int i = 0; i < num_iterations; i++) 
  {
    Ort::IoBinding binding(session);
    auto input = cudaAllocator.GetAllocation(image.size() * sizeof(float));
    cudaMemcpy(input.get(), image.data(), sizeof(float) * image.size(), cudaMemcpyHostToDevice);

    // Create an OrtValue tensor backed by data on CUDA memory
    Ort::Value boundX = Ort::Value::CreateTensor(infoCuda, reinterpret_cast<float*>(input.get()), image.size(), inputDims.data(), inputDims.size());
    std::vector<float> outputData(std::accumulate(outputDims.begin(), outputDims.end(), 1, std::multiplies<int>()));
    auto output = cudaAllocator.GetAllocation(outputData.size() * sizeof(float));

    // Create an OrtValue tensor backed by data on CUDA memory
    Ort::Value boundY = Ort::Value::CreateTensor(infoCuda, reinterpret_cast<float*>(output.get()), outputData.size(), outputDims.data(), outputDims.size());
    
    binding.BindInput(inputTensorName.c_str(), boundX);
    binding.BindOutput(outputTensorName.c_str(), boundY);
    binding.SynchronizeInputs();
    session.Run(Ort::RunOptions(), binding);
    binding.SynchronizeOutputs();
    binding.ClearBoundInputs();
    binding.ClearBoundOutputs();
  }
}

I also use the tool /onnxruntime_perf_test :
./onnxruntime_perf_test -I -S 1 -e cuda -r 2048 -p profile.json -s /data/model/googlenet/dynamic_batch_googlenet_opt.onnx

And for the task Id ~1000, 2000, 4000, 8000, 16000 the bug doubles latency.
Maybe it's Cuda or something reallocate from time to time, and in larger and larger quantities to limit the extra cost.

Thanks in advance

wangyems · 2023-09-26T17:08:41Z

wangyems
Sep 26, 2023

could be related with #14023?
please consider raising an issue

1 reply

Manutea Sep 27, 2023
Author

Thank you, an issue is open #17720

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional costs on multiple inferences in a row in the same session #17590

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Additional costs on multiple inferences in a row in the same session #17590

Manutea Sep 18, 2023

Replies: 1 comment · 1 reply

wangyems Sep 26, 2023

Manutea Sep 27, 2023 Author

Manutea
Sep 18, 2023

Replies: 1 comment 1 reply

wangyems
Sep 26, 2023

Manutea Sep 27, 2023
Author