-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DML EP] Add BFC allocator #16634
base: main
Are you sure you want to change the base?
[DML EP] Add BFC allocator #16634
Conversation
…ignol/add-bfc-allocator-3
…ignol/add-bfc-allocator-3
…origin/user/pavignol/add-bfc-allocator-4
…origin/user/pavignol/add-bfc-allocator-4
onnxruntime/core/providers/dml/DmlExecutionProvider/src/BucketizedBufferAllocator.h
Show resolved
Hide resolved
|
||
namespace Dml | ||
{ | ||
class DmlBfcAllocator : public onnxruntime::IAllocator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onnxruntime/core/providers/dml/DmlExecutionProvider/src/ExecutionProvider.h
Outdated
Show resolved
Hide resolved
m_dataInterface = static_cast<IUnknown*>(m_impl->MutableDataRaw()); | ||
m_tensorData = m_impl->MutableDataRaw(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats going on here??
What happened to my Shadow Copy!??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this whole logic and this is super outdated code. What this code is trying to do is decoding what the actual pointer is (something that inherits IUnknown
or a more specific ID3D12Resource
), but we ended up always returning ID3D12Resource
objects even for the external operators. Also, this notion of "layout" doesn't seem to make any sense here since all GetShadowCopyIfRequired
does is increasing the ref count, but this isn't needed since there's no layout conversion done anywhere.
Maybe @jeffbloo can shed some light here, but from my analysis this all seems to be code that isn't needed anymore.
onnxruntime/core/providers/dml/DmlExecutionProvider/src/MLOperatorAuthorImpl.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/MLOperatorAuthorImpl.cpp
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/MLOperatorAuthorImpl.cpp
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlCommandRecorder.cpp
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorPrivate.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/DmlSubAllocator.h
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/IExecutionProvider.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/MLOperatorAuthorImpl.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/dml/DmlExecutionProvider/src/ExecutionProvider.cpp
Outdated
Show resolved
Hide resolved
…origin/user/pavignol/add-bfc-allocator-4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef USE_DML | ||
const bool bothValuesOnGPU = copy_info.source_device.Type() == OrtDevice::GPU && copy_info.target_device.Type() == OrtDevice::GPU; | ||
const bool sourceIsDmlAlloc = copy_info.source_device.MemType() == OrtDevice::MemType::DEFAULT || copy_info.source_device.MemType() == OrtDevice::MemType::DML_EXTERNAL; | ||
const bool targetIsInternalAlloc = copy_info.target_device.MemType() == OrtDevice::MemType::DEFAULT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_is_internal_alloc
since this is ORT code 🐫🐍 rather than the DML EP 🐫🐪.
|
||
namespace Dml | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] extra blank line
@PatriceVignola Happy 1 year anniversary on this open CR. Still relevant? 🤔 |
…origin/user/pavignol/add-bfc-allocator-4
ort_api | ||
); | ||
// Ensure resource is the same | ||
WINML_EXPECT_EQUAL(d3d12_resource, d3d12_resource_from_allocation); |
Check warning
Code scanning / PREfast
Potential comparison of a constant with another constant. Warning test
auto unique_cpu_memory_info = UniqueOrtMemoryInfo(cpu_memory_info, ort_api->ReleaseMemoryInfo); | ||
auto cpu_tensor = CreateTensorFromMemoryInfo(unique_cpu_memory_info.get()); | ||
THROW_IF_NOT_OK_MSG(winml_adapter_api->ValueGetDeviceId(cpu_tensor.get(), &device_id), ort_api); | ||
WINML_EXPECT_EQUAL(0, device_id); |
Check warning
Code scanning / PREfast
Potential comparison of a constant with another constant. Warning test
THROW_IF_NOT_OK_MSG( | ||
winml_adapter_api->SessionGetInputRequiredDeviceId(cpu_session.get(), "inputImage", &device_id), ort_api | ||
); | ||
WINML_EXPECT_EQUAL(0, device_id); |
Check warning
Code scanning / PREfast
Potential comparison of a constant with another constant. Warning test
This change adds the BFC allocator to the DML EP in order to reduce peak memory usage and allow bigger models to be loaded in memory (e.g. LLMs). Note that we still need to keep the Bucketized allocator since the WinML API allows for the caller to query D3D12Resources directly, which isn't backward compatible with the way that the BFC allocations work.
The reserved resource logic is similar to what we had in TF. We leverage the existing ORT allocator by creating a tagged pointer which can be incremented arithmetically by ORT, as if it was sequentially allocated memory. When we need to access the memory, we decode the tagged pointer in order to access its allocation id and offset. This allows us to effectively retrieve the appropriate resource and access it at the right offset.
To keep the legacy WinML APIs working, I added a way to detect whether custom ops have been registered at session creation them. I then added an ORT API that allows us to disable the BFC allocator when creating the execution provider, which allows us to use the bucketized allocator instead of the BFC one. This doesn't cause any regressions since the custom op users were already using the Bucketized buffer allocator, and it's a niche scenario anyway.