Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Regression on 0.5.0 with DML #1071

Open
azchohfi opened this issue Nov 19, 2024 · 11 comments
Open

[C#] Regression on 0.5.0 with DML #1071

azchohfi opened this issue Nov 19, 2024 · 11 comments
Labels

Comments

@azchohfi
Copy link
Contributor

Describe the bug
C# Version 0.5.0 broke DML models, such as microsoft--Phi-3-mini-4k-instruct-onnx directml-int4-awq-block-128.
The model loads, but the Generator's constructor throws an Access violation exception.

To Reproduce
Steps to reproduce the behavior:

  1. Try running Phi3 Sample with DML
  2. Exception line 98

Expected behavior
Works just as 0.4.0.

Desktop (please complete the following information):

  • OS: Windows 11 (24H2)
@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

Well, DML didn't really work before in 0.40 . I mean it works up to a point then breaks.
I was just about to update to 0.5 myself. Thanks for the warning. 🥲

I took a look at the closed pull requests and didn't see anything relating to any DML fixes which is dissapointing.

@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

Just updated my code to 0.51 to try it out c# directml using same model as OP.
After loading the model it crashes. (Didn't crash with 0.40)

Quadro P5000 GPU

Same line:
generator = new Generator(model, generatorParams);

=================================================================
	Native Crash Reporting
=================================================================
Got a UNKNOWN while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Managed Stacktrace:
=================================================================
	  at <unknown> <0xffffffff>
	  at Microsoft.ML.OnnxRuntimeGenAI.NativeMethods:OgaCreateGenerator <0x00097>
	  at Microsoft.ML.OnnxRuntimeGenAI.Generator:.ctor <0x0004a>
	  at Main:StartGeneration <0x00612>
	  at <Start>d__10:MoveNext <0x002ca>
	  at MoveNextRunner:InvokeMoveNext <0x00091>
	  at System.Threading.ExecutionContext:RunInternal <0x001b5>
	  at System.Threading.ExecutionContext:Run <0x0002a>
	  at MoveNextRunner:Run <0x000ca>
	  at <>c:<.cctor>b__7_0 <0x00039>
	  at WorkRequest:Invoke <0x00023>
	  at UnityEngine.UnitySynchronizationContext:Exec <0x0018a>
	  at UnityEngine.UnitySynchronizationContext:ExecuteTasks <0x0007a>
	  at System.Object:runtime_invoke_void <0x0007c>
=================================================================
Received signal SIGSEGV
Crash!!!

@skyline75489
Copy link
Contributor

@RyanUnderhill This is the one we caught with the validation pipeline. I thought it was the same error but turns out it wasn't. This crash is reason why there's no log message printed. I can reproduce this locally.

@skyline75489

This comment has been minimized.

@skyline75489

This comment has been minimized.

@jiaxuwu2021
Copy link

jiaxuwu2021 commented Nov 19, 2024

We get very similar error/exception since v0.4.0 of Microsoft.ML.OnnxRuntimeGenAI.DirectML nuget package
2024-11-19T13:58:49.3749334+08:00 An unhandled exception has occurred while executing the request. error: [Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(353)\onnxruntime.dll!00007FFCC8BA254C: (caller: 00007FFCC8B9B7CF) Exception(1) tid(240c) 80070057 The parameter is incorrect. , at Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits()

also fail to run with v0.5.0 or v0.5.1, but all good if downgrade Microsoft.ML.OnnxRuntimeGenAI.DirectML to v0.3.0.


EDIT: it seems v0.5.0 works well too, but why only v0.4.0 and 0.5.1 broken?

@skyline75489
Copy link
Contributor

@elephantpanda Is the crash in 0.5.1 or 0.5.0? We had some progress but we might need more to fix.

@azchohfi
Copy link
Contributor Author

For our scenario, downgrading from 0.5.1 to 0.5.0 fixed the issue, so @elephantpanda is probably having a separate issue.

@elephantpanda
Copy link

elephantpanda commented Nov 19, 2024

@elephantpanda Is the crash in 0.5.1 or 0.5.0? We had some progress but we might need more to fix.

I am using 0.5.1 (I have never tried 0.5.0)
Image

Presumably it's the same issue as it's the same line it crashes on.

BTW, just tried this in CPU mode and it works fine so only crashes in DML mode.

@skyline75489
Copy link
Contributor

@elephantpanda You could try 0.5.0 first. We're preparing a 0.5.2 patch release that should fix the crash.

@elephantpanda
Copy link

elephantpanda commented Nov 22, 2024

I installed 0.5.0 onnxruntimegenai.directml and onnxruntimeGenai.managed keeping the other libraries the same.

It now doesn't crash. It just outputs the first token then fails on the second token:

OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\framework\execution_frame.cc:173 onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,32,12,96} Requested shape:{1,32,2048,96}

I'll just wait for the patch I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants