Dynamically Serving Inference Model Adapters with ORT #21406

contrebande-labs · 2024-07-18T15:27:41Z

contrebande-labs
Jul 18, 2024

Hi Guys,

I'm trying to build a inference server in ORT (Java) and want to add dynamic model adapter serving capabilities as in this HF blog post here. Specifically I want to be able to dynamically "enable" and "disable" preloaded LoRAs and DoRAs for a UNet. From what I've seen so far, I should be doing this by using fixtures like AddExternalInitializers() , AddExternalInitializersFromFilesInMemory() and AddInitializer(). The thing is that I have no idea how to use these methods. Can anybody point me to code examples, ideally in Java?

Thanks!

Craigacp · 2024-07-18T15:42:18Z

Craigacp
Jul 18, 2024

That's not possible with ORT as those take effect when the session is initialized. You could potentially add a single empty LoRA to an ONNX UNet model and then add the weights for that as an initializer, but you'd need to destroy the session & recreate it every time you wanted to change LoRA, and you'd also need a different ONNX model for different size LoRAs or ones that are placed on different layers. Creating the ONNX model with the attached LoRA is something I'm not sure how to do automatically, though it's definitely possible if you edit an ONNX protobuf (but it would be nasty to do by hand).

The tests for AddExternalInitializers (https://github.com/microsoft/onnxruntime/blob/main/java/src/test/java/ai/onnxruntime/InferenceTest.java#L695) and AddInitializer (https://github.com/microsoft/onnxruntime/blob/main/java/src/test/java/ai/onnxruntime/InferenceTest.java#L743) show hot to use it. I don't think I've got around to exposing AddExternalInitializersFromFilesInMemory yet as it's not possible to easily mmap large files in Java without Java 22's MemorySegment.

8 replies

contrebande-labs Jul 18, 2024
Author

Yes we can and should talk. Let me just mention here and now that I think the Java ORT bindings should altogether live outside of the ORT main repo if the ORT PMs are not willing to maintain it separately from the Android fork. Android is not Java and the two should never be confused. As I wrote my first lines of code using the Android bindings (because that's how we should refer to them), I immediatly wanted to fork it to add support for the FFM API instead of JNI (and not bundling the compiled libraries in the jars, for instance), the Vector API, Java-native ONNX Protobuf manipulation, etc. The list is long. A while back, I was also interested in IR bytecode (JAX, MLIR, etc.) vs Java bytecode in general and saw a lot of potential for Java in that space. But this has to come from Oracle, not Microsoft. I'm emailing you now.

Craigacp Jul 18, 2024

It targeted Java 8 originally because I had internal users on 8 who wanted it when I wrote the initial implementation in 2019. I think those internal users have moved to newer versions, but I think that's 17 or 21 not 22, and I don't have bandwidth to maintain multiple versions. MemorySegment is important to support, but the VectorAPI doesn't really add much without surfacing higher level operations, and having a good Tensor API requires some language support which unfortunately isn't present in the Java platform yet.

We have a bunch of Java code for manipulating ONNX protobufs in Tribuo - https://github.com/oracle/tribuo/tree/main/Util/ONNXExport, a longer term goal there is to automatically generate the ONNX op enum from the ONNX project itself and then we'd have coverage of all the ops rather than just the ones we needed for Tribuo's ONNX export support. It would also be nice to load in an ONNX protobuf and manipulate it using that API which has a lot more safety than working on the protobuf directly, but we've not had time to build that (mostly because we don't have full op support).

There's also things like OpenJDK Babylon where it would be fun to compile down to ONNX from a Java method, and I've talked to the Babylon team about doing that at some point. They can target OpenAI's Triton from Bablyon, and I know MLIR has been something they are looking closely at for that project.

contrebande-labs Jul 18, 2024
Author

I guess, I will try my hands on Java 22 experimental ORT bindings, then. Will also look at Tribuo. Will report progress here. Might take a while.

Craigacp Jul 18, 2024

There's a direct binding with FFM here - https://github.com/yuzawa-san/onnxruntime-java, but it doesn't conform to the ORT Java API.

contrebande-labs Jul 18, 2024
Author

I will look at it for sure. But I want it to be a 1:1 equivalent (and drop-in replacement) with both what you've done AND the ORT C/C++ API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically Serving Inference Model Adapters with ORT #21406

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Dynamically Serving Inference Model Adapters with ORT #21406

contrebande-labs Jul 18, 2024

Replies: 1 comment · 8 replies

Craigacp Jul 18, 2024

contrebande-labs Jul 18, 2024 Author

Craigacp Jul 18, 2024

contrebande-labs Jul 18, 2024 Author

Craigacp Jul 18, 2024

contrebande-labs Jul 18, 2024 Author

contrebande-labs
Jul 18, 2024

Replies: 1 comment 8 replies

Craigacp
Jul 18, 2024

contrebande-labs Jul 18, 2024
Author

contrebande-labs Jul 18, 2024
Author

contrebande-labs Jul 18, 2024
Author