-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different behaviour when extending this project to Bart #7
Comments
thank you! @tobigue was able to export |
Hey, I also came across this when trying to adapt the fastT5 code for converting (m)BART to onnx and I think it is due to the fact that the So my guess was, that it is not included in the exported graph for that reason. |
Ah that's really good to know. Thought I was going mad for a bit! Please do keep us updated! |
@HenryDashwood this is my current progress when converting mbart to onnx: Currently the ONNX model gives a (slightly) different output for the same input than the pytorch model and I'm not sure why. Also, as I had to use the external data format (the mBART model export failed when I did not, because apparently the export is >2GB for some reason, even though T5-large fits into <2GB with fastT5), quantizing the model is not straight forward, as ONNXQuantizer uses
If you have any insights on the differing outputs, or how to make the quantization work, or manage to export and quantize a BART model in another way, sharing is much appreciated. :) Cheers! |
The reason the init decoder was so large was that it exported the token embeddings twice. With |
I got the quantization for the exported model to work, but only using |
cool! |
constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with using |
It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196 If you also export the init decoder with constant folding and use the external data format (or inspect the model with netron), we might see the same behaviour. |
also I noticed that in the notebook input_names = [x.name for x in self.decoder.get_inputs()]
inputs = [
input_ids.cpu().numpy(),
attention_mask.cpu().numpy(),
] + [
tensor.cpu().numpy() for tensor in flat_past_key_values
]
decoder_inputs = dict(zip(input_names, inputs))
decoder_outputs = self.decoder.run(None, decoder_inputs) in the above code you are using two past_key_values = {
f"pkv_{i}": pkv.cpu().numpy() for i, pkv in enumerate(flat_past_key_values)
}
decoder_outputs = self.decoder.run(None, {**decoder_inputs, **past_key_values}) I've not tested how much effect it might have. |
you're right I did not notice it before but after testing got these results.
just a small difference. and for smaller models it's negligible :) |
yeah, the English models have a quite small vocabulary compared to the multilingual models like mT5 and mBART. You are right about the loops + zip, I would suspect that this should be quite fast though, because we are only iterating over ~50 items. When everything works as expected maybe just naming the inputs from 1 to X is the easier and faster way. 👍 |
Nice work! Interestingly I have been able to quantize Bart with onnxruntime 1.7.0 on my Mac without setting |
Interesting that you don't get the I think optimization works in your case, because the vocabulary of BART is much smaller than mBART, so duplicating the embeddings will probably not result in a model >2GB. |
yeah i suspect that's right. I haven't had to use external_data_format or anything like that. How are you doing in terms of speed? I'm not really seeing any speedups in my code which is probably a bug i've introduced because when i run your notebook, i do. Here's mine: https://colab.research.google.com/drive/1e3b9_a6UvNjJWemqubYA7YKxC7vYrvAX?usp=sharing |
@HenryDashwood Hi, I'm trying to export bart-large-cnn model for summarization to onnx and tried to use your colab, but stuck with error on the export step: Did you face it and how did you fix it? |
Very odd error that I also get sometimes and my only answer at this point is... have you tried running the cell again? That usually sorts it |
@HenryDashwood thanks, I was able to run it, but I have a question. Is there a difference when you run the model through generate func or with
? |
If I understand your answer correctly, we do both. ort_session.run() gets called inside generate to get predictions from the model. But we need more than just a single set of predictions do do something like beam search. So generate() takes care of that as well |
@HenryDashwood great work, I have been struggling on BART Summarization to ONNX & this gives me a ray of hope... [ My earlier tries to onnx never gave final inference summaries similar to non-onnx versions. I am pretty sure, I made some errors refactoring... ] I downloaded the collab & running it locally as-is without any modifications, it's throwing an error stated below, curious if any particular versions of libraries I should be on? RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1614378098133/work/torch/csrc/jit/passes/onnx/shape_type_inference.cpp":676, please report a bug to PyTorch. On a side note, can the similar be possibly achieved by JIT Libraries export? I tried that as well & after making model export & Triton Server work, failing at Inference ... triton-inference-server/server#2846 |
Resolved " RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED " by upgrading to torch 1.8.1 |
@HenryDashwood et al am I reading it right:
|
HI @anshoomehra sorry it took a while to get back. I haven't done much ONNX work in the last couple of weeks so I'm not really sure why I didn't get a speedup. I'll get back to you if I work it out though! |
@HenryDashwood appreciate the acknowledgment! Yes, please do continue (if time permits), yours is the only implementation that I have seen working without losing the final output quality & if ONNX brings down the inference time will be significant value addition. I am dedicating the next few days on it & if I have any breakthroughs shall share back ... Thanks much!! |
@HenryDashwood I have been able to make some progress with ONNX, however, shifting to GPUs seems challenging. I have made ONNX runtime working to engage GPUs using provider=['CUDAExecutionProvider'] (This step itself was issue because of conflicting libraries, we had to install onnxruntime-gpu and uninstall onnx, onnxruntime to make this work) Now, it seems the way graph is exported some variables are still on CPU causing below error, I tried making model and tokenizer to device cuda along with exported version using CUDAExecutionProvider but it still fails, would you have any immediate thought on what code fix we may need to resolve this?? |
Did you try explicitly setting the device of token["input_ids"] and token["attention_mask"]? That's the only thing I can think of. You might be interested in following this work as it makes progress huggingface/transformers#11786 |
@HenryDashwood @tobigue I am looking at your implementation of mbart on onnx and wanted to see if you had any feedback on how to make this work for mBart-50 many to many: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt Specifically, in the generate method, it passes forced_bos_token_id as first generated token to translate into target language. Could you provide any pointers where this parameter can be passed in your implementation: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing#scrollTo=z5j0jM4CRJHj |
Presumably it would be an extra argument passed to the decoder, and would be added to the list of decoder inputs? |
Hello @anshoomehra did you manage to solve the issues with the GPU shifting? |
@HenryDashwood @Ki6an when replicating your colab notebook I am facing the following error: This seems like an issue similar to the following one: #18 Following are the library versions I am using:
|
@sidsharma72 we had the same issue with T5, was able to resolve it in this commit make sure to add the |
Thanks for the response - I have been able to fix the issue. |
Hey, @HenryDashwood @tobigue I was able to use the collab for the model - lidiya/bart-large-xsum-samsum. ONNX was considerably better on CPU i.e, 4 secs whereas Pytorch was 11 secs but, over GPU ONNX was taking ~5 secs whereas Pytorch was ~500 ms |
Hey @mohanvamsibu-kore, I don't have experience with ONNX on GPU, but from what I understand the most efficient way to run stuff on GPU is by using TensorRT. They have an example for T5, which ich pretty similar to BART, here: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace |
@tobigue @HenryDashwood @Ki6an Thank you very much for your work, very helpful! Now I can convert bart model to onnx model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onnx model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT is not consistent with onnx model :( |
Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.
I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the
encoder_hidden_states
input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.
The text was updated successfully, but these errors were encountered: