-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance should be measured and improved #8
Comments
Time to tokenize 4 sentences 10,000 times. Scala looped just 1,000 times and then was multiplied by 10. Scala used the j4rs interface which is still subject to improvement.
Code for Python is here and code for Scala is here. FYI @MihaiSurdeanu |
Thanks! This is incredibly bad :) |
I believe this particular thin Rust wrapper serializes everything to json text and then deserializes it, and that includes converting individual ints of an array to text to Integer to int, etc. I decided to try a straight jni version. However, I think it would be useful to try out the interface already (as soon as I can publish it) while waiting for a faster version, because there is so much else downstream that needs to be tried out and might not work. |
Agreed on both points! |
A straight JNI version is faster than J4rs, but it looks like the key is to use the release version rather than the debug version. In C programs the difference is usually fairly minimal, like 2x, but here for Rust, the speedup is about 16 times! It is now on par with Python. 43 ~~ 45 is within the variation of runs.
|
Awesome!! However, the multi-threaded version is not showing the expected speedup. Do you think JNI has some syncs in there that we are not aware of? |
For the release version above, I was still multiplying by 10 and maybe in that 2.1 seconds there wasn't enough space to make a difference or some of my processors were just busy with other things. Here are some more measurements that show a 5x speedup. The "by sentence" parallelism isn't the best test because there is one long sentence out of the four and the other threads have to wait for that to finish. The parallelism is also applied in an inner loop so that the overhead is incurred 10,000 times. The "by document" parallelism should hopefully approach the number of processors. 5 seems close enough to 8 that I don't think something I've done is getting in the way.
|
nice! |
No description provided.
The text was updated successfully, but these errors were encountered: