-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EntityDomain.add_entity() slow #119
Comments
Yes, you are right, calling add_entity for a lot of entities is expensive. I think it's possible to avoid calling add_entity one by one, I will improve the codes soon |
- add validator for EntityDomain initialization - and avoid add_entity()
@miroli I updated the process for loading entity domains, and I tested the create_datapackage function against a dataset with 1,000,000 entities and it can create the datapackage in 12 minutes. Could you test the master branch against your dataset? If it's not convenient for you to install from source I will make a release for you. |
That's great news! If you could make a release, that would be even greater as installing from source is tricky with our current setup. |
ok, v1.0.6 is ready, please have a try |
It's much better now, thanks! |
We've run into some performance issues when running
ddf_utils.package.create_datapackage()
. We have some files with hundreds of thousands of entities and running this function takes a very long time in those cases.After some profiling it turns out that the culprit is
EntityDomain.add_entity()
inddf_utils.model.ddf
which as I understand it loops through all rows in entity files and runs some identity checks. Would it be possible to vectorize that loop?The text was updated successfully, but these errors were encountered: