-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize provider's is_backtrack_cause by caching resolved names #10621
Conversation
is_backtrack_cause is called for every requirement on every step of resolution. As it's implemented currently if an identifier is not the backtrack cause, the entire list is scanned, invoking the .name property on every element of the list (and possibly parents). Since all the provider needs to know is presence or absence of the identifier in the backtrack causes list stash the resolved names in a set on first invocation and check for presence/absence in that set.
86123ab
to
9ef97f2
Compare
Please provide a reproducer for the test case. |
The dreaded "it's on a private package repository", but isn't |
Ok, on my machine main: this branch - is backtrack cause is pruned from the svg, so it's easier to look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense at a high level. I don't quite like the fact we're caching to a module-level global variable though, it could be bad for unit testing. It's probably better to tie the cache's lifetime to the provider instance.
BTW does anyone know why the Azure checks are showing up? |
Yeah, I also don't like where it is. I'll move it into the provider. There's something about the division of labor between resolvelib and the provider that doesn't quite sit right with me but I don't know exactly how to split it. |
One way I see how resolvelib can do for this is to make |
When I wrote this I knew it was inefficient but in the requirements I tested it didn't have a noticeable impact so I didn't prematurely optimize. In my continued experimentation on improving backtracking I have moved this step to resolvelib and pass
And then I pass the resulting set as |
|
||
|
||
def _get_causes_set(backtrack_causes: Sequence["PreferenceInformation"]) -> Set[str]: | ||
key = id(backtrack_causes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the id being calculated here is the id of a list, but that list mutates:
pip/src/pip/_vendor/resolvelib/resolvers.py
Line 381 in 9ef97f2
self.state.backtrack_causes[:] = causes |
So I think this is an incorrect assumption and that the backtrack causes could change in the list but your cache would not reflect that.
Co-authored-by: Tzu-ping Chung <[email protected]>
Is there a branch for the resolvelib change? I looked at making a change there, but for some reason I thought that change would be too specific to pip? |
to inside the provider. Use the id of every element in the sequence as part of the key as a guard against the sequence mutating.
From a design point of view I think we need to ask @uranusjr , it seems to me that From my point of view this set should be calculated once after each backtrack, making a cache on pip's side is error prone, and only resolvelib knows when a backtrack happened. It's still up to pip, or other downstream libraries, to decide what to do with this information. Side note: while we're doing we could remove As for a branch, no not right now, I am experimenting by modifying a copy of pip main and my changes include a lot more than this (I am testing possible backjumping approaches). But if you would rather I push a PR to resolvelib and pip to fix this (you touched it you own it, aha) then I am happy to do so. |
If you wouldn't mind cutting the resolvelib pr with the minimum change to make backtrack causes a set? |
Done: sarugaku/resolvelib#92 |
Created a new, what I now guess is a competing, draft PR here: sarugaku/resolvelib#93 I personally don't like the approach of creating a cache that depends on object ids from the internal structure of the backtrack_causes object. I don't have a strong opinion about whether it's done in pip or resolvelib, I just don't see how to implement it simply on pip's side. On a side note how did you come across this issue? I've done some benchmarking on |
That's updated so that the cache key is now a tuple of all the ids of the objects in the causes iterable. Which, I agree with you, I'm still not crazy about. But in order for it to break, I think resolvelib would need to:
I think I'm sufficiently convinced that that's not possible, but I understand if it still feels too "dirty" to some.
It's not really possible for this to be a "performance regression" as I have one new idea, which I think is my new preferred option. We could pass a "prepare" method into resolvelib, and resolvelib would call |
Thanks for the info.
If no one likes my newest PR I would prefer this a lot over having a cache. Part of my reasoning is if I ever get this backjump or similar optimization in a state where I'm happy to to submit a PR there will be more communication here between pip and resolvelib on when to prefer or unprefer a package. And dealing with a weird cache here makes it a little trickier and feels a lot less clean. But just my two cents. |
To quickly summarize there are 3 PRs now trying to solve this problem, the high level approaches being:
It could also be the preferred solution is an adapted version of one of these or even a mix of the last two, definitely happy to put in some further work if that helps. |
FYI I've now closed sarugaku/resolvelib#93 in favor of sarugaku/resolvelib#99 and #10732 |
I'm going to close this as I think consensus is a solution that involves cooperation from resolvelib. |
is_backtrack_cause
is called for every requirement on every step of resolution. As it's implemented currently if an identifier is not the backtrack cause, the entire list is scanned, invoking the.name
property on every element of the list (and possibly parents). Since all the provider needs to know is presence or absence of the identifier in the backtrack causes list stash the resolved names in a set on first invocation and check for presence/absence in that set.Profile on branch
is_backtrack_cause
is down to ~1% (from 43% on main), note that this isn't a scientific experiment as I let resolution run for 10-20 minutes and then cancel the process.