Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize provider's is_backtrack_cause by caching resolved names #10621

Closed
wants to merge 3 commits into from

Conversation

jbylund
Copy link
Contributor

@jbylund jbylund commented Oct 28, 2021

is_backtrack_cause is called for every requirement on every step of resolution. As it's implemented currently if an identifier is not the backtrack cause, the entire list is scanned, invoking the .name property on every element of the list (and possibly parents). Since all the provider needs to know is presence or absence of the identifier in the backtrack causes list stash the resolved names in a set on first invocation and check for presence/absence in that set.

Profile on branch is_backtrack_cause is down to ~1% (from 43% on main), note that this isn't a scientific experiment as I let resolution run for 10-20 minutes and then cancel the process.

is_backtrack_cause is called for every requirement on every step of
resolution. As it's implemented currently if an identifier is not the
backtrack cause, the entire list is scanned, invoking the .name
property on every element of the list (and possibly parents). Since all
the provider needs to know is presence or absence of the identifier in
the backtrack causes list stash the resolved names in a set on first
invocation and check for presence/absence in that set.
@pradyunsg
Copy link
Member

Please provide a reproducer for the test case.

@jbylund
Copy link
Contributor Author

jbylund commented Oct 28, 2021

Please provide a reproducer for the test case.

The dreaded "it's on a private package repository", but isn't apache-airflow[all] a popular case for backtracking?

@jbylund
Copy link
Contributor Author

jbylund commented Oct 28, 2021

Ok, on my machine pip install apache-airflow[all] (which ends up failing in the end) takes:

main:
135s wall clock, 16.5s under is_backtrack_cause, 22.8s under get_preference

this branch - is backtrack cause is pruned from the svg, so it's easier to look at get_preference:
118s wall clock, 0.3s under is_backtrack_cause, 5.9s under get_preference

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense at a high level. I don't quite like the fact we're caching to a module-level global variable though, it could be bad for unit testing. It's probably better to tie the cache's lifetime to the provider instance.

news/10621.feature.rst Outdated Show resolved Hide resolved
@uranusjr
Copy link
Member

BTW does anyone know why the Azure checks are showing up?

@jbylund
Copy link
Contributor Author

jbylund commented Oct 28, 2021

I think this makes sense at a high level. I don't quite like the fact we're caching to a module-level global variable though, it could be bad for unit testing. It's probably better to tie the cache's lifetime to the provider instance.

Yeah, I also don't like where it is. I'll move it into the provider. There's something about the division of labor between resolvelib and the provider that doesn't quite sit right with me but I don't know exactly how to split it.

@uranusjr
Copy link
Member

One way I see how resolvelib can do for this is to make backtrack_causes's type opaque like Requirement and Candidate, so the cache can be stored directly on it. Otherewise is_backtrack_cause is entirely a pip thing, so this probably should be handled entirely in pip.

@notatallshaw
Copy link
Member

notatallshaw commented Oct 29, 2021

When I wrote this I knew it was inefficient but in the requirements I tested it didn't have a noticeable impact so I didn't prematurely optimize.

In my continued experimentation on improving backtracking I have moved this step to resolvelib and pass failure_causes to a method like this:

    def _causes_to_names(self, causes):
        return {c.requirement.name for c in causes} | {
            c.parent.name for c in causes if c.parent
        }

And then I pass the resulting set as backtrack_causes to get_preference and then it just becomes identifier in backtrack_causes.



def _get_causes_set(backtrack_causes: Sequence["PreferenceInformation"]) -> Set[str]:
key = id(backtrack_causes)
Copy link
Member

@notatallshaw notatallshaw Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the id being calculated here is the id of a list, but that list mutates:

self.state.backtrack_causes[:] = causes

So I think this is an incorrect assumption and that the backtrack causes could change in the list but your cache would not reflect that.

Co-authored-by: Tzu-ping Chung <[email protected]>
@jbylund
Copy link
Contributor Author

jbylund commented Oct 29, 2021

Is there a branch for the resolvelib change? I looked at making a change there, but for some reason I thought that change would be too specific to pip?

to inside the provider. Use the id of every element in the sequence
as part of the key as a guard against the sequence mutating.
@notatallshaw
Copy link
Member

notatallshaw commented Oct 29, 2021

Is there a branch for the resolvelib change? I looked at making a change there, but for some reason I thought that change would be too specific to pip?

From a design point of view I think we need to ask @uranusjr , it seems to me that resolvelib while kept separate it is only supported to work with pip and this would be acceptable to calculate in resolvelib?

From my point of view this set should be calculated once after each backtrack, making a cache on pip's side is error prone, and only resolvelib knows when a backtrack happened. It's still up to pip, or other downstream libraries, to decide what to do with this information.

Side note: while we're doing we could remove backtrack_causes from the state, we're already mutating it and we don't care about old versions of it, we could just have a local variable in resolve. But this is just me thinking out loud, I'm not specifically recommending it.

As for a branch, no not right now, I am experimenting by modifying a copy of pip main and my changes include a lot more than this (I am testing possible backjumping approaches). But if you would rather I push a PR to resolvelib and pip to fix this (you touched it you own it, aha) then I am happy to do so.

@jbylund
Copy link
Contributor Author

jbylund commented Oct 29, 2021

If you wouldn't mind cutting the resolvelib pr with the minimum change to make backtrack causes a set?

@notatallshaw
Copy link
Member

If you wouldn't mind cutting the resolvelib pr with the minimum change to make backtrack causes a set?

Done: sarugaku/resolvelib#92

@notatallshaw
Copy link
Member

notatallshaw commented Oct 30, 2021

Created a new, what I now guess is a competing, draft PR here: sarugaku/resolvelib#93

I personally don't like the approach of creating a cache that depends on object ids from the internal structure of the backtrack_causes object. I don't have a strong opinion about whether it's done in pip or resolvelib, I just don't see how to implement it simply on pip's side.

On a side note how did you come across this issue? I've done some benchmarking on pip download apache-airflow[all] and I see that pip 21.3.1 is able to complete it in less than a couple of minutes on my machine where as pip 21.2.4 gets stuck in a heavy backtracking situation. Has pip 21.3 enabled some new workflow that now needs to be performance monitored?

@jbylund
Copy link
Contributor Author

jbylund commented Nov 1, 2021

I personally don't like the approach of creating a cache that depends on object ids from the internal structure of the backtrack_causes object. I don't have a strong opinion about whether it's done in pip or resolvelib, I just don't see how to implement it simply on pip's side.

That's updated so that the cache key is now a tuple of all the ids of the objects in the causes iterable. Which, I agree with you, I'm still not crazy about. But in order for it to break, I think resolvelib would need to:

  1. overwrite info in PreferenceInformation objects
  2. the number of objects in there would also need to be the same

I think I'm sufficiently convinced that that's not possible, but I understand if it still feels too "dirty" to some.

On a side note how did you come across this issue? I've done some benchmarking on pip download apache-airflow[all] and I see that pip 21.3.1 is able to complete it in less than a couple of minutes on my machine where as pip 21.2.4 gets stuck in a heavy backtracking situation. Has pip 21.3 enabled some new workflow that now needs to be performance monitored?

It's not really possible for this to be a "performance regression" as is_backtrack_cause was only recently introduced. I started following this thread after profiling pip installing a package from my company's private package repository, I couldn't share that as a reproduction case, and I think I saw that you/others had been using airflow as a test case, so I tried that. It has a lot fewer backtracks, so the fraction of cpu time under is_backtrack_cause goes down from ~45% in my test case to ~15% in the airflow case.

I have one new idea, which I think is my new preferred option. We could pass a "prepare" method into resolvelib, and resolvelib would call prepare(raw_causes) and pass that as an argument to get_preference. prepare would default to the identity function which means that all the other consumers of resolvlib get to stick to their current usage pattern. pip would pass ~_causes_to_names. Which would give pip exactly the behavior it wants with no messy cache, and no one else has to worry about anything.

@notatallshaw
Copy link
Member

Thanks for the info.

I have one new idea, which I think is my new preferred option. We could pass a "prepare" method into resolvelib, and resolvelib would call prepare(raw_causes) and pass that as an argument to get_preference. prepare would default to the identity function which means that all the other consumers of resolvlib get to stick to their current usage pattern. pip would pass ~_causes_to_names. Which would give pip exactly the behavior it wants with no messy cache, and no one else has to worry about anything.

If no one likes my newest PR I would prefer this a lot over having a cache.

Part of my reasoning is if I ever get this backjump or similar optimization in a state where I'm happy to to submit a PR there will be more communication here between pip and resolvelib on when to prefer or unprefer a package. And dealing with a weird cache here makes it a little trickier and feels a lot less clean.

But just my two cents.

@notatallshaw
Copy link
Member

notatallshaw commented Nov 12, 2021

To quickly summarize there are 3 PRs now trying to solve this problem, the high level approaches being:

It could also be the preferred solution is an adapted version of one of these or even a mix of the last two, definitely happy to put in some further work if that helps.

@notatallshaw
Copy link
Member

FYI I've now closed sarugaku/resolvelib#93 in favor of sarugaku/resolvelib#99 and #10732

@jbylund
Copy link
Contributor Author

jbylund commented Dec 20, 2021

I'm going to close this as I think consensus is a solution that involves cooperation from resolvelib.

@jbylund jbylund closed this Dec 20, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants