Optimize provider's is_backtrack_cause by caching resolved names #10621

jbylund · 2021-10-28T19:49:27Z

is_backtrack_cause is called for every requirement on every step of resolution. As it's implemented currently if an identifier is not the backtrack cause, the entire list is scanned, invoking the .name property on every element of the list (and possibly parents). Since all the provider needs to know is presence or absence of the identifier in the backtrack causes list stash the resolved names in a set on first invocation and check for presence/absence in that set.

Profile on branch is_backtrack_cause is down to ~1% (from 43% on main), note that this isn't a scientific experiment as I let resolution run for 10-20 minutes and then cancel the process.

is_backtrack_cause is called for every requirement on every step of resolution. As it's implemented currently if an identifier is not the backtrack cause, the entire list is scanned, invoking the .name property on every element of the list (and possibly parents). Since all the provider needs to know is presence or absence of the identifier in the backtrack causes list stash the resolved names in a set on first invocation and check for presence/absence in that set.

pradyunsg · 2021-10-28T20:47:24Z

Please provide a reproducer for the test case.

jbylund · 2021-10-28T20:56:58Z

Please provide a reproducer for the test case.

The dreaded "it's on a private package repository", but isn't apache-airflow[all] a popular case for backtracking?

jbylund · 2021-10-28T21:33:47Z

Ok, on my machine pip install apache-airflow[all] (which ends up failing in the end) takes:

main:
135s wall clock, 16.5s under is_backtrack_cause, 22.8s under get_preference

this branch - is backtrack cause is pruned from the svg, so it's easier to look at get_preference:
118s wall clock, 0.3s under is_backtrack_cause, 5.9s under get_preference

uranusjr

I think this makes sense at a high level. I don't quite like the fact we're caching to a module-level global variable though, it could be bad for unit testing. It's probably better to tie the cache's lifetime to the provider instance.

news/10621.feature.rst

uranusjr · 2021-10-28T23:23:30Z

BTW does anyone know why the Azure checks are showing up?

jbylund · 2021-10-28T23:30:58Z

I think this makes sense at a high level. I don't quite like the fact we're caching to a module-level global variable though, it could be bad for unit testing. It's probably better to tie the cache's lifetime to the provider instance.

Yeah, I also don't like where it is. I'll move it into the provider. There's something about the division of labor between resolvelib and the provider that doesn't quite sit right with me but I don't know exactly how to split it.

uranusjr · 2021-10-29T00:05:00Z

One way I see how resolvelib can do for this is to make backtrack_causes's type opaque like Requirement and Candidate, so the cache can be stored directly on it. Otherewise is_backtrack_cause is entirely a pip thing, so this probably should be handled entirely in pip.

notatallshaw · 2021-10-29T12:34:07Z

When I wrote this I knew it was inefficient but in the requirements I tested it didn't have a noticeable impact so I didn't prematurely optimize.

In my continued experimentation on improving backtracking I have moved this step to resolvelib and pass failure_causes to a method like this:

    def _causes_to_names(self, causes):
        return {c.requirement.name for c in causes} | {
            c.parent.name for c in causes if c.parent
        }

And then I pass the resulting set as backtrack_causes to get_preference and then it just becomes identifier in backtrack_causes.

notatallshaw · 2021-10-29T12:39:28Z

src/pip/_internal/resolution/resolvelib/provider.py

+
+
+def _get_causes_set(backtrack_causes: Sequence["PreferenceInformation"]) -> Set[str]:
+    key = id(backtrack_causes)


I think the id being calculated here is the id of a list, but that list mutates:

pip/src/pip/_vendor/resolvelib/resolvers.py

Line 381 in 9ef97f2

self.state.backtrack_causes[:] = causes

So I think this is an incorrect assumption and that the backtrack causes could change in the list but your cache would not reflect that.

Co-authored-by: Tzu-ping Chung <[email protected]>

jbylund · 2021-10-29T13:28:09Z

Is there a branch for the resolvelib change? I looked at making a change there, but for some reason I thought that change would be too specific to pip?

to inside the provider. Use the id of every element in the sequence as part of the key as a guard against the sequence mutating.

notatallshaw · 2021-10-29T13:45:44Z

Is there a branch for the resolvelib change? I looked at making a change there, but for some reason I thought that change would be too specific to pip?

From a design point of view I think we need to ask @uranusjr , it seems to me that resolvelib while kept separate it is only supported to work with pip and this would be acceptable to calculate in resolvelib?

From my point of view this set should be calculated once after each backtrack, making a cache on pip's side is error prone, and only resolvelib knows when a backtrack happened. It's still up to pip, or other downstream libraries, to decide what to do with this information.

Side note: while we're doing we could remove backtrack_causes from the state, we're already mutating it and we don't care about old versions of it, we could just have a local variable in resolve. But this is just me thinking out loud, I'm not specifically recommending it.

As for a branch, no not right now, I am experimenting by modifying a copy of pip main and my changes include a lot more than this (I am testing possible backjumping approaches). But if you would rather I push a PR to resolvelib and pip to fix this (you touched it you own it, aha) then I am happy to do so.

jbylund · 2021-10-29T15:09:34Z

If you wouldn't mind cutting the resolvelib pr with the minimum change to make backtrack causes a set?

notatallshaw · 2021-10-29T15:46:16Z

If you wouldn't mind cutting the resolvelib pr with the minimum change to make backtrack causes a set?

Done: sarugaku/resolvelib#92

notatallshaw · 2021-10-30T23:22:15Z

Created a new, what I now guess is a competing, draft PR here: sarugaku/resolvelib#93

I personally don't like the approach of creating a cache that depends on object ids from the internal structure of the backtrack_causes object. I don't have a strong opinion about whether it's done in pip or resolvelib, I just don't see how to implement it simply on pip's side.

On a side note how did you come across this issue? I've done some benchmarking on pip download apache-airflow[all] and I see that pip 21.3.1 is able to complete it in less than a couple of minutes on my machine where as pip 21.2.4 gets stuck in a heavy backtracking situation. Has pip 21.3 enabled some new workflow that now needs to be performance monitored?

jbylund · 2021-11-01T15:25:14Z

I personally don't like the approach of creating a cache that depends on object ids from the internal structure of the backtrack_causes object. I don't have a strong opinion about whether it's done in pip or resolvelib, I just don't see how to implement it simply on pip's side.

That's updated so that the cache key is now a tuple of all the ids of the objects in the causes iterable. Which, I agree with you, I'm still not crazy about. But in order for it to break, I think resolvelib would need to:

overwrite info in PreferenceInformation objects
the number of objects in there would also need to be the same

I think I'm sufficiently convinced that that's not possible, but I understand if it still feels too "dirty" to some.

On a side note how did you come across this issue? I've done some benchmarking on pip download apache-airflow[all] and I see that pip 21.3.1 is able to complete it in less than a couple of minutes on my machine where as pip 21.2.4 gets stuck in a heavy backtracking situation. Has pip 21.3 enabled some new workflow that now needs to be performance monitored?

It's not really possible for this to be a "performance regression" as is_backtrack_cause was only recently introduced. I started following this thread after profiling pip installing a package from my company's private package repository, I couldn't share that as a reproduction case, and I think I saw that you/others had been using airflow as a test case, so I tried that. It has a lot fewer backtracks, so the fraction of cpu time under is_backtrack_cause goes down from ~45% in my test case to ~15% in the airflow case.

I have one new idea, which I think is my new preferred option. We could pass a "prepare" method into resolvelib, and resolvelib would call prepare(raw_causes) and pass that as an argument to get_preference. prepare would default to the identity function which means that all the other consumers of resolvlib get to stick to their current usage pattern. pip would pass ~_causes_to_names. Which would give pip exactly the behavior it wants with no messy cache, and no one else has to worry about anything.

notatallshaw · 2021-11-01T17:26:00Z

Thanks for the info.

I have one new idea, which I think is my new preferred option. We could pass a "prepare" method into resolvelib, and resolvelib would call prepare(raw_causes) and pass that as an argument to get_preference. prepare would default to the identity function which means that all the other consumers of resolvlib get to stick to their current usage pattern. pip would pass ~_causes_to_names. Which would give pip exactly the behavior it wants with no messy cache, and no one else has to worry about anything.

If no one likes my newest PR I would prefer this a lot over having a cache.

Part of my reasoning is if I ever get this backjump or similar optimization in a state where I'm happy to to submit a PR there will be more communication here between pip and resolvelib on when to prefer or unprefer a package. And dealing with a weird cache here makes it a little trickier and feels a lot less clean.

But just my two cents.

notatallshaw · 2021-11-12T20:30:22Z

To quickly summarize there are 3 PRs now trying to solve this problem, the high level approaches being:

Optimize provider's is_backtrack_cause by caching resolved names #10621 Adds a cache on Pip
Implement dedicated causes object sarugaku/resolvelib#93 - Adds a Causes object in resolvelib which contains the logic to generate the set of names
Consumer passes in a prepare method which is called over backtrack causes sarugaku/resolvelib#96 - Let's pip provide the logic to resolvelib how to prepare the causes

It could also be the preferred solution is an adapted version of one of these or even a mix of the last two, definitely happy to put in some further work if that helps.

notatallshaw · 2021-12-18T22:38:21Z

FYI I've now closed sarugaku/resolvelib#93 in favor of sarugaku/resolvelib#99 and #10732

jbylund · 2021-12-20T22:16:04Z

I'm going to close this as I think consensus is a solution that involves cooperation from resolvelib.

jbylund force-pushed the causes_in_set branch from 86123ab to 9ef97f2 Compare October 28, 2021 19:57

jbylund mentioned this pull request Oct 28, 2021

is_backtrack_cause is very expensive to evaluate #10620

Open

1 task

uranusjr reviewed Oct 28, 2021

View reviewed changes

news/10621.feature.rst Outdated Show resolved Hide resolved

notatallshaw reviewed Oct 29, 2021

View reviewed changes

Update news/10621.feature.rst

4d1ea9b

Co-authored-by: Tzu-ping Chung <[email protected]>

Move helper to go from sequence of backtrack causes to set of names

6951b9b

to inside the provider. Use the id of every element in the sequence as part of the key as a guard against the sequence mutating.

notatallshaw mentioned this pull request Oct 29, 2021

Pre-compute backtrack cause names sarugaku/resolvelib#92

Closed

This was referenced Dec 18, 2021

Let downstream library to define causes object sarugaku/resolvelib#99

Closed

Create causes object for backtracking #10732

Closed

jbylund closed this Dec 20, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize provider's is_backtrack_cause by caching resolved names #10621

Optimize provider's is_backtrack_cause by caching resolved names #10621

jbylund commented Oct 28, 2021 •

edited

Loading

pradyunsg commented Oct 28, 2021

jbylund commented Oct 28, 2021

jbylund commented Oct 28, 2021

uranusjr left a comment

uranusjr commented Oct 28, 2021

jbylund commented Oct 28, 2021

uranusjr commented Oct 29, 2021

notatallshaw commented Oct 29, 2021 •

edited

Loading

notatallshaw Oct 29, 2021 •

edited

Loading

jbylund commented Oct 29, 2021

notatallshaw commented Oct 29, 2021 •

edited

Loading

jbylund commented Oct 29, 2021

notatallshaw commented Oct 29, 2021

notatallshaw commented Oct 30, 2021 •

edited

Loading

jbylund commented Nov 1, 2021

notatallshaw commented Nov 1, 2021

notatallshaw commented Nov 12, 2021 •

edited

Loading

notatallshaw commented Dec 18, 2021

jbylund commented Dec 20, 2021



		def _get_causes_set(backtrack_causes: Sequence["PreferenceInformation"]) -> Set[str]:
		key = id(backtrack_causes)

Optimize provider's is_backtrack_cause by caching resolved names #10621

Optimize provider's is_backtrack_cause by caching resolved names #10621

Conversation

jbylund commented Oct 28, 2021 • edited Loading

pradyunsg commented Oct 28, 2021

jbylund commented Oct 28, 2021

jbylund commented Oct 28, 2021

uranusjr left a comment

Choose a reason for hiding this comment

uranusjr commented Oct 28, 2021

jbylund commented Oct 28, 2021

uranusjr commented Oct 29, 2021

notatallshaw commented Oct 29, 2021 • edited Loading

notatallshaw Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

jbylund commented Oct 29, 2021

notatallshaw commented Oct 29, 2021 • edited Loading

jbylund commented Oct 29, 2021

notatallshaw commented Oct 29, 2021

notatallshaw commented Oct 30, 2021 • edited Loading

jbylund commented Nov 1, 2021

notatallshaw commented Nov 1, 2021

notatallshaw commented Nov 12, 2021 • edited Loading

notatallshaw commented Dec 18, 2021

jbylund commented Dec 20, 2021

jbylund commented Oct 28, 2021 •

edited

Loading

notatallshaw commented Oct 29, 2021 •

edited

Loading

notatallshaw Oct 29, 2021 •

edited

Loading

notatallshaw commented Oct 29, 2021 •

edited

Loading

notatallshaw commented Oct 30, 2021 •

edited

Loading

notatallshaw commented Nov 12, 2021 •

edited

Loading