Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache is_satisfied_by #12453

Merged
merged 5 commits into from
Apr 23, 2024
Merged

Cache is_satisfied_by #12453

merged 5 commits into from
Apr 23, 2024

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Dec 28, 2023

This is a performance improvement for a presumably common use case, namely the installation of a project with locked dependencies using pip install -c requirements.txt -e ., in an environment where most dependencies are already installed.

More precisely the case I'm testing is a ~500 lines requirements.txt, and running the above command in an environment where all dependencies are already installed. This is typically done by developers when pulling a new version of the project, to ensure their environment is up-to-date.

When running this under py-spy, it appears that PipProvider.is_satisfied_by largely dominates the 50 seconds running resolve() (out of a total 55 sec execution time).

image

Some tracing showed that is_satisfied_by is repeatedly called for the same requirement and candidate.

So I experimented with some caching.

The result of this PR is a 40 seconds saving on the 50 sec resolve(). The total execution time went down to 15 sec from 55.

image

I'm opening as draft as there are some possible improvements, but it's ready for comments on the general approach already.

def __init__(self, ireq: InstallRequirement) -> None:
assert ireq.link is None, "This is a link, not a specifier"
self._ireq = ireq
self._extras = frozenset(canonicalize_name(e) for e in self._ireq.extras)
KeyBasedCompareMixin.__init__(
self, key=str(ireq), defining_class=SpecifierRequirement
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using str here we may want to make InstallRequirement hashable and comparable.

@sbidoul
Copy link
Member Author

sbidoul commented Dec 28, 2023

The root cause of the perf issue is likely related to sarugaku/resolvelib#147

@pfmoore
Copy link
Member

pfmoore commented Dec 28, 2023

+1 on the general approach. Making InstallRequirement hashable and comparable in its own right can be a follow-up PR if it's complex (and it looks like it could be complex).

There's a risk of incorrect results caused by two requirements which compare as equal but actually aren't (for example two URL requirements with the same target but different auth data) but I'm assuming that risk is small enough to be acceptable.

KeyBasedCompareMixin.__init__(
self, key=(self.name, self.version), defining_class=self.__class__
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it might be safer to hash and compare on self.dist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this may change the logic?

@pradyunsg
Copy link
Member

What's the contains that dominates is_satisfied_by?

@sbidoul
Copy link
Member Author

sbidoul commented Dec 28, 2023

What's the contains that dominates is_satisfied_by?

It's from packaging.specifiers.

@notatallshaw
Copy link
Member

notatallshaw commented Dec 28, 2023

The root cause of the perf issue is likely related to sarugaku/resolvelib#147

Yes, resolvelibs algorithm checks each step if each requirement has been satisfied: https://github.com/sarugaku/resolvelib/blob/1.0.1/src/resolvelib/resolvers.py#L217, called here https://github.com/sarugaku/resolvelib/blob/1.0.1/src/resolvelib/resolvers.py#L409 and here https://github.com/sarugaku/resolvelib/blob/1.0.1/src/resolvelib/resolvers.py#L443.

So for 500 requirements, that are not top level, there will be at least 500 steps, where each step will check if each of the 500 requirements have already been resolved, which for yet unpinned requirements involves checking each requirement's requirements leading to at least O(n^2) behavior. Ideally there would be an algorthmic fix for this, but I've not yet found one.

Also I beleive this PR potentially fixes this users issue #12314 (though reproducing the users results has been challenging).

@notatallshaw
Copy link
Member

notatallshaw commented Dec 28, 2023

By the way for this specific use case:

common use case, namely the installation of a project with locked dependencies using pip install -c requirements.txt -e .

I do have a seperate idea that could fix it algorithmically: sarugaku/resolvelib#146. But I've not done any work towards a PR yet, so I can't compare.

)

@lru_cache(maxsize=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to check how much this increases memory consumption for a worst case scenario, i.e. when ResolutionTooDeep is reached.

This set of requirements can reach that sphinx sphinx-rtd-theme sphinx-toolbox myst-parser sphinxcontrib-bibtex nbsphinx. I can test this in the new year if you'd like.

Copy link
Member

@notatallshaw notatallshaw Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with a few different requirements dry install, looking at apache-airflow[all] I saw an increase of peak memory usage from 298 MBs to 306 MBs, I also saw the time to completion drop from ~2m 41s to ~2m 12s.

Personally this seems significantly worth the memory/time trade off to me (and looking at the memray flamegraph of these installs there are potentially big areas for improvement in other parts of Pip's codebase to use less memory)

@notatallshaw
Copy link
Member

Is there a reason this is still in draft mode? It would be great to see this land this year.

@sbidoul sbidoul marked this pull request as ready for review February 19, 2024 14:07
@sbidoul
Copy link
Member Author

sbidoul commented Feb 19, 2024

@notatallshaw thanks for the ping. I think this was ready, but it needs double checking for correctness. So I marked it ready for review. I'll add the news entry soon.

@notatallshaw
Copy link
Member

Testing performance today on Python 3.12 with all packages cached (ran the command twice to populate cache) pip install --dry-run apache-airflow[all] took ~5 mins 50 seconds on main and ~3 mins 10 seconds on this branch.

This is a huge performance improvement for real world large dependency trees.

@notatallshaw notatallshaw mentioned this pull request Apr 9, 2024
@sbidoul sbidoul force-pushed the cache-is_satified_by-sbi branch from 98c5cae to 705ce7d Compare April 21, 2024 11:33
@sbidoul
Copy link
Member Author

sbidoul commented Apr 21, 2024

@pradyunsg I tentatively added this to 24.1. Feel free to postpone if you are not comfortable with this.

There will be minor conflict with #12300, which I'll be happy to resolve, whichever of the two is merged first.

@sbidoul
Copy link
Member Author

sbidoul commented Apr 21, 2024

Oh, wait KeyBasedCompareMixin I was relying on here has been removed in #12571. I'm resetting to draft.

@sbidoul sbidoul removed this from the 24.1 milestone Apr 21, 2024
@sbidoul sbidoul marked this pull request as draft April 21, 2024 11:44
@ichard26
Copy link
Member

Sorry about that @sbidoul! I suppose in hindsight it would've been better to not remove that mixin.

@sbidoul sbidoul marked this pull request as ready for review April 21, 2024 14:31
@sbidoul
Copy link
Member Author

sbidoul commented Apr 21, 2024

No worries. I already pushed the update.

@sbidoul sbidoul added this to the 24.1 milestone Apr 21, 2024
@uranusjr uranusjr merged commit 6522547 into pypa:main Apr 23, 2024
24 checks passed
@sbidoul sbidoul deleted the cache-is_satified_by-sbi branch April 23, 2024 07:12
@sbidoul
Copy link
Member Author

sbidoul commented Apr 23, 2024

Thanks for the review and merge @uranusjr.

I suppose in hindsight it would've been better to not remove that mixin.

@ichard26 no issue at all. Dead code must be removed without mercy ;)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 9, 2024
@ichard26 ichard26 added the type: performance Commands take too long to run label Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants