Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to cache package payloads synchronously #1679

Conversation

isohedronpipeline
Copy link
Contributor

@isohedronpipeline isohedronpipeline commented Mar 7, 2024

Fixes #1379

Carrying on the work from #1452

This adds support for synchronous/blocking package caching during the resolve process.

Adds a `package_cachy_async` flag which allows users to run caching synchronously (blocking) or asynchronously from the config.

Signed-off-by: ttrently <[email protected]>
Change _async to default True instead of False to mimic previous behavior.

Signed-off-by: ttrently <[email protected]>
Renamed `add_variants_async` to `add_variants` as this method can now run with an `_async` flag.

Signed-off-by: ttrently <[email protected]>
@isohedronpipeline isohedronpipeline requested a review from a team as a code owner March 7, 2024 01:15
Copy link

linux-foundation-easycla bot commented Mar 7, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@isohedronpipeline isohedronpipeline force-pushed the package-cache-run-async branch 2 times, most recently from 0df6d8a to a8e829f Compare March 7, 2024 01:25
@JeanChristopheMorinPerso JeanChristopheMorinPerso added this to the Next milestone Mar 7, 2024
Copy link

codecov bot commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 49.09091% with 28 lines in your changes missing coverage. Please review.

Project coverage is 58.39%. Comparing base (153bd87) to head (355a891).
Report is 38 commits behind head on main.

Files Patch % Lines
src/rez/package_cache.py 51.21% 17 Missing and 3 partials ⚠️
src/rez/cli/env.py 25.00% 4 Missing and 2 partials ⚠️
src/rez/resolved_context.py 80.00% 0 Missing and 1 partial ⚠️
src/rez/rezconfig.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1679      +/-   ##
==========================================
+ Coverage   58.03%   58.39%   +0.36%     
==========================================
  Files         127      126       -1     
  Lines       17069    17205     +136     
  Branches     3496     3519      +23     
==========================================
+ Hits         9906    10047     +141     
+ Misses       6496     6489       -7     
- Partials      667      669       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ben Andersen <[email protected]>
@JeanChristopheMorinPerso JeanChristopheMorinPerso changed the title Package cache run async Cache package payloads synchronously Mar 7, 2024
Copy link
Member

@JeanChristopheMorinPerso JeanChristopheMorinPerso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @isohedronpipeline! I left some comments, please let me know if you have questions.

src/rez/package_cache.py Outdated Show resolved Hide resolved
src/rez/package_cache.py Outdated Show resolved Hide resolved
src/rez/resolved_context.py Outdated Show resolved Hide resolved
src/rez/resolved_context.py Show resolved Hide resolved
src/rez/rezconfig.py Outdated Show resolved Hide resolved
src/rez/tests/test_package_cache.py Outdated Show resolved Hide resolved
src/rez/tests/test_package_cache.py Show resolved Hide resolved
src/rez/cli/env.py Outdated Show resolved Hide resolved
isohedronpipeline and others added 2 commits March 8, 2024 15:11
Co-authored-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Ben Andersen <[email protected]>
Signed-off-by: Ben Andersen <[email protected]>
src/rez/cli/env.py Outdated Show resolved Hide resolved
src/rez/package_cache.py Outdated Show resolved Hide resolved
src/rez/rezconfig.py Outdated Show resolved Hide resolved
@JeanChristopheMorinPerso
Copy link
Member

JeanChristopheMorinPerso commented Mar 10, 2024

I took a deeper look, and as things are in this PR, it's doesn't work as you would expect. If you were to rez-env maya --pkg-cache-mode async and then rez-env maya --pkg-cache-mode sync and initial cache (when we ran with async) takes long enough, the second rez-env (in sync mode) will skip the maya package. See https://github.com/AcademySoftwareFoundation/rez/blob/main/src/rez/package_cache.py#L210-L219. Basically, the package cache will skip any variant that is marked as in-progress. This would result in a non-localized environment.

Additionally, I still have to verify, but I think that we should really avoid using --daemon when caching synchronously. If we use --daemon, it would mean that cancelling rez-env --pkg-cache-mode sync would not cancel the caching process. After all, keeping the process alive is the the reason to have a daemon mode in the first place. Or I'd go even further, we should not use a subprocess.

We should also consider adding a progress bar or some logs when using the sync mode. I anticipate that users will complain if rez-env maya houdini blocks for multiple minutes without clearly saying what's going on.

@isohedronpipeline
Copy link
Contributor Author

isohedronpipeline commented Mar 11, 2024

I took a deeper look, and as things are in this PR, it's doesn't work as you would expect. If you were to rez-env maya --pkg-cache-mode async and then rez-env maya --pkg-cache-mode sync and initial cache (when we ran with async) takes long enough, the second rez-env (in sync mode) will skip the maya package. See https://github.com/AcademySoftwareFoundation/rez/blob/main/src/rez/package_cache.py#L210-L219. Basically, the package cache will skip any variant that is marked as in-progress. This would result in a non-localized environment.

I think it would be preferable to call add_variant(force=True) rather than poll until the .copying sentinel file has disappeared, but I'm not sure what to do about the file lock. We really don't want to be copying the same path at the same time. Perhaps we should modify this logic to copy to an intermediate location that does an atomic move operation once the copy has completed to avoid errors with multiple cache operations running concurrently. I'm not sure there is a great solution for mixing sync and async caching operations for the same files at the same time.

Additionally, I still have to verify, but I think that we should really avoid using --daemon when caching synchronously. If we use --daemon, it would mean that cancelling rez-env --pkg-cache-mode sync would not cancel the caching process. After all, keeping the process alive is the the reason to have a daemon mode in the first place. Or I'd go even further, we should not use a subprocess.

Yeah, I'm fine with that. I'll move some of the logic from the cli entry point into package_cache later today or tomorrow.

We should also consider adding a progress bar or some logs when using the sync mode. I anticipate that users will complain if rez-env maya houdini blocks for multiple minutes without clearly saying what's going on.

We can have something print out in the _while_copying() function that is threaded. Since we're using shutil.copytree I don't think we're going to have much by the way of stats about total size/amount transferred/estimated time to completion without hammering the disk, which I'd like to avoid.

@JeanChristopheMorinPerso
Copy link
Member

I think it would be preferable to call add_variant(force=True) rather than poll until the .copying sentinel file has disappeared, but I'm not sure what to do about the file lock. We really don't want to be copying the same path at the same time. Perhaps we should modify this logic to copy to an intermediate location that does an atomic move operation once the copy has completed to avoid errors with multiple cache operations running concurrently. I'm not sure there is a great solution for mixing sync and async caching operations for the same files at the same time.

I'm not too sure why we would use force=True or why you don't want to poll. Can you explain your thought a bit more please? Like you say, we really don't want to copy twice or copy the same package multiple times in parallel. The code goes to great length to avoid this.

@isohedronpipeline
Copy link
Contributor Author

I'm not too sure why we would use force=True or why you don't want to poll. Can you explain your thought a bit more please? Like you say, we really don't want to copy twice or copy the same package multiple times in parallel. The code goes to great length to avoid this.

You could be right. I am worried that a previous sync process could have failed, leaving behind the .copying file, but no further updates to it will have been made. I would imagine that one use case for doing a package cache syncronously would be to force all of the caching operations to happen right now as possibly a way to restart a locked or frozen cache operation.

I see that the copying file is updated every .2 seconds though, so we should be able to check that and continue to wait while that file is being updated.

My hesitation to hit the disk repeatedly is maybe misplaced since the cache is on the local disk. In studio environments, I hesitate to hit network drives with something that could be run many times in parallel. Of course you're right and starting multiple large file transfers instead is madness.

@JeanChristopheMorinPerso
Copy link
Member

JeanChristopheMorinPerso commented Mar 11, 2024

The package cache class already knows how to handle stalled packages. It should be able to handle the case where a variant is left in progress due to an error or whatever else and restart it. See

if os.path.exists(copying_filepath):
try:
st = os.stat(copying_filepath)
secs = time.time() - st.st_mtime
if secs > self._COPYING_TIME_MAX:
return (self.VARIANT_COPY_STALLED, rootpath)

As for polling, I agree that polling a shared file system is not wanted. Though, caching is usually done on local disks. We could maybe use inotify to optimize ans simplify the process.

Reports progress while waiting for another package to finish copying

Signed-off-by: Ben Andersen <[email protected]>
Signed-off-by: Ben Andersen <[email protected]>
@JeanChristopheMorinPerso JeanChristopheMorinPerso removed this from the Next milestone Mar 26, 2024
Copy link
Member

@JeanChristopheMorinPerso JeanChristopheMorinPerso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first read of your recent changes, but I have not yet reviewed them thoroughly. Hopefully that will allow me to think about it a little bit more.

src/rez/package_cache.py Outdated Show resolved Hide resolved
src/rez/package_cache.py Outdated Show resolved Hide resolved
@chadrik
Copy link
Contributor

chadrik commented May 9, 2024

Hi @JeanChristopheMorinPerso I think all the outstanding issues have been addressed. Any other thoughts?

Copy link
Member

@JeanChristopheMorinPerso JeanChristopheMorinPerso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isohedronpipeline. I think we are almost there. I left some small comments that should hopefully be quick to address.

It would be great if you could also update https://rez.readthedocs.io/en/stable/caching.html#package-caching. The behavior of the async caching is well documented. I think the behavior of the sync cache should also be documented in a similar way.

spinner.next()
time.sleep(self._COPYING_TIME_INC)
status, rootpath = self._get_cached_root(variant)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this else is not needed and is redundant.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is till valid and should be addressed. https://docs.python.org/3/reference/compound_stmts.html#the-while-statement

This repeatedly tests the expression and, if it is true, executes the first suite; if the expression is false (which may be the first time it is tested) the suite of the else clause, if present, is executed and the loop terminates.

A break statement executed in the first suite terminates the loop without executing the else clause’s suite. A continue statement executed in the first suite skips the rest of the suite and goes back to testing the expression.

src/rez/package_cache.py Outdated Show resolved Hide resolved
@@ -371,28 +413,50 @@ def add_variants_async(self, variants):
This method is called when a context is created or sourced. Variants
are then added to the cache in a separate process.
.. deprecated:: 3.1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for maintainers: Update with appropriate version number before merging.

src/rez/package_cache.py Outdated Show resolved Hide resolved
src/rez/package_cache.py Outdated Show resolved Hide resolved
@@ -116,7 +129,7 @@ def get_cached_root(self, variant):

return rootpath

def add_variant(self, variant, force=False):
def add_variant(self, variant, force=False, wait_for_copying=False, logger=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wait_for_copying should be True by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would change the default behavior from async to sync, which I don't think we want to do, do we?

# Asynchronously cache packages. If this is false, resolves will block until
# all packages are cached.
#
# .. versionadded:: 3.1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for maintainers: Update with appropriate version number before merging.

src/rez/package_cache.py Show resolved Hide resolved
src/rez/package_cache.py Outdated Show resolved Hide resolved
@ameliendeshams
Copy link

Hi @isohedronpipeline,

I'm delighted to see this pull request! We've been waiting for this feature, and it's great to see it implemented.

At Fortiche, we've tested this branch, and it works exactly as expected. The synchronous package caching is a game-changer for our studio environments, where we often need to cache packages quickly and reliably.

We're also happy to see that you've addressed the concerns about progress and logging in sync mode. It's essential to provide users with clear feedback when running long-running commands.

Thanks again for your work on this feature! We're excited to see it merged and start using it in our production environments.

isohedronpipeline and others added 3 commits June 14, 2024 08:12
Co-authored-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Ben Andersen <[email protected]>
Co-authored-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Ben Andersen <[email protected]>
changed wait_for_copying on run_caching_operation default to True
Signed-off-by: Ben Andersen <[email protected]>
@JeanChristopheMorinPerso JeanChristopheMorinPerso changed the title Cache package payloads synchronously Add ability to cache package payloads synchronously Jun 22, 2024
Copy link
Member

@JeanChristopheMorinPerso JeanChristopheMorinPerso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

spinner.next()
time.sleep(self._COPYING_TIME_INC)
status, rootpath = self._get_cached_root(variant)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is till valid and should be addressed. https://docs.python.org/3/reference/compound_stmts.html#the-while-statement

This repeatedly tests the expression and, if it is true, executes the first suite; if the expression is false (which may be the first time it is tested) the suite of the else clause, if present, is executed and the loop terminates.

A break statement executed in the first suite terminates the loop without executing the else clause’s suite. A continue statement executed in the first suite skips the rest of the suite and goes back to testing the expression.

@JeanChristopheMorinPerso JeanChristopheMorinPerso added this to the Next milestone Jun 22, 2024
Signed-off-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Jean-Christophe Morin <[email protected]>
Signed-off-by: Jean-Christophe Morin <[email protected]>
Copy link
Member

@JeanChristopheMorinPerso JeanChristopheMorinPerso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @isohedronpipeline and @ttrently!

@JeanChristopheMorinPerso JeanChristopheMorinPerso merged commit 671e32f into AcademySoftwareFoundation:main Jun 29, 2024
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants