Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Context Manager to support tmp volume and other configurable options #7

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

rpmcginty
Copy link
Collaborator

@rpmcginty rpmcginty commented Jun 14, 2024

  • WIP
  • update context manager w/ new tmp vol and config options

What's in this Change?

Merscope analysis pipeline has new requirements in how data is localized. This change addresses those needs.

  • adds support for an optional tmp volume configuration (batch mount point and volume).
  • adds new ContextManagerConfiguration class that has more configurations
    • isolate_inputs - this flag would change where inputs are downloaded. typically, they are downloaded to a shared volume where the remote path is used to generate a sha256 hexdigest value. There are use cases where we need to isolate inputs per run and, furthermore, use an explicit path name. If isolate_inputs is specified, we will create a directory under shared volume based on demand id, and then download to sub path based on local value of job param.
    • env_file_write_mode - this option helps control under what conditions to write env file over passing env vars as is. there are three options, ALWAYS, NEVER, IF_REQUIRED (default is IF_REQUIRED)
    • force - this controls whether to conditionally download data during data sync based on prexisting data. If true, it will download regardless of whether data is present. default is False
    • size_only - this controls how to check validity of cached data already downloaded. If true, only size (and date) will be used to compare remote and local data. If false, md5checksum (or an etag for s3 values) values are also created.

Testing

  • unit tests
  • TBD - e2e deployment

@rpmcginty rpmcginty requested a review from njmei June 14, 2024 20:15
@rpmcginty rpmcginty marked this pull request as ready for review June 14, 2024 20:16
Comment on lines +529 to +539
if env_file_write_mode == EnvFileWriteMode.IF_REQUIRED:
env_size = sum([sys.getsizeof(k) + sys.getsizeof(v) for k, v in environment.items()])

if env_size > 8192 * 0.9:
logger.info(
f"Environment variables are too large to pass directly to container (> 90% of 8192). "
f"Writing environment variables to file {efs_environment_file_uri}."
)
confirm_write = True
else:
confirm_write = False
Copy link
Collaborator

@njmei njmei Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of prefer EnvFileWriteMode.ALWAYS to be the default and maybe just getting rid of EnvFileWriteMode.IF_REQUIRED as an optional entirely maybe? For troubleshooting, it may end up being more confusing if some jobs have environment variables presented in one way, while other jobs have it another just based solely on if their total size goes over the arbitrary AWS 8192 limit...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpmcginty Let's defer this decision. Maybe just add a TODO: revisit whether to remove IF_REQUIRED and only support ALWAYS or NEVER

Comment on lines +342 to +352
# HACK: on macos, /tmp is a symlink to /private/tmp. This causes problems in the
# resolution and mapping of mounted paths to efs paths because we use the
# `pathlib,Path.resolve` to get the real path. This method will resolve the
# symlink to the real path. This is not what we want. We want to keep the
# symlink in the path. But there is no other method in `pathlib.Path` that
# will normalize the path.
# Solution here is to make tmp -> tmpdir. This is a hack and should not be
# a problem in linux based machines in production.
self.access_point_id__tmp = self.create_access_point(
EFS_TMP_ACCESS_POINT_NAME,
f"{EFS_TMP_PATH}",
f"{EFS_TMP_PATH}dir",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to me, aren't all the Docker images and things we're running in the cloud all Linux-based? Is this just a defensive measure in case we ever have any EC2 instances that are somehow MacOS based?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my tests were not passing on my mac device because of this. This is purely for unit testing

Copy link
Collaborator

@njmei njmei Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to add to the note that this is purely only needed for unit testing purposes (on Mac) then.

@njmei njmei self-requested a review June 14, 2024 20:39
@rpmcginty rpmcginty merged commit 8af165d into main Jun 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants