-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Context Manager to support tmp volume and other configurable options #7
Conversation
src/aibs_informatics_aws_lambda/handlers/demand/context_manager.py
Outdated
Show resolved
Hide resolved
if env_file_write_mode == EnvFileWriteMode.IF_REQUIRED: | ||
env_size = sum([sys.getsizeof(k) + sys.getsizeof(v) for k, v in environment.items()]) | ||
|
||
if env_size > 8192 * 0.9: | ||
logger.info( | ||
f"Environment variables are too large to pass directly to container (> 90% of 8192). " | ||
f"Writing environment variables to file {efs_environment_file_uri}." | ||
) | ||
confirm_write = True | ||
else: | ||
confirm_write = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of prefer EnvFileWriteMode.ALWAYS
to be the default and maybe just getting rid of EnvFileWriteMode.IF_REQUIRED
as an optional entirely maybe? For troubleshooting, it may end up being more confusing if some jobs have environment variables presented in one way, while other jobs have it another just based solely on if their total size goes over the arbitrary AWS 8192 limit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rpmcginty Let's defer this decision. Maybe just add a TODO: revisit whether to remove IF_REQUIRED and only support ALWAYS or NEVER
# HACK: on macos, /tmp is a symlink to /private/tmp. This causes problems in the | ||
# resolution and mapping of mounted paths to efs paths because we use the | ||
# `pathlib,Path.resolve` to get the real path. This method will resolve the | ||
# symlink to the real path. This is not what we want. We want to keep the | ||
# symlink in the path. But there is no other method in `pathlib.Path` that | ||
# will normalize the path. | ||
# Solution here is to make tmp -> tmpdir. This is a hack and should not be | ||
# a problem in linux based machines in production. | ||
self.access_point_id__tmp = self.create_access_point( | ||
EFS_TMP_ACCESS_POINT_NAME, | ||
f"{EFS_TMP_PATH}", | ||
f"{EFS_TMP_PATH}dir", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing to me, aren't all the Docker images and things we're running in the cloud all Linux-based? Is this just a defensive measure in case we ever have any EC2 instances that are somehow MacOS based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my tests were not passing on my mac device because of this. This is purely for unit testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to add to the note that this is purely only needed for unit testing purposes (on Mac) then.
What's in this Change?
Merscope analysis pipeline has new requirements in how data is localized. This change addresses those needs.
ContextManagerConfiguration
class that has more configurationsisolate_inputs
- this flag would change where inputs are downloaded. typically, they are downloaded to a shared volume where the remote path is used to generate a sha256 hexdigest value. There are use cases where we need to isolate inputs per run and, furthermore, use an explicit path name. Ifisolate_inputs
is specified, we will create a directory under shared volume based on demand id, and then download to sub path based on local value of job param.env_file_write_mode
- this option helps control under what conditions to write env file over passing env vars as is. there are three options, ALWAYS, NEVER, IF_REQUIRED (default is IF_REQUIRED)force
- this controls whether to conditionally download data during data sync based on prexisting data. If true, it will download regardless of whether data is present. default is Falsesize_only
- this controls how to check validity of cached data already downloaded. If true, only size (and date) will be used to compare remote and local data. If false, md5checksum (or an etag for s3 values) values are also created.Testing