Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make root directory (common prefix of all base_dirs) customizable during setup #346

Closed
yymao opened this issue Jul 12, 2019 · 29 comments · Fixed by #416
Closed

Make root directory (common prefix of all base_dirs) customizable during setup #346

yymao opened this issue Jul 12, 2019 · 29 comments · Fixed by #416
Assignees
Labels
priority:high register Issues related to the register module

Comments

@yymao
Copy link
Member

yymao commented Jul 12, 2019

In an effort to make gcr-catalogs more easily installable outside NERSC (#271), we need to allow the root directory (the common prefix of all base_dirs, after #345 is done) to be configurable during package setup time.

There are a few ways to do this:

  1. Require base_dir to always be a relative path. At run time, the reader prepends root directory path (which is set during package installation) to base_dir.

  2. Require base_dir to be in the form of <ROOT_DIR>/relative/path. At run time, the reader replaces <ROOT_DIR> with the actual root directory path (which is set during package installation).

  3. Require base_dir to be in the form of /global/projecta/projectdirs/lsst/production/catalogs/relative/path. At run time, the reader replaces /global/projecta/projectdirs/lsst/production/catalogs with the actual root directory path (which is set during package installation).

(1) appears more elegant and less hack-y. (2) is more explicit to the user (i.e., the user would know what will happen). (3) has the benefit that for NERSC users, they can easily find the files without needing to find out what ROOT_DIR is.

Comments on this are welcome.

@yymao yymao added the register Issues related to the register module label Jul 12, 2019
@yymao yymao self-assigned this Jul 12, 2019
@wmwv
Copy link
Contributor

wmwv commented Jul 14, 2019

I suggest storing the ROOT_DIR for IN2P3/CC and Cori in the gcr-catalogs.

I think the number of installations we're trying to support is limited to a finite number, so we can make significant progress at minimal user effort. It wouldn't have to be explicitly set during package installation. It just would be read at package initialization with a machine name lookup.

If someone wanted to support their own location installation of gcr-catalogs, they could explicitly install or set an environment variable to control this.

@wmwv
Copy link
Contributor

wmwv commented Jul 14, 2019

I have similar thoughts about redefining REPOS in
https://github.com/LSSTDESC/dc2-dm-repo

to be a dict with 'NERSC' and 'IN2P3/CC' as keys.

@heather999
Copy link
Contributor

NERSC has asked we move off projecta and completely onto the new CFS soon - as in the next few weeks, exact schedule TBD. Hence, this issue has some higher priority.

@JoanneBogart
Copy link
Contributor

I can look into this.
I agree there should be site defaults, but it's still the case that somebody has to specify something, at least at installation time. If, as @wmwv suggests above, there are other places where this kind of thing comes up, we should handle it in a uniform way. I don't know of any reliable method to tell from a login what the site is. We could ask everyone (or at least everyone who installs GCR and wants to make use of the default) to set an environment variable, e.g. DESC_SITE. Is there a better way to go about this?

@heather999
Copy link
Contributor

I'm wondering how this would work for a docker image. At install time, I have no idea where the image might be run, though currently I can safely assume it is at NERSC. I'd need an environment variable that can be set to adjust the ROOT_DIR at runtime.

@johannct
Copy link

johannct commented Feb 4, 2020

we need to assume that it will have to be deployed at NERSC and at CC. Whether with an image or else. I think investigating an env variable setup is the best way forward.

@boutigny
Copy link
Contributor

boutigny commented Feb 6, 2020

Yes, an environment variable defined at installation time is probably the best way to go.

@JoanneBogart
Copy link
Contributor

My plan is

  • include a little yaml file or similar in the package, mapping site name to ROOT_DIR. There would be keys for NERSC and CC. For NERSC the value (while we still have projecta) could be /global/projecta/projectdirs/lsst
  • at install time, look for environment variable DESC_SITE and save value, if any, in a text file accessible to the package at run time [I don't know enough about package setup to do this but it seems like it should be possible]
  • modify all configs in catalog_configs so that filepath values are relative. This affects all instances of keywords filename:, addon_filename, base_dir:, root_dir, catalog_root_dir, header_file and repo. Have I missed any?
  • write a utility to resolve the prefix: if user has an environment variable GCR_DATA_ROOT set, use that for prefix. If not and there is a value of DESC_SITE in current process or from installation, use that to look it up from the mapping. If none of these leads to anything, quit with an error
  • modify readers to use the utility

@yymao
Copy link
Member Author

yymao commented Feb 7, 2020

@JoanneBogart many thanks for posting your plan here! This is really helpful. I think the general direction sounds very reasonable, but I have a few specific suggestions to reduce the complexity. In particular, I would suggest (1) reducing the need of environment variables, and (2) resolve path at run time rather than at installation time. I will post my suggestions in detail by Monday at latest (sorry, has been difficult to find time).

@JoanneBogart
Copy link
Contributor

JoanneBogart commented Feb 8, 2020

@yymao Simplification sounds good to me! I look forward to your comments. One clarification: I'm not suggesting that the paths be resolved at installation time, only that information about the site be cached at that time, so that the typical user doesn't have to specify anything (via environment variable or any other means) at run time.

@yymao
Copy link
Member Author

yymao commented Feb 8, 2020

@JoanneBogart Here's my specific suggestions (some are the same as yours):

  • Include a yaml file in the package, mapping site name to corresponding ROOT_DIR. This yaml file will be maintained in this repo.

  • Establish a module-level variable called ROOT_DIR:

    • Its value is set when GCRCatalogs is first imported during runtime.
    • Its value is set according to the yaml file described above and the environment variable HOSTNAME or something similar (i.e., an environment variable that already exists).
  • Update all existing config files so that the paths are relative.

  • During installation, keep config files as they are.

  • When load_catalog is called, a config file will be loaded into memory as a dictionary. Update the path in memory during this step. For example:

    config = {k: os.path.join(ROOT_DIR, v) if k in [...] else v for k, v in config.items()}
    

    In other words, the path modification is done in the register level (when resolving the config file), not at the reader level (the correct path should be passed to the readers). This in-memory modification is cheap and I think there's no need to cache it. (But we need to check if there's any path not specified at the top level of the config, and if so, the above example is too simplified.)

  • Users can set GCRCatalogs.ROOT_DIR = <path> to overwrite the default setting. This should take effect the next time when load_catalog is called.

  • Users can still use overwrite_config when calling load_catalog to overwrite the path.

Here's the principles I am trying to adhere to when writing the above:

  • Avoid the need of new environment variables (either for users or environment maintainers).
  • Prefer resolving path dynamically at run time over caching at installation time to allow more flexibility and transparency.
  • Allow users to programmatically overwrite both ROOT_DIR and paths as needed.
  • The reader modules should be left unchanged from this update (i.e., users ought to be able to call the reader directly and give it whatever path they want to use.)

@JoanneBogart
Copy link
Contributor

@yymao I think we're converging but are not quite of one mind yet.

I agree it's better have register resolve the path, not the readers.

I agree that users should be able to programmatically overwrite both ROOT_DIR and individual paths. I'm not sure that is sufficient for all use cases, though.

We need a way to set up a sensible default. That means getting information about site to the package. (It does not mean resolving paths at that time; I never suggested that.) An advantage of caching site during installation is that anyone running off of a provided installation won't have to think about it at all unless they need to override. The alternative is to always require this information at run time.

If site could be reliably determined from existing system variables I'd be in favor of that, but I don't think it can be. HOSTNAME wouldn't qualify. The pool of hosts is not under our control. If it changes, whatever parses the value could give the wrong answer.

@yymao
Copy link
Member Author

yymao commented Feb 9, 2020

@JoanneBogart It sounds like we are in good agreement except for when and how to obtain site information.

On "when": my proposal is to obtain site information every time when GCRCatalogs is imported, while yours is to obtain site information at installation time and cache. I think in both cases, GCRCatalogs.ROOT_DIR will be set at import time (in my case, using the site information it just obtained; while in your case, using the cached site information).

For users who want to overwrite ROOT_DIR, this doesn't matter at all. For users who want to use default ROOT_DIR, I don't think it matter either because the site information will be obtained automatically in both cases to set ROOT_DIR. So from user's perspective both proposals are mostly indistinguishable.

My preference in obtaining site information at import time comes from the idea that it seems more consistent (i.e., requires less explanation to new devs). Site information is only needed when setting ROOT_DIR, and hence, if I were a new developer, I would more likely expect to find the code the obtains site information around where ROOT_DIR is set.

The counter argument that I can think of, is that if site information cannot be obtained reliably every time when GCRCatalogs is imported, then of course we should cache it at installation time. So this gets us to "how"...

On "how": my proposal is to obtain site information using $HOSTNAME, while yours is to obtain site information using an user-set environment variable. I agree that $HOSTNAME may not be very reliable. On the other hand I certainly want to reduce human intervention, even during installation time. It seems to me that we can use the following code to obtain site information:

>>> import socket                                                                                                                                                                         
>>> socket.getfqdn()                                                                                                                                                                      
'cori03.nersc.gov'

One thing I am not sure is whether this works fine in docker/shifter images. If it does, I think this can be used instead of reading an user-set environment variable.

@JoanneBogart
Copy link
Contributor

@yymao I don't care that much about the "when" part. The only reason for attempting to determine site during installation is 1) if a significant fraction of (probably more casual) users do not do their own installation - I don't know if that's the case or not - and 2) if determination of site is not automatic.
I do have concerns about the reliability of "how" but your idea of using socket looks promising. I tried it from a shifter image at NERSC and got the same response as before starting up shifter; that is, "nersc" was part of the returned string. From a login at cc-in2p3 the returned value was cca009.in2p3.fr
I think this will work as the primary means of determining site. It might be useful to have a way to override it for certain rare development situations.

@JoanneBogart
Copy link
Contributor

@johannct does the above means to determine site (via response to socket.getfqdn() ) seem adequate? It might not be possible to easily distinguish among different in2p3 sites, but maybe that doesn't matter if the only one involved with DESC is cc-in2p3.
I think whatever method we use should be able to positively identify both primary sites rather than just distinguishing between NERSC and not-NERSC.

@johannct
Copy link

yes it seems ok also for CC, the target word would be in2p3 I guess. Can you guys show me how a catalog config would look like.... I got lost in the discussion. Then I would suggest to implement the minimal solution, and we'll need to use it to judge it concretely.

@yymao
Copy link
Member Author

yymao commented Feb 10, 2020

I think in this proposal that @JoanneBogart and I converge on, the config files will not have much change except for updating all the paths to relative.

Then we will need to revise the code so that it

  1. uses socket.getfqdn to sets a module-level variable ROOT_DIR
  2. when loading config files, prepends ROOT_DIR to any paths in the config file (in memory).

@JoanneBogart
Copy link
Contributor

What @yymao said. That is also my understanding.

@johannct
Copy link

ok in my imagination I was contemplating the env variable to be written in the config files, so that base_dir: /global/projecta/projectdirs/lsst/groups/PZ/PhotoZDC2/run1.2p/TRACTHDF5 for instance becomes base_dir: ${ROOT_DIR}/projectdirs/lsst/groups/PZ/PhotoZDC2/run1.2p/TRACTHDF5

@yymao
Copy link
Member Author

yymao commented Feb 10, 2020

I've considered that too, but later feel that's unnecessary, especially because os.path.join makes it very easy to prepend ROOT_DIR to paths.

@johannct
Copy link

right but it is less portable. I do not want to interfere, go ahead you thought this over much more than I did.

@JoanneBogart
Copy link
Contributor

I'm not sure whether this is a good thing or not but --
If we omit the env variable from the files, we have a way to implement backwards compatibility. If the path is absolute, just use it as is. If it is relative, prepend ROOT_DIR

@JoanneBogart
Copy link
Contributor

I guess it could be convenient when working with a new dataset which is not yet in the standard place.

@yymao
Copy link
Member Author

yymao commented Feb 10, 2020

@johannct can you elaborate on the "less portable" point? I am not sure I get it.

@JoanneBogart Indeed. And in fact that's exactly the behavior of os.path.join: os.path.join(a, b) will return just b if b is already an absolute path.

@JoanneBogart
Copy link
Contributor

Once we move to the new scheme config files with absolute paths should not be allowed in tagged versions of gcr-catalogs.

@JoanneBogart
Copy link
Contributor

@yymao If you don't have objections I will try to implement this.

@yymao
Copy link
Member Author

yymao commented Feb 11, 2020

@JoanneBogart sounds good. If you open a draft PR even while you're still working on it, I'll be happy to provide early feedback.

@JoanneBogart
Copy link
Contributor

JoanneBogart commented Feb 11, 2020

@yymao I pushed my branch, then realized I was supposed to do this via a fork. Sorry.
The branch is u/jrb/root_dir
It includes

  • the yaml dict associating site with a root dir
  • modified register.py
  • three test configs called something_rel.yaml where something.yaml was already a config for some catalog.

I haven't confirmed that all the possible path-like keywords are recognized and handled, but I believe it behaves correctly for the cases I've tested.

@yymao
Copy link
Member Author

yymao commented Feb 12, 2020

@JoanneBogart using branches is perfectly fine! Thanks!

I'll take a look and comment. Do you want to open a draft PR? This way it's clear that it is work in progress (and people won't be notified when you push), but it allows people to comment in the PR thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:high register Issues related to the register module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants