Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NMDC search capability, with all that it entails... #85

Merged
merged 51 commits into from
Nov 20, 2024
Merged

NMDC search capability, with all that it entails... #85

merged 51 commits into from
Nov 20, 2024

Conversation

jeff-cohere
Copy link
Collaborator

@jeff-cohere jeff-cohere commented Nov 12, 2024

One of the pesky things about prototypes is that they can sustain large disruptive changes when prior assumptions turn out to be invalid. This is one of those situations. :-)

This PR implements most of the capabilities for accessing NMDC:

  • "Search" works, with the caveat that "search" means "filter by study or data object ID". NMDC doesn't support unstructured search queries, so we're just doing what we can here.
  • For files ("data objects", in NMDCese) searched using a specific study ID, metadata is harvested from that study.
  • File staging is unnecessary for NMDC
  • I've figured out how to get studies associated with individual data objects!
  • File transfers seem to work!

I just wanted to get this PR going before it grows any more out of control. Sorry @ialarmedalien !

Also included:

  • Resources are now associated with transfer endpoints, as a Database is now allowed to store files in more than one place. NMDC stores files in Globus collections at PNNL and at NERSC. This violated my assumption that a Database uses only endpoint for all of its files. Relaxing this assumption forced me to allow a transfer task to have several "subtasks"
  • There's quite a bit of code cleanup here, mostly resulting from breaking apart large files within packages. Go allows all source files within a package to share internal linkage, so there's no reason to jam everything into a single file. In particular:
    • Error types for each package now reside in errors.go source files, and new error types have been added.
    • The tasks package now has separate source files for the workflow itself (tasks.go), transfer task logic (task.go), and transfer subtask logic (subtask.go).
    • Some dead code (including in dependencies) has been removed.
  • There's still more cleanup to be done, however(!).

TODO

  • Fix existing test failures
  • Implement tests for NMDC database
  • Do a bit more cleanup

Closes #83

A transfer task now has one or more subtasks, depending on how many
endpoints are involved in a file transfer.
tasks/subtask.go Outdated Show resolved Hide resolved
@jeff-cohere
Copy link
Collaborator Author

UPDATE: I've got NMDC file transfers working! Now that everything's figured out (to first order, anyway), I can put together some tests for the NMDC database implementation.

tasks/task.go Outdated Show resolved Hide resolved
tasks/task.go Outdated Show resolved Hide resolved
@jeff-cohere
Copy link
Collaborator Author

Hey @ialarmedalien . Many thanks for your review. I just put in one last change that tests the NMDC database. I think I have to put in some environment variables to get it to authenticate correctly and then we should be set!

@jeff-cohere
Copy link
Collaborator Author

Hmm. GitHub is a lot buggier than it used to be--I guess now that everyone's here, profits are less sensitive to glitches. I'm trying to convince it to use the newly added secrets.

@jeff-cohere jeff-cohere merged commit 4cf1500 into main Nov 20, 2024
2 checks passed
@jeff-cohere jeff-cohere deleted the nmdc branch November 20, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an NMDC database integration
2 participants