Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devise a mechanism for determining whether a file already exists at a destination. #66

Open
jeff-cohere opened this issue Apr 22, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@jeff-cohere
Copy link
Collaborator

We've talked here and there about how to minimize unnecessary data transfers, and discussed the merits and drawbacks of various approaches. In particular, I'm not crazy about using a log to figure out where a file should or shouldn't be--I'd rather ask the source of truth itself!

In this connection, I'm considering an additional endpoint for the Database specification that searches for files by their MD5 checksums specifically, instead of using search queries. This endpoint would accept an array of checksums and return their corresponding file IDs (or null in the case that they aren't found).

Obviously this is a very complicated problem to solve, and the above approach doesn't begin to handle all of the nastiness to do with files that have been transferred but don't yet have IDs, etc. But I think it would at least give us a solid point of departure. I think I can probably stand up a JDP file checksum search endpoint that uses JAMO.

@jeff-cohere jeff-cohere added the enhancement New feature or request label Apr 22, 2024
@jeff-cohere
Copy link
Collaborator Author

JAMO does maintain an md5sum field in its records, but it's hard to know how many records have this populated. Also, it's not an indexed field, which produces pretty terrible performance for queries that select records related to it. I've had no luck getting results from JAMO queries that reference known md5 checksums. So unless I'm overlooking something, it doesn't look like JAMO can provide this capability.

The JAMO documentation says that it's possible to ask the team to add another index. That's an option to explore as this becomes more important.

@jeff-cohere
Copy link
Collaborator Author

Good news here--Chris Beecroft told me he would add md5 checksums to the set of indexed JAMO fields along with some others requested by the JDP team. So it looks like we'll be able to work with the JDP team to add an endpoint that supports querying files by md5 checksum soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant