-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Ability to delete data from the history #80
Comments
@lantiga, Can you comment on how regulations outlined in the GDPR may drive the need for such a feature? TLDR: This isn't possible with the current architecture, it is likely never possible to be fully enforceable, and we may not want to enable such features in the first place.So there are a few issues that may not make this possible without some serious thought here:
Points 1 and 2 are by far the biggest holdups here. I really can't see a path forward here without a really insane amount of effort (if it's reasonably possible at all). Thoughts @lantiga or @elistevens? |
I think that the legal angle is going to become increasingly important. On one side, you have GDPR, DCMA, etc. and on the other are the risks of accidentally vacuuming up something unsavory from the web. It doesn't really matter if some client somewhere still has the content; what matters is that the user who has been notified that they have some content to remove can comply without burning down their entire data store. I think that the important part is that there is a path forward, not so much that it's clean and easy and transparent. I think it's fine to flag a number of content hashes as "deleted due to legal" or w/e, and if that ends up breaking other, non-problematic samples, well, that's the cost. Keep the reference to the hash, but nuke the content. Another option would be to select certain branch-heads and tags, and say "recreate the repo, but only with these branches/tags available." You'd lose a ton of history, but that's kind of the point (who knows how long the offending content has been around). To be clear, I mean only the heads of the branches would be present; no historical commits. Of course, only the human-readable labels would remain the same; all of the underlying hashes would be new. |
IntroI can see how the legal aspects alone make this almost a certain requirement in the future.. But it's going to be quite a challenge to implement. We may want to talk to some experts to understand what the regulations actually require before spending what will likely be a significant chunk of development time on this. It's actually a rather interesting problem to think about; it's definitely not impossible, but it'll need to be implemented with care. The biggest challenge is to keep user facing impact to an absolute minimum; ideally we wouldn't even have to implement a new top-level ProblemThe biggest challenge in my mind is related to my point #2 above:
Rewriting history in this way is almost entirely equivalent to performing a A few of the implications/scenarios we want to avoid:
I can see many scenarios where getting this wrong significantly handicaps the project in the long term. In my current view, losing history is the worst possible thing that we could do (opening up many more problems than we would solve). Idea?I'm thinking something slightly different: keeping more history. I think our problems would be solved if we were to have a system which not only keeps a record of what events occur and how we believe they relate to eachother historically as observers right now in this moment of time; but which also keeps a (nearly) perfect recounting of those same details for how all of our long lost ancestors actually did experience events throughout history at their points in time. Think about a repositories history as being events written in a diary passed down through generations of a family. Sure, things would probably look very different from our ancestors point of view, and sure, some details or precious artifacts may have been lost over the centuries, but by keeping a log which survives beyond the march of time (or an attempt to rewrite some of our history, maybe?) we would have a record which would allow us to link/guide someone at any point in time to the state which it exists in now. In more technical termshistory:
Say we then had to go back to commit
Both records of history are kept so that even though the data (
This would solve essentially every problem a
The only caveat is that during a "jump" into a new history, we do not allow squash operations. It's important to have a mapping of each old commit to it's new state in the updated history. I'm ignoring a whole ton of complexity which surrounds this entire venture, but i'm interested in your thoughts @lantiga and @elistevens. |
I think that you're trying to solve a much larger problem than you actually need to. I don't know enough about the internal implementation to propose a concrete solution, but I think that this should be close enough to inspire something that actually works. Let's say you've got 4 tensors(
W&P has the following chunk SHAs: HP has: Combined WP-HP: Combined HP-WP: The combined texts have changed SHAs for the second work because the chunks have different split boundaries. You get a legal nastygram saying that HP, WP-HP, and HP-WP need to be removed. That translates into the SHAs: aaa bbb ccc 111 222 399 b99 c99 cff 2ff 3ff That's every chunk present except 333. Note that the 111 and 222 chunks are false positives. How those get handled is, IMO, not super relevant to this issue (you'd need more robust tools to examine samples that have chunk overlap with problematic chunks, and aside from a chunk that consists of only 0.0s, I suspect that will be very rare in practice). Let's say that we don't do anything fancy, and just nuke them all. At your chunk storage layer, you keep a list of all purged chunk SHAs. When a user tries to get a sample that has a purged SHA, they instead get an exception It might be nice to have the purge also make a commit that removes the samples from some list of branches, so that going forward users won't be getting exceptions when they iterate over the data. Note that this is the only part of any of this that touches history or the DAG. Nothing existing changes, except that the purged data is now kinda in this phantom state; the ghost remains (SHA, size, etc.) but the actual body is gone. |
@elistevens, I should actually clarify a few important point about the internals:
As a result of point 1, the example you gave would actually result in the following situation:
A request to remove the three
A much more difficult case arrives from point 2: Say we have a very simple dataset containing sampled named by some
If we received a request to remove the data for the account of To handle this case, we would instead have to overwrite the sample This is the point were things break down In order to generate the commit hash digest, we basically hash a concatenation of the parent commit's hash with every sample digest recorded in the commit. If we were to do this after marking sample digests as Since we can't keep the digest attached to the sample for legal reasons (someone could just modify the hangar code to ignore the flag variable and retrieve content they shouldn't be able to), nor can we just remove it from history entirely without causing huge issues, I can't see any way around the need for a way to deterministically map the transformation of one recollection of history to a new (verifiable) one. In response
This is a great Idea! I'll remember this when the time comes!
Hopefully i'm not missing something obvious here, but from my understanding the phantom state is the root cause of all the problems i've mentioned perviously. Does that make sense? |
Separate out the "remove user1's data" use case, since that's not the same thing. Maybe one mechanism can solve both, maybe not. There, the problem isn't "24", it's the association of the tensors that make up I have been assuming that there's a separation between the objects that represent the state of the repo, and the data storage that contains the actual tensor bytes, similar to how git has commits, trees, and content. I'm saying you can keep the commits and trees, and replace the purged object with a tombstone that says "sorry bro, this got purged. trust me, it totally had the hash 1234 when it existed." That happens outside of your commit/tree structure, and the data is truly gone. You'd have to also propagate tombstones when syncing, but that shouldn't be too difficult (I know, famous last words). |
@elistevens The problem I could see with replacing a data piece with tombstone comes back to the same thing @rlizzo mentioned -> what if two commits (data of two people) points to the same data piece in the storage layer. As in the above example, if user 1 deletes his/her data, and user 4 doesn't, we need to keep the data piece intact but somehow remove the pointer of user1's hash to the actual data piece. But what if we just remove the retrieval info value for the digest key? So we have two cases
For those users who try to sync the data whenever they want, we could look into the value of the hash keys and make it invalid or delete the actual data as needed. And then whenever we find |
Again, that's a different use case, because "age: 24" isn't the same thing as the complete text of harry potter, and fixing it will require a different approach. For the case where you need to delete But I think that tombstones can solve the harry potter problem. |
@elistevens I think this is the fundamental difference in assumptions that we are making which are leading us to different conclusions. From the perspective of Hangar, I don't think there should be a distinction between @hhsecond, the problem with:
Leads exactally to my points which I've outlined in:
Does that make sense? or am I misreading the question? |
I think that what you're calling "deleted data" there is just "not available on HEAD of this branch" right? You could still check out an older commit and the "deleted" data would still be there, right? My desire is to have a mechanism where the data is 100% unavailable. If you're shipping a repo that has a copyright violation, it doesn't really matter if you have to check out an old commit to get it or not. |
Not quite. I'm saying that if you delete data in HEAD, the changes will propagate through history from the initial point that the data was added to the repo. That's why there's so many issues with history. |
Oh, yeah, rebasing everything from the moment the problematic data was added forward is going to be awful. The whole point of tombstones is to avoid that in the cases where it's only the data bytes that are problematic. |
Yeah... considering there's not even a At the very least, we should take this thread and archive the discussion and potential solutions on our docs page for future reference. (though I'm still interested in @lantiga's opinion) @elistevens, Would you be available for a con-call one of these days to brainstorm on potential options or workarounds? |
Hey! Late to the party it seems, but great discussion. For the time being, tombstones are going to be the only viable solution to me. The issue with privacy might be in different spots, but all related to individual samples:
We need to be able to purge all these from history, and each case has to deal with a different place. The tensor The association The meta data The sample name As a general consideration, we absolutely need to encode the recipe for the changes in some way, so that whoever cloned the dataset can have a change to still work with the dataset by applying the same changes. Maybe just overwriting the metadata, possibly preserving local changes, or applying the same "recipe" that was applied remotely locally. I general I really liked the super-DAG idea that @rlizzo had, let's think about the implications. One other sticky point is blocking pushes to a remote if data that has been tombstoned hasn't been tombstoned. We need to avoid undoing the changes due to a stale local repo pushing data back. Great discussion, it would be ideal to have a minimal plan for implementation that we might or might not enact right away. |
Is your feature request related to a problem? Please describe.
Currently, we can't delete data from history once it is committed. Since in some cases where the end user triggers a data deletion request, organizations who use hangar for data storage and versioning will have to delete it from everywhere including history
The text was updated successfully, but these errors were encountered: