Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: no namespace remove snapshot support to demux snapshotter #678

Closed
wants to merge 1 commit into from
Closed

add: no namespace remove snapshot support to demux snapshotter #678

wants to merge 1 commit into from

Conversation

austinvazquez
Copy link
Contributor

@austinvazquez austinvazquez commented Jun 17, 2022

Signed-off-by: Austin Vazquez [email protected]

Issue #, if available:
#652

Description of changes:
Add feature to demux snapshotter to broadcast snapshot removal and cleanup to all cached snapshotters if no namespace is provided.
Because of the broadcast, we need to mask snapshot not found errors as most snapshotters are unlikely to have the referenced snapshot.

Reworked internal snapshotter mocks into a single more robust snapshotter mock.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@austinvazquez austinvazquez marked this pull request as ready for review June 17, 2022 15:53
@austinvazquez austinvazquez requested a review from a team as a code owner June 17, 2022 15:53
@austinvazquez
Copy link
Contributor Author

austinvazquez commented Jun 17, 2022

We now see remove content messages being sent to the in-VM snapshotters and cleanup scheduled. The errors below should be further reduced by cache eviction feature #651 which is currently in draft #662 slowed by test issues.

time="2022-06-17T16:26:02.706470754Z" level=debug msg="remove snapshot" key="sha256:d2742e2df4a4530fada3b8dbea2b65c4b6fc03fc89f250e8077ecd0425e1ab6c" snapshotter=demux
time="2022-06-17T16:26:02.706534356Z" level=debug msg="remove content" key="sha256:07de61ed5cef583643702d85315c20e24d9ce8fc6961481806c90912e1e7bef3"
time="2022-06-17T16:26:02.706578419Z" level=debug msg="remove content" key="sha256:3072dc67a6c382e000c3a60911488be108eccf7b7ffaee826489346980d36360"
time="2022-06-17T16:26:02.706596365Z" level=debug msg="remove content" key="sha256:644de93aece85d86312aec7d832bb9caa767a10e1b4fa750a1e35f415ecfac2f"
time="2022-06-17T16:26:02.708508821Z" level=debug msg="schedule snapshotter cleanup" snapshotter=demux
time="2022-06-17T16:26:02.708566192Z" level=debug msg="schedule content cleanup"
time="2022-06-17T16:26:02.709184388Z" level=debug msg="content garbage collected" d="582.637µs"
time="2022-06-17T16:26:02.712571712Z" level=warning msg="snapshot garbage collection failed" error="6 errors occurred:\n\t* failed to walk function on snapshotter[8]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/8#8/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/8#8/firecracker.vsock: connect: no such file or directory\": unavailable\n\t* failed to walk function on snapshotter[7]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/7#7/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/7#7/firecracker.vsock: connect: no such file or directory\": unavailable\n\t* failed to walk function on snapshotter[2]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/2#2/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/2#2/firecracker.vsock: connect: no such file or directory\": unavailable\n\t* failed to walk function on snapshotter[5]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/5#5/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/5#5/firecracker.vsock: connect: no such file or directory\": unavailable\n\t* failed to walk function on snapshotter[9]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/9#9/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/9#9/firecracker.vsock: connect: connection refused\": unavailable\n\t* failed to walk function on snapshotter[1]: connection error: desc = \"transport: Error while dialing non-temporary vsock dial failure: failed to dial \\\"/srv/firecracker_containerd_tests/1#1/firecracker.vsock\\\" within 100ms: dial unix /srv/firecracker_containerd_tests/1#1/firecracker.vsock: connect: no such file or directory\": unavailable\n\n: unknown" snapshotter=demux
time="2022-06-17T16:26:02.712649859Z" level=debug msg="garbage collected" d=2.648493ms

Copy link
Contributor

@ginglis13 ginglis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -158,6 +158,12 @@ func (s *Snapshotter) Commit(ctx context.Context, name string, key string, opts
func (s *Snapshotter) Remove(ctx context.Context, key string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we implement other APIs that aren't in the Snapshotter interface? Treating "" as a special "delete all" operation seems scary to be honest. If we would have different APIs in future, we should move these CleanupAll and DeteleAll there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the xxxAll functions might be poorly named here. These are implemented on the cache and we just want to execute Remove on all cached snapshotter entries. In other words, broadcast the Remove or Cleanup request to all cached snapshotters. In theory a snapshot could be in use by multiple VMs and only available for free in a subset, but the remote snapshotter should be able to handle that case.

Copy link
Contributor Author

@austinvazquez austinvazquez Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would BroadcastRemove be more clear than RemoveAll?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoveAll is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone forget to set the namespace, we are not only deleting a VM's snapshot, but also all of VMs' snapshot. This seems dangerous to me.

That being said, I don't have better alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The least I can do is to have a configuration option for enabling garbage collection via these broadcasts. So users would explicitly accept risk of enabling the feature and then forgetting to set the namespace would be user error. Also I have committed to some documentation for the demux snapshotter so that would outline the risk.

@Kern--
Copy link
Contributor

Kern-- commented Jun 30, 2022

What is the use-case for non-namespaced Remove? I can see non-namespaced Cleanup, but it's not clear what the goal is for Remove. Is it something that containerd GC does or something?

@austinvazquez
Copy link
Contributor Author

What is the use-case for non-namespaced Remove? I can see non-namespaced Cleanup, but it's not clear what the goal is for Remove. Is it something that containerd GC does or something?

That's exactly correct. So containerd appears to do a non-namespaced walk followed by removes after a layer removal for garbage collection based on observations. I'm sure that is an oversimplification.

@austinvazquez
Copy link
Contributor Author

After conversation with @kzys, decided to close this PR and open an issue discussing potential paths forward with support garbage collection with remote snapshotters. Stay tuned.

@austinvazquez austinvazquez deleted the remote-snapshotter-garbage-collection branch July 8, 2022 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants