Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accidental deletion of a container #1268

Closed
xmonader opened this issue May 10, 2021 · 7 comments
Closed

Accidental deletion of a container #1268

xmonader opened this issue May 10, 2021 · 7 comments
Assignees
Labels
type_bug Something isn't working
Projects

Comments

@xmonader
Copy link
Collaborator

image

35722 https://explorer.testnet.grid.tf/api/v1/reservations/workloads/35722

@xmonader xmonader added the type_bug Something isn't working label May 10, 2021
@xmonader xmonader added this to the now milestone May 10, 2021
@LeeSmet
Copy link
Contributor

LeeSmet commented May 11, 2021

Seems the container already got deleted 29th of april around 17:20 CEST. The logs on the node show that around this time, some new reservations were deployed and the flists were not in cache. This includes another container using the same flist. Since the container was supposedly running at the time, the flist should have been there. For some reason, the container exited, got restarted by the container daemon, and then failed because zinit was not found in the path. This further supports that the flist was no longer there. We will need to investigate what exactly caused this.

About 2 minutes later, daemons were restarting, though there does not seem to be an indication of an upgrade, which is possibly related

@LeeSmet LeeSmet removed their assignment May 11, 2021
@xmonader
Copy link
Collaborator Author

@muhamadazmy
Copy link
Member

I need to clarify something first, a node can initiate a delete if it failed to start a workload, even if it has been running for some time. So basically an error that can crash the workload, or if the node was rebooted and couldn't bring the workload to it's running state it will get deleted since that is the only way to communicate an error with the owner. It's better than having it reported as deployed, but not actually running.

Also, this container got deleted on the 29th of April, so that is already a long time ago. any reason why it was only reported recently?

So now about what I think has happened:
the logs from this time period shows that the machine was booting up, for some reason the node couldn't redeply the container (seems that it the flist mount was corrupt somehow) hence the container failed to start and cause the deletion.

I will have to look deeper into the logs to see what exactly happened to the flist mount

@muhamadazmy
Copy link
Member

On other hand, the bot should recover by redeploying another container on a different node if suddenly this node is not reachable anymore.

@muhamadazmy
Copy link
Member

muhamadazmy commented May 12, 2021

After investigating the issue more and looking deeper on the state of the node I found the cause of this issue
image

The OOM decided to kill some of the running 0-fs processes, which caused the container mount to fail and hence the container itself.

Solutions of what we can do to avoid this in the future:

  • Increase the container overhead, to account for the 0-fs process that runs the container
  • Protect 0-fs processes against OOM by setting the oom priority of this process

Edit:

  • NodeID: 8zPYak76CXcoZxRoJBjdU69kVjo7XYU1SFE2NEK4UMqn
  • Seems node has other issues with the hardware, it might not be a memory capacity planning issues that triggered OOM. The 2 containers that has been deleted randomly are running on this exact node

@muhamadazmy
Copy link
Member

So this was caused by the following issues:

  • The node itself had a physical problem with one of the disk, this has been replaced
  • The oom has killed the 0-fs process for the container, this is completely random, so a PR has been open to make sure the 0-fs process is never selected by OOM Adj 0-fs oom prio #1272

@sasha-astiadi sasha-astiadi linked a pull request May 12, 2021 that will close this issue
@sasha-astiadi sasha-astiadi removed a link to a pull request May 12, 2021
@xmonader xmonader added this to Backlog in TFGrid_2.8 via automation May 20, 2021
@xmonader xmonader moved this from Backlog to Verification in TFGrid_2.8 May 20, 2021
@muhamadazmy muhamadazmy added this to Reviewer approved in ZOS_0.4.X May 20, 2021
TFGrid_2.8 automation moved this from Verification to Done May 27, 2021
ZOS_0.4.X automation moved this from Verification to Done May 27, 2021
@sasha-astiadi sasha-astiadi removed this from Done in TFGrid_2.8 May 31, 2021
@sameh-farouk
Copy link
Member

@xmonader @muhamadazmy although this hard to verify, I saw similar behavior like the one described in the issue happens even after the fix was merged.
see: threefoldtech/js-sdk#3148
feel free to close the issue if you believe that it is a different issue.

@sameh-farouk sameh-farouk reopened this Jun 3, 2021
ZOS_0.4.X automation moved this from Done to In progress Jun 3, 2021
@xmonader xmonader removed this from the now milestone Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
No open projects
ZOS_0.4.X
  
In progress
Development

No branches or pull requests

4 participants