Accidental deletion of a container #1268

xmonader · 2021-05-10T11:01:06Z

35722 https://explorer.testnet.grid.tf/api/v1/reservations/workloads/35722

LeeSmet · 2021-05-11T07:43:37Z

Seems the container already got deleted 29th of april around 17:20 CEST. The logs on the node show that around this time, some new reservations were deployed and the flists were not in cache. This includes another container using the same flist. Since the container was supposedly running at the time, the flist should have been there. For some reason, the container exited, got restarted by the container daemon, and then failed because zinit was not found in the path. This further supports that the flist was no longer there. We will need to investigate what exactly caused this.

About 2 minutes later, daemons were restarting, though there does not seem to be an indication of an upgrade, which is possibly related

xmonader · 2021-05-11T13:32:06Z

another one https://github.com/threefoldtech/itenv_testnet/issues/4

muhamadazmy · 2021-05-11T14:25:44Z

I need to clarify something first, a node can initiate a delete if it failed to start a workload, even if it has been running for some time. So basically an error that can crash the workload, or if the node was rebooted and couldn't bring the workload to it's running state it will get deleted since that is the only way to communicate an error with the owner. It's better than having it reported as deployed, but not actually running.

Also, this container got deleted on the 29th of April, so that is already a long time ago. any reason why it was only reported recently?

So now about what I think has happened:
the logs from this time period shows that the machine was booting up, for some reason the node couldn't redeply the container (seems that it the flist mount was corrupt somehow) hence the container failed to start and cause the deletion.

I will have to look deeper into the logs to see what exactly happened to the flist mount

muhamadazmy · 2021-05-11T14:27:08Z

On other hand, the bot should recover by redeploying another container on a different node if suddenly this node is not reachable anymore.

muhamadazmy · 2021-05-12T07:38:04Z

After investigating the issue more and looking deeper on the state of the node I found the cause of this issue

The OOM decided to kill some of the running 0-fs processes, which caused the container mount to fail and hence the container itself.

Solutions of what we can do to avoid this in the future:

Increase the container overhead, to account for the 0-fs process that runs the container
Protect 0-fs processes against OOM by setting the oom priority of this process

Edit:

NodeID: 8zPYak76CXcoZxRoJBjdU69kVjo7XYU1SFE2NEK4UMqn
Seems node has other issues with the hardware, it might not be a memory capacity planning issues that triggered OOM. The 2 containers that has been deleted randomly are running on this exact node

muhamadazmy · 2021-05-12T12:40:48Z

So this was caused by the following issues:

The node itself had a physical problem with one of the disk, this has been replaced
The oom has killed the 0-fs process for the container, this is completely random, so a PR has been open to make sure the 0-fs process is never selected by OOM Adj 0-fs oom prio #1272

sameh-farouk · 2021-06-03T13:48:05Z

@xmonader @muhamadazmy although this hard to verify, I saw similar behavior like the one described in the issue happens even after the fix was merged.
see: threefoldtech/js-sdk#3148
feel free to close the issue if you believe that it is a different issue.

xmonader assigned LeeSmet May 10, 2021

xmonader added the type_bug Something isn't working label May 10, 2021

xmonader added this to the now milestone May 10, 2021

LeeSmet removed their assignment May 11, 2021

sasha-astiadi assigned muhamadazmy May 11, 2021

muhamadazmy mentioned this issue May 12, 2021

Adj 0-fs oom prio #1272

Merged

sasha-astiadi linked a pull request May 12, 2021 that will close this issue

Adj 0-fs oom prio #1272

Merged

sasha-astiadi removed a link to a pull request May 12, 2021

Adj 0-fs oom prio #1272

Merged

xmonader added this to Backlog in TFGrid_2.8 via automation May 20, 2021

xmonader moved this from Backlog to Verification in TFGrid_2.8 May 20, 2021

muhamadazmy added this to Reviewer approved in ZOS_0.4.X May 20, 2021

muhamadazmy closed this as completed May 27, 2021

TFGrid_2.8 automation moved this from Verification to Done May 27, 2021

ZOS_0.4.X automation moved this from Verification to Done May 27, 2021

sasha-astiadi removed this from Done in TFGrid_2.8 May 31, 2021

sameh-farouk reopened this Jun 3, 2021

ZOS_0.4.X automation moved this from Done to In progress Jun 3, 2021

xmonader removed this from the now milestone Jul 4, 2022

xmonader closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accidental deletion of a container #1268

Accidental deletion of a container #1268

xmonader commented May 10, 2021

LeeSmet commented May 11, 2021 •

edited

Loading

xmonader commented May 11, 2021

muhamadazmy commented May 11, 2021

muhamadazmy commented May 11, 2021

muhamadazmy commented May 12, 2021 •

edited

Loading

muhamadazmy commented May 12, 2021

sameh-farouk commented Jun 3, 2021

Accidental deletion of a container #1268

Accidental deletion of a container #1268

Comments

xmonader commented May 10, 2021

LeeSmet commented May 11, 2021 • edited Loading

xmonader commented May 11, 2021

muhamadazmy commented May 11, 2021

muhamadazmy commented May 11, 2021

muhamadazmy commented May 12, 2021 • edited Loading

muhamadazmy commented May 12, 2021

sameh-farouk commented Jun 3, 2021

LeeSmet commented May 11, 2021 •

edited

Loading

muhamadazmy commented May 12, 2021 •

edited

Loading