Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider reducing TerminationGracePeriodSeconds for spin-apps deployment/pod spec #118

Closed
rajatjindal opened this issue Mar 2, 2024 · 4 comments

Comments

@rajatjindal
Copy link
Member

I was trying to understand why the scaling down of spin apps (after manually editing the number of replicas) is taking so long. It is likely due to the default 30s value of TerminationGracePeriodSeconds when creating the spin-app pods.

I reduced TerminationGracePeriodSeconds to 2s on my local setup (via a custom build of spin-operator), after which the scale-down is quite fast now. I believe that this change will also help with HPAorKeda` based scaledown.

We should consider adding a decent default and should possibly make it configurable on the SpinApp CRD.

@endocrimes
Copy link
Contributor

It needs to be >= the length of the longest request you expect to receive to allow for inflight events to safely drain.

If it's not shutting down after draining inflight reqs and instead waits for a SIGKILL that sounds like a bug in spin or the shim.

@rajatjindal
Copy link
Member Author

rajatjindal commented Mar 3, 2024

you are right. I looked into containerd logs:

time="2024-03-03T02:54:23.254922505Z" level=info msg="StopContainer for \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" with timeout 30 (s)"
time="2024-03-03T02:54:23.255270546Z" level=info msg="Stop container \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" with signal terminated"
time="2024-03-03T02:54:23.255412963Z" level=info msg="sending signal 15 to instance: a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec"

### 30 Seconds later

time="2024-03-03T02:54:53.267455088Z" level=info msg="Kill container \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\""
time="2024-03-03T02:54:53.267601463Z" level=info msg="sending signal 9 to instance: a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec"
time="2024-03-03T02:54:53.280759796Z" level=info msg="deleting instance: a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec"
time="2024-03-03T02:54:53.281252463Z" level=info msg="shim disconnected" id=a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec namespace=k8s.io
time="2024-03-03T02:54:53.281263755Z" level=warning msg="cleaning up after shim disconnected" id=a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec namespace=k8s.io
time="2024-03-03T02:54:53.285352921Z" level=info msg="StopContainer for \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" returns successfully"
time="2024-03-03T02:54:53.285577796Z" level=info msg="Container to stop \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-03-03T02:54:53.866617713Z" level=info msg="RemoveContainer for \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\""
time="2024-03-03T02:54:53.868465005Z" level=info msg="RemoveContainer for \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" returns successfully"
time="2024-03-03T02:54:53.868765338Z" level=error msg="ContainerStatus for \"a3db1d9768c405058596d2c6298d7b53d89cc81c79eb54d4801cbee0cdc434ec\" failed" error="rpc error: code = NotFound desc = an error occurred wh

@rajatjindal
Copy link
Member Author

oh I think this is same as: deislabs/containerd-wasm-shims#207

@rajatjindal
Copy link
Member Author

this turns out to be due to os signal handling in containerd-shim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants