Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull-npd-e2e-test failing ssh handshake #970

Open
wangzhen127 opened this issue Oct 9, 2024 · 10 comments
Open

pull-npd-e2e-test failing ssh handshake #970

wangzhen127 opened this issue Oct 9, 2024 · 10 comments

Comments

@wangzhen127
Copy link
Member

https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.

[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen 
[1]   NPD should update problem_counter and problem_gauge
[1]   /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -k   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]

This is affecting several different PRs: #955, #961, #969.

@wangzhen127
Copy link
Member Author

This looks like an infra issue. @BenTheElder Do you know who should we talk to?

CC @hakman

@BenTheElder
Copy link
Member

It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.

seems like the VM is not serving SSH or something similar

@wangzhen127
Copy link
Member Author

CC @DigitalVeer

@BenTheElder
Copy link
Member

If these are like node e2e tests, folks in SIG node might be familiar

SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.

@ameukam
Copy link
Member

ameukam commented Oct 10, 2024

It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.

@hakman
Copy link
Member

hakman commented Oct 13, 2024

This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar).
I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:

echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error

Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:

[  169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[  169.108852] Aborting journal on device sda1-8.
[  169.115130] EXT4-fs (sda1): Remounting filesystem read-only

There may be some recent changes that affect the behaviour of trigger_fs_error.
https://lore.kernel.org/all/[email protected]/t/#u

@wangzhen127
Copy link
Member Author

New updates:

Talked to COS team and found the root cause: https://www.spinics.net/lists/linux-ext4/msg90066.html

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

This is an intentional change from upstream kernel so on COS side they won't change it. The path forward would be updating the NPD test case for newer kernel versions (>=6.5.0-rc3).

@hakman
Copy link
Member

hakman commented Nov 7, 2024

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

@wangzhen127 I don't think SSH failing after this is an intended behaviour.

@wangzhen127
Copy link
Member Author

Yeah, this is from COS team's perspective, because the change in upstream. So there is not much they can do. So they recommend us to update tests. Sorry for the confusion.

@hakman
Copy link
Member

hakman commented Nov 7, 2024

No worries, I just meant that maybe they can configure the SSH server to not fail completely. I agree that the FS should become read-only, but not accepting SSH connections is quite unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants