pull-npd-e2e-test failing ssh handshake #970

wangzhen127 · 2024-10-09T16:59:08Z

https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.

[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen 
[1]   NPD should update problem_counter and problem_gauge
[1]   /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -k   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]

This is affecting several different PRs: #955, #961, #969.

The text was updated successfully, but these errors were encountered:

wangzhen127 · 2024-10-09T17:00:52Z

This looks like an infra issue. @BenTheElder Do you know who should we talk to?

CC @hakman

BenTheElder · 2024-10-09T17:03:48Z

It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.

seems like the VM is not serving SSH or something similar

wangzhen127 · 2024-10-09T17:38:05Z

CC @DigitalVeer

BenTheElder · 2024-10-09T19:21:12Z

If these are like node e2e tests, folks in SIG node might be familiar

SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.

ameukam · 2024-10-10T14:00:00Z

It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.

hakman · 2024-10-13T20:00:49Z

This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar).
I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:

echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error

Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:

[  169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[  169.108852] Aborting journal on device sda1-8.
[  169.115130] EXT4-fs (sda1): Remounting filesystem read-only

There may be some recent changes that affect the behaviour of trigger_fs_error.
https://lore.kernel.org/all/[email protected]/t/#u

wangzhen127 · 2024-11-06T19:34:29Z

New updates:

Talked to COS team and found the root cause: https://www.spinics.net/lists/linux-ext4/msg90066.html

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

This is an intentional change from upstream kernel so on COS side they won't change it. The path forward would be updating the NPD test case for newer kernel versions (>=6.5.0-rc3).

hakman · 2024-11-07T04:25:25Z

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

@wangzhen127 I don't think SSH failing after this is an intended behaviour.

wangzhen127 · 2024-11-07T19:47:12Z

Yeah, this is from COS team's perspective, because the change in upstream. So there is not much they can do. So they recommend us to update tests. Sorry for the confusion.

hakman · 2024-11-07T20:21:20Z

No worries, I just meant that maybe they can configure the SSH server to not fail completely. I agree that the FS should become read-only, but not accepting SSH connections is quite unexpected.

wangzhen127 mentioned this issue Oct 9, 2024

CVE found with v0.8.19 #926

Closed

hakman mentioned this issue Oct 11, 2024

Test using "k8s-infra-e2e-boskos-157" #971

Closed

hakman mentioned this issue Oct 14, 2024

Skip ext4 e2e tests #974

Merged

BenTheElder mentioned this issue Nov 13, 2024

[Failing test][sig-node] ci-crio-cgroupv1-node-e2e-conformance kubernetes/kubernetes#128774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull-npd-e2e-test failing ssh handshake #970

pull-npd-e2e-test failing ssh handshake #970

wangzhen127 commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

ameukam commented Oct 10, 2024

hakman commented Oct 13, 2024

wangzhen127 commented Nov 6, 2024

hakman commented Nov 7, 2024

wangzhen127 commented Nov 7, 2024

hakman commented Nov 7, 2024

pull-npd-e2e-test failing ssh handshake #970

pull-npd-e2e-test failing ssh handshake #970

Comments

wangzhen127 commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

ameukam commented Oct 10, 2024

hakman commented Oct 13, 2024

wangzhen127 commented Nov 6, 2024

hakman commented Nov 7, 2024

wangzhen127 commented Nov 7, 2024

hakman commented Nov 7, 2024