fix flakes in log following related unit tests #112

gabemontero · 2022-04-20T21:26:18Z

Changes

Fixes #103

/kind cleanup

Hello @otaviof @HeavyWombat @SaschaSchwarze0

I finally got some shipwright cycles this week, and one of my long standing to-do's was to spend quality time on the log following unit test flakes @otaviof reported.

Allow me to explain my process to getting to this point of an alternative PR.

While, as it is clear to see, I initially took a brief pass at reviewing @otaviof 's #104 a month ago, I decided to re-engage here by first making sure I understood first hand what was the crux of the original situation as I remember @otaviof describing in our team meetings, where if I were to attempt to summarize, it was the inherent fragility of k8s fake clients, especially around watches (of course @otaviof please do chime in if my brief summary misses key points).

So running @otaviof 's reproducers in #103 as well as running the unit tests in the debugger, the k8s fake client watch fragility is what was most noticeable to me. If I were to attempt to summarize, it did not seem thread safe. To the point where it was unclear if the events reported by the unit test as missing were the actual events that were missing.

But even of more importance, the "has synced" concept we have with the k8s watch based informers used in controllers occurred to me as being something that would be useful here (I describe this in more detail in #103 (comment)). So, building on that to see if that proved valid (which it did), after about a full day of coding, I've got a different take on addressing this flake with this PR.

I see #104 needs a rebase as it its. My initial thought is we iterate on this to merge it, so we address the flake. Then reset and see if we want some of the refactoring from #104 that may not be immediately related to the k8s fake client flakiness, but that we like, either by reworking 1#04, and get that ready again, or if it makes sense to just redo #104 in a new PR. But that said, if folks think of elements of #104 that can be easily applied here, sure, let's discuss.

A more detailed breakdown

no changes to the existing flags ... the "user experience" is the same
follwer/pod_watcher Start methonds have been split into Connect and WaitForCompletion ... this somewhat speaks to the "has synched" concept with k8s informers
the run command has a FollowerReady method to help with multi-threaded use of consumers of the follower (in theory other commands that use follower could follow this pattern if need be) ... this more directly is the "has synched" equivalent from k8s informers
we have better consistency across the different commands, in that they create or initialize their Follwers in their Complete methods, and they startup and wait for them in their Run methods
removed the 3 second sleep in the follower event processing
removed that initialization of the follower in the run unit test
added lower level tuning (not exposed as flags at this time) around some of the timing intervals to help speed up unit tests
removed some of the multi threaded elements in the pod_watcher unit tests; they intermittently stressed the more flaky elements of the k8s fake clients, especially the watch elements; I think in the end, running with race detection, we have sufficient coverage there anyway

Some output from some of @otaviof 's flake reproducers:

# Attempt 97...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 98...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.110s
# Attempt 99...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.103s
# Attempt 100...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s
gmontero ~/go/src/github.com/shipwright-io/cli-1  (k8s-unit-test-flake-impacts-podwatcher)$

Submitter Checklist

[/ ] Includes tests if functionality changed/was added
[ n/a] Includes docs if changes are user-facing
[ /] Set a kind label on this PR
[ /] Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

…d message

gabemontero · 2022-04-20T22:06:39Z

clean tests

saw one minor correction in the code ... forgot to use the new fail poll interval in follow.go

gabemontero · 2022-04-21T11:59:33Z

fyi @otaviof @HeavyWombat @SaschaSchwarze0 - another simplification occurred to me last night

I'm making an update this morning and will post again if it is in fact successful and I've updated this PR

thanks

gabemontero · 2022-04-21T15:09:36Z

fyi @otaviof @HeavyWombat @SaschaSchwarze0 - another simplification occurred to me last night

I'm making an update this morning and will post again if it is in fact successful and I've updated this PR

thanks

ok @otaviof @HeavyWombat @SaschaSchwarze0 my simplifications are in (along with some additional comments in the code explaining some details wrt mulit-threaded testing, or lack there of)

PTAL when you all get the chance

thanks

gabemontero · 2022-04-25T13:23:04Z

ok we all discussed this one in the community meeting

@otaviof is good with moving forward with this reduced scope version of addressing the test flakes. He'll review, and when he and I iterate and land with consensus on these changes, we'll merge this. @HeavyWombat and @SaschaSchwarze0 signed off on this plan as well.

going ahead and

/approve

openshift-ci · 2022-04-25T13:23:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gabemontero]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

otaviof · 2022-04-28T05:28:03Z

On the very first run, it failed with test timeout, as per:

$ for i in `seq 1 100` ; do
        echo "# Attempt ${i}..."
        make test-unit \
                GO_FLAGS='' \
                GO_TEST_FLAGS='-race -failfast -timeout=45s' \
                CMD='' \
                PKG=./pkg/shp/cmd/build \
                ARGS='-run="^TestStartBuildRunFollowLog"' || break
done
# Attempt 1...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
[container] fake logs
panic: test timed out after 45s

goroutine 84 [running]:
testing.(*M).startAlarm.func1()
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:2029 +0xb8
created by time.goFunc
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/time/sleep.go:176 +0x4c

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000582340, {0x106c1e0de, 0x1a}, 0x1074a15e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1487 +0x580
testing.runTests.func1(0x0?)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1839 +0x98
testing.tRunner(0xc000582340, 0xc0007bfb98)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1439 +0x190
testing.runTests(0xc0004595e0?, {0x108421bd0, 0x1, 0x1}, {0xc0004f7ce0?, 0x104f80a80?, 0x108451c80?})
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1837 +0x6cc
testing.(*M).Run(0xc0004595e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1719 +0x87c
main.main()
	_testmain.go:47 +0x300

goroutine 21 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0?)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/k8s.io/klog/v2/klog.go:1283 +0x78
created by k8s.io/klog/v2.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/k8s.io/klog/v2/klog.go:420 +0x20c

goroutine 5 [chan receive]:
knative.dev/pkg/metrics.(*metricsWorker).start(0xc000010258)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/knative.dev/pkg/metrics/metrics_worker.go:99 +0x48
created by knative.dev/pkg/metrics.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/knative.dev/pkg/metrics/exporter.go:39 +0x138

goroutine 39 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0001fe000)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/go.opencensus.io/stats/view/worker.go:276 +0x104
created by go.opencensus.io/stats/view.init.0
	/Users/otaviof/go/src/github.com/shipwright-io/cli/vendor/go.opencensus.io/stats/view/worker.go:34 +0xfc

goroutine 6 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550300, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138460, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc000066d40, 0x1074c5f10?, 0xc000532810)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog(0xc0005824e0)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:191 +0x1498
testing.tRunner(0xc0005824e0, 0x1074a15e0)
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1439 +0x190
created by testing.(*T).Run
	/opt/homebrew/Cellar/go/1.18.1/libexec/src/testing/testing.go:1486 +0x564

goroutine 25 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550000, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc00047e050, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc0000666c0, 0x106c82150?, 0xc00053c360)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 10 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc00052a900, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138230, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc000112640, 0x106c82150?, 0xc000533230)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 12 [chan receive]:
github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func1.1()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:54 +0x5c
created by github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func1
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:53 +0x27c

goroutine 32 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550180, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000a481e0, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239
github.com/shipwright-io/cli/pkg/shp/cmd/build.(*RunCommand).Run(0xc0004bf4c0, 0x106c82150?, 0xc00069d590)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run.go:111 +0x48c
github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog.func3()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:197 +0x54
created by github.com/shipwright-io/cli/pkg/shp/cmd/build.TestStartBuildRunFollowLog
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/build/run_test.go:196 +0x1600

goroutine 82 [chan receive (nil chan)]:
github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start.func2()
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:65 +0x54
created by github.com/shipwright-io/cli/pkg/shp/tail.(*Tail).Start
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/tail/tail.go:64 +0x1d8
FAIL	github.com/shipwright-io/cli/pkg/shp/cmd/build	45.740s
FAIL
make: *** [test-unit] Error 1

gabemontero · 2022-04-28T11:31:41Z

very weird ... I ran that for loop a bunch of times

either I made a minor change that messed things up and forgot to run the for loop, or this flake is really intermittent

ok I'll take another pass today and report back

thanks

gabemontero · 2022-04-28T12:08:44Z

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.

As an example:

goroutine 6 [select]:
github.com/shipwright-io/cli/pkg/shp/reactor.(*PodWatcher).Start(0xc000550300, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc000138460, 0x44}, {0x0, 0x0}, 0x0, ...})
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/reactor/pod_watcher.go:133 +0x318
github.com/shipwright-io/cli/pkg/shp/cmd/follower.(*Follower).Start(...)
	/Users/otaviof/go/src/github.com/shipwright-io/cli/pkg/shp/cmd/follower/follow.go:239

When I look at the code, the Start in follow.go starts on line 274.

And line 133 of pod_watcher.go is a comment. The select actually starts on line 137 for me.

The line numbers in your stack traces do line up with what I see in main branch.

Also, if you look at follow.go's Start now, it calls a different method besides Start from pod_watcher.go:

func (f *Follower) Start(listOpts metav1.ListOptions) (*corev1.Pod, error) {
	err := f.Connect(listOpts)
	if err != nil {
		return nil, err
	}
	return f.WaitForCompletion()
}

gabemontero · 2022-04-28T12:15:17Z

Just ran

for i in `seq 1 100` ; do
        echo "# Attempt ${i}..."
        make test-unit \
                GO_FLAGS='' \
                GO_TEST_FLAGS='-race -failfast -timeout=45s' \
                CMD='' \
                PKG=./pkg/shp/cmd/build \
                ARGS='-run="^TestStartBuildRunFollowLog"' || break
done

and it completes OK for me:

# Attempt 98...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s
# Attempt 99...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.105s
# Attempt 100...
go test  -race -failfast -timeout=45s  ./pkg/shp/cmd/build -run="^TestStartBuildRunFollowLog"
ok  	github.com/shipwright-io/cli/pkg/shp/cmd/build	0.104s

otaviof · 2022-04-28T15:55:14Z

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.
[...]

Sorry 🤦‍♀️ I end up running the tests on the wrong terminal window. Indeed, the tests are passing consistently, so I'll continue the review where I left this morning 👍

gabemontero · 2022-04-28T16:05:11Z

@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code.
[...]

Sorry woman_facepalming I end up running the tests on the wrong terminal window. Indeed, the tests are passing consistently, so I'll continue the review where I left this morning +1

whew :-) ... great thanks for confirming @otaviof

otaviof

@gabemontero Just a couple of "nit" comments, all in all I liked the approach to avoid flaky results and reliably consume Watch events 👍 Thanks!

otaviof · 2022-04-29T09:17:37Z

pkg/shp/reactor/pod_watcher.go

-// Start runs the event loop based on a watch instantiated against informed pod. In case of errors
-// the loop is interrupted.
-func (p *PodWatcher) Start(listOpts metav1.ListOptions) (*corev1.Pod, error) {
+func (p *PodWatcher) Connect(listOpts metav1.ListOptions) error {


Would be great to add a doc-comment explaining how the connect helps the multi-threading, etc.

otaviof · 2022-04-29T09:24:59Z

pkg/shp/cmd/build/run.go

@@ -27,6 +27,7 @@ type RunCommand struct {
 	buildRunSpec *buildv1alpha1.BuildRunSpec // stores command-line flags
 	follow       bool                        // flag to tail pod logs
 	follower     *follower.Follower
+	follwerReady chan bool


nit: typo on the variable name.

eliminate some multi threaded elements in pod watcher unit test to help fragile fake k8s client decrease some unit test timeouts to speed them up

…its consumers allow for optimize / eliminate wait/sleep/timeout intervals in follower unit test and follower consumer unit tests remove follower initialization in unit test split follower setup where Complete() is for creation and Run() for running follower add buildrun name setter to facilitate Complete() vs. Run() split add multi thread coordination via FollowerReady to make sure pod_watcher is up

gabemontero · 2022-04-29T14:28:22Z

updates from your comments pushed @otaviof

given their nature, I just squashed them into the existing commits

thanks!

otaviof · 2022-04-29T17:57:45Z

/lgtm

Thanks @gabemontero!

stop follower on get buildrun error; clean up buildrun has not starte…

e5d5096

…d message

openshift-ci bot added release-note-none kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Apr 20, 2022

openshift-ci bot requested review from HeavyWombat and SaschaSchwarze0 April 20, 2022 21:26

gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch from 57d1f02 to 18d9f9e Compare April 20, 2022 22:06

gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch 4 times, most recently from 3faebb8 to 773555a Compare April 21, 2022 13:55

gabemontero mentioned this pull request Apr 25, 2022

April 25th, 2022 Community Meeting shipwright-io/community#80

Closed

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2022

otaviof reviewed Apr 29, 2022

View reviewed changes

gabemontero added 2 commits April 29, 2022 10:26

split pod_watcher Start into Connect and WaitForCompletion

0dc7c18

eliminate some multi threaded elements in pod watcher unit test to help fragile fake k8s client decrease some unit test timeouts to speed them up

gabemontero force-pushed the k8s-unit-test-flake-impacts-podwatcher branch from 773555a to 600428d Compare April 29, 2022 14:27

openshift-ci bot assigned otaviof Apr 29, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2022

openshift-merge-robot merged commit e427f92 into shipwright-io:main Apr 29, 2022

gabemontero deleted the k8s-unit-test-flake-impacts-podwatcher branch April 29, 2022 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flakes in log following related unit tests #112

fix flakes in log following related unit tests #112

gabemontero commented Apr 20, 2022

gabemontero commented Apr 20, 2022

gabemontero commented Apr 21, 2022

gabemontero commented Apr 21, 2022

gabemontero commented Apr 25, 2022

openshift-ci bot commented Apr 25, 2022

otaviof commented Apr 28, 2022

gabemontero commented Apr 28, 2022

gabemontero commented Apr 28, 2022 •

edited

Loading

gabemontero commented Apr 28, 2022

otaviof commented Apr 28, 2022

gabemontero commented Apr 28, 2022

otaviof left a comment

otaviof Apr 29, 2022

otaviof Apr 29, 2022

gabemontero commented Apr 29, 2022

otaviof commented Apr 29, 2022

fix flakes in log following related unit tests #112

fix flakes in log following related unit tests #112

Conversation

gabemontero commented Apr 20, 2022

Changes

Submitter Checklist

Release Notes

gabemontero commented Apr 20, 2022

gabemontero commented Apr 21, 2022

gabemontero commented Apr 21, 2022

gabemontero commented Apr 25, 2022

openshift-ci bot commented Apr 25, 2022

otaviof commented Apr 28, 2022

gabemontero commented Apr 28, 2022

gabemontero commented Apr 28, 2022 • edited Loading

gabemontero commented Apr 28, 2022

otaviof commented Apr 28, 2022

gabemontero commented Apr 28, 2022

otaviof left a comment

Choose a reason for hiding this comment

otaviof Apr 29, 2022

Choose a reason for hiding this comment

otaviof Apr 29, 2022

Choose a reason for hiding this comment

gabemontero commented Apr 29, 2022

otaviof commented Apr 29, 2022

gabemontero commented Apr 28, 2022 •

edited

Loading