-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flakes in log following related unit tests #112
fix flakes in log following related unit tests #112
Conversation
57d1f02
to
18d9f9e
Compare
clean tests saw one minor correction in the code ... forgot to use the new fail poll interval in follow.go |
fyi @otaviof @HeavyWombat @SaschaSchwarze0 - another simplification occurred to me last night I'm making an update this morning and will post again if it is in fact successful and I've updated this PR thanks |
3faebb8
to
773555a
Compare
ok @otaviof @HeavyWombat @SaschaSchwarze0 my simplifications are in (along with some additional comments in the code explaining some details wrt mulit-threaded testing, or lack there of) PTAL when you all get the chance thanks |
ok we all discussed this one in the community meeting @otaviof is good with moving forward with this reduced scope version of addressing the test flakes. He'll review, and when he and I iterate and land with consensus on these changes, we'll merge this. @HeavyWombat and @SaschaSchwarze0 signed off on this plan as well. going ahead and /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gabemontero The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
On the very first run, it failed with test timeout, as per:
|
very weird ... I ran that for loop a bunch of times either I made a minor change that messed things up and forgot to run the for loop, or this flake is really intermittent ok I'll take another pass today and report back thanks |
@otaviof - are you sure you recompiled with my branch? The line numbers from your #112 (comment) do not line up with what I see in the code. As an example:
When I look at the code, the And line 133 of pod_watcher.go is a comment. The select actually starts on line 137 for me. The line numbers in your stack traces do line up with what I see in main branch. Also, if you look at follow.go's Start now, it calls a different method besides
|
Just ran
and it completes OK for me:
|
Sorry 🤦♀️ I end up running the tests on the wrong terminal window. Indeed, the tests are passing consistently, so I'll continue the review where I left this morning 👍 |
whew :-) ... great thanks for confirming @otaviof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabemontero Just a couple of "nit" comments, all in all I liked the approach to avoid flaky results and reliably consume Watch
events 👍 Thanks!
// Start runs the event loop based on a watch instantiated against informed pod. In case of errors | ||
// the loop is interrupted. | ||
func (p *PodWatcher) Start(listOpts metav1.ListOptions) (*corev1.Pod, error) { | ||
func (p *PodWatcher) Connect(listOpts metav1.ListOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to add a doc-comment explaining how the connect helps the multi-threading, etc.
pkg/shp/cmd/build/run.go
Outdated
@@ -27,6 +27,7 @@ type RunCommand struct { | |||
buildRunSpec *buildv1alpha1.BuildRunSpec // stores command-line flags | |||
follow bool // flag to tail pod logs | |||
follower *follower.Follower | |||
follwerReady chan bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo on the variable name.
eliminate some multi threaded elements in pod watcher unit test to help fragile fake k8s client decrease some unit test timeouts to speed them up
…its consumers allow for optimize / eliminate wait/sleep/timeout intervals in follower unit test and follower consumer unit tests remove follower initialization in unit test split follower setup where Complete() is for creation and Run() for running follower add buildrun name setter to facilitate Complete() vs. Run() split add multi thread coordination via FollowerReady to make sure pod_watcher is up
773555a
to
600428d
Compare
updates from your comments pushed @otaviof given their nature, I just squashed them into the existing commits thanks! |
/lgtm Thanks @gabemontero! |
Changes
Fixes #103
/kind cleanup
Hello @otaviof @HeavyWombat @SaschaSchwarze0
I finally got some shipwright cycles this week, and one of my long standing to-do's was to spend quality time on the log following unit test flakes @otaviof reported.
Allow me to explain my process to getting to this point of an alternative PR.
While, as it is clear to see, I initially took a brief pass at reviewing @otaviof 's #104 a month ago, I decided to re-engage here by first making sure I understood first hand what was the crux of the original situation as I remember @otaviof describing in our team meetings, where if I were to attempt to summarize, it was the inherent fragility of k8s fake clients, especially around watches (of course @otaviof please do chime in if my brief summary misses key points).
So running @otaviof 's reproducers in #103 as well as running the unit tests in the debugger, the k8s fake client watch fragility is what was most noticeable to me. If I were to attempt to summarize, it did not seem thread safe. To the point where it was unclear if the events reported by the unit test as missing were the actual events that were missing.
But even of more importance, the "has synced" concept we have with the k8s watch based informers used in controllers occurred to me as being something that would be useful here (I describe this in more detail in #103 (comment)). So, building on that to see if that proved valid (which it did), after about a full day of coding, I've got a different take on addressing this flake with this PR.
I see #104 needs a rebase as it its. My initial thought is we iterate on this to merge it, so we address the flake. Then reset and see if we want some of the refactoring from #104 that may not be immediately related to the k8s fake client flakiness, but that we like, either by reworking 1#04, and get that ready again, or if it makes sense to just redo #104 in a new PR. But that said, if folks think of elements of #104 that can be easily applied here, sure, let's discuss.
A more detailed breakdown
Start
methonds have been split intoConnect
andWaitForCompletion
... this somewhat speaks to the "has synched" concept with k8s informersFollowerReady
method to help with multi-threaded use of consumers of the follower (in theory other commands that use follower could follow this pattern if need be) ... this more directly is the "has synched" equivalent from k8s informersFollwers
in theirComplete
methods, and they startup and wait for them in theirRun
methodsSome output from some of @otaviof 's flake reproducers:
Submitter Checklist
See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.
Release Notes