Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: controller: ensures workflow reconciling task result properly when failing to received timely updates from api server #14026

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bom-d-van
Copy link
Contributor

@bom-d-van bom-d-van commented Dec 23, 2024

There seems to be no related open issue, so I just submit a pr. but it's related to #12537

Motivation

error scenario: a pod for a step in a workflow has completed, and its task result are properly created and finalized by its wait container (judging from the exit status of the wait container), however, the task result informer in the controller leader has not received any updates about it (due to overloaded api server or etcd).

currently, the argo workflow controller doesn't handle the above scenario properly. it would mark the workflow node succeeded and shows no artifact outputs (even though they are already uploaded to the repository).

we did run into this situation in our production instance (it's v3.5.8).

it's not easy to reproduce this problem, but we can have a manual fault injection in workflow/controller/taskresult.go:func (woc *wfOperationCtx) taskResultReconciliation() to simulate the situation and I did reproduce the issue on release v3.6.2:

+++ workflow/controller/taskresult.go
@@ -1,7 +1,9 @@
 package controller

 import (
+	"os"
 	"reflect"
+	"strings"
 	"time"

 	log "github.com/sirupsen/logrus"
@@ -62,6 +64,12 @@ func (woc *wfOperationCtx) taskResultReconciliation() {
 	objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name)
 	woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation")

+	if strings.Contains(woc.wf.Name, "-xhu-debug-") {
+		if _, err := os.Stat("/tmp/xhu-debug-control"); err != nil {
+			return
+		}
+	}

Modifications

the change is to forcefully mark the workflow having incomplete TaskResult in assessNodeStatus.

this fix doesn't handle the case when a pod failed, there are too many potentially failure scenarios (like the wait container might not be able to insert a task result). plus, a retry is probably needed when there are failures. the loss is probably not as great as a successful one.

Verification

it's covered by the updated test case and my manual verification details are included bellow (using the fault injection above).

test workflow to verify the fix:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-xhu-debug-
spec:
  entrypoint: bash-script-example
  activeDeadlineSeconds: 3600
  templates:
  - name: bash-script-example
    steps:
    - - name: generate
        template: gen-random-int-bash
    - - name: print
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "{{steps.generate.outputs.result}}"  # The result of the here-script

  - name: gen-random-int-bash
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art
        path: /tmp/hello_world.txt
    script:
      image: reg.deeproute.ai/deeproute-public/nicolaka/netshoot:v0.13
      command: [bash]
      source: |                                         # Contents of the here-script
        touch /tmp/hello_world.txt
        cat /dev/urandom | od -N2 -An -i | awk -v f=1 -v r=100 '{printf "%i\n", f + r * $1 / 65536}'

  - name: print-message
    inputs:
      parameters:
      - name: message
    outputs:
      artifacts:
      # generate hello-art artifact from /tmp/hello_world.txt
      # artifacts can be directories as well as files
      - name: hello-art2
        path: /tmp/hello_world.txt
    container:
      image: reg.deeproute.ai/deeproute-public/nicolaka/netshoot:v0.13
      command: [sh, -c]
      args: ["touch /tmp/hello_world.txt; echo result was: {{inputs.parameters.message}}"]

with the manual fault injection, without the fix:

image

with the manual fault injection, with the fix:

image

…en failing to received timely updates from api server

error scenario: a pod for a step in a workflow has completed, and its task
result are properly created and finalized by its wait container (judging from
the exit status of the wait container), however, the task result informer in
the controller leader has not received any updates about it (due to overloaded
api server or etcd).

currently, the argo workflow controller doesn't handle the above scenario
properly. it would mark the workflow node succeeded and shows no artifact
outputs (even though they are already uploaded to the repository).

we did run into this situation in our production instance (it's v3.5.8).

it's not easy to reproduce this problem, but we can have a manual fault
injection in `workflow/controller/taskresult.go:func
(woc *wfOperationCtx) taskResultReconciliation()` to simulate the situation and
I did reproduce the issue on release v3.6.2:

```diff
+++ workflow/controller/taskresult.go
@@ -1,7 +1,9 @@
 package controller

 import (
+       "os"
        "reflect"
+       "strings"
        "time"

        log "github.com/sirupsen/logrus"
@@ -62,6 +64,12 @@ func (woc *wfOperationCtx) taskResultReconciliation() {
        objs, _ := woc.controller.taskResultInformer.GetIndexer().ByIndex(indexes.WorkflowIndex, woc.wf.Namespace+"/"+woc.wf.Name)
        woc.log.WithField("numObjs", len(objs)).Info("Task-result reconciliation")

+       if strings.Contains(woc.wf.Name, "-xhu-debug-") {
+               if _, err := os.Stat("/tmp/xhu-debug-control"); err != nil {
+                       return
+               }
+       }
```

the change is to forcefully mark the workflow having incomplete TaskResult in
assessNodeStatus.

this fix doesn't handle the case when a pod failed, there are too many
potentially failure scenarios (like the wait container might not be able to
insert a task result). plus, a retry is probably needed when there are
failures. the loss is probably not as great as a successful one.

Signed-off-by: Xiaofan Hu <[email protected]>
@bom-d-van bom-d-van force-pushed the controller/reconciliation-of-delayed-task-result-public branch from 63cd4c4 to 747f1e6 Compare December 23, 2024 07:05
@bom-d-van bom-d-van marked this pull request as ready for review December 23, 2024 07:35
Copy link
Member

@tczhao tczhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding.
A delay in taskresultInformer will result in the issue you are seeing.
However, if you experience k8sapi pressure and the retry inside informer fails, you will see a much bigger problem. e.g. argo uses the cache from informer, the k8sapi pressure may result in this cache out of sync, and they will never sync until you restarts controller

@shuangkun has more context related to this change, will ask him to review instead

Comment on lines +1452 to +1455
// this fix doesn't handle the case when a pod failed, there are too many
// potentially failure scenarios (like the wait container might not be able to
// insert a task result). plus, a retry is probably needed when there are
// failures. the loss is probably not as great as a successful one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have retry for transient errors in wait container to handle this

func (we *WorkflowExecutor) reportResult(ctx context.Context, result wfv1.NodeResult) error {
return retryutil.OnError(wait.Backoff{
Duration: time.Second,
Factor: 2,
Jitter: 0.1,
Steps: 5,
Cap: 30 * time.Second,

@tczhao tczhao requested a review from shuangkun December 25, 2024 05:36
@bom-d-van
Copy link
Contributor Author

bom-d-van commented Dec 25, 2024

Great finding. A delay in taskresultInformer will result in the issue you are seeing. However, if you experience k8sapi pressure and the retry inside informer fails, you will see a much bigger problem. e.g. argo uses the cache from informer, the k8sapi pressure may result in this cache out of sync, and they will never sync until you restarts controller

@shuangkun has more context related to this change, will ask him to review instead

yep, but in a way, a controller restart could help us resolving the issue when things went terribly wrong.
arguably, at least for us, it's still better than having partially succeeded workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants