Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Add k8s events to task phase updates #600

Merged
merged 11 commits into from
Sep 29, 2023
Merged

Conversation

andrewwdye
Copy link
Contributor

@andrewwdye andrewwdye commented Aug 3, 2023

TL;DR

Add support for watching and sending batched kubernetes object events in TaskExecutionEvent

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This change adds support for watching and sending batched kubernetes object events in TaskExecutionEvent. It

  • Adds an EventWatcher to keep a local cache of kubernetes object event adds/deletes. To be mindful of memory, just the event note and timestamp are stored. To avoid bloating the task closure for spammy events, only stores the creation timestamp of the event (first occurrence). Others are ignored. This tradeoff means that information about current state may not be available in the UI (i.e., crash looping)
  • Refactors clientset plumbing to use a separate clientset for the event informer. The flyte clientset abstraction does not support filtering to specific object types.
  • Sends object events to admin via the newly added batched TaskExecutionEvent.reasons field
  • This feature is default disabled (turned on via K8sPluginConfig.SendObjectEvents)

Testing

Ran single binary locally and got the following console output

9/29/2023 4:50:59 PM UTC task submitted to K8s

9/29/2023 4:50:59 PM UTC Unschedulable:0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

9/29/2023 4:50:59 PM UTC 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

9/29/2023 4:52:11 PM UTC [ContainersNotReady|ContainerCreating]: containers with unready status: [f4e1edfa9f2a045bb9cd-n0-0]|

9/29/2023 4:52:11 PM UTC Successfully assigned flytesnacks-development/f4e1edfa9f2a045bb9cd-n0-0 to 5b0f2dc6442c

9/29/2023 4:52:12 PM UTC Container image "cr.flyte.org/flyteorg/flytekit:py3.10-1.8.1" already present on machine

9/29/2023 4:52:12 PM UTC Created container f4e1edfa9f2a045bb9cd-n0-0

9/29/2023 4:52:12 PM UTC Started container f4e1edfa9f2a045bb9cd-n0-0

Looking at the execution closure

❯ flytectl get execution -p flytesnacks -d development f4e1edfa9f2a045bb9cd --details -o yaml
- node_exec:
    closure:
      createdAt: "2023-09-29T16:50:59.879001Z"
      outputUri: s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f4e1edfa9f2a045bb9cd/start-node/data/0/outputs.pb
      phase: SUCCEEDED
      updatedAt: "2023-09-29T16:50:59.879041Z"
    id:
      executionId:
        domain: development
        name: f4e1edfa9f2a045bb9cd
        project: flytesnacks
      nodeId: start-node
    inputUri: s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f4e1edfa9f2a045bb9cd/start-node/data/inputs.pb
    metadata:
      specNodeId: start-node
- node_exec:
    closure:
      createdAt: "2023-09-29T16:50:59.886835Z"
      phase: RUNNING
      startedAt: "2023-09-29T16:50:59.909353Z"
      updatedAt: "2023-09-29T16:50:59.909698Z"
    id:
      executionId:
        domain: development
        name: f4e1edfa9f2a045bb9cd
        project: flytesnacks
      nodeId: n0
    inputUri: s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f4e1edfa9f2a045bb9cd/n0/data/inputs.pb
    metadata:
      specNodeId: n0
  task_execs:
  - closure:
      createdAt: "2023-09-29T16:50:59.903548Z"
      eventVersion: 1
      logs:
      - messageFormat: JSON
        name: Kubernetes Logs (User)
        uri: http://localhost:30080/kubernetes-dashboard/#/log/flytesnacks-development/f4e1edfa9f2a045bb9cd-n0-0/pod?namespace=flytesnacks-development
      metadata:
        generatedName: f4e1edfa9f2a045bb9cd-n0-0
        pluginIdentifier: container
      phase: RUNNING
      reason: Started container f4e1edfa9f2a045bb9cd-n0-0
      reasons:
      - message: task submitted to K8s
        occurredAt: "2023-09-29T16:50:59.903548Z"
      - message: 'Unschedulable:0/1 nodes are available: 1 Insufficient cpu. preemption:
          0/1 nodes are available: 1 No preemption victims found for incoming pod.'
        occurredAt: "2023-09-29T16:50:59Z"
      - message: '0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes
          are available: 1 No preemption victims found for incoming pod.'
        occurredAt: "2023-09-29T16:50:59Z"
      - message: '[ContainersNotReady|ContainerCreating]: containers with unready
          status: [f4e1edfa9f2a045bb9cd-n0-0]|'
        occurredAt: "2023-09-29T16:52:11Z"
      - message: Successfully assigned flytesnacks-development/f4e1edfa9f2a045bb9cd-n0-0
          to 5b0f2dc6442c
        occurredAt: "2023-09-29T16:52:11Z"
      - message: Container image "cr.flyte.org/flyteorg/flytekit:py3.10-1.8.1" already
          present on machine
        occurredAt: "2023-09-29T16:52:12Z"
      - message: Created container f4e1edfa9f2a045bb9cd-n0-0
        occurredAt: "2023-09-29T16:52:12Z"
      - message: Started container f4e1edfa9f2a045bb9cd-n0-0
        occurredAt: "2023-09-29T16:52:12Z"
      startedAt: "2023-09-29T16:52:12Z"
      taskType: python-task
      updatedAt: "2023-09-29T16:52:12.770241Z"

Tracking Issue

flyteorg/flyte#3825

Follow-up issue

N/A

@kumare3
Copy link
Contributor

kumare3 commented Aug 9, 2023

Is this just watching kube events? If so, we should increased the throughput on kubeclient. Do not write kube events. That impacts kubeapi drastically

@andrewwdye andrewwdye marked this pull request as ready for review September 21, 2023 06:41
@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #600 (5e3c423) into master (2aca906) will increase coverage by 0.49%.
The diff coverage is 65.78%.

❗ Current head 5e3c423 differs from pull request most recent head fe453eb. Consider uploading reports for the commit fe453eb to get more accurate results

Additional details and impacted files

Signed-off-by: Andrew Dye <[email protected]>
Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but will defer to Dan here for approval :)

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once idl and plugins are merged.

pkg/controller/executors/dag_structure.go Show resolved Hide resolved
hamersaw
hamersaw previously approved these changes Sep 28, 2023
@andrewwdye
Copy link
Contributor Author

@kumare3

Is this just watching kube events? If so, we should increased the throughput on kubeclient. Do not write kube events. That impacts kubeapi drastically

Correct, this just watches k8s events. It uses a separate clientset vs the KubeClient in flyteplugins in order to create a filtered informer.

@hamersaw hamersaw merged commit 40fef66 into flyteorg:master Sep 29, 2023
14 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants