display StatefulSet errors when notebook pod fails to create #1703

christianvogt · 2023-08-24T21:16:52Z

Closes: #793

Description

When LimitRange prevents the creation of notebook server pods, no errors are displayed in the UI. The error occurs on the StatefulSet. This change reads events from the StatefulSet when no pod is present. Once a pod is present, we go back to reading the pod events as was done previously.

Added a new error message Failed to create pod to represent this state (see screen shots below).
cc @kywalker-rh

How Has This Been Tested?

Limit Range YAML

kind: LimitRange
apiVersion: v1
metadata:
  name: limit-range
  namespace: <namespace>
spec:
  limits:
    - type: Container
      max:
        cpu: 50m
        memory: 50Mi
      default:
        cpu: 50m
        memory: 50Mi
      defaultRequest:
        cpu: 50m
        memory: 50Mi
    - type: Pod
      max:
        cpu: 100m
        memory: 50Mi

Workbench:

Create a data science project
Create a LimitRange in the newly created project / namespace
Create a workbench whose container size will be limited by the LimitRange
Click on the Starting... text in the workbench status column to open the status popup
Click Event log
Wait for an error state to appear describing resource limits with the overall error message of Failed to create pod
Close the modal
Toggle the status to off to stop the notebook
Toggle the status to on to start the notebook and observe the same error state
Toggle the status to off
Delete the LimitRange resource
Toggle the status to on
Observe the notebook starts successfully. Viewing the event log during this process shows pod events accordingly.

Jupyter notebook:

Create a LimitRange in default ODH namespace. eg. opendatahub
Click on Applications => Enabled in the left nav
on the Jupyter card, click Launch application
Start a notebook server whose container size will be limited by the LimitRange
Wait for an error state to appear describing resource limits with the overall error message of Failed to create pod
Click Cancel
Delete the LimitRange resource
Click Start server
Observe the notebook starts successfully. Viewing the event log during this process shows pod events accordingly.

Test Impact

no new tests

Request review criteria:

Self checklist (all need to be checked):

The developer has manually tested the changes and verified that the changes work
Commits have been squashed into descriptive, self-contained units of work (e.g. 'WIP' and 'Implements feedback' style messages have been removed)
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has added tests or explained why testing cannot be added (unit tests & storybook for related changes)

If you have UI changes:

Included any necessary screenshots or gifs if it was a UI change.
Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

The developer has tested their solution on a cluster by using the image produced by the PR to main

christianvogt · 2023-08-25T20:15:59Z

The fix here isn't ideal as we keep having to expand our search to identify what's going on with the notebook. I've created a feature request for the notebook controller to update the Notebook with relevant status: opendatahub-io/kubeflow#159

lucferbux · 2023-08-28T10:24:31Z

The fix here isn't ideal as we keep having to expand our search to identify what's going on with the notebook. I've created a feature request for the notebook controller to update the Notebook with relevant status: opendatahub-io/kubeflow#159

Yes, I agree, if we start adding a lot of patches we are loosing our track. The main goal of the dashboard is to reflect what's happening in the cluster and add an overlay of features. But if we keep up adding logic to some flow it's a great indication that it might be fixed by other controller.

kywalker-rh · 2023-08-30T19:17:07Z

Do we have any control over the error text presented? Its definitely hard to parse if you don't have much OpenShift/Kubernetes knowledge.

christianvogt · 2023-08-31T14:57:52Z

@kywalker-rh unfortunately the message is just a blob of text. We control the line formatting though:

`${getEventTimestamp(event)} [${event.type}] ${event.message}`

lucferbux

Did a first pass of the code, it looks great, let me test it in my cluster (i need to set it up first cause my last review broke the operator) and i can stamp an approval

lucferbux · 2023-09-06T11:06:37Z

frontend/src/pages/notebookController/screens/server/StartServerModal.tsx

-          </ListItem>
-        ))}
+        {events
+          .slice()


Just out of curiosity, why the slice here?

@christianvogt ~~yeah -- why slice?~~ nvm, I misread the diff -- I thought slice was only used in some cases.

@lucferbux .reverse() mutates the array -- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/reverse

lucferbux

/lgtm
Tested it and it works as expected, managed to get logs for failed allocation but for successful installations too and codes looks great too

openshift-ci · 2023-09-06T11:14:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucferbux

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lucferbux]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from alexcreasy and Gkrumbach07 August 24, 2023 21:16

christianvogt force-pushed the ss-errors branch from b76f1a8 to 38ff7a6 Compare August 25, 2023 13:07

display stateful set errors when notebook pod fails to create

ae0f2a9

christianvogt force-pushed the ss-errors branch from 38ff7a6 to ae0f2a9 Compare August 25, 2023 13:34

lucferbux reviewed Sep 6, 2023

View reviewed changes

lucferbux approved these changes Sep 6, 2023

View reviewed changes

openshift-ci bot assigned lucferbux Sep 6, 2023

openshift-ci bot added the lgtm label Sep 6, 2023

openshift-ci bot added the approved label Sep 6, 2023

openshift-merge-robot merged commit 6ee8280 into opendatahub-io:main Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

display StatefulSet errors when notebook pod fails to create #1703

display StatefulSet errors when notebook pod fails to create #1703

christianvogt commented Aug 24, 2023 •

edited

Loading

christianvogt commented Aug 25, 2023

lucferbux commented Aug 28, 2023

kywalker-rh commented Aug 30, 2023

christianvogt commented Aug 31, 2023

lucferbux left a comment

lucferbux Sep 6, 2023

andrewballantyne Sep 6, 2023 •

edited

Loading

lucferbux left a comment

openshift-ci bot commented Sep 6, 2023

display StatefulSet errors when notebook pod fails to create #1703

display StatefulSet errors when notebook pod fails to create #1703

Conversation

christianvogt commented Aug 24, 2023 • edited Loading

Description

How Has This Been Tested?

Test Impact

Request review criteria:

christianvogt commented Aug 25, 2023

lucferbux commented Aug 28, 2023

kywalker-rh commented Aug 30, 2023

christianvogt commented Aug 31, 2023

lucferbux left a comment

Choose a reason for hiding this comment

lucferbux Sep 6, 2023

Choose a reason for hiding this comment

andrewballantyne Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

lucferbux left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 6, 2023

christianvogt commented Aug 24, 2023 •

edited

Loading

andrewballantyne Sep 6, 2023 •

edited

Loading