Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

display StatefulSet errors when notebook pod fails to create #1703

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

christianvogt
Copy link
Contributor

@christianvogt christianvogt commented Aug 24, 2023

Closes: #793

Description

When LimitRange prevents the creation of notebook server pods, no errors are displayed in the UI. The error occurs on the StatefulSet. This change reads events from the StatefulSet when no pod is present. Once a pod is present, we go back to reading the pod events as was done previously.

Added a new error message Failed to create pod to represent this state (see screen shots below).
cc @kywalker-rh

How Has This Been Tested?

Limit Range YAML
kind: LimitRange
apiVersion: v1
metadata:
  name: limit-range
  namespace: <namespace>
spec:
  limits:
    - type: Container
      max:
        cpu: 50m
        memory: 50Mi
      default:
        cpu: 50m
        memory: 50Mi
      defaultRequest:
        cpu: 50m
        memory: 50Mi
    - type: Pod
      max:
        cpu: 100m
        memory: 50Mi

Workbench:

  • Create a data science project
  • Create a LimitRange in the newly created project / namespace
  • Create a workbench whose container size will be limited by the LimitRange
  • Click on the Starting... text in the workbench status column to open the status popup
  • Click Event log
  • Wait for an error state to appear describing resource limits with the overall error message of Failed to create pod
    image
    image
  • Close the modal
  • Toggle the status to off to stop the notebook
  • Toggle the status to on to start the notebook and observe the same error state
  • Toggle the status to off
  • Delete the LimitRange resource
  • Toggle the status to on
  • Observe the notebook starts successfully. Viewing the event log during this process shows pod events accordingly.

Jupyter notebook:

  • Create a LimitRange in default ODH namespace. eg. opendatahub
  • Click on Applications => Enabled in the left nav
  • on the Jupyter card, click Launch application
  • Start a notebook server whose container size will be limited by the LimitRange
  • Wait for an error state to appear describing resource limits with the overall error message of Failed to create pod
    image
  • Click Cancel
  • Delete the LimitRange resource
  • Click Start server
  • Observe the notebook starts successfully. Viewing the event log during this process shows pod events accordingly.

Test Impact

no new tests

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Commits have been squashed into descriptive, self-contained units of work (e.g. 'WIP' and 'Implements feedback' style messages have been removed)
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit tests & storybook for related changes)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

@christianvogt
Copy link
Contributor Author

The fix here isn't ideal as we keep having to expand our search to identify what's going on with the notebook. I've created a feature request for the notebook controller to update the Notebook with relevant status: opendatahub-io/kubeflow#159

@lucferbux
Copy link
Contributor

The fix here isn't ideal as we keep having to expand our search to identify what's going on with the notebook. I've created a feature request for the notebook controller to update the Notebook with relevant status: opendatahub-io/kubeflow#159

Yes, I agree, if we start adding a lot of patches we are loosing our track. The main goal of the dashboard is to reflect what's happening in the cluster and add an overlay of features. But if we keep up adding logic to some flow it's a great indication that it might be fixed by other controller.

@kywalker-rh
Copy link

Do we have any control over the error text presented? Its definitely hard to parse if you don't have much OpenShift/Kubernetes knowledge.

@christianvogt
Copy link
Contributor Author

@kywalker-rh unfortunately the message is just a blob of text. We control the line formatting though:

`${getEventTimestamp(event)} [${event.type}] ${event.message}`

Copy link
Contributor

@lucferbux lucferbux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass of the code, it looks great, let me test it in my cluster (i need to set it up first cause my last review broke the operator) and i can stamp an approval

</ListItem>
))}
{events
.slice()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, why the slice here?

Copy link
Member

@andrewballantyne andrewballantyne Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christianvogt yeah -- why slice? nvm, I misread the diff -- I thought slice was only used in some cases.

@lucferbux .reverse() mutates the array -- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/reverse

Copy link
Contributor

@lucferbux lucferbux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Tested it and it works as expected, managed to get logs for failed allocation but for successful installations too and codes looks great too

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucferbux

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Sep 6, 2023
@openshift-merge-robot openshift-merge-robot merged commit 6ee8280 into opendatahub-io:main Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Launcher only reports Pod errors/log whereas others can happen
5 participants