Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Serve: Fail to create Serve applications #46308

Closed
Galeos93 opened this issue Jun 27, 2024 · 1 comment
Closed

Ray Serve: Fail to create Serve applications #46308

Galeos93 opened this issue Jun 27, 2024 · 1 comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@Galeos93
Copy link

What happened + What you expected to happen

I follow this tutorial to deploy an application using Ray Serve. I get the following error events upon executing kubectl describe rayservice rayservice-sample:

Type    Reason                       Age                  From                   Message
  ----    ------                       ----                 ----                   -------
  Normal  ServiceNotReady              10m (x9 over 10m)    rayservice-controller  The service is not ready yet. Controller will perform a round of actions in 2s.
  Normal  WaitForServeDeploymentReady  47s (x293 over 10m)  rayservice-controller  Fail to create / update Serve applications. If you observe this error consistently, please check "Issue 5: Fail to create / update Serve applications." in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: UpdateDeployments fail: 404 Not Found 404: Not Found

Also applying the command kubectl logs kuberay-operator-7f85d8578-mj4bs | tee operator-log I get the following logs:

{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","RayCluster name":"rayservice-sample-raycluster-gzvwz"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","head service name":"rayservice-sample-raycluster-gzvwz-head-svc","namespace":"default"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","head service URL":"rayservice-sample-raycluster-gzvwz-head-svc.default.svc.cluster.local:8265","port":"dashboard"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","shouldUpdateServe":true,"reason":"Nothing has been cached for cluster rayservice-sample-raycluster-gzvwz with key default/rayservice-sample/rayservice-sample-raycluster-gzvwz"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","V2 config":"applications:\n  - name: text_ml_app\n    import_path: text_ml.app\n    route_prefix: /summarize_translate\n    runtime_env:\n      working_dir: \"https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip\"\n      pip:\n        - torch\n        - transformers\n    deployments:\n      - name: Translator\n        num_replicas: 1\n        ray_actor_options:\n          num_cpus: 0.1\n        user_config:\n          language: french\n      - name: Summarizer\n        num_replicas: 1\n        ray_actor_options:\n          num_cpus: 0.1\n"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","MULTI_APP json config":"{\"applications\":[{\"deployments\":[{\"name\":\"Translator\",\"num_replicas\":1,\"ray_actor_options\":{\"num_cpus\":0.1},\"user_config\":{\"language\":\"french\"}},{\"name\":\"Summarizer\",\"num_replicas\":1,\"ray_actor_options\":{\"num_cpus\":0.1}}],\"import_path\":\"text_ml.app\",\"name\":\"text_ml_app\",\"route_prefix\":\"/summarize_translate\",\"runtime_env\":{\"pip\":[\"torch\",\"transformers\"],\"working_dir\":\"https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip\"}}]}"}
{"level":"error","ts":"2024-06-27T20:02:25.034Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","error":"Fail to create / update Serve applications. If you observe this error consistently, please check \"Issue 5: Fail to create / update Serve applications.\" in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: UpdateDeployments fail: 404 Not Found 404: Not Found","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

I expected the rayservice to start without issues, as shown in the tutorial.

Versions / Dependencies

  • kuberay/kuberay-operator --version 1.1.1
  • kubectl:
    • Client Version: v1.30.2
    • Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    • Server Version: v1.30.1-eks-1de2ab1

Reproduction script

I follow the tutorial in here after deploying an EKS cluster in AWS, using 2 nodes of t3.medium type. The service configuration I use is not the same as in the tutorial, I have set less resources:

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  # Only one of serveConfig and serveConfigV2 should be used.
  serveConfigV2: |
    applications:
      - name: text_ml_app
        import_path: text_ml.app
        route_prefix: /summarize_translate
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip"
          pip:
            - torch
            - transformers
        deployments:
          - name: Translator
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
            user_config:
              language: french
          - name: Summarizer
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
  rayClusterConfig:
    rayVersion: '2.6.3' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.6.3
              resources:
                limits:
                  cpu: "500m"
                  memory: 1Gi
                requests:
                  cpu: "500m"
                  memory: 1Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.6.3
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "500m"
                    memory: "1Gi"
                  requests:
                    cpu: "500m"
                    memory: "1Gi"

Issue Severity

High: It blocks me from completing my task.

@Galeos93 Galeos93 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 27, 2024
@Galeos93
Copy link
Author

After using a newer ray version (2.9.0), the issue was solved. Here is the yaml I used:

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  # Only one of serveConfig and serveConfigV2 should be used.
  serveConfigV2: |
    applications:
      - name: text_ml_app
        import_path: text_ml.app
        route_prefix: /summarize_translate
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip"
          pip:
            - torch
            - transformers
        deployments:
          - name: Translator
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.2
            user_config:
              language: french
          - name: Summarizer
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.2
  rayClusterConfig:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              resources:
                limits:
                  cpu: 1
                  memory: 2Gi
                requests:
                  cpu: 1
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

1 participant