-
Notifications
You must be signed in to change notification settings - Fork 76
add ecosystem. #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add ecosystem. #405
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,49 @@ | ||||||
| +++ | ||||||
| title = "MXNet on Volcano" | ||||||
|
|
||||||
| date = 2025-07-20 | ||||||
| lastmod = 2025-07-20 | ||||||
|
|
||||||
| draft = false # Is this a draft? true/false | ||||||
| toc = true # Show table of contents? true/false | ||||||
| type = "docs" # Do not modify. | ||||||
|
|
||||||
| # Add menu entry to sidebar. | ||||||
| linktitle = "MXNet" | ||||||
| [menu.docs] | ||||||
| parent = "zoology" | ||||||
| weight = 3 | ||||||
|
|
||||||
| +++ | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
| # MXNet Introduction | ||||||
|
|
||||||
| MXNet is an open-source deep learning framework designed for efficient and flexible training and deployment of deep neural networks. It supports seamless scaling from a single GPU to multiple GPUs, and further to distributed multi-machine multi-GPU setups. | ||||||
|
|
||||||
| # MXNet on Volcano | ||||||
|
|
||||||
| Combining MXNet with Volcano allows you to fully leverage Kubernetes' container orchestration capabilities and Volcano's batch scheduling functionality to achieve efficient distributed training. | ||||||
|
|
||||||
| Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files: | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The link text "here" is not very descriptive. For better accessibility and clarity, it's best to use text that describes the link's destination.
Suggested change
|
||||||
|
|
||||||
| - Dockerfile: Builds the standalone worker image. | ||||||
| - Makefile: Used to build the above image. | ||||||
| - train-mnist-cpu.yaml: Volcano Job specification. | ||||||
|
|
||||||
| To run the example, edit the image name and version in `train-mnist-cpu.yaml`. Then run: | ||||||
|
|
||||||
| ``` | ||||||
| kubectl apply -f train-mnist-cpu.yaml -n ${NAMESPACE} | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
| ``` | ||||||
|
|
||||||
| to create the Job. | ||||||
|
|
||||||
| Then use: | ||||||
|
|
||||||
| ``` | ||||||
| kubectl -n ${NAMESPACE} describe job.batch.volcano.sh mxnet-job | ||||||
| ``` | ||||||
|
|
||||||
| to view the status. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| +++ | ||
| title = "Argo on Volcano" | ||
|
|
||
| date = 2025-07-20 | ||
| lastmod = 2025-07-20 | ||
|
|
||
| draft = false # Is this a draft? true/false | ||
| toc = true # Show table of contents? true/false | ||
| type = "docs" # Do not modify. | ||
|
|
||
| # Add menu entry to sidebar. | ||
| linktitle = "Argo" | ||
| [menu.docs] | ||
| parent = "zoology" | ||
| weight = 3 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The A similar issue exists for |
||
|
|
||
| +++ | ||
|
|
||
|
|
||
|
|
||
| ### Argo Introduction | ||
|
|
||
| Argo is an open-source Kubernetes native workflow engine that allows users to define and execute containerized workflows. The Argo project includes multiple components, with Argo Workflows being the core component used for orchestrating parallel jobs on Kubernetes, supporting DAG (Directed Acyclic Graph) and step templates. | ||
|
|
||
| ### Argo on Volcano | ||
|
|
||
| By integrating Argo Workflow with Volcano, you can combine the advantages of both: Argo provides powerful workflow orchestration capabilities, while Volcano provides advanced scheduling features. | ||
|
|
||
| #### Integration Method | ||
|
|
||
| Argo resource templates allow for the creation, deletion, or updating of any type of Kubernetes resource (including CRDs). We can use resource templates to integrate Volcano Jobs into Argo Workflow, thereby adding job dependency management and DAG flow control capabilities to Volcano. | ||
|
|
||
| #### Configuring RBAC Permissions | ||
|
|
||
| Before integration, ensure that Argo Workflow has sufficient permissions to manage Volcano resources: | ||
|
|
||
| 1. Argo Workflow needs to specify a serviceAccount, which can be specified as follows: | ||
|
|
||
| ``` | ||
| argo submit --serviceaccount <name> | ||
| ``` | ||
|
|
||
| 2. Add Volcano resource management permissions to the serviceAccount: | ||
|
|
||
| ```yaml | ||
| yaml- apiGroups: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - batch.volcano.sh | ||
| resources: | ||
| - "*" | ||
| verbs: | ||
| - "*" | ||
| ``` | ||
|
|
||
| #### Example | ||
|
|
||
| Here is an example YAML for creating a Volcano Job using Argo Workflow: | ||
|
|
||
| ```yaml | ||
| yamlapiVersion: argoproj.io/v1alpha1 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| kind: Workflow | ||
| metadata: | ||
| generateName: volcano-job- | ||
| spec: | ||
| entrypoint: nginx-tmpl | ||
| serviceAccountName: argo # Specify service account | ||
| templates: | ||
| - name: nginx-tmpl | ||
| activeDeadlineSeconds: 120 # Limit workflow execution time | ||
| resource: # Indicates this is a resource template | ||
| action: create # kubectl operation type | ||
| successCondition: status.state.phase = Completed | ||
| failureCondition: status.state.phase = Failed | ||
| manifest: | | ||
| apiVersion: batch.volcano.sh/v1alpha1 | ||
| kind: Job | ||
| metadata: | ||
| generateName: test-job- | ||
| ownerReferences: # Add owner references to ensure resource lifecycle management | ||
| - apiVersion: argoproj.io/v1alpha1 | ||
| blockOwnerDeletion: true | ||
| kind: Workflow | ||
| name: "{{workflow.name}}" | ||
| uid: "{{workflow.uid}}" | ||
| spec: | ||
| minAvailable: 1 | ||
| schedulerName: volcano | ||
| policies: | ||
| - event: PodEvicted | ||
| action: RestartJob | ||
| plugins: | ||
| ssh: [] | ||
| env: [] | ||
| svc: [] | ||
| maxRetry: 5 | ||
| queue: default | ||
| tasks: | ||
| - replicas: 2 | ||
| name: "default-nginx" | ||
| template: | ||
| metadata: | ||
| name: web | ||
| spec: | ||
| containers: | ||
| - image: nginx:latest | ||
| imagePullPolicy: IfNotPresent | ||
| name: nginx | ||
| resources: | ||
| requests: | ||
| cpu: "100m" | ||
| restartPolicy: OnFailure | ||
| ``` | ||
|
|
||
| For more information and advanced configurations, please check the [link](https://github.com/volcano-sh/volcano/tree/master/example/integrations/argo) to learn more. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| +++ | ||
| title = "Cromwell on Volcano" | ||
|
|
||
| date = 2025-07-20 | ||
| lastmod = 2025-07-20 | ||
|
|
||
| draft = false # Is this a draft? true/false | ||
| toc = true # Show table of contents? true/false | ||
| type = "docs" # Do not modify. | ||
|
|
||
| # Add menu entry to sidebar. | ||
| linktitle = "Cromwell" | ||
| [menu.docs] | ||
| parent = "zoology" | ||
| weight = 3 | ||
|
|
||
| +++ | ||
|
|
||
|
|
||
|
|
||
| # Cromwell Introduction | ||
|
|
||
| Cromwell is a workflow management system designed for scientific workflows. | ||
|
|
||
| # Cromwell on Volcano | ||
|
|
||
| Cromwell can be integrated with Volcano to efficiently schedule and execute bioinformatics workflows in Kubernetes environments. | ||
|
|
||
| To make Cromwell interact with a Volcano cluster and dispatch jobs to it, you can use the following basic configuration: | ||
|
|
||
| ```hocon | ||
| hoconhoconVolcano { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory" | ||
| config { | ||
| runtime-attributes = """ | ||
| Int runtime_minutes = 600 | ||
| Int cpus = 2 | ||
| Int requested_memory_mb_per_core = 8000 | ||
| String queue = "short" | ||
| """ | ||
|
|
||
| submit = """ | ||
| vcctl job run -f ${script} | ||
| """ | ||
| kill = "vcctl job delete -N ${job_id}" | ||
| check-alive = "vcctl job view -N ${job_id}" | ||
| job-id-regex = "(\\d+)" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Please note that this configuration example is community-contributed and therefore not officially supported. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| +++ | ||
| title = "Horovod on Volcano" | ||
|
|
||
| date = 2025-07-20 | ||
| lastmod = 2025-07-20 | ||
|
|
||
| draft = false # Is this a draft? true/false | ||
| toc = true # Show table of contents? true/false | ||
| type = "docs" # Do not modify. | ||
|
|
||
| # Add menu entry to sidebar. | ||
| linktitle = "Horovod" | ||
| [menu.docs] | ||
| parent = "zoology" | ||
| weight = 3 | ||
|
|
||
| +++ | ||
|
|
||
|
|
||
|
|
||
| # Horovod Introduction | ||
|
|
||
| Horovod is a distributed deep learning training framework compatible with PyTorch, TensorFlow, Keras, and Apache MXNet. With Horovod, existing training scripts can be scaled to run on hundreds of GPUs with just a few lines of Python code. It achieves near-linear performance improvements on large-scale GPU clusters. | ||
|
|
||
| ## Horovod on Volcano | ||
|
|
||
| Volcano as a cloud-native batch system, provides native support for Horovod distributed training jobs. Through Volcano's scheduling capabilities, users can easily deploy and manage Horovod training tasks on Kubernetes clusters. | ||
|
|
||
| Below is an example configuration for running Horovod on Volcano: | ||
|
|
||
| ```yaml | ||
| yamlapiVersion: batch.volcano.sh/v1alpha1 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| kind: Job | ||
| metadata: | ||
| name: lm-horovod-job | ||
| labels: | ||
| "volcano.sh/job-type": Horovod | ||
| spec: | ||
| minAvailable: 4 | ||
| schedulerName: volcano | ||
| plugins: | ||
| ssh: [] | ||
| svc: [] | ||
| policies: | ||
| - event: PodEvicted | ||
| action: RestartJob | ||
| tasks: | ||
| - replicas: 1 | ||
| name: master | ||
| policies: | ||
| - event: TaskCompleted | ||
| action: CompleteJob | ||
| template: | ||
| spec: | ||
| containers: | ||
| - command: | ||
| - /bin/sh | ||
| - -c | ||
| - | | ||
| WORKER_HOST=`cat /etc/volcano/worker.host | tr "\n" ","`; | ||
| mkdir -p /var/run/sshd; /usr/sbin/sshd; | ||
| mpiexec --allow-run-as-root --host ${WORKER_HOST} -np 3 python tensorflow_mnist_lm.py; | ||
| image: volcanosh/horovod-tf-mnist:0.5 | ||
| name: master | ||
| ports: | ||
| - containerPort: 22 | ||
| name: job-port | ||
| resources: | ||
| requests: | ||
| cpu: "500m" | ||
| memory: "1024Mi" | ||
| limits: | ||
| cpu: "500m" | ||
| memory: "1024Mi" | ||
| restartPolicy: OnFailure | ||
| imagePullSecrets: | ||
| - name: default-secret | ||
| - replicas: 3 | ||
| name: worker | ||
| template: | ||
| spec: | ||
| containers: | ||
| - command: | ||
| - /bin/sh | ||
| - -c | ||
| - | | ||
| mkdir -p /var/run/sshd; /usr/sbin/sshd -D; | ||
| image: volcanosh/horovod-tf-mnist:0.5 | ||
| name: worker | ||
| ports: | ||
| - containerPort: 22 | ||
| name: job-port | ||
| resources: | ||
| requests: | ||
| cpu: "1000m" | ||
| memory: "2048Mi" | ||
| limits: | ||
| cpu: "1000m" | ||
| memory: "2048Mi" | ||
| restartPolicy: OnFailure | ||
| imagePullSecrets: | ||
| - name: default-secret | ||
| ``` | ||
|
|
||
| In this configuration, we define a Horovod distributed training job with the following key components: | ||
|
|
||
| 1. Task structure: Consists of 1 master node and 3 worker nodes, totaling 4 Pods | ||
| 2. Communication mechanism: Utilizes Volcano's SSH plugin for inter-node communication | ||
| 3. Resource allocation: Master node is allocated fewer resources (500m CPU/1Gi memory), while worker nodes receive more resources (1000m CPU/2Gi memory) | ||
| 4. Fault tolerance: When a Pod is evicted, the entire job restarts | ||
| 5. Job completion policy: When the master task completes, the entire job is marked as complete | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra space after the
=in the title definition. For consistency with other files, please remove it.