volcano-sh · bearslyricattack · Jul 20, 2025 · gemini-code-assist · Jul 20, 2025 · gemini-code-assist
diff --git a/content/en/docs/MXNet_on_volcano.md b/content/en/docs/MXNet_on_volcano.md
@@ -0,0 +1,49 @@
++++
+title =  "MXNet on Volcano"
-title =  "MXNet on Volcano"
+title = "MXNet on Volcano"
-title =  "MXNet on Volcano"
+title = "MXNet on Volcano"
+
+date = 2025-07-20
+lastmod = 2025-07-20
+
+draft = false  # Is this a draft? true/false
+toc = true  # Show table of contents? true/false
+type = "docs"  # Do not modify.
+
+# Add menu entry to sidebar.
+linktitle = "MXNet"
+[menu.docs]
+  parent = "zoology"
+  weight = 3
+
++++
+
+
+
+# MXNet Introduction
+
+MXNet is an open-source deep learning framework designed for efficient and flexible training and deployment of deep neural networks. It supports seamless scaling from a single GPU to multiple GPUs, and further to distributed multi-machine multi-GPU setups.
+
+# MXNet on Volcano
+
+Combining MXNet with Volcano allows you to fully leverage Kubernetes' container orchestration capabilities and Volcano's batch scheduling functionality to achieve efficient distributed training.
+
+Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:
-Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:
+Click [here to view the example](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) provided by the MXNet team. This directory contains the following files:
-Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:
+Click [here to view the example](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) provided by the MXNet team. This directory contains the following files:
+
+- Dockerfile: Builds the standalone worker image.
+- Makefile: Used to build the above image.
+- train-mnist-cpu.yaml: Volcano Job specification.
+
+To run the example, edit the image name and version in `train-mnist-cpu.yaml`. Then run:
+
+```
+kubectl apply -f train-mnist-cpu.yaml -n ${NAMESPACE}
+```
+
+to create the Job.
+
+Then use:
+
+```
+kubectl -n ${NAMESPACE} describe job.batch.volcano.sh mxnet-job
+```
+
+to view the status.
diff --git a/content/en/docs/argo_on_volcano.md b/content/en/docs/argo_on_volcano.md
@@ -0,0 +1,113 @@
++++
+title =  "Argo on Volcano"
+
+date = 2025-07-20
+lastmod = 2025-07-20
+
+draft = false  # Is this a draft? true/false
+toc = true  # Show table of contents? true/false
+type = "docs"  # Do not modify.
+
+# Add menu entry to sidebar.
+linktitle = "Argo"
+[menu.docs]
+  parent = "zoology"
+  weight = 3
+
++++
+
+
+
+### Argo Introduction
+
+Argo is an open-source Kubernetes native workflow engine that allows users to define and execute containerized workflows. The Argo project includes multiple components, with Argo Workflows being the core component used for orchestrating parallel jobs on Kubernetes, supporting DAG (Directed Acyclic Graph) and step templates.
+
+### Argo on Volcano
+
+By integrating Argo Workflow with Volcano, you can combine the advantages of both: Argo provides powerful workflow orchestration capabilities, while Volcano provides advanced scheduling features.
+
+#### Integration Method
+
+Argo resource templates allow for the creation, deletion, or updating of any type of Kubernetes resource (including CRDs). We can use resource templates to integrate Volcano Jobs into Argo Workflow, thereby adding job dependency management and DAG flow control capabilities to Volcano.
+
+#### Configuring RBAC Permissions
+
+Before integration, ensure that Argo Workflow has sufficient permissions to manage Volcano resources:
+
+1. Argo Workflow needs to specify a serviceAccount, which can be specified as follows:
+
+   ```
+   argo submit --serviceaccount <name>
+   ```
+
+2. Add Volcano resource management permissions to the serviceAccount:
+
+   ```yaml
+   yaml- apiGroups:
-   yaml- apiGroups:
+   - apiGroups:
-   yaml- apiGroups:
+   - apiGroups:
+     - batch.volcano.sh
+     resources:
+     - "*"
+     verbs:
+     - "*"
+   ```
+
+#### Example
+
+Here is an example YAML for creating a Volcano Job using Argo Workflow:
+
+```yaml
+yamlapiVersion: argoproj.io/v1alpha1
-yamlapiVersion: argoproj.io/v1alpha1
+apiVersion: argoproj.io/v1alpha1
-yamlapiVersion: argoproj.io/v1alpha1
+apiVersion: argoproj.io/v1alpha1
+kind: Workflow
+metadata:
+  generateName: volcano-job-
+spec:
+  entrypoint: nginx-tmpl
+  serviceAccountName: argo        # Specify service account
+  templates:
+  - name: nginx-tmpl
+    activeDeadlineSeconds: 120    # Limit workflow execution time
+    resource:                     # Indicates this is a resource template
+      action: create              # kubectl operation type
+      successCondition: status.state.phase = Completed
+      failureCondition: status.state.phase = Failed
+      manifest: |
+        apiVersion: batch.volcano.sh/v1alpha1
+        kind: Job
+        metadata:
+          generateName: test-job-
+          ownerReferences:        # Add owner references to ensure resource lifecycle management
+          - apiVersion: argoproj.io/v1alpha1
+            blockOwnerDeletion: true
+            kind: Workflow
+            name: "{{workflow.name}}"
+            uid: "{{workflow.uid}}"
+        spec:
+          minAvailable: 1
+          schedulerName: volcano
+          policies:
+          - event: PodEvicted
+            action: RestartJob
+          plugins:
+            ssh: []
+            env: []
+            svc: []
+          maxRetry: 5
+          queue: default
+          tasks:
+          - replicas: 2
+            name: "default-nginx"
+            template:
+              metadata:
+                name: web
+              spec:
+                containers:
+                - image: nginx:latest
+                  imagePullPolicy: IfNotPresent
+                  name: nginx
+                  resources:
+                    requests:
+                      cpu: "100m"
+                restartPolicy: OnFailure
+```
+
+For more information and advanced configurations, please check the [link](https://github.com/volcano-sh/volcano/tree/master/example/integrations/argo) to learn more.
diff --git a/content/en/docs/cromwell_on_volcano.md b/content/en/docs/cromwell_on_volcano.md
@@ -0,0 +1,52 @@
++++
+title =  "Cromwell on Volcano"
+
+date = 2025-07-20
+lastmod = 2025-07-20
+
+draft = false  # Is this a draft? true/false
+toc = true  # Show table of contents? true/false
+type = "docs"  # Do not modify.
+
+# Add menu entry to sidebar.
+linktitle = "Cromwell"
+[menu.docs]
+  parent = "zoology"
+  weight = 3
+
++++
+
+
+
+# Cromwell Introduction
+
+Cromwell is a workflow management system designed for scientific workflows.
+
+# Cromwell on Volcano
+
+Cromwell can be integrated with Volcano to efficiently schedule and execute bioinformatics workflows in Kubernetes environments.
+
+To make Cromwell interact with a Volcano cluster and dispatch jobs to it, you can use the following basic configuration:
+
+```hocon
+hoconhoconVolcano {
-hoconhoconVolcano {
+Volcano {
-hoconhoconVolcano {
+Volcano {
+  actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
+  config {
+    runtime-attributes = """
+    Int runtime_minutes = 600
+    Int cpus = 2
+    Int requested_memory_mb_per_core = 8000
+    String queue = "short"
+    """
+
+    submit = """
+        vcctl job run -f ${script}
+    """
+    kill = "vcctl job delete -N ${job_id}"
+    check-alive = "vcctl job view -N ${job_id}"
+    job-id-regex = "(\\d+)"
+  }
+}
+```
+
+Please note that this configuration example is community-contributed and therefore not officially supported.
diff --git a/content/en/docs/horovod_on_volcano.md b/content/en/docs/horovod_on_volcano.md
@@ -0,0 +1,111 @@
++++
+title =  "Horovod on Volcano"
+
+date = 2025-07-20
+lastmod = 2025-07-20
+
+draft = false  # Is this a draft? true/false
+toc = true  # Show table of contents? true/false
+type = "docs"  # Do not modify.
+
+# Add menu entry to sidebar.
+linktitle = "Horovod"
+[menu.docs]
+  parent = "zoology"
+  weight = 3
+
++++
+
+
+
+# Horovod Introduction
+
+Horovod is a distributed deep learning training framework compatible with PyTorch, TensorFlow, Keras, and Apache MXNet. With Horovod, existing training scripts can be scaled to run on hundreds of GPUs with just a few lines of Python code. It achieves near-linear performance improvements on large-scale GPU clusters.
+
+## Horovod on Volcano
+
+Volcano as a cloud-native batch system, provides native support for Horovod distributed training jobs. Through Volcano's scheduling capabilities, users can easily deploy and manage Horovod training tasks on Kubernetes clusters.
+
+Below is an example configuration for running Horovod on Volcano:
+
+```yaml
+yamlapiVersion: batch.volcano.sh/v1alpha1
-yamlapiVersion: batch.volcano.sh/v1alpha1
+apiVersion: batch.volcano.sh/v1alpha1
-yamlapiVersion: batch.volcano.sh/v1alpha1
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata:
+  name: lm-horovod-job
+  labels:
+    "volcano.sh/job-type": Horovod
+spec:
+  minAvailable: 4
+  schedulerName: volcano
+  plugins:
+    ssh: []
+    svc: []
+  policies:
+    - event: PodEvicted
+      action: RestartJob
+  tasks:
+    - replicas: 1
+      name: master
+      policies:
+        - event: TaskCompleted
+          action: CompleteJob
+      template:
+        spec:
+          containers:
+            - command:
+                - /bin/sh
+                - -c
+                - |
+                  WORKER_HOST=`cat /etc/volcano/worker.host | tr "\n" ","`;
+                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
+                  mpiexec --allow-run-as-root --host ${WORKER_HOST} -np 3 python tensorflow_mnist_lm.py;
+              image: volcanosh/horovod-tf-mnist:0.5
+              name: master
+              ports:
+                - containerPort: 22
+                  name: job-port
+              resources:
+                requests:
+                  cpu: "500m"
+                  memory: "1024Mi"
+                limits:
+                  cpu: "500m"
+                  memory: "1024Mi"
+          restartPolicy: OnFailure
+          imagePullSecrets:
+            - name: default-secret
+    - replicas: 3
+      name: worker
+      template:
+        spec:
+          containers:
+            - command:
+                - /bin/sh
+                - -c
+                - |
+                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
+              image: volcanosh/horovod-tf-mnist:0.5
+              name: worker
+              ports:
+                - containerPort: 22
+                  name: job-port
+              resources:
+                requests:
+                  cpu: "1000m"
+                  memory: "2048Mi"
+                limits:
+                  cpu: "1000m"
+                  memory: "2048Mi"
+          restartPolicy: OnFailure
+          imagePullSecrets:
+            - name: default-secret
+```
+
+In this configuration, we define a Horovod distributed training job with the following key components:
+
+1. Task structure: Consists of 1 master node and 3 worker nodes, totaling 4 Pods
+2. Communication mechanism: Utilizes Volcano's SSH plugin for inter-node communication
+3. Resource allocation: Master node is allocated fewer resources (500m CPU/1Gi memory), while worker nodes receive more resources (1000m CPU/2Gi memory)
+4. Fault tolerance: When a Pod is evicted, the entire job restarts
+5. Job completion policy: When the master task completes, the entire job is marked as complete