Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions content/en/docs/MXNet_on_volcano.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
+++
title = "MXNet on Volcano"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an extra space after the = in the title definition. For consistency with other files, please remove it.

Suggested change
title = "MXNet on Volcano"
title = "MXNet on Volcano"


date = 2025-07-20
lastmod = 2025-07-20

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "MXNet"
[menu.docs]
parent = "zoology"
weight = 3

+++



# MXNet Introduction

MXNet is an open-source deep learning framework designed for efficient and flexible training and deployment of deep neural networks. It supports seamless scaling from a single GPU to multiple GPUs, and further to distributed multi-machine multi-GPU setups.

# MXNet on Volcano

Combining MXNet with Volcano allows you to fully leverage Kubernetes' container orchestration capabilities and Volcano's batch scheduling functionality to achieve efficient distributed training.

Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link text "here" is not very descriptive. For better accessibility and clarity, it's best to use text that describes the link's destination.

Suggested change
Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:
Click [here to view the example](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) provided by the MXNet team. This directory contains the following files:


- Dockerfile: Builds the standalone worker image.
- Makefile: Used to build the above image.
- train-mnist-cpu.yaml: Volcano Job specification.

To run the example, edit the image name and version in `train-mnist-cpu.yaml`. Then run:

```
kubectl apply -f train-mnist-cpu.yaml -n ${NAMESPACE}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ${NAMESPACE} variable is used here and on line 46 without explanation. It would be helpful to add a note for the user to replace this with their target Kubernetes namespace.

```

to create the Job.

Then use:

```
kubectl -n ${NAMESPACE} describe job.batch.volcano.sh mxnet-job
```

to view the status.
113 changes: 113 additions & 0 deletions content/en/docs/argo_on_volcano.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
+++
title = "Argo on Volcano"

date = 2025-07-20
lastmod = 2025-07-20

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "Argo"
[menu.docs]
parent = "zoology"
weight = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The weight for this menu item is 3, which conflicts with other new documentation files (MXNet_on_volcano.md, cromwell_on_volcano.md, horovod_on_volcano.md) that also use the same weight under the same parent zoology. This can lead to an unpredictable order in the navigation sidebar. Please assign unique and sequential weights to ensure a consistent and logical order.

A similar issue exists for ray_on_volcano.md and pytorch_on_volcano.md which both have weight = 6.


+++



### Argo Introduction

Argo is an open-source Kubernetes native workflow engine that allows users to define and execute containerized workflows. The Argo project includes multiple components, with Argo Workflows being the core component used for orchestrating parallel jobs on Kubernetes, supporting DAG (Directed Acyclic Graph) and step templates.

### Argo on Volcano

By integrating Argo Workflow with Volcano, you can combine the advantages of both: Argo provides powerful workflow orchestration capabilities, while Volcano provides advanced scheduling features.

#### Integration Method

Argo resource templates allow for the creation, deletion, or updating of any type of Kubernetes resource (including CRDs). We can use resource templates to integrate Volcano Jobs into Argo Workflow, thereby adding job dependency management and DAG flow control capabilities to Volcano.

#### Configuring RBAC Permissions

Before integration, ensure that Argo Workflow has sufficient permissions to manage Volcano resources:

1. Argo Workflow needs to specify a serviceAccount, which can be specified as follows:

```
argo submit --serviceaccount <name>
```

2. Add Volcano resource management permissions to the serviceAccount:

```yaml
yaml- apiGroups:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The yaml prefix in this line is incorrect and will be rendered as part of the code. Please remove it.

Suggested change
yaml- apiGroups:
- apiGroups:

- batch.volcano.sh
resources:
- "*"
verbs:
- "*"
```

#### Example

Here is an example YAML for creating a Volcano Job using Argo Workflow:

```yaml
yamlapiVersion: argoproj.io/v1alpha1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The yaml prefix in this line is incorrect and will be rendered as part of the code. Please remove it.

Suggested change
yamlapiVersion: argoproj.io/v1alpha1
apiVersion: argoproj.io/v1alpha1

kind: Workflow
metadata:
generateName: volcano-job-
spec:
entrypoint: nginx-tmpl
serviceAccountName: argo # Specify service account
templates:
- name: nginx-tmpl
activeDeadlineSeconds: 120 # Limit workflow execution time
resource: # Indicates this is a resource template
action: create # kubectl operation type
successCondition: status.state.phase = Completed
failureCondition: status.state.phase = Failed
manifest: |
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
generateName: test-job-
ownerReferences: # Add owner references to ensure resource lifecycle management
- apiVersion: argoproj.io/v1alpha1
blockOwnerDeletion: true
kind: Workflow
name: "{{workflow.name}}"
uid: "{{workflow.uid}}"
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 2
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:latest
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: "100m"
restartPolicy: OnFailure
```

For more information and advanced configurations, please check the [link](https://github.com/volcano-sh/volcano/tree/master/example/integrations/argo) to learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file is missing a newline character at the end. It's a good practice to end files with a newline to avoid issues with some tools. This applies to other new markdown files in this PR as well.

52 changes: 52 additions & 0 deletions content/en/docs/cromwell_on_volcano.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
+++
title = "Cromwell on Volcano"

date = 2025-07-20
lastmod = 2025-07-20

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "Cromwell"
[menu.docs]
parent = "zoology"
weight = 3

+++



# Cromwell Introduction

Cromwell is a workflow management system designed for scientific workflows.

# Cromwell on Volcano

Cromwell can be integrated with Volcano to efficiently schedule and execute bioinformatics workflows in Kubernetes environments.

To make Cromwell interact with a Volcano cluster and dispatch jobs to it, you can use the following basic configuration:

```hocon
hoconhoconVolcano {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hoconhocon prefix here is a typo and will be rendered as part of the code. Please remove it.

Suggested change
hoconhoconVolcano {
Volcano {

actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
runtime-attributes = """
Int runtime_minutes = 600
Int cpus = 2
Int requested_memory_mb_per_core = 8000
String queue = "short"
"""

submit = """
vcctl job run -f ${script}
"""
kill = "vcctl job delete -N ${job_id}"
check-alive = "vcctl job view -N ${job_id}"
job-id-regex = "(\\d+)"
}
}
```

Please note that this configuration example is community-contributed and therefore not officially supported.
111 changes: 111 additions & 0 deletions content/en/docs/horovod_on_volcano.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
+++
title = "Horovod on Volcano"

date = 2025-07-20
lastmod = 2025-07-20

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
type = "docs" # Do not modify.

# Add menu entry to sidebar.
linktitle = "Horovod"
[menu.docs]
parent = "zoology"
weight = 3

+++



# Horovod Introduction

Horovod is a distributed deep learning training framework compatible with PyTorch, TensorFlow, Keras, and Apache MXNet. With Horovod, existing training scripts can be scaled to run on hundreds of GPUs with just a few lines of Python code. It achieves near-linear performance improvements on large-scale GPU clusters.

## Horovod on Volcano

Volcano as a cloud-native batch system, provides native support for Horovod distributed training jobs. Through Volcano's scheduling capabilities, users can easily deploy and manage Horovod training tasks on Kubernetes clusters.

Below is an example configuration for running Horovod on Volcano:

```yaml
yamlapiVersion: batch.volcano.sh/v1alpha1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The yaml prefix in this line is incorrect and will be rendered as part of the code. Please remove it.

Suggested change
yamlapiVersion: batch.volcano.sh/v1alpha1
apiVersion: batch.volcano.sh/v1alpha1

kind: Job
metadata:
name: lm-horovod-job
labels:
"volcano.sh/job-type": Horovod
spec:
minAvailable: 4
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
WORKER_HOST=`cat /etc/volcano/worker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
mpiexec --allow-run-as-root --host ${WORKER_HOST} -np 3 python tensorflow_mnist_lm.py;
image: volcanosh/horovod-tf-mnist:0.5
name: master
ports:
- containerPort: 22
name: job-port
resources:
requests:
cpu: "500m"
memory: "1024Mi"
limits:
cpu: "500m"
memory: "1024Mi"
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
- replicas: 3
name: worker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: volcanosh/horovod-tf-mnist:0.5
name: worker
ports:
- containerPort: 22
name: job-port
resources:
requests:
cpu: "1000m"
memory: "2048Mi"
limits:
cpu: "1000m"
memory: "2048Mi"
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
```

In this configuration, we define a Horovod distributed training job with the following key components:

1. Task structure: Consists of 1 master node and 3 worker nodes, totaling 4 Pods
2. Communication mechanism: Utilizes Volcano's SSH plugin for inter-node communication
3. Resource allocation: Master node is allocated fewer resources (500m CPU/1Gi memory), while worker nodes receive more resources (1000m CPU/2Gi memory)
4. Fault tolerance: When a Pod is evicted, the entire job restarts
5. Job completion policy: When the master task completes, the entire job is marked as complete
Loading