Skip to content

Commit 6ebe33e

Browse files
committed
Provide GPU version of lifelong cityscapes example.
Signed-off-by: Jie Pu <[email protected]>
1 parent 3e8de61 commit 6ebe33e

File tree

3 files changed

+162
-0
lines changed

3 files changed

+162
-0
lines changed

examples/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ Example: [Using Federated Learning Job in Surface Defect Detection Scenario](./f
2222
### Lifelong Learning
2323
Example: [Using Lifelong Learning Job in Thermal Comfort Prediction Scenario](./lifelong_learning/atcii/README.md)
2424

25+
Example: [Using Lifelong Learning in Campus Robot Delivery Scenario](./lifelong_learning/cityscapes/README.md)
26+
2527
### Multi-Edge Inference
2628
Example: [Using ReID to Track an Infected COVID-19 Carrier in Pandemic Scenario](./multiedgeinference/pedestrian_tracking/README.md)
2729

examples/lifelong_learning/cityscapes/cityscapes-segmentation-lifelong-learning-tutorial.md

+17
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,23 @@ spec:
211211
EOF
212212
```
213213

214+
### GPU enabled (optional)
215+
If you want GPU to accelerate training or inference in Sedna, you can follow the steps below to enable GPU:
216+
217+
> 1. Follow the instructions in [nvidia-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#quick-start) to make nvidia-docker to be docker runtime.
218+
> 2. Set config `devicePluginEnabled` to `true` and restart edgecore in the gpu edge node.
219+
> 3. Deploy the [device-plugin daemonset](https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes) and check the device-plugin-pod running status in the gpu edge node.
220+
> 4. Check the capacity and allocatable of gpu edge node status.
221+
> 5. Deploy the [cuda-add pod](https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes), and wait some time for the pod to be running since the size of cuda-add image is 1.97GB.
222+
> 6. Check the cuda-add pod status, the log of "Test PASSED" means the gpu is enabled successfully.
223+
224+
The disscussion can be found in this [issue](https://github.com/kubeedge/kubeedge/issues/2324#issuecomment-726645832)
225+
226+
When GPU plugin has been enabled, you can use the [robot-dog-delivery-gpu.yaml](./yaml/robot-dog-delivery-gpu.yaml) configuration to create and run lifelong learning job.
227+
228+
To enable GPU in other Sedna features can be similarly configured like the above steps.
229+
230+
214231
## 1.5 Check Lifelong Learning Job
215232
**(1). Query lifelong learning service status**
216233

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
apiVersion: sedna.io/v1alpha1
2+
kind: LifelongLearningJob
3+
metadata:
4+
name: $job_name
5+
spec:
6+
dataset:
7+
name: "lifelong-robo-dataset"
8+
trainProb: 0.8
9+
trainSpec:
10+
template:
11+
spec:
12+
nodeName: $TRAIN_NODE
13+
dnsPolicy: ClusterFirstWithHostNet
14+
containers:
15+
- image: $cloud_image
16+
name: train-worker
17+
imagePullPolicy: IfNotPresent
18+
args: [ "train.py" ]
19+
env:
20+
- name: "num_class"
21+
value: "24"
22+
- name: "epoches"
23+
value: "1"
24+
- name: "attribute"
25+
value: "real, sim"
26+
- name: "city"
27+
value: "berlin"
28+
- name: "BACKEND_TYPE"
29+
value: "PYTORCH"
30+
- name: "train_ratio"
31+
value: "0.9"
32+
- name: "gpu_ids"
33+
value: "0"
34+
resources:
35+
limits:
36+
nvidia.com/gpu: 1 # requesting 1 GPU
37+
cpu: 6
38+
memory: 12Gi
39+
requests:
40+
cpu: 4
41+
memory: 12Gi
42+
nvidia.com/gpu: 1 # requesting 1 GPU
43+
volumeMounts:
44+
- mountPath: /dev/shm
45+
name: cache-volume
46+
volumes:
47+
- emptyDir:
48+
medium: Memory
49+
sizeLimit: 256Mi
50+
name: cache-volume
51+
trigger:
52+
checkPeriodSeconds: 30
53+
timer:
54+
start: 00:00
55+
end: 24:00
56+
condition:
57+
operator: ">"
58+
threshold: 100
59+
metric: num_of_samples
60+
evalSpec:
61+
template:
62+
spec:
63+
nodeName: $EVAL_NODE
64+
dnsPolicy: ClusterFirstWithHostNet
65+
containers:
66+
- image: $cloud_image
67+
name: eval-worker
68+
imagePullPolicy: IfNotPresent
69+
args: [ "evaluate.py" ]
70+
env:
71+
- name: "operator"
72+
value: "<"
73+
- name: "model_threshold"
74+
value: "0"
75+
- name: "num_class"
76+
value: "24"
77+
- name: "BACKEND_TYPE"
78+
value: "PYTORCH"
79+
- name: "gpu_ids"
80+
value: "0"
81+
resources:
82+
limits:
83+
cpu: 6
84+
memory: 12Gi
85+
nvidia.com/gpu: 1 # requesting 1 GPU
86+
requests:
87+
cpu: 4
88+
memory: 12Gi
89+
nvidia.com/gpu: 1 # requesting 1 GPU
90+
deploySpec:
91+
template:
92+
spec:
93+
nodeName: $INFER_NODE
94+
dnsPolicy: ClusterFirstWithHostNet
95+
hostNetwork: true
96+
containers:
97+
- image: $edge_image
98+
name: infer-worker
99+
imagePullPolicy: IfNotPresent
100+
args: [ "predict.py" ]
101+
env:
102+
- name: "test_data"
103+
value: "/data/test_data"
104+
- name: "num_class"
105+
value: "24"
106+
- name: "unseen_save_url"
107+
value: "/data/unseen_samples"
108+
- name: "INFERENCE_RESULT_DIR"
109+
value: "/data/infer_results"
110+
- name: "BACKEND_TYPE"
111+
value: "PYTORCH"
112+
- name: "gpu_ids"
113+
value: "0"
114+
volumeMounts:
115+
- name: unseenurl
116+
mountPath: /data/unseen_samples
117+
- name: inferdata
118+
mountPath: /data/infer_results
119+
- name: testdata
120+
mountPath: /data/test_data
121+
resources:
122+
limits:
123+
cpu: 6
124+
memory: 12Gi
125+
nvidia.com/gpu: 1 # requesting 1 GPU
126+
requests:
127+
cpu: 4
128+
memory: 12Gi
129+
nvidia.com/gpu: 1 # requesting 1 GPU
130+
volumes:
131+
- name: unseenurl
132+
hostPath:
133+
path: /data/unseen_samples
134+
type: DirectoryOrCreate
135+
- name: inferdata
136+
hostPath:
137+
path: /data/infer_results
138+
type: DirectoryOrCreate
139+
- name: testdata
140+
hostPath:
141+
path: /data/test_data
142+
type: DirectoryOrCreate
143+
outputDir: $OUTPUT/$job_name

0 commit comments

Comments
 (0)