Site updated: 2024-09-13 16:32:16

SilentEchoe · Sep 13, 2024 · 8ec014c · 8ec014c
1 parent 178367b
commit 8ec014c
Showing 1 changed file with 4 additions and 6 deletions.
diff --git a/2024/09/06/Kubernetes GPU 管理/index.html b/2024/09/06/Kubernetes GPU 管理/index.html
@@ -248,13 +248,11 @@ <h3 id="GPU-Share"><a href="#GPU-Share" class="headerlink" title="GPU Share"></a
 <p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913105232561.png" alt="image-20240913105232561"></p>
 <p>在该项目的 README 能发现共享的解决方案基于<a target="_blank" rel="noopener" href="https://github.com/NVIDIA/nvidia-docker">Nvidia Docker2</a>(这个项目现在已经封存了)，通过参考 <a target="_blank" rel="noopener" href="https://docs.google.com/document/d/1ZgKH_K4SEfdiE_OfxQ836s4yQWxZfSjS288Tq9YIWCA/edit#heading=h.r88v2xgacqr">GPU 共享设计</a>完成该项目，虽然<a target="_blank" rel="noopener" href="https://github.com/AliyunContainerService/gpushare-device-plugin">gpushare-device-plugin</a>已经停止维护，但从源码的设计来看可以发现，它的实现方式和 **<a target="_blank" rel="noopener" href="https://github.com/NVIDIA/k8s-device-plugin">k8s-device-plugin</a>**的思路相似，整个过程大致分为三步：</p>
 <ol>
-<li><p>查询所有的GPU ID 设备，如果查询到了，那么更改 Pod 的 Spec 信息</p>
-<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113457790.png" alt="image-20240913113457790"></p>
-</li>
-<li><p>将Pod 绑定到指定的 Node 上</p>
-</li>
+<li>查询所有的GPU ID 设备，如果查询到了，那么更改 Pod 的 Spec 信息<img src="/Users/kai/Library/Application Support/typora-user-images/image-20240913163041142.png" alt="image-20240913163041142"  /></li>
+<li>将Pod 绑定到指定的 Node 上</li>
 </ol>
-<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113535393.png" alt="image-20240913113535393"></p>
+<img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113535393.png" alt="image-20240913113535393"  />
+
 <p>3.如果pod更新成功，更新设备信息</p>
 <p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113724395.png" alt="image-20240913113724395"></p>
 <p>将一个设备信息同时挂载多个 Pod 是实现小任务共享的简单方式，GPU Share 通过插件和调度器动态遍历所有 Pod 的显存信息来计算所能调度的 GPU 资源，以扩展的方式来进行资源分配。这种方式实现成本低，但也会存在问题：GPU 资源不隔离。</p>