Skip to content

Commit

Permalink
Site updated: 2024-09-13 16:32:16
Browse files Browse the repository at this point in the history
SilentEchoe committed Sep 13, 2024
1 parent 178367b commit 8ec014c
Showing 1 changed file with 4 additions and 6 deletions.
10 changes: 4 additions & 6 deletions 2024/09/06/Kubernetes GPU 管理/index.html
Original file line number Diff line number Diff line change
@@ -248,13 +248,11 @@ <h3 id="GPU-Share"><a href="#GPU-Share" class="headerlink" title="GPU Share"></a
<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913105232561.png" alt="image-20240913105232561"></p>
<p>在该项目的 README 能发现共享的解决方案基于<a target="_blank" rel="noopener" href="https://github.com/NVIDIA/nvidia-docker">Nvidia Docker2</a>(这个项目现在已经封存了),通过参考 <a target="_blank" rel="noopener" href="https://docs.google.com/document/d/1ZgKH_K4SEfdiE_OfxQ836s4yQWxZfSjS288Tq9YIWCA/edit#heading=h.r88v2xgacqr">GPU 共享设计</a>完成该项目,虽然<a target="_blank" rel="noopener" href="https://github.com/AliyunContainerService/gpushare-device-plugin">gpushare-device-plugin</a>已经停止维护,但从源码的设计来看可以发现,它的实现方式和 **<a target="_blank" rel="noopener" href="https://github.com/NVIDIA/k8s-device-plugin">k8s-device-plugin</a>**的思路相似,整个过程大致分为三步:</p>
<ol>
<li><p>查询所有的GPU ID 设备,如果查询到了,那么更改 Pod 的 Spec 信息</p>
<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113457790.png" alt="image-20240913113457790"></p>
</li>
<li><p>将Pod 绑定到指定的 Node 上</p>
</li>
<li>查询所有的GPU ID 设备,如果查询到了,那么更改 Pod 的 Spec 信息<img src="/Users/kai/Library/Application Support/typora-user-images/image-20240913163041142.png" alt="image-20240913163041142" /></li>
<li>将Pod 绑定到指定的 Node 上</li>
</ol>
<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113535393.png" alt="image-20240913113535393"></p>
<img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113535393.png" alt="image-20240913113535393" />

<p>3.如果pod更新成功,更新设备信息</p>
<p><img src="https://raw.githubusercontent.com/SilentEchoe/images/main/image-20240913113724395.png" alt="image-20240913113724395"></p>
<p>将一个设备信息同时挂载多个 Pod 是实现小任务共享的简单方式,GPU Share 通过插件和调度器动态遍历所有 Pod 的显存信息来计算所能调度的 GPU 资源,以扩展的方式来进行资源分配。这种方式实现成本低,但也会存在问题:GPU 资源不隔离。</p>

0 comments on commit 8ec014c

Please sign in to comment.