Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: request gpushare on the same GPU #43

Open
xhejtman opened this issue Aug 7, 2021 · 3 comments
Open

Question: request gpushare on the same GPU #43

xhejtman opened this issue Aug 7, 2021 · 3 comments

Comments

@xhejtman
Copy link

xhejtman commented Aug 7, 2021

Hello,

is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?

@wsxiaozhang
Copy link
Contributor

by default, gpushare scheduler will try to allocate different pods on gpu cards with "Binpack" first policy. Binpack means multiple pods who request gpu memory, will be placed on same card of same node, in order to leave as many free gpu cards for "big" job. In that case, yes, your two pod would be possible to share the same gpu card. however, that's a kind of best effort policy. Means those two pods still can be allocated to different cards, if the card which pod1 placed doesn't have enough memory for pod2. Then pod2 will be placed to another card.

@xhejtman
Copy link
Author

xhejtman commented Nov 5, 2021

Is there any chance to extending this plugin so that it would be possible to request allocation from the same physical card? It is usable for Statefulset deployments where you might need to share the same physical gpu among all containers.

@swood
Copy link

swood commented Nov 11, 2021

I'm trying to use this plugin and use your example code and it seems it doesn't work as declared.

ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=2
2021-11-11 22:22:59.635156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-11 22:22:59.675793: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
0.1
Traceback (most recent call last):
  File "/app/main.py", line 40, in <module>
    train(fraction)
  File "/app/main.py", line 23, in train
    sess = tf.Session(config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1482, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

This situation occurs when I increase number of replicas to two:

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 2

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 2

I have three cards with 14Gb on each. However, I am not able to run two copies of this software. Why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants