Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum batch sizes #6

Open
michaelbornholdt opened this issue Aug 17, 2021 · 2 comments
Open

Maximum batch sizes #6

michaelbornholdt opened this issue Aug 17, 2021 · 2 comments

Comments

@michaelbornholdt
Copy link
Contributor

image

	"model": {
            "name": "efficientnet",
            "crop_generator": "sampled_crop_generator",
            "metrics": ["accuracy", "top_k"],
            "epochs": 2,
            "initialization":"ImageNet",
            "params": {
                "learning_rate": 0.005,
                "batch_size": 64,
                "conv_blocks": 0,
                "feature_dim": 256,
                "pooling": "avg"
            },
            "lr_schedule": "cosine"
        },
	"sampling": {
            "factor": 1,
            "workers": 4,
            "cache_size": 10000
        },
	"validation": {
            "frequency": 1,
            "top_k": 5,
            "batch_size": 40,
            "frame": "val",
            "sample_first_crops": true

@michaelbornholdt
Copy link
Contributor Author

michaelbornholdt commented Aug 17, 2021

Profiling with

    "profile": {
      "feature_layer": "Compound",
      "checkpoint": "checkpoint_0010.hdf5",
      "batch_size": 128
    }
}

deepprofiler/__main__.py:180: DtypeWarning: Columns (12) have mixed types.Specify dtype option on import or set low_memory=False.
  dset = deepprofiler.dataset.image_dataset.read_dataset(context.obj["config"], mode='profile')
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `l$
  "The `lr` argument is deprecated, use `learning_rate` instead.")
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and$
  category=CustomMaskWarning)
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "deepprofiler/__main__.py", line 197, in <module>
    cli(obj={})
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "deepprofiler/__main__.py", line 181, in profile
    deepprofiler.learning.profiling.profile(context.obj["config"], dset)
  File "/DeepProfiler/deepprofiler/learning/profiling.py", line 105, in profile
    profile.configure()
  File "/DeepProfiler/deepprofiler/learning/profiling.py", line 35, in configure
    self.profile_crop_generator.start(K.get_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 742, in get_session
    session = _get_session(op_input_list)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 714, in _get_session
    config=get_default_session_config())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

@michaelbornholdt
Copy link
Contributor Author

"profile": {
"feature_layer": "Compound",
"checkpoint": "checkpoint_0010.hdf5",
"batch_size": 32 and 64
}
}

Matplotlib created a temporary config/cache directory at /var/lib/condor/execute/slot1/dir_52011/matplotlib-4q3kc0vd because the default path (/.conf$
2021-08-17 20:03:10.420321: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.743367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-17 20:03:16.768252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.768291: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.771330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-17 20:03:16.771378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-17 20:03:16.772531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-17 20:03:16.772749: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-17 20:03:16.773586: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-17 20:03:16.774328: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-17 20:03:16.774506: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-17 20:03:16.775931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.776471: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network $
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-17 20:03:16.785075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.786631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.786737: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:17.342112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-17 20:03:17.342162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]	  0
2021-08-17 20:03:17.342172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-08-17 20:03:17.344450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/devic$
2021-08-17 20:03:17.843576: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3200140000 Hz
2021-08-17 20:03:18.134139: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 174.69M (183173120 bytes) from device: CUDA_ERRO$
2021-08-17 20:03:36.615088: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 71.56Mi$
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-08-17 20:03:36.615281: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for GPU_0_bfc
2021-08-17 20:03:36.615311: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256):   Total Chunks: 231, Chunks in use: 231. 57.8KiB alloca$
2021-08-17 20:03:36.615323: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512):   Total Chunks: 77, Chunks in use: 76. 47.8KiB allocate$
2021-08-17 20:03:36.615333: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024):  Total Chunks: 39, Chunks in use: 38. 44.5KiB allocate$
2021-08-17 20:03:36.615343: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048):  Total Chunks: 73, Chunks in use: 72. 183.8KiB allocat$
2021-08-17 20:03:36.615389: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096):  Total Chunks: 53, Chunks in use: 50. 252.2KiB allocat$
2021-08-17 20:03:36.615399: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192):  Total Chunks: 27, Chunks in use: 20. 298.2KiB allocat$
2021-08-17 20:03:36.615441: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384):         Total Chunks: 12, Chunks in use: 8. 241.8KiB $
2021-08-17 20:03:36.615453: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768):         Total Chunks: 26, Chunks in use: 22. 1.00MiB $
2021-08-17 20:03:36.615485: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536):         Total Chunks: 28, Chunks in use: 26. 2.24MiB $
2021-08-17 20:03:36.615495: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072):        Total Chunks: 29, Chunks in use: 28. 5.46MiB $
2021-08-17 20:03:36.615504: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144):        Total Chunks: 15, Chunks in use: 12. 4.91MiB $
2021-08-17 20:03:36.615513: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288):        Total Chunks: 19, Chunks in use: 14. 15.88MiB$
2021-08-17 20:03:36.615522: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576):	Total Chunks: 6, Chunks in use: 4. 8.77MiB al$
2021-08-17 20:03:36.615532: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152):	Total Chunks: 5, Chunks in use: 2. 13.76MiB a$
2021-08-17 20:03:36.615541: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304):	Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615550: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608):	Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615558: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216):	Total Chunks: 1, Chunks in use: 0. 22.25MiB a$
2021-08-17 20:03:36.615567: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432):	Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615605: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864):	Total Chunks: 1, Chunks in use: 1. 81.84MiB a$
2021-08-17 20:03:36.615616: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615652: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615665: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 71.56MiB was 64.00MiB, Chunk State:
2021-08-17 20:03:36.615702: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 164855808
2021-08-17 20:03:36.615722: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000000 of size 1280 by op ScratchBuffer action_cou$
2021-08-17 20:03:36.615752: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000500 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615761: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000600 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615770: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000700 of size 2048 by op Com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant