Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Unit Tests Pass But tensorflow Produces Non-zero Exit Status #591

Open
xanderdunn opened this issue Dec 27, 2020 · 7 comments
Open

Unit Tests Pass But tensorflow Produces Non-zero Exit Status #591

xanderdunn opened this issue Dec 27, 2020 · 7 comments

Comments

@xanderdunn
Copy link

Swift for Tensorflow 0.12. On an Ubuntu 18.04 machine with CUDA 10.2 installed and a V100 GPU:

Test Case 'LayerTests.testTransformerOnSineData' passed (9.897 seconds)
Test Suite 'LayerTests' passed at 2020-12-27 07:46:55.797
         Executed 1 test, with 0 failures (0 unexpected) in 9.897 (9.897) seconds
Test Suite 'Selected tests' passed at 2020-12-27 07:46:55.797
         Executed 1 test, with 0 failures (0 unexpected) in 9.897 (9.897) seconds
2020-12-27 07:46:45.834770: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcudart.so.10.2
2020-12-27 07:46:45.900244: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneA
PI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1
SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-27 07:46:45.927378: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
2020-12-27 07:46:45.928060: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55dcd7e22640 initialized for platf
orm Host (this does not guarantee that XLA will be used). Devices:
2020-12-27 07:46:45.928091: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Versi
on
2020-12-27 07:46:45.929282: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcuda.so.1
2020-12-27 07:46:45.938033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS h
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:45.942321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-27 07:46:45.942365: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudart.so.10.2
2020-12-27 07:46:45.944454: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcublas.so.10
2020-12-27 07:46:45.946287: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcufft.so.10
2020-12-27 07:46:45.946676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcurand.so.10
2020-12-27 07:46:45.948799: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusolver.so.10
2020-12-27 07:46:45.949971: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusparse.so.10
2020-12-27 07:46:45.954121: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudnn.so.7
2020-12-27 07:46:45.954209: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:46.057140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:46.060045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-12-27 07:46:47.587211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with $
trength 1 edge matrix:
2020-12-27 07:46:47.587249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-12-27 07:46:47.587258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-12-27 07:46:47.587416: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:47.588011: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS h
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:47.588561: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS h
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 07:46:47.589445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost
/replica:0/task:0/device:GPU:0 with 13081 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:
00:04.0, compute capability: 7.0)
2020-12-27 07:46:47.603215: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55dcee9cc0d0 initialized for platf
orm CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-27 07:46:47.603251: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16G
B, Compute Capability 7.0
2020-12-27 07:46:48.270017: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcublas.so.10
2020-12-27 07:46:48.635438: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:54] Peer localservice 1 {localhost:43152
}
2020-12-27 07:46:48.635684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with s
trength 1 edge matrix:
2020-12-27 07:46:48.635706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
2020-12-27 07:46:48.641531: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job
 localservice -> {0 -> localhost:43152}
2020-12-27 07:46:48.642177: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc
://localhost:43152
2020-12-27 07:46:48.642683: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: CPU:0
2020-12-27 07:46:48.642744: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: GPU:0
2020-12-27 07:46:55.887941: E tensorflow/stream_executor/stream.cc:338] Error recording event in stream: Error recording CUDA
event: UNKNOWN ERROR (4); not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2020-12-27 07:46:55.887986: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to quer
y event: UNKNOWN ERROR (4)
2020-12-27 07:46:55.888013: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Exited with signal code 6

On a GitHub Action Ubuntu 18.04 machine with no GPU and no CUDA installed, I get this non-zero exit:

2020-12-27T16:28:57.6788801Z [50/50] Testing myModelTests.LayerTests/testTransformerOnSineMaskedData
2020-12-27T16:28:57.6790043Z 
2020-12-27T16:28:57.6791357Z Test Suite 'Selected tests' started at 2020-12-27 16:26:06.614
2020-12-27T16:28:57.6792343Z Test Suite 'LayerTests' started at 2020-12-27 16:26:06.616
2020-12-27T16:28:57.6793517Z Test Case 'LayerTests.testSimpleTransformer' started at 2020-12-27 16:26:06.616
2020-12-27T16:28:57.6797957Z 2020-12-27 16:26:06.578054: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/runner/work/my_model/my_model/swift-tensorflow-RELEASE-0.12-cuda10.2-cudnn7-ubuntu18.04/usr/lib/swift/linux
2020-12-27T16:28:57.6800499Z 2020-12-27 16:26:06.578121: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-12-27T16:28:57.6802383Z 2020-12-27 16:26:06.617073: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-12-27T16:28:57.6803875Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-27T16:28:57.6806374Z 2020-12-27 16:26:06.617347: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/runner/work/my_model/my_model/swift-tensorflow-RELEASE-0.12-cuda10.2-cudnn7-ubuntu18.04/usr/lib/swift/linux
2020-12-27T16:28:57.6808661Z 2020-12-27 16:26:06.617366: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-27T16:28:57.6810094Z 2020-12-27 16:26:06.617386: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az195-301): /proc/driver/nvidia/version does not exist
2020-12-27T16:28:57.6811413Z 2020-12-27 16:26:06.635039: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2593905000 Hz
2020-12-27T16:28:57.6813825Z 2020-12-27 16:26:06.635207: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562fed07dfb0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-27T16:28:57.6815423Z 2020-12-27 16:26:06.635217: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-12-27T16:28:57.6816130Z Exited with signal code 11
2020-12-27T16:28:57.6816402Z 
2020-12-27T16:28:57.6816993Z Test Suite 'Selected tests' started at 2020-12-27 16:27:32.332
2020-12-27T16:28:57.6817716Z Test Suite 'LayerTests' started at 2020-12-27 16:27:32.334
2020-12-27T16:28:57.6818826Z Test Case 'LayerTests.testSingleIterationLayerChanges' started at 2020-12-27 16:27:32.334
2020-12-27T16:28:57.6821626Z 2020-12-27 16:27:32.279131: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.2'; dlerror: libcudart.so.10.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/runner/work/my_model/my_model/swift-tensorflow-RELEASE-0.12-cuda10.2-cudnn7-ubuntu18.04/usr/lib/swift/linux
2020-12-27T16:28:57.6824100Z 2020-12-27 16:27:32.279195: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-12-27T16:28:57.6825977Z 2020-12-27 16:27:32.334431: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-12-27T16:28:57.6827452Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-27T16:28:57.6829957Z 2020-12-27 16:27:32.334723: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/runner/work/my_model/my_model/swift-tensorflow-RELEASE-0.12-cuda10.2-cudnn7-ubuntu18.04/usr/lib/swift/linux
2020-12-27T16:28:57.6832230Z 2020-12-27 16:27:32.334742: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-27T16:28:57.6833640Z 2020-12-27 16:27:32.334760: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az195-301): /proc/driver/nvidia/version does not exist
2020-12-27T16:28:57.6835179Z 2020-12-27 16:27:32.358841: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2593905000 Hz
2020-12-27T16:28:57.6836636Z 2020-12-27 16:27:32.359025: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558117e71fb0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-27T16:28:57.6838026Z 2020-12-27 16:27:32.359035: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-12-27T16:28:57.6838755Z Exited with signal code 11
2020-12-27T16:28:57.6853361Z ##[error]Process completed with exit code 1.

I see this on most runs of my 50 unit tests. This is a problem because my CI builds are being marked as failed when in fact all the tests are passing.

Has anyone encountered this on the continuous integration testing of Swift for Tensorflow projects?

I didn't encounter this on Swift for Tensorflow 0.11.

@brettkoonce
Copy link
Contributor

do you have cudnn installed as well?

@xanderdunn
Copy link
Author

xanderdunn commented Dec 27, 2020

In the first output above I do, yes. I have cudnn-7 installed. The first example is the machine I use for all of my training runs. It is fully functional with tensorflow on GPU. To make sure, I sudo rm -r /usr/local/cuda and re-installed CUDA 10.2 and cuddn-7. Same result.

In the second example where I have no GPU and no CUDA, cudnn is not installed. I am expecting all of my tests to run on CPU TF_EAGER on the GitHub Actions continuous integration machine.

@xanderdunn
Copy link
Author

xanderdunn commented Dec 28, 2020

Ok, there are two different issues here:

Exited with signal code 11

This was actually a very opaque error occurring in a specific unit test because my test was calling model.callAsFunction(_ input: Tensor<Float>), but I didn't actually have that implemented. To support both continuous and categorical inputs, my model instead implements a custom protocol SparseAndDenseLayer with callAsFunction(continuousInputs: Tensor<Float>, categoricalInputs: [Tensor<Int32>]) -> Tensor<Float>, as done here in swift-models. I'm not sure why a call to model(inputs) even compiled on a struct that didn't implement callAsFunction(_ input) or conform to Layer.

The opaqueness of the error and lack of stack trace made this difficult to find. I fixed my test to call the model correctly and I no longer see the Exited with signal code 11.

Exited with signal code 6

V100 GPU machine, Ubuntu 18.04, CUDA 10.2, cuddn-7:

Test Case 'LayerTests.testMeanSquaredErrorOnRandomValues' passed (7.785 seconds)
Test Suite 'LayerTests' passed at 2020-12-28 07:44:19.787
         Executed 1 test, with 0 failures (0 unexpected) in 7.785 (7.785) seconds
Test Suite 'Selected tests' passed at 2020-12-28 07:44:19.787
         Executed 1 test, with 0 failures (0 unexpected) in 7.785 (7.785) seconds
2020-12-28 07:44:11.937355: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcudart.so.10.2
2020-12-28 07:44:12.543778: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneA
PI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1
SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-28 07:44:12.561256: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz
2020-12-28 07:44:12.561975: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5649fd253640 initialized for platf
orm Host (this does not guarantee that XLA will be used). Devices:
2020-12-28 07:44:12.562007: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Versi
on
2020-12-28 07:44:12.563635: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic librar
y libcuda.so.1
2020-12-28 07:44:12.573900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.574754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2020-12-28 07:44:12.574810: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudart.so.10.2
2020-12-28 07:44:12.577960: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcublas.so.10
2020-12-28 07:44:12.580506: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcufft.so.10
2020-12-28 07:44:12.580992: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcurand.so.10
2020-12-28 07:44:12.583847: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusolver.so.10
2020-12-28 07:44:12.585490: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcusparse.so.10
2020-12-28 07:44:12.591345: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic libra$
y libcudnn.so.7
2020-12-28 07:44:12.591420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.591997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS $
ad negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:12.592514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-12-28 07:44:13.402013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with $
trength 1 edge matrix:
2020-12-28 07:44:13.402053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-12-28 07:44:13.402060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-12-28 07:44:13.402202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.402816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.403369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-28 07:44:13.403885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13460 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0)
2020-12-28 07:44:13.405731: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a14cebe70 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-28 07:44:13.405755: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-12-28 07:44:13.856129: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-28 07:44:14.111203: I tensorflow/compiler/xla/xla_client/xrt_local_service.cc:54] Peer localservice 1 {localhost:34351}
2020-12-28 07:44:14.111446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-28 07:44:14.111467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
2020-12-28 07:44:14.115319: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:34351}
2020-12-28 07:44:14.115770: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:34351
2020-12-28 07:44:14.116095: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: CPU:0
2020-12-28 07:44:14.116134: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: GPU:0
2020-12-28 07:44:19.874952: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:647] Non-OK-status: GpuLaunchKernel(BlockReduceKernel<IN_T, OUT_T, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, out, in_size, op, init) status: Internal: driver shutting down
Exited with signal code 6

This is still happening on my GPU machine at the completion of all tests, but only when I run all my unit tests in parallel with swift test --parallel. The device is set to .default for all tests, so I expect they're running on GPU TF_EAGER. A handful of the tests are full models that are trained to convergence on simple datasets. Is it possible that running multiple models simultaneously on GPU is causing this error?

@xanderdunn
Copy link
Author

xanderdunn commented Dec 28, 2020

I replaced all of the .default devices in my unit tests with let testDevice: Device = Device(kind: Device.Kind.CPU, ordinal: 0, backend: Device.Backend.TF_EAGER), but the Exited with signal code 6 error still occurs at the end of all the unit tests:

2020-12-28 08:16:09.823251: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: CPU:0
2020-12-28 08:16:09.823301: I tensorflow/compiler/xla/xla_client/computation_client.cc:202] NAME: GPU:0
2020-12-28 08:16:20.953706: E tensorflow/stream_executor/stream.cc:338] Error recording event in stream: Error recording CUDA
event: UNKNOWN ERROR (4); not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2020-12-28 08:16:20.953756: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: UNKNOWN ERROR (4)
2020-12-28 08:16:20.953766: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Exited with signal code 6

It does not occur on my CPU-only GitHub Actions continuous integration machine.

@brettkoonce
Copy link
Contributor

I'm definitely not the person to ask about this but IIRC swift test --parallel won't work on a GPU, that is correct.

@xanderdunn
Copy link
Author

Thanks @brettkoonce! Is this expected to be the case even when all Tensors and models are specified on Device.Kind.CPU? The mere presence of a GPU breaks swift test --parallel? Maybe it's a conflict caused during init due to Tensorflow finding a GPU?

@xanderdunn
Copy link
Author

I've been successfully running my unit tests without --parallel so I believe this can be closed if the above is expected behavior.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants