Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition during EvalCallback #347

Open
shinstra opened this issue Aug 11, 2023 · 1 comment
Open

Race condition during EvalCallback #347

shinstra opened this issue Aug 11, 2023 · 1 comment
Assignees

Comments

@shinstra
Copy link

Bit of a weird one that I'm hoping someone may have encountered before and can give some direction. I'm getting variable errors during EvalCallback.on_epoch_end. These errors change between runs (see examples below) and seem to relate to data inconsistencies. If I step through the code in debug mode there is no problem and it seems to work fine in following epochs. If I include time.sleep(1) at the start of the callback execution then no errors are thrown.

My best guess is that some data used by the callback has not been fully initialized when the first call to EvalCallback.on_epoch_end is made. However I'm not sure if this is an issue due to something happening in the underlying tensorflow/keras level, or if the issue is arising from the tfsim level.

Error Examples

Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 332.3957 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 188, in on_epoch_end
    known_results = _compute_metrics(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 291, in _compute_metrics
    classification_results = evaluator.evaluate_classification(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\evaluators\memory_evaluator.py", line 152, in evaluate_classification
    matcher.compute_count(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 177, in compute_count
    match_mask, distance_mask = self._compute_match_indicators(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 130, in _compute_match_indicators
    d_labels, d_dist = self.derive_match(lookup_labels, lookup_distances)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\match_nearest.py", line 55, in derive_match
    return lookup_labels[:, :1], lookup_distances[:, :1]
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Index out of range using input dim 1; input has only 1 dims [Op:StridedSlice] name: strided_slice/
Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 331.3144 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 186, in on_epoch_end
    self.model.index(self.targets, self.target_labels, verbose=0)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\models\contrastive_model.py", line 558, in index
    predictions = self.predict(x)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\models\contrastive_model.py", line 457, in predict
    x = self.backbone.predict(
ValueError: can only convert an array of size 1 to a Python scalar
Epoch 1/800
62/62 [==============================] - ETA: 0s - loss: 329.6670 - proj_std: 0.0441Traceback (most recent call last):
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 194, in <module>
    main()
  File "C:\Users\chris\Documents\XXXXX\Projects\PythonScratch\tfsim_contrastive_model\train_synthetic.py", line 178, in main
    history = contrastive_model.fit(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 188, in on_epoch_end
    known_results = _compute_metrics(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\callbacks.py", line 291, in _compute_metrics
    classification_results = evaluator.evaluate_classification(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\evaluators\memory_evaluator.py", line 152, in evaluate_classification
    matcher.compute_count(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 177, in compute_count
    match_mask, distance_mask = self._compute_match_indicators(
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 128, in _compute_match_indicators
    ClassificationMatch._check_shape(query_labels, lookup_labels, lookup_distances)
  File "C:\Users\chris\anaconda3\envs\PythonScratchTF_Test\lib\site-packages\tensorflow_similarity\matchers\classification_match.py", line 305, in _check_shape
    raise ValueError("Number of query labels must match the number of " "lookup_label sets.")
ValueError: Number of query labels must match the number of lookup_label sets.

I'm working pretty close to the unsupervised-learning example notebook with the following key exceptions:

  1. custom dataset with input size (None, 64, 64, 1)
  2. The backbone is the same from the supervised learning notebook.
  3. The only callback I'm using is the EvalCallback.

I'm using python==3.8.16 and tensorflow==2.10.1

@owenvallis
Copy link
Collaborator

Thanks @shinstra, I'll try and take a look into this. The lookup error may be caused by something in the result set returned by nmslib, but I'll have to dig into the other errors to find out more.

@owenvallis owenvallis self-assigned this Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants