Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool: log error when failed to connect to shard #276

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

muzarski
Copy link
Collaborator

This change is related to #275.
Silently ignoring the connection failure can be misleading, which was the case in the issue provided above.

This PR adds an error log when driver fails to open the CQL connection to specific shard.

@muzarski muzarski self-assigned this Dec 1, 2023
@fruch fruch requested review from Lorak-mmk and fruch December 3, 2023 07:37
@fruch
Copy link

fruch commented Dec 3, 2023

@muzarski let add an integration test with the relevant case (in the issue you have most of what needed)

also I'm not sure about solution here, it's gonna hide any failure from the connection method

total_shards=self.host.sharding_info.shards_count)
conn.original_endpoint = self.host.endpoint
except Exception as exc:
log.error("Failed to open connection to %s, on shard_id=%i: %s", self.host, shard_id, exc)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reraise the exception here, otherwise it gonna hide all possible failure from the connection creation, and there plently of those.

also I'm not sure how this fixes anything regarding the blocked shard aware port

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't fix the issue mentioned in the comment.

As mentioned in the issue, the log messages are very misleading - they don't suggest that shard-aware port is blocked. The only thing I change here is to log something in case of shard-aware CQL connection failure.

The main issue (ThreadPoolExecutor blocking enqueued tasks) is still there.

@muzarski
Copy link
Collaborator Author

muzarski commented Dec 6, 2023

@fruch but the main issue is still not fixed. Is there a point to add an integration test?

@fruch
Copy link

fruch commented Dec 6, 2023

@fruch but the main issue is still not fixed. Is there a point to add an integration test?

I was more concerned with the behavior change you introduced.

anyhow the log doesn't really help in directing a use to what is the issue, and what he can do to handle it.

@muzarski
Copy link
Collaborator Author

muzarski commented Dec 6, 2023

I was more concerned with the behavior change you introduced.

Yeah, I forgot to reraise the exception - fixed it already.

anyhow the log doesn't really help in directing a use to what is the issue, and what he can do to handle it.

I'm not sure I understand correctly.

As for me, I wish I'd seen this kind of log message when I had debugged the issue. During the SCT run, there was no information regarding the shard-aware CQL connection failure - nothing suggested that this was the real issue.

@fruch
Copy link

fruch commented Dec 19, 2023

@Lorak-mmk can you review this one ?

@avelanarius
Copy link

@muzarski Please rebase on top of the current master.

Copy link

@avelanarius avelanarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the change - previously the exception was silent and it was hard to debug connection problems. The printed exception should give some clues to why the connection has failed.

conn.original_endpoint = self.host.endpoint
except Exception as exc:
log.error("Failed to open connection to %s, on shard_id=%i: %s", self.host, shard_id, exc)
raise exc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to just raise? that way you will preserve original backtrace iiuc

@Lorak-mmk
Copy link

One thing I'd like to know about this change that I won't be able to understand from code changes: What happens with the exception? Why isn't it logged?
Could you describe that in commit message?

@muzarski
Copy link
Collaborator Author

One thing I'd like to know about this change that I won't be able to understand from code changes: What happens with the exception? Why isn't it logged? Could you describe that in commit message?

In the original issue we encountered during SCT run, the error was silently dropped in the HostConnection::_open_connections_for_all_shards:

for shard_id in range(self.host.sharding_info.shards_count):
    if skip_shard_id is not None and skip_shard_id == shard_id:
        continue
    future = self._session.submit(self._open_connection_to_missing_shard, shard_id)
    if isinstance(future, Future):
        self._connecting.add(shard_id)
        self._shard_connections_futures.append(future)

The HostConnection::_open_connection_to_missing_shard function is submitted to the ThreadPoolExecutor which calls this function asynchronously, and sets the exception on the returned future.

I believe the reason that this exception was ignored, is the fact that we don't check what the future holds (either result or exception) after execution.
We can see that the resulting future is appended to HostConnection::_shard_connections_futures, but from what I see in the code, this list is only used in the HostConnection::shutdown method to cancel all of the futures contained in the list.

I'll add a short description in the commit message.

The exception from `HostConnection::_open_connection_to_missing_shard`
during connection failure is silently dropped by the callers.

This function is submitted to the `ThreadPoolExecutor` which
assigns the result of this function to the future
(either success or exception).

The callers throughout the code ignore the future's result
and that is why this exception is ignored.

In this commit we add an error log when opening a connection to the
specific shard fails.
@muzarski
Copy link
Collaborator Author

v2:

  • extended the commit message so it explains the cause of the issue
  • replaced raise exc with raise

@muzarski muzarski requested a review from Lorak-mmk February 29, 2024 16:34
@Lorak-mmk Lorak-mmk merged commit 49a136e into scylladb:master Mar 1, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants