Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

Open
stevel0821 opened this issue Feb 3, 2022 · 2 comments
Assignees
Labels
feature status-triage_done Initial triage done, will be further handled by the driver team

Comments

@stevel0821
Copy link

Issue description

I'm experiencing an issue where the RetryHandler in HttpUtil.cs is catching an SSL certificate exception thrown by HttpClient, but ignores it and retries for 2 minutes until the timeout expires even though the error is non-recoverable. This is true for any exception that is thrown by HttpClient. RetryHandler only breaks out if the cancellation tokens are canceled or timeout expires, otherwise all exceptions are swallowed.

The exception is happening because our company's traffic passes through a gateway that appends to the certificate chain, and it's unable to verify the CRL status for these certs (which is throwing as it should). Obviously this is not a Snowflake problem, but the only way I was even able to figure out the problem was to clone SnowflakeConnector to my machine, add the project to the solution, and step through the debugger until we figured it out. This is less than ideal since we are paying to use Snowflake. Setting InsecureMode = true is also not an option since we want to validate the CRL.

What we'd really like is to have is the details of exceptions like this thrown to us so we can handle them, or at least be aware of what's going on. Storing them in a list and throwing them all as an AggregateException at the end would be better than throwing a timeout exception, since it's not really a timeout. We don't want to disable the retries, but we do need some control over the retry logic if the library cannot decide what to do.

Example code

One way to reproduce the endless HttpException is to add a ServerCertificateCustomValidationCallback to the HttpClientHandler that always returns false.

Related Issues

#213
#323

Actual Code:

Below is the current retry code.
The line of interest is line 55, //TODO: Should probably check to see if the error is recoverable or transient.

          
            protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage requestMessage,
                CancellationToken cancellationToken)
            {
                HttpResponseMessage response = null;
                int backOffInSec = 1;
                int totalRetryTime = 0;
                int maxDefaultBackoff = 16;

                ServicePoint p = ServicePointManager.FindServicePoint(requestMessage.RequestUri);
                p.Expect100Continue = false; // Saves about 100 ms per request
                p.UseNagleAlgorithm = false; // Saves about 200 ms per request
                p.ConnectionLimit = 20;      // Default value is 2, we need more connections for performing multiple parallel queries

                TimeSpan httpTimeout = (TimeSpan)requestMessage.Properties[SFRestRequest.HTTP_REQUEST_TIMEOUT_KEY];
                TimeSpan restTimeout = (TimeSpan)requestMessage.Properties[SFRestRequest.REST_REQUEST_TIMEOUT_KEY];

                if (logger.IsDebugEnabled())
                {
                    logger.Debug("Http request timeout : " + httpTimeout);
                    logger.Debug("Rest request timeout : " + restTimeout);
                }

                CancellationTokenSource childCts = null;

                UriUpdater updater = new UriUpdater(requestMessage.RequestUri);

                while (true)
                {

                    try
                    {
                        childCts = null;
                        if (!httpTimeout.Equals(Timeout.InfiniteTimeSpan))
                        {
                            childCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
                            childCts.CancelAfter(httpTimeout);
                        }
                        response = await base.SendAsync(requestMessage, childCts == null ?
                            cancellationToken : childCts.Token).ConfigureAwait(false);
                    }
                    catch (Exception e)
                    {
                        if (cancellationToken.IsCancellationRequested)
                        {
                            logger.Debug("SF rest request timeout or explicit cancel called.");
                            cancellationToken.ThrowIfCancellationRequested();
                        }
                        else if (childCts != null && childCts.Token.IsCancellationRequested)
                        {
                            logger.Warn("Http request timeout. Retry the request");
                            totalRetryTime += (int)httpTimeout.TotalSeconds;
                        }
                        else
                        {
                            //TODO: Should probably check to see if the error is recoverable or transient.
                            logger.Warn("Error occurred during request, retrying...", e);
                        }
                    }

                    if (childCts != null)
                    {
                        childCts.Dispose();
                    }

                    if (response != null)
                    {
                        if (response.IsSuccessStatusCode) {
                            logger.Debug($"Success Response: StatusCode: {(int)response.StatusCode}, ReasonPhrase: '{response.ReasonPhrase}'");
                            return response;
                        }
                        else
                        {
                            logger.Debug($"Failed Response: StatusCode: {(int)response.StatusCode}, ReasonPhrase: '{response.ReasonPhrase}'");
                            bool isRetryable = isRetryableHTTPCode((int)response.StatusCode);
                            if (!isRetryable)
                            {
                                // No need to keep retrying, stop here
                                return response;
                            }
                        }
                    }
                    else
                    {
                        logger.Info("Response returned was null.");
                    }

                    // Disposing of the response if not null now that we don't need it anymore
                    response?.Dispose();

                    requestMessage.RequestUri = updater.Update();

                    logger.Debug($"Sleep {backOffInSec} seconds and then retry the request");
                    await Task.Delay(TimeSpan.FromSeconds(backOffInSec), cancellationToken).ConfigureAwait(false);
                    totalRetryTime += backOffInSec;
                    // Set next backoff time
                    backOffInSec = backOffInSec >= maxDefaultBackoff ?
                            maxDefaultBackoff : backOffInSec * 2;

                    if ((restTimeout.TotalSeconds > 0) && (totalRetryTime + backOffInSec > restTimeout.TotalSeconds))
                    {
                        // No need to wait more than necessary if it can be avoided.
                        // If the rest timeout will be reached before the next back-off,
                        // use a smaller one to give the Rest request a chance to timeout early
                        backOffInSec = Math.Max(1, (int)restTimeout.TotalSeconds - totalRetryTime - 1);
                    }
                }
            }
 
@sfc-gh-jfan sfc-gh-jfan reopened this Jul 1, 2022
@github-actions github-actions bot closed this as completed Jul 2, 2022
@sfc-gh-jfan sfc-gh-jfan reopened this Jul 6, 2022
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Jun 8, 2023
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Jun 8, 2023
@sfc-gh-dszmolka sfc-gh-dszmolka added status-in_progress Issue is worked on by the driver team and removed status-triage Issue is under initial triage labels Jun 8, 2023
@sfc-gh-dszmolka
Copy link
Contributor

hello and thank you for submitting this issue, and also for the detailed description + aggregating the other painpoints from other tickets here, really appreciated! also sorry for the long period without any response. RetryHandler definitely needs a review and we'll take a look.

@sfc-gh-dszmolka
Copy link
Contributor

while this gets addressed more systematically, #677 made for another bug perhaps could be relevant here too so adding here

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-in_progress Issue is worked on by the driver team labels Feb 11, 2024
@sfc-gh-dprzybysz sfc-gh-dprzybysz changed the title RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

6 participants