SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

stevel0821 · 2022-02-03T23:27:06Z

Issue description

I'm experiencing an issue where the RetryHandler in HttpUtil.cs is catching an SSL certificate exception thrown by HttpClient, but ignores it and retries for 2 minutes until the timeout expires even though the error is non-recoverable. This is true for any exception that is thrown by HttpClient. RetryHandler only breaks out if the cancellation tokens are canceled or timeout expires, otherwise all exceptions are swallowed.

The exception is happening because our company's traffic passes through a gateway that appends to the certificate chain, and it's unable to verify the CRL status for these certs (which is throwing as it should). Obviously this is not a Snowflake problem, but the only way I was even able to figure out the problem was to clone SnowflakeConnector to my machine, add the project to the solution, and step through the debugger until we figured it out. This is less than ideal since we are paying to use Snowflake. Setting InsecureMode = true is also not an option since we want to validate the CRL.

What we'd really like is to have is the details of exceptions like this thrown to us so we can handle them, or at least be aware of what's going on. Storing them in a list and throwing them all as an AggregateException at the end would be better than throwing a timeout exception, since it's not really a timeout. We don't want to disable the retries, but we do need some control over the retry logic if the library cannot decide what to do.

Example code

One way to reproduce the endless HttpException is to add a ServerCertificateCustomValidationCallback to the HttpClientHandler that always returns false.

Related Issues

#213
#323

Actual Code:

Below is the current retry code.
The line of interest is line 55, //TODO: Should probably check to see if the error is recoverable or transient.

          
            protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage requestMessage,
                CancellationToken cancellationToken)
            {
                HttpResponseMessage response = null;
                int backOffInSec = 1;
                int totalRetryTime = 0;
                int maxDefaultBackoff = 16;

                ServicePoint p = ServicePointManager.FindServicePoint(requestMessage.RequestUri);
                p.Expect100Continue = false; // Saves about 100 ms per request
                p.UseNagleAlgorithm = false; // Saves about 200 ms per request
                p.ConnectionLimit = 20;      // Default value is 2, we need more connections for performing multiple parallel queries

                TimeSpan httpTimeout = (TimeSpan)requestMessage.Properties[SFRestRequest.HTTP_REQUEST_TIMEOUT_KEY];
                TimeSpan restTimeout = (TimeSpan)requestMessage.Properties[SFRestRequest.REST_REQUEST_TIMEOUT_KEY];

                if (logger.IsDebugEnabled())
                {
                    logger.Debug("Http request timeout : " + httpTimeout);
                    logger.Debug("Rest request timeout : " + restTimeout);
                }

                CancellationTokenSource childCts = null;

                UriUpdater updater = new UriUpdater(requestMessage.RequestUri);

                while (true)
                {

                    try
                    {
                        childCts = null;
                        if (!httpTimeout.Equals(Timeout.InfiniteTimeSpan))
                        {
                            childCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
                            childCts.CancelAfter(httpTimeout);
                        }
                        response = await base.SendAsync(requestMessage, childCts == null ?
                            cancellationToken : childCts.Token).ConfigureAwait(false);
                    }
                    catch (Exception e)
                    {
                        if (cancellationToken.IsCancellationRequested)
                        {
                            logger.Debug("SF rest request timeout or explicit cancel called.");
                            cancellationToken.ThrowIfCancellationRequested();
                        }
                        else if (childCts != null && childCts.Token.IsCancellationRequested)
                        {
                            logger.Warn("Http request timeout. Retry the request");
                            totalRetryTime += (int)httpTimeout.TotalSeconds;
                        }
                        else
                        {
                            //TODO: Should probably check to see if the error is recoverable or transient.
                            logger.Warn("Error occurred during request, retrying...", e);
                        }
                    }

                    if (childCts != null)
                    {
                        childCts.Dispose();
                    }

                    if (response != null)
                    {
                        if (response.IsSuccessStatusCode) {
                            logger.Debug($"Success Response: StatusCode: {(int)response.StatusCode}, ReasonPhrase: '{response.ReasonPhrase}'");
                            return response;
                        }
                        else
                        {
                            logger.Debug($"Failed Response: StatusCode: {(int)response.StatusCode}, ReasonPhrase: '{response.ReasonPhrase}'");
                            bool isRetryable = isRetryableHTTPCode((int)response.StatusCode);
                            if (!isRetryable)
                            {
                                // No need to keep retrying, stop here
                                return response;
                            }
                        }
                    }
                    else
                    {
                        logger.Info("Response returned was null.");
                    }

                    // Disposing of the response if not null now that we don't need it anymore
                    response?.Dispose();

                    requestMessage.RequestUri = updater.Update();

                    logger.Debug($"Sleep {backOffInSec} seconds and then retry the request");
                    await Task.Delay(TimeSpan.FromSeconds(backOffInSec), cancellationToken).ConfigureAwait(false);
                    totalRetryTime += backOffInSec;
                    // Set next backoff time
                    backOffInSec = backOffInSec >= maxDefaultBackoff ?
                            maxDefaultBackoff : backOffInSec * 2;

                    if ((restTimeout.TotalSeconds > 0) && (totalRetryTime + backOffInSec > restTimeout.TotalSeconds))
                    {
                        // No need to wait more than necessary if it can be avoided.
                        // If the rest timeout will be reached before the next back-off,
                        // use a smaller one to give the Rest request a chance to timeout early
                        backOffInSec = Math.Max(1, (int)restTimeout.TotalSeconds - totalRetryTime - 1);
                    }
                }
            }

The text was updated successfully, but these errors were encountered:

sfc-gh-dszmolka · 2023-06-08T10:45:52Z

hello and thank you for submitting this issue, and also for the detailed description + aggregating the other painpoints from other tickets here, really appreciated! also sorry for the long period without any response. RetryHandler definitely needs a review and we'll take a look.

sfc-gh-dszmolka · 2023-06-09T06:40:37Z

while this gets addressed more systematically, #677 made for another bug perhaps could be relevant here too so adding here

github-actions bot closed this as completed Jun 30, 2022

sfc-gh-jfan reopened this Jul 1, 2022

github-actions bot closed this as completed Jul 2, 2022

sfc-gh-jfan reopened this Jul 6, 2022

sfc-gh-igarish added the feature label Mar 30, 2023

sfc-gh-dszmolka self-assigned this Jun 8, 2023

sfc-gh-dszmolka added the status-triage Issue is under initial triage label Jun 8, 2023

sfc-gh-dszmolka assigned sfc-gh-anugupta and unassigned sfc-gh-dszmolka Jun 8, 2023

sfc-gh-dszmolka added status-in_progress Issue is worked on by the driver team and removed status-triage Issue is under initial triage labels Jun 8, 2023

sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-in_progress Issue is worked on by the driver team labels Feb 11, 2024

sfc-gh-dprzybysz assigned sfc-gh-snow-drivers-warsaw-dl and unassigned sfc-gh-anugupta Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

stevel0821 commented Feb 3, 2022

sfc-gh-dszmolka commented Jun 8, 2023

sfc-gh-dszmolka commented Jun 9, 2023

SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

SNOW-835593: RetryHandler needs better handling of non-recoverable exceptions, or option to disable/provide custom exception handler #416

Comments

stevel0821 commented Feb 3, 2022

Issue description

Example code

Related Issues

Actual Code:

sfc-gh-dszmolka commented Jun 8, 2023

sfc-gh-dszmolka commented Jun 9, 2023