Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK doesn't appear to use Retry-After on 429s #978

Closed
5 tasks done
gavinbarron opened this issue Sep 27, 2022 · 10 comments · May be fixed by #1514
Closed
5 tasks done

SDK doesn't appear to use Retry-After on 429s #978

gavinbarron opened this issue Sep 27, 2022 · 10 comments · May be fixed by #1514

Comments

@gavinbarron
Copy link
Member

Bug Report

Prerequisites

  • Can you reproduce the problem?
  • Are you running the latest version?
  • Are you reporting to the correct repository?
  • Did you perform a cursory search?

Description

During testing of a tool to simulate 429s on the network layer (i.e. outside of the JavaScript library) it was observed that a when a Retry-After or 5s was returned to the client application the SDK issued a retry request before the 5s window had elapsed

Console Errors: N/A

Screenshots:

Steps to Reproduce

  1. Use the ChaosProxy sample using the sdk and with the proxy running with 99% failure rate and only 429s as the allowed errors
  2. see that the a retry is sent before the 5s wait period is finished

Expected behavior: [What you expected to happen]
SDK should not be retrying until after the number of seconds provided in the Retry-After header

Actual behavior: [What actually happened]
SDK retries the request at 3s and eventually succeeds due to the exponential backoff behavior

Additional Context

Add any other context about the problem here..
Chaos Proxy is here: https://github.com/microsoftgraph/msgraph-chaos-proxy (Microsoft private repo at present)

Usage Information

Request ID - N/A

SDK Version - script link to https://cdn.jsdelivr.net/npm/@microsoft/microsoft-graph-client/lib/graph-js-sdk.js so I assume latest

  • Browser (Check, if using Browser version of SDK)

Browser Name - Edge

Version - 105

@ghost ghost added the ToTriage label Sep 27, 2022
@nikithauc nikithauc self-assigned this Oct 26, 2022
@waldekmastykarz
Copy link

waldekmastykarz commented Nov 14, 2022

This seems to be caused by the fact that the retry-after header is not listed among the safe headers in the access-control-expose-headers response header coming from Graph.

Because calling Graph from client-side apps is a CORS request, clients are only allowed to access a predefined set of safe headers, plus additional headers defined by the server as safe using the access-control-expose-headers header. Right now, retry-after is not listed among these headers, which is why the fetch API always returns its value as null.

To fix this issue, we'd need to add the retry-after header to the access-control-expose-headers header on Graph.

@sebastienlevert
Copy link
Contributor

Quick update on this issue. We are working with the AGS team to add this header to the access-control-expose-headers. When available worldwide, we'll update this thread.

@altano
Copy link

altano commented Oct 23, 2023

@sebastienlevert I'm not seeing this header in the response to graph API requests. I should expect to see it, right? (I assume you closing the topic means it was exposed at some point)

I do see retry-after listed in access-control-expose-headers but the header still isn't in the response, as you can see:
image

@sebastienlevert
Copy link
Contributor

Some APIs might not send a "Retry-After" and the SDK will automatically backoff exponentially.

@altano
Copy link

altano commented Oct 24, 2023

@sebastienlevert are you sure about the exponential backoff? Here's what I'm seeing:

image

Note the timestamps. Retries seem to be happening every few seconds without any backoff.

@sebastienlevert
Copy link
Contributor

Based on this, it should https://github.com/microsoftgraph/msgraph-sdk-javascript/blob/dev/src/middleware/RetryHandler.ts#L136. What is the version of the SDK you are using?

@altano
Copy link

altano commented Oct 26, 2023

I'm on @microsoft/microsoft-graph-client v3.0.7 (latest).

I realized what is going on: retryAttempts is capping out at 3 because of the default retry limit, and then the request fails. Then my code moves on to the next page and it does the same thing, with retryAttempts going from 0 - 3.

I am trying to fix this in my code by having a much higher maxRetries but I can't figure out how. Do I really have to create a middleware chain that replaces the default RetryHandler with a custom one, and then use that in Client.initWithMiddleware, or is there an easier way to control this?

FWIW I think it would make sense to have a MUCH higher default maxRetries value than 3 for 429 responses that don't include the retry-after header, as the first 3 retries are almost certainly going to also fail.

@altano
Copy link

altano commented Oct 26, 2023

Even that doesn't work because RetryHandlerOptions forces you to use no more than 10 retries, which is only (with exponential backoff) 2^10 seconds in minutes, or ~17 minutes.

That's not a long enough wait for the OneNote API throttling which is per hour. I need to be able to specify a number of retries that will cause the graph API SDK to wait >= 1 hour.

@altano
Copy link

altano commented Oct 26, 2023

Okay, I got it working with some crazy code:

/**
 * Create a new class that implements RetryHandlerOptions. We can't re-use
 * RetryHandlerOptions because it forces a max max number of retries of 10,
 * which (with exponential backoff) is only 17 minutes of retry delay, which is
 * potentially not enough. OneNote may require us to wait 1 hour before allowing
 * requests again.
 */
class HighMaxRetryHandlerOptions implements RetryHandlerOptions {
  /**
   * Matches RetryHandlerOptions.DEFAULT_DELAY
   */
  delay = 3;
  /**
   * Given the exponential backoff in the Microsoft Graph API, 12 retries means
   * we'll retry enough times to wait >= 1 hour before giving up. If waiting an
   * hour isn't long enough, the request is unlikely to succeed, and then giving
   * up makes sense.
   */
  maxRetries = 12;
  /**
   * One hour in seconds. The OneNote API limits are all described as
   * <something> per hour, so it doesn't make sense to wait more than an hour
   * between requests.
   */
  getMaxDelay() {
    return 3600;
  }
  shouldRetry() {
    return true;
  }
}

function getMiddlewareChain(): Middleware[] {
  const defaultMiddlewareChain = MiddlewareFactory.getDefaultMiddlewareChain(
    new MyCustomAuthenticationProvider()
  );

  // Create a middleware chain that has a non-default RetryHandler. The new
  // RetryHandler should have a much higher number of max retries.
  const middlewareChain = defaultMiddlewareChain.map((middleware) => {
    if (!(middleware instanceof RetryHandler)) {
      // Leave non-retry-related middleware alone
      return middleware;
    }

    // and replace RetryHandlers with our version
    return new RetryHandler(new HighMaxRetryHandlerOptions());
  });

  return middlewareChain;
}

// ... 

Client.initWithMiddleware({
  middleware: getMiddlewareChain()
});

Surely there's something much simpler that I'm missing, but this seems to be working for now:

SCR-20231025-sdqb

@waldekmastykarz
Copy link

Ideally, you shouldn't need to guess the retry value. That said, based on your observation @altano, @sebastienlevert I wonder if we should increase the retry number to 12 to accommodate the reset period of 1 hour for OneNote APIs.

altano added a commit to altano/msgraph-sdk-javascript that referenced this issue Oct 30, 2023
Some of the graph APIs do not provide the `retry-after` header in HTTP 429 throttling responses. The default behavior of this SDK is to retry 3 times, with exponential backoff. The problem with this approach is that the max retry of 3 attempts (with a delay of ~8 seconds) just isn't enough for some APIs. For example, OneNote's API will potentially return 429s for 1 hour.

To fix this, let us create some new defaults:

| Thing               | Old Value   | New Value      | Reasoning                                                                                                                 |
| ------------------- | ----------- | -------------- | ------------------------------------------------------------------------------------------------------------------------- |
| DEFAULT_MAX_RETRIES | 3           | 12             | 2^12 seconds is just over 1 hour, which should be enough for OneNote                                                      |
| MAX_DELAY           | 180 seconds | 3_600 (1 hour) | This makes it so that retry 12+ only wait 1 hour between retries instead of continuing to backoff the delay exponentially |
| MAX_MAX_RETRIES     | 10          | 64             | This needs to be higher than DEFAULT_MAX_RETRIES but the choice of 64 is arbitrary                                        |

See [this GitHub issue](microsoftgraph#978) for more discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants