fix: iOS app crash caused by the request operation canceling #48350

zhouzh1 · 2024-12-20T10:45:02Z

Summary:

Currently we observed many iOS app crashes caused by the [RCTFileRequestHanlder invalidate] method, just as the below screenshot.

Changelog:

[IOS] [FIXED] - app crash caused by the [RCTFileRequestHanlder invalidate] method

Test Plan:

I am not able to reproduce this issue locally either, so the changes in this PR are totally from my inference, I am not sure if it really makes sense, so please help take a deeper look, thanks.

zhouzh1 · 2024-12-20T10:47:02Z

@mojodna @jsierles @augustl Could you help take a look at this PR? Thanks.

augustl · 2024-12-20T10:53:52Z

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

cipolleschi

I'm not sure this is proper fix for the issue.
Could you share more about the crash?
Is there an error message?
Can you share the full crashlog?
Is it happening in production only or also in debug?
Is the app in background or in foreground?

cipolleschi · 2024-12-20T12:13:12Z

packages/react-native/Libraries/Network/RCTFileRequestHandler.mm

+  if (_fileQueue) {
+    for (NSOperation *operation in _fileQueue.operations) {
+      if ([operation isKindOfClass:[NSOperation class]] && !operation.isCancelled && !operation.isFinished) {
+        [operation cancel];
+      }
+    }
+    _fileQueue = nil;
+  }


This is not necessary. You can only add NSOperation to an NSOperationQueue and cancellAllOperations runs the same code you are manually writing.

@cipolleschi Seems that the cancellAllOperations won't do the status checks for the operation, and in the crash report of my iOS app, the stack trace exactly tells me the crash point is just in the cancellAllOperations internal.

This is the official Apple docs for cancelAllOperations.

This method calls the cancel method on all operations currently in the queue.
Canceling the operations does not automatically remove them from the queue or stop those that are currently executing. For operations that are queued and waiting execution, the queue must still attempt to execute the operation before recognizing that it is canceled and moving it to the finished state. For operations that are already executing, the operation object itself must check for cancellation and stop what it is doing so that it can move to the finished state. In both cases, a finished (or canceled) operation is still given a chance to execute its completion block before it is removed from the queue.

And this is the docs of NSOperation cancel.

In any case, calling cancel on an already cancelled or finished operation does not crash the app.

I believe that the crash is happening inside one of the operations that are being cancelled and that's why the crash reporter reports the crash there.

Is it possible that even though we already put assurance for the corresponding code to make it only executed on the main thread or another sole thread, e.g. the JS thread, but because of the nature of object pointer reference, the operation object could still be shared among multiple thread contexts, then the situation you said above happens.

cipolleschi · 2024-12-20T12:14:46Z

packages/react-native/React/CxxBridge/RCTCxxBridge.mm

-  if (_didInvalidate) {
-    return;
-  }
+  RCTUnsafeExecuteOnMainQueueSync(^{


This could potentially deadlock. We should not run the unsafe variant of this method. Can you change it with RCTExecuteOnMainQueue?

cipolleschi · 2024-12-20T12:15:45Z

packages/react-native/React/CxxBridge/RCTCxxBridge.mm


-  RCTAssertMainQueue();
-  RCTLogInfo(@"Invalidating %@ (parent: %@, executor: %@)", self, _parentBridge, [self executorClass]);
+    RCTAssertMainQueue();


One thing that confuses me is that, in your stacktrace, the crash is happening in Thread 26... but this assert should force the app to be on the main thread, which is not the Thread 26... how's this possible?

@cipolleschi Good question, that's because of actually the RCTAssertMainQueue only takes effects in dev build, when in the release build, it does nothing.

That's why I added the RCTUnsafeExecuteOnMainQueueSync wrapper to ensure the code to run on the main thread.

As you can see, when the `NS_BLOCK_ASSERTIONS` is defined, the `RCTAssert` macro is actually empty, and generally the `NS_BLOCK_ASSERTIONS` is defined in release build.

That's a good explanation, but then we should see crashes in development happening because of the assertion. And IIUC, the app does not crash in development, right?

By looking at the crash log, the JS thread is triggering the invalidation. I think that this is the root of the problem: after the JS thread detect the invalidation, we should jump on the UI thread to invalidate everything...

I didn't encounter this crash in development, but I am not sure if it happens in it, you know, it's a occasional issue on itself.

zhouzh1 · 2024-12-20T12:59:07Z

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

Hey @augustl, sorry, I thought you're a memeber of the react-native dev team, because your id was displayed in the hint list when I was typing the @ char in the comment box. 😸

zhouzh1 · 2024-12-20T13:23:17Z

@cipolleschi Put more information here for your reference.

I only observed this crash in production by sentry, have not seen it in dev.
According the sentry breadcrumbs, seems that most of the events happened after the app was pushed to the background state.
You can see the relevant error messages that we collected by sentry on the below screenshot.

4. Posted a complete crash report. [79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip](https://github.com/user-attachments/files/18211158/79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip)

RSNara · 2024-12-20T15:12:05Z

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

zhouzh1 · 2024-12-21T12:41:00Z

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

RSNara · 2024-12-23T18:47:57Z

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

zhouzh1 · 2024-12-24T02:42:57Z

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

Your curiosity is mine as well. Before I submitted this PR, I was also suspecting if there is a certain place in my code or 3rd-party library code where the RCTBridge or the RCTCxxBridge deallocation is invoked explicitly, but I didn't manage to find it. However anyway, we always need to ensure the RCTCxxBridge invalidation to be run on the main thread, is it right? If so, it makes sense to wrap it with the RCTExecuteOnMainQueue?

cipolleschi · 2024-12-24T10:36:05Z

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

zhouzh1 · 2024-12-24T11:05:14Z

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

Yes, our app is using the expo and many of its associated libraries (e.g. the expo-updates, expo-camera, and so on), I think what you said above makes sense.
According to the Sentry breadcrumbs we collected, I found most of the crash events occurring after the app was pushed to the background state, and meanwhile, there were some requests in pending.

zhouzh1 · 2024-12-25T10:43:28Z

@cipolleschi Any ideas about the above information I provided?

cipolleschi · 2024-12-27T16:54:53Z

It partially makes sense. Moving an app to background triggers the events on the main thread. So the whole invalidation should already happen in the right thread.
But from your stack trace, the invalidation seems to arrive from a different thread. That's what we do not understand.

Also, the purpose of the invalidation and requestCancelling, is exactly to avoid those kind of crashes... As soon as the app goes in background, the operation are cancelled and they should not be executed... :/

Any chance that you can try and prepare a repro using this template, so that we can investigate this problem with more ease?

zhouzh1 · 2024-12-30T02:32:12Z

@cipolleschi Unfortunately we haven't managed to reproduce it...😂, we just observed it on sentry and Xcode's crash statistics, but from what you said, seems that as long as we can ensure that the invalidation process only happens on the main thread, we can certainly avoid those crashes, is it right?

cipolleschi · 2024-12-31T10:48:22Z

I think so, that's what we assumed when we wrote the code. However, we couldn't predict how frameworks like Expo might use that. :(
It is also a bit tricky to handle because invalidation should block everything else, but if it is not triggered by the main thread, we should jump asynchronously to the main queue, otherwise we risk to incur in deadlocks and this will not block everything else.
@RSNara do you have any idea on how to handle this case?

cipolleschi · 2024-12-31T11:53:32Z

I was investigating this a little bit more, and I found that when the app is reloaded (pressing r on Metro):

we are already on the main thread
we call [RCTBridge invalidate]. This function jumps asynchronously to the main queue.
That function then calls [RCTCxxBridge invalidate] that performs all the invalidation code.

In your crash, there could be some piece of code that calls directly [RCTCxxBridge invalidate] without passing for the RCTBridge instance. The problem might probably be in Expo, at this point. I looked at all the React Native call to invalidate, and aside those two, none other call directly go to the Bridge.

cipolleschi · 2024-12-31T12:21:14Z

@zhouzh1 Any chance you can share the package.json with the dependencies you have in your project? This might help to trim down to the dependency that might cause the issue.

Kudo · 2024-12-31T14:45:14Z

Posted a complete crash report. 79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip

also would be good to share all threads' stacktrace from sentry if available. this crash log does not have symbolicated stacktrace. it looks like the crash happens when reloading the app. do you use Updates.reloadAsync() or any other reload code in your app?

zhouzh1 · 2025-01-01T12:39:29Z

@cipolleschi You can find all the package dependencies in this file. 👉 dependency.json

zhouzh1 · 2025-01-01T12:59:03Z

Posted a complete crash report. 79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip

also would be good to share all threads' stacktrace from sentry if available. this crash log does not have symbolicated stacktrace. it looks like the crash happens when reloading the app. do you use Updates.reloadAsync() or any other reload code in your app?

@Kudo You're right, we're using the expo-updates and its reloadAsync method, I was suspecting it as well, but I didn't mamage to find any evidence, if you can help take a look, that would be very helpful.

zhouzh1 · 2025-01-01T13:23:23Z

And post a crash report which has clearer symbolicated stack trace. @cipolleschi @Kudo
2024-12-31_21-35-22.0862_-0800-81ec277901cdb54faeb4f79131f4ec43b8824f00.crash.zip

zhouzh1 · 2025-01-02T13:50:22Z

@cipolleschi Is it possible that the buggy point is in the onReload methods which are individually defined in the RCTHost.mm file and the RCTBridge.mm file? I saw they are using the RCTAssertMainQueue macro as well, but as you know, this macro won't take effect in release build.

Kudo · 2025-01-02T18:39:31Z

And post a crash report which has clearer symbolicated stack trace.
2024-12-31_21-35-22.0862_-0800-81ec277901cdb54faeb4f79131f4ec43b8824f00.crash.zip

that is useful. thanks for sharing the crashlog.

Is it possible that the buggy point is in the onReload methods which are individually defined in the RCTHost.mm file and the RCTBridge.mm file?

it should be unrelated. RCTHost is for new architecture and RCTBridge more on legacy architecture.

Kudo · 2025-01-02T19:24:32Z

my finding

expo-updates Updates.reloadAsync() uses RCTTriggerReloadCommandListeners to reload the app. the old bridge instance will be deallocated and [RCTCxxBridge invalidate] would be called on main thread. that is correct flow.

[RCTCxxBridge invalidate] will switch to different thread at different point:

it's initially on main:

react-native/packages/react-native/React/CxxBridge/RCTCxxBridge.mm

Line 1192 in a4430b5

RCTAssertMainQueue();
then switch to js thread:

react-native/packages/react-native/React/CxxBridge/RCTCxxBridge.mm

Line 1211 in a4430b5

[self ensureOnJavaScriptThread:^{

then switch to worker thread:

react-native/packages/react-native/React/CxxBridge/RCTCxxBridge.mm

Lines 1239 to 1243 in a4430b5

    
           dispatchBlock:^{ 
        
             [(id<RCTInvalidating>)moduleData.instance invalidate]; 
        
             dispatch_group_leave(moduleInvalidation); 
        
           } 
        
                   queue:moduleData.methodQueue];

note that RCTFileRequestHandler does not define methodQueue, it will call invalidate on its dedicated com.facebook.react.FileRequestHandlerQueue.

RCTFileRequestHandler init on main thread and invalidate on worker thread. if you want to invalidate on main thread, you can add the code in RCTFileRequestHandler.mm
```
- (dispatch_queue_t)methodQueue
{
  return dispatch_get_main_queue();
}
```
according to apple, NSOperationQueue should be thread-safe. it should be fine even RCTFileRequestHandler init/invalidate on different threads. i still have no idea how the crash happens. but it could be during reload and fetch('file:///...').

during invalidate, the RCTFileRequestHandler.cancelRequest will also be called:

from RCTNetworking.invalidate:

react-native/packages/react-native/Libraries/Network/RCTNetworking.mm

Lines 192 to 200 in a4430b5

    
           - (void)invalidate 
        
           { 
        
             [super invalidate]; 
        
             std::lock_guard<std::mutex> lock(_handlersLock); 
        
             for (NSNumber *requestID in _tasksByRequestID) { 
        
               [_tasksByRequestID[requestID] cancel]; 
        
             }

then RCTNetworkTask.cancel:

react-native/packages/react-native/Libraries/Network/RCTNetworkTask.mm

Lines 84 to 94 in a4430b5

    
           - (void)cancel 
        
           { 
        
             if (_status == RCTNetworkTaskFinished) { 
        
               return; 
        
             } 
        
             _status = RCTNetworkTaskFinished; 
        
             id token = _requestToken; 
        
             if (token && [_handler respondsToSelector:@selector(cancelRequest:)]) { 
        
               [_handler cancelRequest:token]; 
        
             }

and RCTFileRequestHandler.cancelRequest:

react-native/packages/react-native/Libraries/Network/RCTFileRequestHandler.mm

Lines 84 to 87 in a4430b5

    
           - (void)cancelRequest:(NSOperation *)op 
        
           { 
        
             [op cancel]; 
        
           }

the other hypothesis is from race condition between NSOperation.cancel and NSOperationQueue.cancelAllOperations. in this case, maybe we can try revise the cancelRequest method as
```
- (void)cancelRequest:(NSOperation *)op
{
  if (!op.isCancelled && !op.isFinished) {
    [op cancel];
  }
}
```
that is actually similar to the pr's solution though.

zhouzh1 · 2025-01-03T00:58:24Z

@Kudo Thanks for your debugging and the detaild analysis, that's very helpful, if we're not able to get the real root cause for this crash or reproduce it(I tried many ways to mock the crash scenario but didn't catch it), maybe let me patch the RN by this solution and to see what would happen.

fix: iOS app crash caused by canceling request operations

7181c31

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team. labels Dec 20, 2024

cipolleschi requested changes Dec 20, 2024

View reviewed changes

refactor: use the RCTExecuteOnMainQueue instead of its unsafe variant

dc302f3

fix: iOS app crash caused by the request operation canceling #48350

Are you sure you want to change the base?

fix: iOS app crash caused by the request operation canceling #48350

Conversation

zhouzh1 commented Dec 20, 2024

Summary:

Changelog:

Test Plan:

zhouzh1 commented Dec 20, 2024

augustl commented Dec 20, 2024

cipolleschi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhouzh1 Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

zhouzh1 commented Dec 20, 2024

zhouzh1 commented Dec 20, 2024

RSNara commented Dec 20, 2024 • edited Loading

zhouzh1 commented Dec 21, 2024 • edited Loading

RSNara commented Dec 23, 2024 • edited Loading

zhouzh1 commented Dec 24, 2024

cipolleschi commented Dec 24, 2024

zhouzh1 commented Dec 24, 2024

zhouzh1 commented Dec 25, 2024

cipolleschi commented Dec 27, 2024

zhouzh1 commented Dec 30, 2024 • edited Loading

cipolleschi commented Dec 31, 2024

cipolleschi commented Dec 31, 2024

cipolleschi commented Dec 31, 2024 • edited Loading

Kudo commented Dec 31, 2024

zhouzh1 commented Jan 1, 2025

zhouzh1 commented Jan 1, 2025 • edited Loading

zhouzh1 commented Jan 1, 2025

zhouzh1 commented Jan 2, 2025 • edited Loading

Kudo commented Jan 2, 2025

Kudo commented Jan 2, 2025

zhouzh1 commented Jan 3, 2025 • edited Loading

zhouzh1 Dec 21, 2024 •

edited

Loading

RSNara commented Dec 20, 2024 •

edited

Loading

zhouzh1 commented Dec 21, 2024 •

edited

Loading

RSNara commented Dec 23, 2024 •

edited

Loading

zhouzh1 commented Dec 30, 2024 •

edited

Loading

cipolleschi commented Dec 31, 2024 •

edited

Loading

zhouzh1 commented Jan 1, 2025 •

edited

Loading

zhouzh1 commented Jan 2, 2025 •

edited

Loading

zhouzh1 commented Jan 3, 2025 •

edited

Loading