-
-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEATURE: Improve error support for dynamic code in dynamic routing #28
Comments
I found out that |
Yeah, I was noticing that the load manager has a way to register for events related to load failures as well, but the deferred might be a better mechanism since it could probably localize the problem to the specific triggering event. Then again, this is more of a global problem...your app is dead if loading code failed. So, it almost seems like registering mutations under a specific well-known name might be the most general way to handle all kinds of global failures. Thoughts? |
I just tried out what happens when a load fails, and if it fails to load a module because of network issues and you trigger it again when you have internet it will just work. So maybe we need to add something to the |
So I was just adding comments around that basic idea to the reference guide about networking in general. Because of the optimistic update model: an idempotent set of mutations means that if you write networking to just retry with exponential backoff, things will eventually just clear up and continue properly. Unfortunately, in the general case, it cannot be automated...you have to make sure your mutations are idempotent. BUT, for code loading: you're right. There is nothing that gets "stuck" or screwed up by retrying a code load. I was thinking of trying to do a better job of error handling in general. The global network error handler seems heavy handed. I know D.N. had some plans around error handling that never got coded into Om Next, but I don't know what they were. The issue really has to do with UI. Sure, we can retry. But if the network is down showing something that tells the user what is going on is really what I care about. My ideas at the moment are akin to the global error handler, or a well-defined mutation you can I mean, certainly we can add a network retry to code load. In fact, isn't there already one? I seem to remember seeing Google's loader retrying on something. In either case, retry isn't the issue for me as much as letting the developer control the UI in response to the problem. From there, I'm also thinking having them trigger the desired recovery might be desirable as well. |
So I talked to David Nolen about his (unimplemented) ideas around Om Next error handling. Here are his exact words: " errors are just in the tree period. we get back the props/error tree from the server on the client we split apart the props tree and the error tree - preserving all structure (edited) props tree will be persisted error tree won’t, but we use it for error look ups when a component life-cycle methods fire, we’ll check the error tree to see if we need to pass error value this gives us the “one shot” we want and the error tree will disappear after render cycle is complete" So, his ideas were around errors from query sources, though in general since general data merge is all that queries use, and we can leverage merge for novelty at any time, it is possible that this mechanism could be used for more general error handling; however, he was explicit that he didn't consider "outage/network" kinds of errors to be the target of this error handling. So, the basic implementation would be:
So, the "pending" error support in Om Next is, as one might expect, centered around the idea of the tree query out, tree response in. This doesn't explicitly give a path for dealing with the kinds of other errors (outage, network, etc); however, it does provide a consistent possible render model that is database agnostic and does not pollute your app state with errors. It is possible that such support could be included in app state history (so time travel includes it). From my perspective this is of limited utility for query problems that are not related to outages. If your UI sends a query that it isn't supposed to (auth or structure), then that is a bug. The vast majority of requests won't have errors in them...we're more concerned that they just won't succeed at all. The one caveat is micro-services: if n servers are involved in responding to a query then it is possible that some parts of the tree response could be "missing", and in that case it would be nice in those cases to be able to see those errors in-situ. So, now I'd like to explore how this might help in the more general problem set:
Mutations come back with a response in the form {'mutation-symbol return-value}. Unless you've augmented the merge story, these values are discarded. There are two ways to indicate an error: throw an exception on the server (or otherwise cause a bad status code), or return a value that indicates an error. The former can be detected by fulcro's global network error handler, but that centralizes all error handling for all mutations to one spot (and the idea here is that this kind of error typically means something went wrong enough that the user should probably just reload). The latter is gentler, but requires you write merge code for each mutation that might return a value (typically as a multimethod). To date, this has not been the source of many reported issues, and may be sufficient. The addition of "tree-based" error support would make it possible to have an alternative that somehow triggers a merge on mutations where the mutation can provide a query/tree combo that goes with some part of the UI. This would allow the mutation to target error messages that would be transient and would not require a presence in the database. From Fulcro's perspective (we have a known database format), the fact that errors are typically placed in app state doesn't seem to be particularly problematic. Outage-based errors: In this case send detects the error, and it could technically "make up" a response tree in order to place transient errors that are targeted at components. In practice this seems like a lot of extra work for little gain. These kinds of errors are global errors. Your app isn't going to work until the outage is over. I don't see how the above mechanism really could be leveraged to add much. Code loading problems (e.g. the dynamic routing issue that prompted this discussion) is really an outage-based error. Nothing is going to be working right if they can't talk to a server (unless you've coded it to work "offline", but that is an even more complex topic). As was noted earlier in this issue: simply retrying the load will work once the network has re-appeared. So, at the moment I'm not seeing any general improvements that could significantly help our overall error handling story. Addressing the present problem of route load failure is probably the best we can do here. |
@mitchelkuijpers So, about your idea of something akin to a fallback. So, we haven't moved the route yet (it is in pending route). Really, we could just clear the pending route. That would make it so the user "clicking again" could retry. |
@awkay From my perspective:
The optimistic update focus is really giving me a hard time :) For example, a pretty standard webapp flow:
Even if it's 2 failed for 1000 requests this approach seems strange:
Optimistic updates sound really awesome for some actions, ex: like article, add to collection. For some reason I mapped query=load data, mutation=update data. Only after talking on slack I realised that it should be perfectly resonableon the server to have mutations.clj that has Personally I think defmutation & defquery on server add a lot of confusion. The way I think about them is more like:
|
So, I don't want to get too far off track on other APIs, but I would comment on optimistic update a bit. The intention is to give the user a fast responsive experience. The default networking does not auto-retry mutations because it doesn't know if your mutations are safe to repeat (e.g. are idempotent). As I said earlier, an idempotent mutation set with auto retries can make a pretty robust system; however, you are correct that if you give the user the impression that you're done and they close the browser, then some data loss could occur in an unexpected way. Have you used google docs? The do exactly this model: optimistic update with background save. You wouldn't want the UI blocking while you're typing something. The trick there is to let the user know when saves are actually complete. That way they can keep working, but can also see that there is network activity. This is a relatively painless UI change. Just have some clear indicator that work is still in progress, but that you're allowing them to continue. You can also block the browser tab/window closing with an onWindowClose handler, to warn them that networking is in progress. If you must "block" the UI to ensure save before continue, then that is simple as well: just don't include the optimistic (local) side of the mutation! Send it to the server and follow it with a follow-on read (as you suggest, you could also do this with a parameterized query and post mutation). This will do the traditional "lock up the UI while networking happens". Finally, there is the actual error handling itself. There are a number of interesting things you could do to deal with this. One is time-travel :) You have the UI history. You could "transactional timestamp" the network requests with the history UUID, and if something fails, just back them up to the point in their application use! How cool is that? I do agree with you that telling the user that something is complete when it isn't is probably going to lead to problems. However, rather than making it act like something old, I'd prefer to give them a better experience and inform them that it is something new (and more responsive). |
@awkay The transaction time-stamp idea is really really cool, thank you :) I know it's not ideal... but for some projects not a lot of time is allocated & also this implies that people doing the design understand the concept of optimistic updates, unfortunately that is not the case for most projects. |
Reading this thread, I see that we've ended up thinking or doing similar things with om.next. Here's what we've done and maybe it can help in some way. We've implemented "transactional timestamp" for our plain om.next app and it's been working really well. I was inspired by this talk from Lee Byron: https://vimeo.com/166790294 skip to 21:45 Here are some thoughts:
For error messages, we have mutations returning messages that follow this protocol: (defprotocol IMutationMessage
(message [this] "Returns the message")
(final? [this] "true if the mutation message was returned from the server")
(pending? [this] "true if we're still waiting for a server response.")
(success? [this] "true if the server mutation was successful. False is error occured.")) A message is either For us it's always the component that fires the mutation that is interested in the result, so we've made a namespace to make this easier. In this namespace there's a function with signature We don't really handle network issues, code loading problems or server unavailable problems yet, so I've got nothing to add there. I hope this helps validating or thinking about these ideas. |
@petterik Great talk...that's pretty much the arch we've had for 2 years ;) Just still finding out what all you can do with it! |
@petterik So, do you really find that the rebase behavior is necessary or even optimal? My perspective is that the UI should already know (in advance) if a mutation should be "legal"...e.g. failure of a mutation should either be a bug (you should not have let the user issue it) or an outage-related event (network fail, server outage, etc.). I also wonder about the limits of optimistic update on lag. The user is eventually going to do something that needs a server read. If that gets backed-up, then letting them continue to queue up more and more seems kind of dangerous...they might end up waiting through 10 minutes of backed up stuff they don't care about (nav to A, delayed...so nav to B, delayed, etc...network comes back, now all those loads and such start processing). I can't think of many cases where the optimistic update is going to be "off". In the talk, he didn't realize that Om Next invented a way to handle the fact that IDs were not available on creation (he mentions waiting to get the ID back). Fulcro augmented Om Next by ensuring that tempids that appear in the network queue get rewritten on tempid remappings...so you can actually queue a create and update in sequence and have them work in complete optimistic mode. Thus, in theory, it seems to me that a Fulcro/Om Next app with these features (tempid remapping to include the network queue) should issue a sequence of legal operations, which it has correctly applied locally. If there is a need to wait for a read, then you block the UI with a loading marker. not much you can improve there. In the talk, the assumption is that your mutations might return something different from what you'd done optimistically. I think this is wrong (as in I would consider that a bad implementation). If you're unsure of the server state after a mutation, your app should be implemented with a follow-on remote read (in queue) to ensure that the app state becomes up-to-date. I agree that his model is technically correct, but it's harder to reason about because your app history is no longer clean in the sense of linear progression. The rebasing causes all sorts of churn in history (just as it does in git). I guess you could bypass recording these rebases, but then you have to consider that the app state changes in weird ways that are not directly traceable from the mutation history (e.g. some snapshot in time you see A change (due to an async rebase) but only mutations that were changing B ran in that time span). Now you're back in async problem space. With a simple follow-on read model the network queue and mutation history maintain a nice linear progression. Technically, a merge handler for mutations could accept data as return values (which fulcro gives you a hook for), and that doesn't hurt anything either (the post-mutation merge is a linear progression step). If we take my earlier assertion: mutations should not fail unless there is an outage-level event, then we can do things in a much simpler fashion:
Now, what does this look like in (practical) reality?
Yes, the user potentially loses work...but there's not much you can really do about it if it wasn't a network-related recoverable operation. If you've cached the mutation history (for replay), then technically the dialog can allow the user to at least retry the sequence. The assumption here is that a server error (not network related...maybe db was down) could produce a problem that you could (eventually) resolve, but you would not want to completely automate (in case the user either decided to give up or it was obvious things were hosed). You have the UI history, so you can "back up". You can record the transaction history, so you have a replay buffer (both of which you're essentially forking the second you start your "retry" dialogs). Now you don't need any special mutation "protocol" at all: network errors are network errors. Security, outage, and bugs are exceptions (server 500). The former are auto-retry, the latter are possibly auto-retry a few times, but if that's not working you can give the user an option of how to proceed. If it is a true security problem, then who cares: it's probably a hacker. 500 doesn't hurt your real users. If it's a true outage, then you're giving them the best possible handling of it: retry until you're satisfied things are too screwed. Thoughts? |
OK, so I changed the implementation to retry a route's loads until such time as they succeed, or the user changes the route. The ModuleManager automatically retries for about 30s. I added a little expoential back-off, and an attempt count, but I didn't enforce a count limit, and the back-off is more of a safety measure (in case goog ModuleLoader changes how it handles a failure). I'm going to move the other converation about general error handling improvements to a new issue. |
Please continue error handling general discussion in #34 |
* Use fulcro app id as the inspect app uuid * Refactor to reduce warning * Send tx options to inspect * Increase send channel size to 50 * Add back `compressible-transact!` * Make toggle! and set-value! mutations compressible * Fix compressible-transact!
Add some kind of hook for cases where we know things are possibly not recoverable: E.g. a network error caused dynamic routing code loading to fail. Basically allow the program to transact, or js/alert, or something that let's the user know things are hosed, and what can be done.
The text was updated successfully, but these errors were encountered: