Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: provide a set of results for all invocations caused by a message #28

Open
alexellis opened this issue Aug 9, 2019 · 7 comments

Comments

@alexellis
Copy link
Member

alexellis commented Aug 9, 2019

Description

Several users have requested the ability to "retry" invocations, but I think that retrying is made more difficult by the 1:* relationship functions have to topics.

sqs-connector-broadcast (1)

Conceptual diagram of invocations

Use-case

As a developer using the connector-sdk, I want to be able to retry invocations that failed

1:1 mapping

If a message comes in on a topic payment, and only one function stripe-fn has a matching annotation of: topic: payment.

In the case of a failure, the connector could just retry. This would probably be OK, depending on the function, perhaps some caution should be applied here.

The Linkerd2 documentation talks about the dangers of retrying automatically and about retry budgets. See: "How Retries can Go wrong" https://linkerd.io/2/features/retries-and-timeouts/

1:* mapping

This is the assumed default operating mode.

If a message comes in on a topic payment, and two functions exist:stripe-fn and datalake-fn both with: topic: payment.

In the case that datalake-fn fails and stripe-fn passes, following the logic above, retrying the whole batch may issue a second payment via stripe-fn.

The proposed solution in #26 would act this way.

Potential solutions

  1. [Feature Request] Include contextual information in responses #26 proposes that any response should indicate whether to acknowledge, delete, or do something else with a message. This works for a 1:1 mapping, but fails for a 1:* mapping.

  2. Have the status of all the invocations returned as a batch

The Invoke method which is called by all connectors, when they receive messages could be extended to return a set of results. By returning a set of results the caller has to wait for N*max_timeout to decide whether a set of invocations were successful or not. This may be offset by using a Go channel, but the connector may still have to wait for all of the results to come in before making a decision.

This is what 2) would look like with the example in the connector-sdk GitHub repo:

Before:

Screenshot 2019-08-09 at 15 28 33

After:

Screenshot 2019-08-09 at 15 33 25

I would welcome input from anyone who is considering conditional acking, retrying or doing anything else with the results from the invocations.

@embano1
Copy link
Contributor

embano1 commented Aug 11, 2019

Thanks Alex for putting this out as retry and error handling has been is a growing concern for many implementations of the connector-sdk (which might not make it easy to reach consensus).

Before going into the details, perhaps it would help to clarify what we want the user experience to be like. I.e. do we want to put as much as control/logic but also complexity to the caller (connector-sdk user) or put it on OpenFaaS gateway to deal with errors and retries?

I'm sure many will be in favor of the first, but the latter for me would be in-line with the great user experience and ease of use OpenFaaS. This is one of the main reasons why the community likes OpenFaaS.

Instead of letting the invoker deal with figuring out which individual functions failed and need retry (failure is not binary, also note the risk of retry storms, etc. as pointed out in that article linked) from a UX perspective knowing that OpenFaaS has received (ack-ed) the invocation request would be a user-friendly implementation - while the complexity obviously is shifted to OpenFaaS gateway. There's still a need for a (simple) retry/backoff implementation (with jitter) in the connector-sdk (configurable, e.g. at-most/least-once up to X retries).

Happy to brainstorm on this and figure out where the logic and heavy lifting should reside.

@ontehfritz
Copy link

Retries for invocations can be achieved through a "retry budget" with Linkerd, or through your choice of ingress controller such as Heptio Contour.

Yes, looked into it not ideal,it could be used as workaround for now.

As an example, not exactly the same as we are talking about ... but, durable functions with azure are pretty cool, more of imperative way they do it.
https://www.serverlessnotes.com/docs/retries-with-azure-durable-functions

To take it one step further, I feel it would be way cooler, if it was declarative either on invocation or deploy of any OpenFaaS function, leaning to invocation would be the ultimate solution, there is an issue, how do you tell the function to use retries from events ...? but this could be done with annotations, we already do that in the stack.yml file for topics, adding retries and back-off annotations, if they exist, execute the functions with those parameters.

as suggested from @Miserlou

Would this be a new header? X-Retry-Strategy: Exponential, X-Num-Retries: 5 (with a default maximum to prevent ddos), etc?

Then the connector would pick up the meta data and invoke the functions with retries and backoff. This also gives the ability when using regular rest invokes to provide retry functionality as well, not relying on the connector to only have retries. So basically, it is saying please retry with the payload given, from event or rest invoke it doesn't matter.

now how it works under the hood ...?, if it is matter including of Linkerd etc I am not sure yet. Just throwing out ideas of how I would want it ;)

@alexellis
Copy link
Member Author

Hi @ontehfritz thanks for moving the comments / conversation over to this repository.

See also "How Retries Can Go Wrong" and "Retry budgets" -> https://linkerd.io/2/features/retries-and-timeouts/

Linkerd is not included by default, but can be added through this tutorial https://github.com/openfaas-incubator/openfaas-linkerd2

@ontehfritz
Copy link

@alexellis thanks! Yes, it is a balancing act with retries, but in cases with event triggered functions, even if you retried automatically once, and failed, it is a huge help instead of figuring out what event it failed on and resubmitting the event, especially if you use event sourcing pattern, you don't want to pollute your stream/topic, and restarting the consumer group to start at a missed offset is also messy, It is easier to say, go retry all failed jobs.

The retry budgets is a pretty cool way to look at retries without circuit breaking with a defined amount of retries, pretty interesting.

Some background on what we want to accomplish with openFaaS:
We have had great success, with integrating a job queue and triggering functions with events in our micro-services, we have abstracted it into library. However, we found to create a micro-service is over kill because most of these are just simple functions, and the ceremony to create the docker containers and deploy is not ideal . Hence, we have ported them over to openFaaS and is working well. But always need to plan for failure. This is why we need the job queue/retry functionality.

If the connector went down, that is not that big of deal as the consumer group events would get queued. When the connector comes back up it will resume consuming events. If the function fails, then we need to know and retry the function (manually or automatic) with the same payload it received.

When writing this, maybe that is more important than the automatic retry/backoff, is just the ability to know which functions have been failing, and be able to trigger the function again with the same payload. From there retry/backoffs automatically would be fairly easy to build on top.

Thanks for letting me think out loud! We will try the Linkerd approach and see how it goes! I am sure this will stimulate more insight to optimally handle this with OpenFaaS :)

@ontehfritz
Copy link

ontehfritz commented Oct 8, 2019

Looking at your diagram above with the 1:* problem, you would need to record the invocation, payload received and the status of execution. In this scenario in our micro-services for the node.js ones anyway, we off load that functionality to bee-queue, which uses redis to store the job with the payload. This gives us the functionality needed, that I have been explaining. Might be able to do the same thing in the connector, we were going to look into potentially integrating this into our kafka connector:
https://github.com/benmanns/goworker or something that provides job runner/queue type functionality

Haven't had time to dig into, but might work.

@ontehfritz
Copy link

I have a connector POC written in node.js, using a job queue ... needs a little more work, but I will share it on github when done. ;)

@ontehfritz
Copy link

ontehfritz commented Oct 10, 2019

https://github.com/ratehub/openfaas-kafka-connector, here is our connector needs a lot of polish still as I only have had a day and half to work on it, but testing is going really well so far, you can keep an eye on our progress! Thanks :) Any insights and opinions would be greatly appreciated @alexellis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants