|
2 | 2 |
|
3 | 3 | ## Durable execution |
4 | 4 |
|
5 | | -A durable execution represents the complete lifecycle of a Lambda durable function. The SDK uses a checkpoint and replay mechanism to track progress, suspend execution, and recover from failures. A single execution may span multiple Lambda invocations. |
| 5 | +A durable execution is the complete lifecycle of an AWS Lambda durable function. It uses |
| 6 | +a checkpoint and replay mechanism to track progress, suspend execution, and recover from |
| 7 | +failures. When functions resume after suspension or interruptions, previously completed |
| 8 | +checkpoints replay and the function continues execution. |
| 9 | + |
| 10 | +The execution lifecycle could include multiple invocations of the Lambda function to |
| 11 | +complete, particularly after suspensions or failure recovery. With these replays the |
| 12 | +execution can run for extended periods (up to one year) while maintaining reliable |
| 13 | +progress despite interruptions. |
| 14 | + |
| 15 | +### Timeouts |
| 16 | + |
| 17 | +The |
| 18 | +[execution timeout](https://docs.aws.amazon.com/lambda/latest/api/API_DurableConfig.html#lambda-Type-DurableConfig-ExecutionTimeout) |
| 19 | +and Lambda function |
| 20 | +[Timeout](https://docs.aws.amazon.com/lambda/latest/api/API_CreateFunction.html#lambda-CreateFunction-request-Timeout) |
| 21 | +are different settings. The Lambda function timeout controls how long each individual |
| 22 | +invocation can run (maximum 15 minutes). The execution timeout controls the total |
| 23 | +elapsed time for the entire durable execution (maximum 1 year). |
6 | 24 |
|
7 | 25 | ## Durable functions |
8 | 26 |
|
9 | | -A durable function is a Lambda function decorated with `@durable_execution` that can be checkpointed and resumed. The function receives a `DurableContext` that provides methods for durable operations. |
| 27 | +A durable function is a Lambda function configured with the |
| 28 | +[`DurableConfig`](https://docs.aws.amazon.com/lambda/latest/dg/durable-configuration.html) |
| 29 | +object at creation time. Lambda will then apply the checkpoint and replay mechanism to |
| 30 | +the function's execution to make it durable at invocation time. |
10 | 31 |
|
11 | | -## Operations |
| 32 | +## DurableContext |
| 33 | + |
| 34 | +`DurableContext` is the context object your durable function receives instead of the |
| 35 | +standard Lambda `Context`. It exposes all durable operations and provides methods for |
| 36 | +creating checkpoints, managing execution flow, and coordinating with external systems. |
12 | 37 |
|
13 | | -Operations are units of work in a durable execution. Each operation type serves a specific purpose: |
| 38 | +Your durable function receives a `DurableContext` instead of the default Lambda context: |
14 | 39 |
|
15 | | -- **Steps** - Execute code and checkpoint the result with retry support |
16 | | -- **Waits** - Pause execution for a specified duration without blocking Lambda |
17 | | -- **Callbacks** - Wait for external systems to respond with results |
18 | | -- **Invoke** - Call other durable functions to compose complex workflows |
19 | | -- **Child contexts** - Isolate nested workflows for better organization |
20 | | -- **Parallel** - Execute multiple operations concurrently with completion criteria |
21 | | -- **Map** - Process collections in parallel with batching and failure tolerance |
| 40 | +=== "TypeScript" |
| 41 | + |
| 42 | + ```typescript |
| 43 | + --8<-- "examples/typescript/getting-started/durable-context.ts" |
| 44 | + ``` |
| 45 | + |
| 46 | +=== "Python" |
| 47 | + |
| 48 | + ```python |
| 49 | + --8<-- "examples/python/getting-started/durable-context.py" |
| 50 | + ``` |
| 51 | + |
| 52 | +=== "Java" |
| 53 | + |
| 54 | + ```java |
| 55 | + --8<-- "examples/java/getting-started/durable-context.java" |
| 56 | + ``` |
| 57 | + |
| 58 | +## Operations |
| 59 | + |
| 60 | +Operations are units of work in a durable execution. Each operation type serves a |
| 61 | +specific purpose: |
| 62 | + |
| 63 | +- [Steps](../sdk-reference/operations/step.md) Execute business logic with automatic |
| 64 | + checkpointing and configurable retry |
| 65 | +- [Waits](../sdk-reference/operations/wait.md) Suspend execution for a duration without |
| 66 | + consuming compute resources |
| 67 | +- [Callbacks](../sdk-reference/operations/callback.md) Suspend execution and wait for an |
| 68 | + external system to submit a result |
| 69 | +- [Invoke](../sdk-reference/operations/invoke.md) Invoke another Lambda function and |
| 70 | + checkpoint the result |
| 71 | +- [Parallel](../sdk-reference/operations/parallel.md) Execute multiple independent |
| 72 | + operations concurrently |
| 73 | +- [Map](../sdk-reference/operations/map.md) Execute an operation on each item in an |
| 74 | + array concurrently with optional concurrency control |
| 75 | +- [Child context](../sdk-reference/operations/child-context.md) Group operations into an |
| 76 | + isolated context for sub-workflow organization and concurrent determinism |
| 77 | +- [Wait for condition](../sdk-reference/operations/wait-for-condition.md) Poll for a |
| 78 | + condition with automatic checkpointing between attempts |
22 | 79 |
|
23 | 80 | ## Checkpoints |
24 | 81 |
|
25 | | -Checkpoints are saved states of execution that allow resumption. When your function calls `context.step()` or other operations, the SDK creates a checkpoint and sends it to AWS. If Lambda recycles your environment or your function waits for an external event, execution can resume from the last checkpoint. |
| 82 | +A checkpoint is a saved record of a completed durable operation: its type, name, inputs, |
| 83 | +result, and timestamp. The SDK creates checkpoints automatically as your function |
| 84 | +executes operations. Together, the checkpoints form a log that Lambda uses to resume |
| 85 | +execution after a suspension or interruption. |
| 86 | + |
| 87 | +When your code calls a durable operation, the SDK follows this sequence: |
| 88 | + |
| 89 | +1. **Check for an existing checkpoint** if this operation already completed in a |
| 90 | + previous invocation, the SDK returns the stored result without re-executing |
| 91 | +2. **Execute the operation** if no checkpoint exists, the SDK runs the operation code |
| 92 | +3. **Serialize the result** the SDK serializes the result for storage |
| 93 | +4. **Persist the checkpoint** the SDK calls the Lambda checkpoint API to durably store |
| 94 | + the result before continuing |
| 95 | +5. **Return the result** execution continues to the next operation |
| 96 | + |
| 97 | +Once the SDK persists a checkpoint, that operation's result is safe. If your function is |
| 98 | +interrupted at any point, the SDK can replay up to the last persisted checkpoint on the |
| 99 | +next invocation. |
26 | 100 |
|
27 | 101 | ## Replay |
28 | 102 |
|
29 | | -When your function resumes, completed operations don't re-execute. Instead, they return their checkpointed results instantly. This means your function code runs multiple times, but side effects only happen once per operation. |
| 103 | +Lambda keeps a running log of all durable operations as your function executes. When |
| 104 | +your function needs to pause or encounters an interruption, Lambda saves this checkpoint |
| 105 | +log and stops the execution. When it's time to resume, Lambda invokes your function |
| 106 | +again from the beginning and replays the checkpoint log: |
| 107 | + |
| 108 | +1. **Load checkpoint log** the SDK retrieves the checkpoint log for the execution from |
| 109 | + Lambda |
| 110 | +2. **Run from beginning** your handler runs from the start, not from where it paused |
| 111 | +3. **Skip completed operations** as your code calls durable operations, the SDK checks |
| 112 | + each against the checkpoint log and returns stored results without re-executing the |
| 113 | + operation code |
| 114 | +4. **Resume at interruption point** when the SDK reaches an operation without a |
| 115 | + checkpoint, it executes normally and creates new checkpoints from that point |
| 116 | + forward |
| 117 | + |
| 118 | +The SDK enforces determinism by validating that operation names and types match the |
| 119 | +checkpoint log during replay. Your orchestration code must make the same sequence of |
| 120 | +durable operation calls on every invocation. |
| 121 | + |
| 122 | +## Determinism |
30 | 123 |
|
31 | | -Because your code runs again on replay, it must be **deterministic** — avoid random values, timestamps, or external API calls outside of steps, as these can produce different values on replay. |
| 124 | +Because your code runs again on replay, it must be **deterministic**. Deterministic |
| 125 | +means that the code always produces the same results given the same inputs. Given the |
| 126 | +same inputs and checkpoint log, your function must make the same sequence of durable |
| 127 | +operation calls. Avoid operations with side effects (like generating random numbers or |
| 128 | +getting the current time) outside of steps, as these can produce different values during |
| 129 | +replay and cause non-deterministic behavior. |
32 | 130 |
|
33 | | -## How replay works in practice |
| 131 | +### Rules for deterministic durable operations |
| 132 | + |
| 133 | +1. All durable operations in a context must start sequentially. |
| 134 | +2. To run durable operations concurrently, wrap each set of operations in its own child |
| 135 | + context and then run the child contexts concurrently. |
| 136 | +3. Only use the child `DurableContext` in the child context scope. Do not use any |
| 137 | + parent's context in a child context scope. |
| 138 | + |
| 139 | +## Replay Walkthrough |
34 | 140 |
|
35 | 141 | Let's trace through a simple workflow: |
36 | 142 |
|
37 | 143 | === "TypeScript" |
38 | 144 |
|
39 | | - ``` typescript |
| 145 | + ```typescript |
40 | 146 | --8<-- "examples/typescript/getting-started/execution-model.ts" |
41 | 147 | ``` |
42 | 148 |
|
43 | 149 | === "Python" |
44 | 150 |
|
45 | | - ``` python |
| 151 | + ```python |
46 | 152 | --8<-- "examples/python/getting-started/execution-model.py" |
47 | 153 | ``` |
48 | 154 |
|
49 | 155 | === "Java" |
50 | 156 |
|
51 | | - ``` java |
| 157 | + ```java |
52 | 158 | --8<-- "examples/java/getting-started/execution-model.java" |
53 | 159 | ``` |
54 | 160 |
|
55 | 161 | **First invocation (t=0s):** |
56 | 162 |
|
57 | | -1. Lambda invokes your function |
58 | | -2. `fetch_data` executes and calls an external API |
59 | | -3. Result is checkpointed to AWS |
60 | | -4. `context.wait(Duration.from_seconds(30))` is reached |
61 | | -5. Function returns, Lambda can recycle the environment |
| 163 | +1. You start a durable execution by invoking a durable function |
| 164 | +2. The durable functions service invokes your durable function handler |
| 165 | +3. The fetch step runs and calls an external API |
| 166 | +4. The SDK checkpoints the result of the fetch step |
| 167 | +5. Execution reaches `context.wait()` and the SDK checkpoints the wait operation |
| 168 | +6. The SDK terminates the current Lambda invocation, but the durable execution is still |
| 169 | + active |
62 | 170 |
|
63 | 171 | **Second invocation (t=30s):** |
64 | 172 |
|
65 | | -1. Lambda invokes your function again |
66 | | -2. Function code runs from the beginning |
67 | | -3. `fetch_data` returns the checkpointed result instantly (no API call) |
68 | | -4. `context.wait()` is already complete, execution continues |
69 | | -5. `process_data` executes for the first time |
70 | | - |
71 | | -## The two SDKs |
72 | | - |
73 | | -### Execution SDK (`aws-durable-execution-sdk-python`) |
74 | | - |
75 | | -Runs in your Lambda functions. Provides `DurableContext`, operations, decorators, and serialization. Install in your Lambda deployment package. |
76 | | - |
77 | | -```console |
78 | | -pip install aws-durable-execution-sdk-python |
79 | | -``` |
80 | | - |
81 | | -### Testing SDK (`aws-durable-execution-sdk-python-testing`) |
82 | | - |
83 | | -A separate SDK for testing your durable functions locally without AWS. Provides `DurableFunctionTestRunner`, pytest integration, and result inspection. Install in your development environment only. |
84 | | - |
85 | | -```console |
86 | | -pip install aws-durable-execution-sdk-python-testing |
87 | | -``` |
88 | | - |
89 | | -## Decorators |
90 | | - |
91 | | -The SDK provides decorators to mark functions as durable: |
92 | | - |
93 | | -- `@durable_execution` - Marks your Lambda handler as a durable function |
94 | | -- `@durable_step` - Marks a function that can be used with `context.step()` |
95 | | -- `@durable_with_child_context` - Marks a function that receives a child context |
| 173 | +1. The durable functions service invokes your function again |
| 174 | +2. The function runs from the ginning |
| 175 | +3. The fetch step returns its checkpointed result instantly, it does not re-execute the |
| 176 | + API call |
| 177 | +4. The wait has already elapsed, so execution continues |
| 178 | +5. The process step runs for the first time |
| 179 | +6. The SDK checkpoints the result of the process step |
| 180 | +7. The function returns naturally and the invocation ends |
| 181 | +8. The durable execution ends |
0 commit comments