Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 12 additions & 31 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,19 @@
# Getting Started

## Installation
This section will help you get started building durable functions in
[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html).

Install the SDK using pip:
Please see the [SDK Reference](../sdk-reference/) section for in-depth detail of the
concepts introduced here.

```console
pip install aws-durable-execution-sdk-python
```
## Prerequisites

## Quick example
- An AWS account with permissions to create and execute AWS Lambda functions
- Familiarity with writing AWS Lambda functions

Here's a simple durable function that processes an order:
## In this section

=== "TypeScript"

``` typescript
--8<-- "examples/typescript/index/quick-example.ts"
```

=== "Python"

``` python
--8<-- "examples/python/index/quick-example.py"
```

=== "Java"

``` java
--8<-- "examples/java/index/quick-example.java"
```

Each `context.step()` call is checkpointed automatically. If Lambda recycles your execution environment, the function resumes from the last completed step.

## Next steps

- [Key Concepts](key-concepts.md) - Understand the mental model behind durable execution
- [Quick Start](quick-start.md) - Build and test your first durable function
- [Key Concepts](key-concepts.md) Understand durable execution, checkpoints, replay, and
the DurableContext before writing code
- [Quick Start](quick-start.md) Install the SDK, write your first durable function, and
test it locally
194 changes: 140 additions & 54 deletions docs/getting-started/key-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,94 +2,180 @@

## Durable execution

A durable execution represents the complete lifecycle of a Lambda durable function. The SDK uses a checkpoint and replay mechanism to track progress, suspend execution, and recover from failures. A single execution may span multiple Lambda invocations.
A durable execution is the complete lifecycle of an AWS Lambda durable function. It uses
a checkpoint and replay mechanism to track progress, suspend execution, and recover from
failures. When functions resume after suspension or interruptions, previously completed
checkpoints replay and the function continues execution.

The execution lifecycle could include multiple invocations of the Lambda function to
complete, particularly after suspensions or failure recovery. With these replays the
execution can run for extended periods (up to one year) while maintaining reliable
progress despite interruptions.

### Timeouts

The
[execution timeout](https://docs.aws.amazon.com/lambda/latest/api/API_DurableConfig.html#lambda-Type-DurableConfig-ExecutionTimeout)
and Lambda function
[Timeout](https://docs.aws.amazon.com/lambda/latest/api/API_CreateFunction.html#lambda-CreateFunction-request-Timeout)
are different settings. The Lambda function timeout controls how long each individual
invocation can run (maximum 15 minutes). The execution timeout controls the total
elapsed time for the entire durable execution (maximum 1 year).

## Durable functions

A durable function is a Lambda function decorated with `@durable_execution` that can be checkpointed and resumed. The function receives a `DurableContext` that provides methods for durable operations.
A durable function is a Lambda function configured with the
[`DurableConfig`](https://docs.aws.amazon.com/lambda/latest/dg/durable-configuration.html)
object at creation time. Lambda will then apply the checkpoint and replay mechanism to
the function's execution to make it durable at invocation time.

## Operations
## DurableContext

`DurableContext` is the context object your durable function receives instead of the
standard Lambda `Context`. It exposes all durable operations and provides methods for
creating checkpoints, managing execution flow, and coordinating with external systems.

Operations are units of work in a durable execution. Each operation type serves a specific purpose:
Your durable function receives a `DurableContext` instead of the default Lambda context:

- **Steps** - Execute code and checkpoint the result with retry support
- **Waits** - Pause execution for a specified duration without blocking Lambda
- **Callbacks** - Wait for external systems to respond with results
- **Invoke** - Call other durable functions to compose complex workflows
- **Child contexts** - Isolate nested workflows for better organization
- **Parallel** - Execute multiple operations concurrently with completion criteria
- **Map** - Process collections in parallel with batching and failure tolerance
=== "TypeScript"

```typescript
--8<-- "examples/typescript/getting-started/durable-context.ts"
Comment thread
yaythomas marked this conversation as resolved.
```

=== "Python"

```python
--8<-- "examples/python/getting-started/durable-context.py"
```

=== "Java"

```java
--8<-- "examples/java/getting-started/durable-context.java"
```

## Operations

Operations are units of work in a durable execution. Each operation type serves a
specific purpose:

- [Steps](../sdk-reference/operations/step.md) Execute business logic with automatic
checkpointing and configurable retry
- [Waits](../sdk-reference/operations/wait.md) Suspend execution for a duration without
consuming compute resources
- [Callbacks](../sdk-reference/operations/callback.md) Suspend execution and wait for an
external system to submit a result
- [Invoke](../sdk-reference/operations/invoke.md) Invoke another Lambda function and
checkpoint the result
- [Parallel](../sdk-reference/operations/parallel.md) Execute multiple independent
operations concurrently
- [Map](../sdk-reference/operations/map.md) Execute an operation on each item in an
array concurrently with optional concurrency control
Comment thread
yaythomas marked this conversation as resolved.
- [Child context](../sdk-reference/operations/child-context.md) Group operations into an
isolated context for sub-workflow organization and concurrent determinism
- [Wait for condition](../sdk-reference/operations/wait-for-condition.md) Poll for a
condition with automatic checkpointing between attempts

## Checkpoints

Checkpoints are saved states of execution that allow resumption. When your function calls `context.step()` or other operations, the SDK creates a checkpoint and sends it to AWS. If Lambda recycles your environment or your function waits for an external event, execution can resume from the last checkpoint.
A checkpoint is a saved record of a completed durable operation: its type, name, inputs,
result, and timestamp. The SDK creates checkpoints automatically as your function
executes operations. Together, the checkpoints form a log that Lambda uses to resume
execution after a suspension or interruption.

When your code calls a durable operation, the SDK follows this sequence:

1. **Check for an existing checkpoint** if this operation already completed in a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is high level intro for checkpoints, we don't need to mention that checkpoint will not save result for large size, correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call-out, thank you 😁

Yes, I think that belongs in the more detailed section for child context and step.

previous invocation, the SDK returns the stored result without re-executing
2. **Execute the operation** if no checkpoint exists, the SDK runs the operation code
3. **Serialize the result** the SDK serializes the result for storage
4. **Persist the checkpoint** the SDK calls the Lambda checkpoint API to durably store
the result before continuing
5. **Return the result** execution continues to the next operation

Once the SDK persists a checkpoint, that operation's result is safe. If your function is
interrupted at any point, the SDK can replay up to the last persisted checkpoint on the
next invocation.

## Replay

When your function resumes, completed operations don't re-execute. Instead, they return their checkpointed results instantly. This means your function code runs multiple times, but side effects only happen once per operation.
Lambda keeps a running log of all durable operations as your function executes. When
your function needs to pause or encounters an interruption, Lambda saves this checkpoint
log and stops the execution. When it's time to resume, Lambda invokes your function
again from the beginning and replays the checkpoint log:

1. **Load checkpoint log** the SDK retrieves the checkpoint log for the execution from
Lambda
2. **Run from beginning** your handler runs from the start, not from where it paused
3. **Skip completed operations** as your code calls durable operations, the SDK checks
each against the checkpoint log and returns stored results without re-executing the
operation code
4. **Resume at interruption point** when the SDK reaches an operation without a
checkpoint, it executes normally and creates new checkpoints from that point
forward

The SDK enforces determinism by validating that operation names and types match the
checkpoint log during replay. Your orchestration code must make the same sequence of
durable operation calls on every invocation.

## Determinism

Because your code runs again on replay, it must be **deterministic** — avoid random values, timestamps, or external API calls outside of steps, as these can produce different values on replay.
Because your code runs again on replay, it must be **deterministic**. Deterministic
means that the code always produces the same results given the same inputs. Given the
same inputs and checkpoint log, your function must make the same sequence of durable
operation calls. Avoid operations with side effects (like generating random numbers or
getting the current time) outside of steps, as these can produce different values during
replay and cause non-deterministic behavior.

## How replay works in practice
### Rules for deterministic durable operations

1. All durable operations in a context must start sequentially.
2. To run durable operations concurrently, wrap each set of operations in its own child
context and then run the child contexts concurrently.
3. Only use the child `DurableContext` in the child context scope. Do not use any
parent's context in a child context scope.

## Replay Walkthrough

Let's trace through a simple workflow:

=== "TypeScript"

``` typescript
```typescript
--8<-- "examples/typescript/getting-started/execution-model.ts"
```

=== "Python"

``` python
```python
--8<-- "examples/python/getting-started/execution-model.py"
```

=== "Java"

``` java
```java
--8<-- "examples/java/getting-started/execution-model.java"
```

**First invocation (t=0s):**

1. Lambda invokes your function
2. `fetch_data` executes and calls an external API
3. Result is checkpointed to AWS
4. `context.wait(Duration.from_seconds(30))` is reached
5. Function returns, Lambda can recycle the environment
1. You start a durable execution by invoking a durable function
2. The durable functions service invokes your durable function handler
3. The fetch step runs and calls an external API
4. The SDK checkpoints the result of the fetch step
5. Execution reaches `context.wait()` and the SDK checkpoints the wait operation
6. The SDK terminates the current Lambda invocation, but the durable execution is still
active

**Second invocation (t=30s):**

1. Lambda invokes your function again
2. Function code runs from the beginning
3. `fetch_data` returns the checkpointed result instantly (no API call)
4. `context.wait()` is already complete, execution continues
5. `process_data` executes for the first time

## The two SDKs

### Execution SDK (`aws-durable-execution-sdk-python`)

Runs in your Lambda functions. Provides `DurableContext`, operations, decorators, and serialization. Install in your Lambda deployment package.

```console
pip install aws-durable-execution-sdk-python
```

### Testing SDK (`aws-durable-execution-sdk-python-testing`)

A separate SDK for testing your durable functions locally without AWS. Provides `DurableFunctionTestRunner`, pytest integration, and result inspection. Install in your development environment only.

```console
pip install aws-durable-execution-sdk-python-testing
```

## Decorators

The SDK provides decorators to mark functions as durable:

- `@durable_execution` - Marks your Lambda handler as a durable function
- `@durable_step` - Marks a function that can be used with `context.step()`
- `@durable_with_child_context` - Marks a function that receives a child context
1. The durable functions service invokes your function again
2. The function runs from the ginning
3. The fetch step returns its checkpointed result instantly, it does not re-execute the
API call
4. The wait has already elapsed, so execution continues
5. The process step runs for the first time
6. The SDK checkpoints the result of the process step
7. The function returns naturally and the invocation ends
8. The durable execution ends
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Lambda. Your functions can pause execution, wait for external events, retry fail
operations, and resume exactly where they left off, even if Lambda recycles your
execution environment.

If you are new, [Get started](getting-started/) over here.

For detailed programming language reference, see [SDK Reference](sdk-reference/).

## Key features

- **Automatic checkpointing** Workflow state is saved automatically after each operation
Expand Down
8 changes: 8 additions & 0 deletions docs/patterns/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Patterns

Reusable patterns and guidance for common durable function use cases.

## In this section

- [Best Practices](best-practices.md) Function design, timeout configuration, naming
conventions, and common mistakes to avoid
39 changes: 39 additions & 0 deletions docs/sdk-reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SDK Reference

The SDK Reference covers everything you need to build, configure, and operate durable
functions.

## Operations

The core building blocks for constructing durable workflows:

- [Step](operations/step.md) Execute and checkpoint a unit of work
- [Wait](operations/wait.md) Pause execution for a duration
- [Wait for Condition](operations/wait-for-condition.md) Pause until an external
condition is met
- [Callback](operations/callback.md) Resume execution via an external signal
- [Invoke](operations/invoke.md) Invoke another durable function
- [Parallel](operations/parallel.md) Execute multiple operations concurrently
- [Map](operations/map.md) Apply an operation across a collection
- [Child Context](operations/child-context.md) Scope a sub-workflow within a parent

## Error Handling

- [Errors](error-handling/errors.md) Exception types and error response formats
- [Retries](error-handling/retries.md) Configuring retry behavior for steps

## State

- [Serialization](state/serialization.md) How state is serialized between checkpoints

## Observability

- [Logging](observability/logging.md) Structured logging within durable functions

## Language Guides

Language-specific installation and configuration:

- [TypeScript](languages/typescript/index.md)
- [Python](languages/python/index.md)
- [Java](languages/java/index.md)
7 changes: 7 additions & 0 deletions docs/sdk-reference/languages/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Language Guides

Language-specific installation and setup for the AWS Durable Execution SDK.

- [TypeScript](typescript/index.md)
- [Python](python/index.md)
- [Java](java/index.md)
3 changes: 3 additions & 0 deletions docs/sdk-reference/languages/java/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Java SDK

Coming soon.
31 changes: 31 additions & 0 deletions docs/sdk-reference/languages/python/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Python SDK

## Execution SDK

The execution SDK (`aws-durable-execution-sdk-python`) runs in your Lambda functions. It
provides `DurableContext`, operations, and decorators. Install it in your Lambda
deployment package.

```console
pip install aws-durable-execution-sdk-python
```

## Testing SDK

The testing SDK (`aws-durable-execution-sdk-python-testing`) lets you test durable
functions locally without AWS. It provides `DurableFunctionTestRunner`, pytest
integration, and result inspection. Install it in your development environment only.

```console
pip install aws-durable-execution-sdk-python-testing
```

## Decorators

The SDK provides decorators to mark functions as durable:

- `@durable_execution` - Marks your Lambda handler as a durable function
- `@durable_step` - Marks a function that can be used with `context.step()`
- `@durable_with_child_context` - Marks a function that receives a child context

The Python SDK uses synchronous methods and does not support `await`.
Loading
Loading