Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
Cargo.lock
.tgops.toml
.tgops
.idea
notes/
docs/book/
data/
gha-creds-*.json
Expand Down
99 changes: 99 additions & 0 deletions ADR/ADR-001-relationship-state-machine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# ADR 001: Relationship State Machine

## Status
Proposed

## Context
The current TSP SDK implementation lacks a formal state machine for managing relationship lifecycles. This leads to several issues:
1. **Undefined States**: The `ReverseUnidirectional` status is defined but rarely used, leading to ambiguity when a node receives a relationship request.
2. **Concurrency Issues**: If two nodes request a relationship with each other simultaneously, both end up in a `Unidirectional` state, with no clear resolution path.
3. **No Timeouts**: There is no mechanism to handle lost messages or unresponsive peers during the handshake process.
4. **Idempotency**: Duplicate control messages are not handled consistently.

## Decision
We will implement a formal `RelationshipMachine` to govern state transitions.

### 1. State Machine Definition

The state machine will transition based on `RelationshipEvent`s.

| Current State | Event | New State | Action/Notes |
| :--- | :--- | :--- | :--- |
| `Unrelated` | `SendRequest` | `Unidirectional` | Store `thread_id` |
| `Unrelated` | `ReceiveRequest` | `ReverseUnidirectional` | Store `thread_id` |
| `Unidirectional` | `ReceiveAccept` | `Bidirectional` | Verify `thread_id` matches. |
| `ReverseUnidirectional` | `SendAccept` | `Bidirectional` | Verify `thread_id` matches. |
| `Bidirectional` | `SendCancel` | `Unrelated` | |
| `Bidirectional` | `ReceiveCancel` | `Unrelated` | |
| `Unidirectional` | `SendRequest` | `Unidirectional` | Idempotent (retransmission) |
| `Unidirectional` | `ReceiveRequest` | *Conflict Resolution* | See Concurrency Handling |

### 2. Concurrency Handling
When a node in `Unidirectional` state (sent a request) receives a `RequestRelationship` from the target (meaning they also sent a request):
- **Compare `thread_id`s**: The request with the *lower* `thread_id` (lexicographically) wins.
- **If my `thread_id` < their `thread_id`**: I ignore their request (or reject it). I expect them to accept my request.
- **If my `thread_id` > their `thread_id`**: I accept their request. I cancel my pending request state and transition to `ReverseUnidirectional` (effectively accepting their flow).

### 3. Timeout & Retry
- **Timeout**: A `request_timeout` field will be added to `VidContext`. If a `Unidirectional` state persists beyond the timeout (e.g., 60s), it transitions back to `Unrelated`.
- **Retry**: Before timing out, the system may attempt retransmissions.

### 4. Idempotency
- **Duplicate Request**: If in `ReverseUnidirectional` or `Bidirectional` and receive the same `RequestRelationship` (same `thread_id`), ignore it or resend the previous response.
- **Duplicate Accept**: If in `Bidirectional` and receive `AcceptRelationship` with the same `thread_id`, ignore it.

## Consequences
- **Robustness**: Relationship establishment will be reliable under network jitter and concurrency.
- **Complexity**: The `store.rs` logic will become more complex.
- **Breaking Changes**: Existing tests that manually manipulate state might fail and need updating to respect the state machine.

---

# ADR 001: 关系状态机

## 状态
已提议

## 背景
目前的 TSP SDK 实现缺乏用于管理关系生命周期的正式状态机。这导致了几个问题:
1. **未定义的状态**:`ReverseUnidirectional`(反向单向)状态已定义但很少使用,导致节点收到关系请求时产生歧义。
2. **并发问题**:如果两个节点同时请求建立关系,双方都会处于 `Unidirectional`(单向)状态,且没有明确的解决路径。
3. **无超时机制**:在握手过程中,没有机制处理丢失的消息或无响应的对等节点。
4. **幂等性**:重复的控制消息未得到一致处理。

## 决定
我们将实现一个正式的 `RelationshipMachine`(关系状态机)来管理状态转换。

### 1. 状态机定义

状态机将基于 `RelationshipEvent`(关系事件)进行转换。

| 当前状态 | 事件 | 新状态 | 动作/备注 |
| :--- | :--- | :--- | :--- |
| `Unrelated` (无关系) | `SendRequest` (发送请求) | `Unidirectional` (单向) | 存储 `thread_id` |
| `Unrelated` (无关系) | `ReceiveRequest` (接收请求) | `ReverseUnidirectional` (反向单向) | 存储 `thread_id` |
| `Unidirectional` (单向) | `ReceiveAccept` (接收接受) | `Bidirectional` (双向) | 验证 `thread_id` 是否匹配。 |
| `ReverseUnidirectional` (反向单向) | `SendAccept` (发送接受) | `Bidirectional` (双向) | 验证 `thread_id` 是否匹配。 |
| `Bidirectional` (双向) | `SendCancel` (发送取消) | `Unrelated` (无关系) | |
| `Bidirectional` (双向) | `ReceiveCancel` (接收取消) | `Unrelated` (无关系) | |
| `Unidirectional` (单向) | `SendRequest` (发送请求) | `Unidirectional` (单向) | 幂等 (重传) |
| `Unidirectional` (单向) | `ReceiveRequest` (接收请求) | *冲突解决* | 见并发处理 |

### 2. 并发处理
当处于 `Unidirectional` 状态(已发送请求)的节点收到来自目标的 `RequestRelationship`(意味着对方也发送了请求)时:
- **比较 `thread_id`**:`thread_id` *较小*(按字典序)的请求胜出。
- **如果我的 `thread_id` < 对方的 `thread_id`**:我忽略他们的请求(或拒绝)。我期望他们接受我的请求。
- **如果我的 `thread_id` > 对方的 `thread_id`**:我接受他们的请求。我取消我的挂起请求状态并转换为 `ReverseUnidirectional`(实际上是接受他们的流程)。

### 3. 超时与重试
- **超时**:`VidContext` 将增加一个 `request_timeout` 字段。如果 `Unidirectional` 状态持续超过超时时间(例如 60秒),它将转换回 `Unrelated`。
- **重试**:在超时之前,系统可能会尝试重传。

### 4. 幂等性
- **重复请求**:如果处于 `ReverseUnidirectional` 或 `Bidirectional` 状态并收到相同的 `RequestRelationship`(相同的 `thread_id`),忽略它或重发之前的响应。
- **重复接受**:如果处于 `Bidirectional` 状态并收到具有相同 `thread_id` 的 `AcceptRelationship`,忽略它。

## 后果
- **健壮性**:在网络抖动和并发情况下,关系建立将更加可靠。
- **复杂性**:`store.rs` 的逻辑将变得更加复杂。
- **破坏性变更**:现有的手动操作状态的测试可能会失败,需要更新以遵循状态机。
91 changes: 91 additions & 0 deletions ADR/ADR-002-retry-mechanism.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# ADR 002: Automatic Retry Mechanism

## Status
Proposed

## Context
Network unreliability can cause relationship requests or other control messages to be lost. Currently, if a request times out, the relationship status resets to `Unrelated`, requiring manual intervention to restart the handshake. We need an automatic retry mechanism to improve robustness.

## Decision
We will implement an **Exponential Backoff** retry strategy for relationship requests.

### 1. Retry Policy
We will introduce a `RetryPolicy` struct:
- `max_retries`: Maximum number of attempts (e.g., 3).
- `initial_delay`: Duration before first retry (e.g., 500ms).
- `multiplier`: Factor to increase delay (e.g., 1.5x).
- `max_delay`: Cap on the delay (e.g., 5s).

### 2. State Persistence (`PendingRequest`)
The `PendingRequest` struct in `store.rs` will be expanded to store the necessary data for retransmission:
- `message`: The exact `Vec<u8>` TSP message (ciphertext) generated during the initial request. This ensures cryptographic consistency (same thread_id).
- `retry_count`: Number of retries attempted so far.
- `last_attempt`: Timestamp of the last attempt.

### 3. Timeout Handling (`check_timeouts`)
The `check_timeouts` method will be updated to:
1. Identify expired requests.
2. Check if `retry_count < max_retries`.
3. **If Retry**:
- Calculate next timeout using the backoff policy.
- Update `request_timeout`.
- Increment `retry_count`.
- Return the `message` and `endpoint` to the caller for transmission.
4. **If Exhausted**:
- Transition state to `Unrelated`.
- Log failure.

### 4. Transport Interface
`check_timeouts` will return `Result<Vec<(Url, Vec<u8>)>, Error>`. The caller (e.g., `AsyncSecureStore` loop or main application) is responsible for actually sending these messages over the network.

## Consequences
- **Reliability**: Temporary network glitches won't kill the handshake.
- **Traffic**: Retries increase network traffic, but backoff mitigates storms.
- **Storage**: `VidContext` size increases slightly to store the cached message.

---

# ADR 002: 自动重试机制

## 状态
已提议

## 背景
网络的不稳定性可能导致关系请求或其他控制消息丢失。目前,如果请求超时,关系状态会重置为 `Unrelated`(无关系),需要人工干预才能重新开始握手。我们需要一种自动重试机制来提高健壮性。

## 决定
我们将为关系请求实施 **指数退避 (Exponential Backoff)** 重试策略。

### 1. 重试策略 (Retry Policy)
我们将引入一个 `RetryPolicy` 结构体:
- `max_retries`:最大尝试次数(例如 3 次)。
- `initial_delay`:首次重试前的延迟时间(例如 500ms)。
- `multiplier`:每次重试后延迟增加的倍数(例如 1.5x)。
- `max_delay`:延迟时间的上限(例如 5s)。

### 2. 状态持久化 (`PendingRequest`)
`store.rs` 中的 `PendingRequest` 结构体将被扩展,以存储重传所需的数据:
- `message`:初始请求期间生成的准确 `Vec<u8>` TSP 消息(密文)。这确保了加密的一致性(相同的 `thread_id`)。
- `retry_count`:目前已尝试的重试次数。
- `last_attempt`:最后一次尝试的时间戳。

### 3. 超时处理 (`check_timeouts`)
`check_timeouts` 方法将更新为:
1. 识别过期的请求。
2. 检查是否 `retry_count < max_retries`。
3. **如果重试**:
- 使用退避策略计算下一次超时时间。
- 更新 `request_timeout`。
- 增加 `retry_count`。
- 将 `message` 和 `endpoint` 返回给调用者进行传输。
4. **如果耗尽**:
- 将状态转换为 `Unrelated`。
- 记录失败日志。

### 4. 传输接口
`check_timeouts` 将返回 `Result<Vec<(Url, Vec<u8>)>, Error>`。调用者(例如 `AsyncSecureStore` 循环或主应用程序)负责通过网络实际发送这些消息。

## 后果
- **可靠性**:暂时的网络故障不会中断握手。
- **流量**:重试会增加网络流量,但退避策略可以减轻风暴。
- **存储**:`VidContext` 的大小略有增加,以存储缓存的消息。
79 changes: 79 additions & 0 deletions ADR/ADR-003-offline-message-queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# ADR 003: Offline Message Queue

## Status
Proposed

## Context
When sending TSP messages, the transport layer (e.g., TCP, HTTP) may be unavailable, or the recipient may be offline. Currently, if a send fails, the message is lost unless the application manually handles it. We need a mechanism to queue these messages and attempt to resend them later.

## Decision
We will implement an in-memory **Offline Message Queue** within the `SecureStore`.

### 1. `MessageQueue` Structure
We will create a new module `queue.rs` with a `MessageQueue` struct.
- **Storage**: `VecDeque<QueuedMessage>`
- **`QueuedMessage`**:
- `message`: `Vec<u8>` (The sealed TSP message)
- `url`: `Url` (The destination)
- `priority`: `u8` (Optional, for future use)
- `created_at`: `Instant`

### 2. Integration with `SecureStore`
- `SecureStore` will hold a `Arc<RwLock<MessageQueue>>`.
- **Enqueue**: When a message cannot be sent (e.g., transport error), the application (or `AsyncSecureStore`) can call `store.queue_message(url, message)`.
- **Dequeue/Flush**: A method `store.process_queue()` (or similar) will be available to retrieve messages for attempting to resend.

### 3. Integration with `AsyncSecureStore`
- `AsyncSecureStore` is the active component that handles sending.
- It will check the queue periodically or upon reconnection events.
- When the queue is not empty, it will attempt to send the messages.
- If successful, the message is removed. If failed, it remains (or is moved to the back with a backoff, reusing Feature 2's logic if applicable, though Feature 2 is specific to Relationship Requests).

### 4. Persistence
For this iteration, the queue is **in-memory only**. If the application restarts, queued messages are lost. Persistence (to disk/DB) is out of scope for now but the design should allow for it later (e.g., by serializing `MessageQueue`).

## Consequences
- **Reliability**: Messages are not lost during temporary network outages.
- **Memory Usage**: Queued messages consume memory. We may need a cap on queue size.
- **Ordering**: `VecDeque` preserves FIFO order, which is generally desired.

---

# ADR 003: 离线消息队列

## 状态
已提议

## 背景
发送 TSP 消息时,传输层(例如 TCP, HTTP)可能不可用,或者接收方可能离线。目前,如果发送失败,除非应用程序手动处理,否则消息将会丢失。我们需要一种机制来排队这些消息,并稍后尝试重新发送。

## 决定
我们将在 `SecureStore` 中实现一个内存中的 **离线消息队列 (Offline Message Queue)**。

### 1. `MessageQueue` 结构
我们将创建一个新模块 `queue.rs`,其中包含 `MessageQueue` 结构体。
- **存储**: `VecDeque<QueuedMessage>`
- **`QueuedMessage`**:
- `message`: `Vec<u8>` (密封的 TSP 消息)
- `url`: `Url` (目的地)
- `priority`: `u8` (可选,供将来使用)
- `created_at`: `Instant` (创建时间)

### 2. 与 `SecureStore` 集成
- `SecureStore` 将持有一个 `Arc<RwLock<MessageQueue>>`。
- **入队**: 当消息无法发送时(例如传输错误),应用程序(或 `AsyncSecureStore`)可以调用 `store.queue_message(url, message)`。
- **出队/刷新**: 将提供一个方法 `store.process_queue()`(或类似方法)来检索消息以尝试重新发送。

### 3. 与 `AsyncSecureStore` 集成
- `AsyncSecureStore` 是处理发送的活动组件。
- 它将定期或在重连事件发生时检查队列。
- 当队列不为空时,它将尝试发送消息。
- 如果成功,消息将被移除。如果失败,它将保留(或通过退避移至队尾,如果适用,可复用 Feature 2 的逻辑,尽管 Feature 2 专门针对关系请求)。

### 4. 持久化
在本次迭代中,队列 **仅在内存中**。如果应用程序重启,排队的消息将丢失。持久化(到磁盘/数据库)暂时不在范围内,但设计应允许以后添加(例如,通过序列化 `MessageQueue`)。

## 后果
- **可靠性**: 消息不会因暂时的网络中断而丢失。
- **内存使用**: 排队的消息会消耗内存。我们可能需要限制队列大小。
- **顺序**: `VecDeque` 保持 FIFO(先进先出)顺序,这通常是期望的。
Loading