Skip to content
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions dev/epaxos_exec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# instance

- 每个`instance`需要保存它所依赖其它`Replica`上的那个`instance`

- 比如三个`Replica`, 每个`instance`定义一个数组`Deps[3]`

- `Dep[0]`表示依赖`Replica 0`上的的`instance`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deps 和Dep 是不是应该一样的?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,我漏了个s


- `Dep[1]`表示依赖`Replica 1`上的的`instance`

- `Dep[2]`表示依赖`Replica 2`上的的`instance`


# 看下执行过程

例如下面是一个instance的依赖图

```
instance1---------->instance3---------->instance5
^ | |
| | |
| | |
| | |
| V V
instance2<----------instance4---------->instance6
```

实际上执行的过程就是有向图强连通分量的`Tarjan`算法

- `Tarjan`需要用到两个数组,`DFN[u]`为节点u搜索的次序编号,`LOW(u)`为u或u的子树能够追溯到的最早的栈中节点的次序号
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DFN全称是啥? 文档里解释下便于理解

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的


- 从instance1开始dfs遍历, 比如顺序是`instance1->instance3->instance5->instance6`,保存到栈中, 源代码里面每个`instance`
定义了两个变量`Index`(=0表示这个节点没有被访问过),`LowLink`,这里对应下面的`DFN`和`LOW`

```
-----------
|instance6|->DFN:4 LOW:4
-----------
|instance5|->DFN:3 LOW:3
-----------
|instance3|->DFN:2 LOW:2
-----------
|instance1|->DFN:1 LOW:1
-----------
```

- 搜索到`instance6`发现没有边可搜索,这个时候退栈发现`DFN:4==LOW:4`,说明`instance6`是一个强连通分量,这个时候去
`execution instance6`

- 同理`instance5`也是一个强连通分量,执行完`instance6`去执行`instance5`,这个时候栈如下

```
-----------
|instance3|->DFN:2 LOW:2
-----------
|instance1|->DFN:1 LOW:1
-----------
```

- 退栈到`instance3`,继续搜索`instance4`并加入栈

- 继续搜索`instance2`,由于`instance2`有边指向`instance1`,`instance1`还在栈里面,这个时候`instance2`的`LOW`取`instance1`
的`LOW`,也就是1,栈如下

```
-----------
|instance2|->DFN:6 LOW:1
-----------
|instance4|->DFN:5 LOW:5
-----------
|instance3|->DFN:2 LOW:2
-----------
|instance1|->DFN:1 LOW:1
-----------
```

- 这个时候,所有`instance`已经搜索完成,开始出栈,出栈会修改`LOW`的值,取看到的最小值,如下

```
-----------
|instance2|->DFN:6 LOW:1
-----------
|instance4|->DFN:5 LOW:1
-----------
|instance3|->DFN:2 LOW:1
-----------
|instance1|->DFN:1 LOW:1
-----------
```

- 找到一个`DFN[idx]=LOW[idx]`,也就是强连通分量的根,这里就是`instance1`,栈里面的元素集合就是一个强连通分量,这里是
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1个小建议, 文档和代码一样, 以容易读懂为最主要目的, instancei如果都换成insi 或直接用数字i 表示每个instance, 更好看1点.

`instance1 instance2 instance3 instance4`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不太明白这里的顺序为什么是 1 2 3 4 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个表示1 2 3 4是一个强连通分量,就是相互依赖。这个不是顺序,找到这个分量后,还需要通过Seq进行排序,排序后的顺序就是执行的顺序

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文博的问题是这样的: 不需要一定是1234,这里只是假设一个遍历得到了1234这样的结果.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦哦。dfs遍历也可以从2开始,或者3先遍历4。都可以,到时候得到的强连通分量还是1234,顺序可能不一样。


- 每个`instance`有一个`seq`,通过它对强连通分量里面的`instance`进行排序,按照这个顺序去执行这个`instance`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq不是强连通子图算法的一部分, 这里拆分下章节


# 其它实现细节

- 比如`instance1(Replica1)---->instance10(Replica2)`,每个`Replica`保存了其`execution`后最大的`instance id`,这个时候
执行`instance10`需要把小于等于这个`instance id`的其它`instance`执行。也可以这样理解,`instance1(Replica1)`依赖`Replica2`
上`instanceid <= 10`的所有`instance`

- `execution`线程会给当前`Replica`上最小的没有`committed`的`active instance`设置一个超时时间,到达超时时间还没有达到
`committed`状态,会触发`instance`的恢复流程