Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

duyanghao
Copy link

@duyanghao duyanghao commented Mar 14, 2018

Signed-off-by: duyanghao [email protected]

What changes were proposed in this pull request?

Add recovery logic for failed pod and fix MEM_EXCEEDED_EXIT_CODE constant.

How was this patch tested?

Manual tests show successful for recovery of failed pod as below:

  1. make one executor pod fail(register itself failure)
  2. driver can discover the failed pod
  3. driver allocates a new executor pod

spark.executor.instances=5

# kubectl get pods -n=xxx -a -o wide|grep spark-debug-sar-test8
spark-debug-sar-test8           1/1       Completed     0          3m        192.168.25.92    x.x.x.x
spark-debug-sar-test8-exec-1    1/1       Completed     0          3m        192.168.25.94    x.x.x.x
spark-debug-sar-test8-exec-2    1/1       Completed     0          3m        192.168.25.93    x.x.x.x
spark-debug-sar-test8-exec-3    0/1       Error       0          3m        192.168.11.31    x.x.x.x
spark-debug-sar-test8-exec-4    0/1       Error       0          3m        192.168.11.37    x.x.x.x
spark-debug-sar-test8-exec-5    0/1       Error       0          3m        192.168.11.44    x.x.x.x
spark-debug-sar-test8-exec-6    1/1       Completed     0          48s       192.168.25.99    x.x.x.x
spark-debug-sar-test8-exec-7    1/1       Completed     0          48s       192.168.25.95    x.x.x.x
spark-debug-sar-test8-exec-8    1/1       Completed     0          48s       192.168.25.97    x.x.x.x

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant