Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinated Restore at Checkpoint experiment #1033

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Coordinated Restore at Checkpoint experiment #1033

wants to merge 1 commit into from

Conversation

velo
Copy link
Contributor

@velo velo commented Dec 20, 2024

Initial experimentation with Coordinated Restore at Checkpoint experiment!

I was able to have a stable process that let me run sqrl compiler and wait for JDK.checkpoint.

How to use this branch

  1. Build
$ mvn clean install -DskipTests=true
$ cd sqrl-tools/
$ docker build -t datasqrl/cmd:crac .
  1. Start compiler with extra flags
  • JVM_OPTS="-XX:CRaCCheckpointTo=/build/crac"
  • --wait-for-crac-event
$ docker rm -f cloud-backend ; docker run --name cloud-backend -p 8888:8888 -p 8081:8081 -p 9092:9092 --rm -v $PWD:/build -e JVM_OPTS="-XX:CRaCCheckpointTo=/build/crac" -e KAFKA_EVENTS_BOOTSTRAP_SERVER="localhost:9092" -e KAFKA_EVENTS_CONSUMER_GROUP=cloud-backend1 -d datasqrl/cmd:crac --wait-for-crac-event compile -c package-dynamic-kafka.json ; docker logs cloud-backend -f
Error response from daemon: No such container: cloud-backend
485109dbf545314d5690855263e941a507003464201131f25d99258df1bfc3e2
Compiling...this takes about 10 seconds
Enabling resource adapter for Project CRaC
...
regular compiler output
...
Awaiting crac
  1. AFTER the Awaiting message, collect a checkpoint
$ docker exec -ti cloud-backend jcmd sqrl-cli JDK.checkpoint
7:
An exception during a checkpoint operation:
jdk.internal.crac.mirror.CheckpointException
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=10 type=directory path=/build/build/deployment-observability
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=11 type=directory path=/build/build/deployment-observability
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=12 type=directory path=/build/build/cloud-data
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=13 type=directory path=/build/build/cloud-data
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=14 type=directory path=/build/build/sink
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
        Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=15 type=directory path=/build/build/sink
                at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)

Right now, it's all failing due to some file handlers open.

We need to close all file handers before checkpoint, we can use CracResourceAdapter#beforeCheckpoint to close them, but I found no way to identify root cause .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant