Skip to content

Conversation

dybyte
Copy link
Contributor

@dybyte dybyte commented Sep 4, 2025

Purpose of this pull request

The Paimon connectors CI occasionally fails. In these cases, the tests hang until they timeout, which can take up to 2 hours and is a very cumbersome issue. The failure occurs because the jobs attempt to access the database before the necessary privileges have been granted.

Example error logs:

2025-09-03 15:06:32,239 INFO  org.apache.seatunnel.engine.server.TaskExecutionService - [localhost]:5801 [seatunnel] [5.1] task 1000300000000 error with exception: [org.apache.paimon.privilege.NoPrivilegeException: User paimon doesn't have privilege SELECT on table seatunnel_namespace11.st_test], cancel other task in taskGroup TaskGroupLocation{jobId=1015638881762017281, pipelineId=1, taskGroupId=3}.
2025-09-03T16:36:38.8565633Z ##[error]The operation was canceled.

To address this, a utility class has been added to ensure that the required database privileges are granted before job execution. This should prevent the flaky test behavior and avoid long-running test hang-ups.

For details, see: https://github.com/dybyte/seatunnel/actions/runs/17436786971/job/49509156864

Does this PR introduce any user-facing change?

No

How was this patch tested?

Covered by existing test

Check list

@github-actions github-actions bot added the e2e label Sep 4, 2025
@dybyte dybyte marked this pull request as draft September 4, 2025 08:07
@dybyte dybyte marked this pull request as ready for review September 4, 2025 08:55
@dybyte dybyte marked this pull request as draft September 4, 2025 09:22
@dybyte dybyte marked this pull request as ready for review September 4, 2025 14:45
@Hisoka-X
Copy link
Member

Hisoka-X commented Sep 5, 2025

cc @hawk9821 as well.

Hisoka-X
Hisoka-X previously approved these changes Sep 8, 2025
tableIdentifier);
}
break;
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,why was only SELECT/INSERT checked here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the tests only check for insert and select privileges. I thought it would be fine to add checks for other privileges when we introduce corresponding tests for them. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the tests only check for insert and select privileges. I thought it would be fine to add checks for other privileges when we introduce corresponding tests for them. What do you think?

@dybyte Sure, but it would be better if we could improve it when adding methods. See your schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the suggestion!

@hawk9821
Copy link
Contributor

hawk9821 commented Sep 15, 2025

The grant operation of Paimon is synchronous, and this behavior is quite unusual. I think we should pay attention to why the job does not exit (fail fast) after throw exception.

@Hisoka-X
Copy link
Member

The grant operation of Paimon is synchronous, and this behavior is quite unusual. I think we should pay attention to why the job does not exit (fail fast) after throw exception.

@hawk9821 Should we create another issue to track it? Does it block on this PR be merged?

@zhangshenghang
Copy link
Member

The grant operation of Paimon is synchronous, and this behavior is quite unusual. I think we should pay attention to why the job does not exit (fail fast) after throw exception.

@hawk9821 Should we create another issue to track it? Does it block on this PR be merged?

Yes, we can fix it first and analyze the cause of the problem as a separate issue

@dybyte
Copy link
Contributor Author

dybyte commented Sep 16, 2025

I’m not entirely certain why the permission check still throws an exception even after the retry. However, it seems that the reason the job does not terminate afterward may be related to #9749 .

Typically, tests verify behavior with assertions after the job has successfully completed. In this particular case, however, an exception occurs during the job execution itself. As a result, the job is likely not removed from pendingJobMasterMap and continues to be counted in runningJobCount.

For more details, please see the end of this log:
https://github.com/dybyte/seatunnel/actions/runs/17436786971/job/49509156864

2025-09-03T15:41:40.5697311Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: ***********************************************
2025-09-03T15:41:40.5698018Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT:                 Job info detail
2025-09-03T15:41:40.5698734Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: ***********************************************
2025-09-03T15:41:40.5699653Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: createdJobCount           :                   0
2025-09-03T15:41:40.5700308Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: pendingJobCount           :                   0
2025-09-03T15:41:40.5700922Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: scheduledJobCount         :                   0
2025-09-03T15:41:40.5701606Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: runningJobCount           :                   1
2025-09-03T15:41:40.5702402Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: failingJobCount           :                   0
2025-09-03T15:41:40.5702986Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: failedJobCount            :                   0
2025-09-03T15:41:40.5703563Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: cancellingJobCount        :                   0
2025-09-03T15:41:40.5704134Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: canceledJobCount          :                   0
2025-09-03T15:41:40.5704791Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: finishedJobCount          :                   0
2025-09-03T15:41:40.5705342Z [] 2025-09-03 15:41:40,568 INFO  tc.seatunnel-engine:openjdk:8 - STDOUT: ***********************************************

This is just my hypothesis based on the logs, please correct me if I’m wrong.
@Hisoka-X @hawk9821 @zhangshenghang

@dybyte
Copy link
Contributor Author

dybyte commented Sep 16, 2025

After reviewing the CI logs again I noticed Slow operation detected messages for CheckpointErrorReportOperation.
In the CI environment, multiple jobs run concurrently and resources may be constrained, which can delay Hazelcast’s OperationThreads. This could potentially cause pipeline state updates or checkpoint error handling to be postponed, leaving the job stuck in the “running” state..

@zhangshenghang
Copy link
Member

After reviewing the CI logs again I noticed Slow operation detected messages for CheckpointErrorReportOperation. In the CI environment, multiple jobs run concurrently and resources may be constrained, which can delay Hazelcast’s OperationThreads. This could potentially cause pipeline state updates or checkpoint error handling to be postponed, leaving the job stuck in the “running” state..

If it's the problem you mentioned, not only does Paimon have this issue, but many places have the same problem? Can we reproduce this problem? Or should we wait until # 9749 is merged before verifying?

@dybyte
Copy link
Contributor Author

dybyte commented Sep 16, 2025

If it's the problem you mentioned, not only does Paimon have this issue, but many places have the same problem? Can we reproduce this problem? Or should we wait until # 9749 is merged before verifying?

Since this only happens occasionally in CI, it's hard to pinpoint the exact cause.
My previous comments are just guesses based on the logs, nothing is confirmed yet.
If anyone can find a way to reproduce it consistently, please let me know.

@Hisoka-X
Copy link
Member

We can waiting #9749 be merged, then see what would happend in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants