Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex is Flaky #8352

Closed
Rishikesh1159 opened this issue Jun 29, 2023 · 5 comments
Labels
bug Something isn't working distributed framework

Comments

@Rishikesh1159
Copy link
Member

Describe the bug
A clear and concise description of what the bug is.
org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex is flaky.

java.lang.AssertionError: Tried to stop the only cluster-manager eligible shared node
	at __randomizedtesting.SeedInfo.seed([D34AC87DB7FB7699:2BBC977EF7931736]:0)
	at org.opensearch.test.InternalTestCluster.stopRandomNode(InternalTestCluster.java:1789)
	at org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex(RestoreSnapshotIT.java:438)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:833)

Full stack trace: here

To Reproduce
Steps to reproduce the behavior:

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex" -Dtests.seed=D34AC87DB7FB7699 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=de-AT -Dtests.timezone=America/Nuuk -Druntime.java=17

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@nknize
Copy link
Collaborator

nknize commented Jun 30, 2023

Found another on #8382 and it was reproducible locally.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex" -Dtests.seed=8CA2A73B7F437193 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv-SE -Dtests.timezone=MST7MDT -Druntime.java=17
[testindex1/_DLgbwL0SEaGw-HtK67jdw], shard=[testindex1][0]]
org.opensearch.snapshots.SnapshotMissingException: [test-restore-snapshot-repo:test-restore-snapshot1/MiJULTGxSiWcbKXSap9dVQ] is missing
	at org.opensearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3117) ~[main/:?]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.getShardSnapshotStatus(BlobStoreRepository.java:2924) ~[main/:?]
	at org.opensearch.snapshots.InternalSnapshotsInfoService$FetchingSnapshotShardSizeRunnable.doRun(InternalSnapshotsInfoService.java:240) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.nio.file.NoSuchFileException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.snapshots.RestoreSnapshotIT_8CA2A73B7F437193-004/tempDir-002/repos/CvLRFjlcDt/indices/_DLgbwL0SEaGw-HtK67jdw/0/snap-MiJULTGxSiWcbKXSap9dVQ.dat
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) ~[?:?]
	at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]
	at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newInputStream(HandleTrackingFS.java:94) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.tests.mockfile.FilterFileSystemProvider.newInputStream(FilterFileSystemProvider.java:193) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newInputStream(HandleTrackingFS.java:94) ~[lucene-test-framework-9.7.0.jar:9.7.0 ccf4b198ec328095d45d2746189dc8ca633e8bcf - 2023-06-21 11:48:16]
	at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]
	at org.opensearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:170) ~[main/:?]
	at org.opensearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:121) ~[main/:?]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3115) ~[main/:?]
	... 7 more

@sachinpkale
Copy link
Member

@harishbhakuni21 Can you please take a look?

@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Jul 3, 2023

Had a quick look into this flaky test. It fails with multiple type of failures. Here are few failures:

java.lang.AssertionError: Tried to stop the only cluster-manager eligible shared node
	at __randomizedtesting.SeedInfo.seed([D34AC87DB7FB7699:2BBC977EF7931736]:0)

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex" -Dtests.seed=8CA2A73B7F437193 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv-SE -Dtests.timezone=MST7MDT -Druntime.java=17


org.opensearch.snapshots.SnapshotMissingException: [test-restore-snapshot-repo:test-restore-snapshot1/MiJULTGxSiWcbKXSap9dVQ] is missing
	at org.opensearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3117) ~[main/:?]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.getShardSnapshotStatus(BlobStoreRepository.java:2924) ~[main/:?]
	at org.opensearch.snapshots.InternalSnapshotsInfoService$FetchingSnapshotShardSizeRunnable.doRun(InternalSnapshotsInfoService.java:240) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.nio.file.NoSuchFileException: /var/jenkins/workspace/gradle-check/search/server/build/testrun/internalClusterTest/temp/org.opensearch.snapshots.RestoreSnapshotIT_8CA2A73B7F437193-004/tempDir-002/repos/CvLRFjlcDt/indices/_DLgbwL0SEaGw-HtK67jdw/0/snap-MiJULTGxSiWcbKXSap9dVQ.dat
REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex" -Dtests.seed=943DB37ABEDFD4A8 -Dtests.locale=en-US -Dtests.timezone=Africa/Khartoum -Druntime.java=14

NodeClosedException[node closed {node_s7}{GlcOjvQiR3ereHzpRji7Cg}{xF1004YkQs6yUQeIl004_Q}{127.0.0.1}{127.0.0.1:42999}{d}{shard_indexing_pressure_enabled=true}
]
	at __randomizedtesting.SeedInfo.seed([943DB37ABEDFD4A8:6CCBEC79FEB7B507]:0)
	at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:1125)
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:387)

@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Jul 3, 2023

By making following changes, few of the above failures are fixed:

-> Change L351 to final String firstNode = internalCluster().startDataOnlyNode();
-> Change L381 to final String secondNode = internalCluster().startDataOnlyNode();
-> Change L438 to internalCluster().stopRandomNode(InternalTestCluster.nameFilter(firstNode));

But after running for few iterations with above fix I still see

REPRODUCE WITH: ./gradlew 'null' --tests "org.opensearch.snapshots.RestoreSnapshotIT.testRestoreInSameRemoteStoreEnabledIndex" -Dtests.seed=943DB37ABEDFD4A8 -Dtests.locale=en-US -Dtests.timezone=Africa/Khartoum -Druntime.java=14

NodeClosedException[node closed {node_s7}{GlcOjvQiR3ereHzpRji7Cg}{xF1004YkQs6yUQeIl004_Q}{127.0.0.1}{127.0.0.1:42999}{d}{shard_indexing_pressure_enabled=true}
]
	at __randomizedtesting.SeedInfo.seed([943DB37ABEDFD4A8:6CCBEC79FEB7B507]:0)
	at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:1125)
	at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:387)

Needs more deep dive

@sachinpkale
Copy link
Member

Fixed in #8422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

4 participants