Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows integTest stuck on executing integtest.sh commands #4563

Closed
peterzhuamazon opened this issue Mar 23, 2024 · 8 comments · Fixed by #4564
Closed

[BUG] Windows integTest stuck on executing integtest.sh commands #4563

peterzhuamazon opened this issue Mar 23, 2024 · 8 comments · Fixed by #4564

Comments

@peterzhuamazon
Copy link
Member

Hi, We are currently seeing this issue in Windows Zip, for OS only.

  • Symptoms:
    • Running windows integTest on 2.13.0 on a fresh container resulted in these behaviors:
    • Running windows integTest on 2.12.0 on a fresh container, then run 2.13.0
      • Test will successfully pass on Windows Zip
  • Debugging:
    • Ways to temporary pass the issue, but not persist after a docker container restart:
      • Run 2.13.0 once, then run 2.12.0, then run 2.13.0
      • Run 2.13.0 once, update a print statement anywhere within the src/system/execute.py.
      • Run 2.13.0 once, update anything related to python code
    • Changing gradle version does fix the issue
    • Switch from main to 2.12.0 opensearch-build code does not fix the issue
    • Ways to permanently pass the issue in POC
      • Change the integtest.sh execution to something else, like powershell, or use bash to directly call commands without calling a script in the middle
  • We suspect this is caused by subprocess.run does not receive a proper exit/return from bash integtest.sh, thus keep waiting for the end of the subroutine until time out
  • Mitigation:
    • Need some help from the core team / plugin teams to understand if there is any changes related to OS that cause this issue in Windows
    • Whether gradle tasks has changed that cause this issue
    • More clues are welcomed because this issue does not happen on 2.12.0 Windows Zip artifacts.

Thanks.

@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Mar 23, 2024
@peterzhuamazon peterzhuamazon removed the untriaged Issues that have not yet been triaged label Mar 23, 2024
@rishabh6788
Copy link
Collaborator

rishabh6788 commented Mar 23, 2024

As per subprocess documentation to ignore UnicodeDecodeError exception, we need to add errors=ignore in the subprocess.call command. To verify this I spun up a windows host and ran the windows container.
I ran for opensearch-observability without error=ignore flag, and I got the same UnicodeDecodeError exception , then i add the flag in the command in execute method and it ran successfully. To confirm it works, in the same container session I removed the flag again and got the decode error again and after adding it back it is gone.
I can see the files in the test-results directory.
We can better handle it by catching this exception.

try:
    result = subprocess.run(command, cwd=dir, shell=True, capture_output=capture, text=True, encoding='utf-8')
except UnicodeDecodeError as e:
    print(f"Error decoding stdout: {e}")
    result = subprocess.run(command, cwd=dir, shell=True, capture_output=capture, text=True, encoding='utf-8', errors='ignore')

@peterzhuamazon
Copy link
Member Author

Thanks Rishabh,
We tried to use the errors already.

The problem is:

  1. It is not clear why sometimes we can reach the unicode error, sometimes it just stuck on executing without even goes to next step.
  2. Even without changing the errors we can still get it pass by running 2.12 before 2.13. Then 2.13 pass.

We will take a look and sync up later.

@peterzhuamazon peterzhuamazon self-assigned this Mar 25, 2024
@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Mar 25, 2024

By switching the gradlew to gradlew.bat on windows, I finally able to consistently reach unicode error step, then I see this error pops up on observability, which also being confirmed that shows up in AD as well:


org.opensearch.observability.rest.CreateObjectIT > test create notebook fail FAILED
    javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    See https://opensearch.org/docs/latest/clients/java-rest-high-level/ for troubleshooting.
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:948)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at org.opensearch.observability.PluginRestTestCase.wipeAllOpenSearchIndices(PluginRestTestCase.kt:95)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.base/java.lang.Thread.run(Thread.java:829)

        Caused by:
        javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
            at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:360)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:303)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1357)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.onConsumeCertificate(CertificateMessage.java:1232)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.consume(CertificateMessage.java:1175)
            at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
            at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1076)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1063)
            at java.base/java.security.AccessController.doPrivileged(Native Method)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1010)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:289)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:357)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:545)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            ... 1 more

            Caused by:
            sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
                at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
                at java.base/sun.security.validator.Validator.validate(Validator.java:264)
                at java.base/sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141)
                at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1335)
                ... 19 more

                Caused by:
                sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
                    at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
                    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:434)
                    ... 25 more

    javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    See https://opensearch.org/docs/latest/clients/java-rest-high-level/ for troubleshooting.
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:948)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at org.opensearch.test.rest.OpenSearchRestTestCase.ensureNoInitializingShards(OpenSearchRestTestCase.java:958)
        at org.opensearch.test.rest.OpenSearchRestTestCase.cleanUpCluster(OpenSearchRestTestCase.java:358)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.base/java.lang.Thread.run(Thread.java:829)

        Caused by:
        javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
            at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:360)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:303)
            at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:298)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1357)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.onConsumeCertificate(CertificateMessage.java:1232)
            at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.consume(CertificateMessage.java:1175)
            at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
            at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:443)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1076)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask$DelegatedAction.run(SSLEngineImpl.java:1063)
            at java.base/java.security.AccessController.doPrivileged(Native Method)
            at java.base/sun.security.ssl.SSLEngineImpl$DelegatedTask.run(SSLEngineImpl.java:1010)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.doRunTask(SSLIOSession.java:289)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.doHandshake(SSLIOSession.java:357)
            at org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:545)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            ... 1 more

            Caused by:
            sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
                at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
                at java.base/sun.security.validator.Validator.validate(Validator.java:264)
                at java.base/sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:313)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:276)
                at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:141)
                at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1335)
                ... 19 more

                Caused by:
                sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:148)
                    at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:129)
                    at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
                    at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:434)

@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Mar 25, 2024

Added on top you apparently have to manually mount a volume on windows to keep the persistence of its Windows container storage.

I am able to constantly reproduce the bug now.

Thanks @rishabh6788

@peterzhuamazon
Copy link
Member Author

Note that Jenkins also mount the volume without using existing ones.

@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Mar 25, 2024

docker run -it -d -v "C:\Users\Administrator\opensearch-build:C:\Users\Administrator\opensearch-build" <> cmd.exe

@peterzhuamazon
Copy link
Member Author

https://docs.python.org/3/library/functions.html#open

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.

'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.

@peterzhuamazon
Copy link
Member Author

We will add a errors='replace' for src/system/execute.py's subprocess.run().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants