Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe thread contention while running large test with parallel methods #772

Closed
benlamonica opened this issue Aug 4, 2015 · 22 comments
Closed

Comments

@benlamonica
Copy link

While running a large test (600,000+ test methods) I encountered a deadlock situation.

All of the threads are blocked with this stack trace:

pool-102-thread-8 [227] (BLOCKED)
pool-102-thread-7 [226] (BLOCKED)
pool-102-thread-6 [225] (BLOCKED)
pool-102-thread-5 [224] (BLOCKED)
pool-102-thread-4 [223] (BLOCKED)
pool-102-thread-3 [222] (BLOCKED)
pool-102-thread-2 [221] (BLOCKED)
pool-102-thread-16 [235] (BLOCKED)
pool-102-thread-15 [234] (BLOCKED)
pool-102-thread-14 [233] (BLOCKED)
pool-102-thread-13 [232] (BLOCKED)
pool-102-thread-12 [231] (BLOCKED)
pool-102-thread-11 [230] (BLOCKED)
pool-102-thread-10 [229] (BLOCKED)
pool-102-thread-1 [220] (BLOCKED)
pool-102-thread-9 [228] (BLOCKED)
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus line: 85 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute line: 94 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1150 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 617 
   java.lang.Thread.run line: 745 

And the thread that is holding the mutex:

pool-1-thread-1 [23] (RUNNABLE)
   java.util.HashMap$TreeNode.find line: 1865 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.putTreeVal line: 1979 
   java.util.HashMap.putVal line: 637 
   java.util.HashMap.put line: 611 
   java.util.HashSet.add line: 219 
   org.testng.internal.DynamicGraph.setStatus line: 117 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus line: 86 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.runNodes line: 66 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.run line: 54 
   org.testng.TestRunner.privateRun line: 745 
   org.testng.TestRunner.run line: 621 
   org.testng.SuiteRunner.runTest line: 357 
   org.testng.SuiteRunner.runSequentially line: 352 
   org.testng.SuiteRunner.privateRun line: 310 
   org.testng.SuiteRunner.run line: 259 
   org.testng.SuiteRunnerWorker.runSuite line: 52 
   org.testng.SuiteRunnerWorker.run line: 86 
   org.testng.TestNG.runSuitesSequentially line: 1199 
   org.testng.TestNG.runSuitesLocally line: 1124 
   org.testng.TestNG.run line: 1032 
   org.apache.maven.surefire.testng.TestNGExecutor.run line: 295 
   org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute line: 84 
   org.apache.maven.surefire.testng.TestNGProvider.invoke line: 90 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2 line: 208 
   org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke line: 155 
   org.apache.maven.surefire.booter.ProviderFactory.invokeProvider line: 81 
   org.apache.maven.plugin.surefire.InPluginVMSurefireStarter.runSuitesInProcess line: 82 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider line: 974 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked line: 824 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute line: 722 
   org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo line: 132 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 208 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 153 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 145 
   org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject line: 116 
   org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject line: 80 
   org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build line: 51 
   org.apache.maven.lifecycle.internal.LifecycleStarter.execute line: 120 
   org.apache.maven.DefaultMaven.doExecute line: 347 
   org.apache.maven.DefaultMaven.execute line: 154 
   org.jvnet.hudson.maven3.launcher.Maven32Launcher.main line: 132 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   org.codehaus.plexus.classworlds.launcher.Launcher.launchStandard line: 330 
   org.codehaus.plexus.classworlds.launcher.Launcher.launch line: 238 
   jenkins.maven3.agent.Maven32Main.launch line: 181 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   hudson.maven.Maven3Builder.call line: 136 
   hudson.maven.Maven3Builder.call line: 71 
   hudson.remoting.UserRequest.perform line: 118 
   hudson.remoting.UserRequest.perform line: 48 
   hudson.remoting.Request$2.run line: 326 
   hudson.remoting.InterceptingExecutorService$1.call line: 72 
   java.util.concurrent.FutureTask.run line: 266 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1142 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 617 
   java.lang.Thread.run line: 745 
@cbeust
Copy link
Collaborator

cbeust commented Aug 4, 2015

I took a look at the source and I'm not sure this is the correct interpretation. m_graph is synchronized in three places in this class and one of them is setStatus. As far as I can tell, this method is always called outside of a synchronized block so that if thread A calls it while B is inside the synchronized block in setStatus, this should not result in a deadlock.

Do you have the full thread dump?

Any way you can reproduce this on a small project?

@benlamonica
Copy link
Author

I think you're right, it is just taking a very long time for it to complete the test, and the threads are blocked because another thread is busy inserting into the graph.

I looked at the log again and I see that it's only partially done. I'm not sure why it's so slow though, when I ran this in JUnit, the tests execute very quickly. (Full test runs in 1-2 hours, in testng, it's been running for 11+ hours now).
272000 out of 668592 tests run

So this is not a bug, feel free to close the issue. I'm sure it's something with my test.

Thanks for the quick response.

@cbeust
Copy link
Collaborator

cbeust commented Aug 4, 2015

Thanks for the clarification.

Not sure I want to close this just yet, I'm not convinced yet that this is not a bug in TestNG, especially if you say that the same test runs fast in JUnit. Can you double check this? I'm a bit confused, since if the same tests are running in JUnit, then they are all independent (since JUnit doesn't support dependent tests) which means that the DynamicGraph use in TestNG should be minimal (this graph is only used when you have dependent test methods).

Can you report back on what you find? I want to make sure there's not some hidden slow algorithm in TestNG.

@benlamonica
Copy link
Author

Let me start the jvm with a code profiler and run again.

Here are the particulars of the test:

  • 6 test methods in a class (annotated with @test, nothing else) in a class
  • the constructor annotated with @factory(dataProvider="data") which takes one argument, a POJO with a bunch of data gathered by the DataFactory
  • a static method that returns an Object[][] annotated with @dataProvider(name="data", parallel="true") - this method returns an Object[] of about 111000 in size.
  • a method annotated with @AfterMethod which counts the tests executed and cleans up after everything is done and also does a little bit more logging
  • a method annotated with @BeforeMethod which changes an instance variable containing the current test name (this isn't great, but I couldn't figure out how to differentiate between the different test runs, and this is what I found on StackOverflow)
  • The test class implements ITest so that the method names can be more intelligible when looking at failures.
  • the suite.xml defines the following attributes: verbose="1" parallel="methods" thread-count="16"

I'll do a little more research and get back to you.

@benlamonica
Copy link
Author

Here is the execution profiler.

Stack Trace Sample Count    Percentage(%)
java.lang.Thread.run()  1,037   98.95
   java.util.concurrent.ThreadPoolExecutor$Worker.run() 639 60.973
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)  639 60.973
         org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute(Runnable, Throwable) 406 38.74
            org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus(IWorker, DynamicGraph$Status)    406 38.74
               org.testng.internal.DynamicGraph.setStatus(Object, DynamicGraph$Status)  398 37.977
                  java.util.HashSet.add(Object) 298 28.435
                     java.util.HashMap.put(Object, Object)  298 28.435
                        java.util.HashMap.putVal(int, Object, Object, boolean, boolean) 298 28.435
                  org.testng.internal.DynamicGraph.removeNode(Object)   100 9.542
                     java.util.HashSet.remove(Object)   100 9.542
               java.util.ArrayList.remove(Object)   8   0.763
         org.testng.internal.TestMethodWorker.run() 227 21.66
            org.testng.internal.TestMethodWorker.invokeTestMethods(ITestNGMethod, Object, ITestContext) 227 21.66
               org.testng.internal.Invoker.invokeTestMethods(ITestNGMethod, XmlSuite, Map, ConfigurationGroupMethods, Object, ITestContext) 227 21.66
                  org.testng.internal.Invoker.invokeTestMethod(Object, ITestNGMethod, Object[], int, XmlSuite, Map, ITestClass, ITestNGMethod[], ITestNGMethod[], ConfigurationGroupMethods, Invoker$FailureContext)    226 21.565
                     org.testng.internal.Invoker.invokeMethod(Object, ITestNGMethod, Object[], int, XmlSuite, Map, ITestClass, ITestNGMethod[], ITestNGMethod[], ConfigurationGroupMethods, Invoker$FailureContext) 226 21.565
                        org.testng.internal.MethodInvocationHelper.invokeMethod(Method, Object, Object[])   224 21.374
                           java.lang.reflect.Method.invoke(Object, Object[])    224 21.374
                              sun.reflect.DelegatingMethodAccessorImpl.invoke(Object, Object[]) 224 21.374
                                 sun.reflect.GeneratedMethodAccessor135.invoke(Object, Object[])    224 21.374
                                    com.baml.gfx.pte.regression.PTERegression.db()  224 21.374
                                       com.baml.gfx.pte.regression.PTERegression.areIdenticalNodes(String, String, DifferenceListener, ElementQualifier)    224 21.374
                                          org.custommonkey.xmlunit.DetailedDiff.<init>(Diff)    94  8.969
                                          org.custommonkey.xmlunit.Diff.<init>(String, String)  84  8.015
                                          org.custommonkey.xmlunit.Diff.similar()   46  4.389
                        org.testng.internal.Invoker.invokeConfigurations(IClass, ITestNGMethod, ITestNGMethod[], XmlSuite, Map, Object[], Object, ITestResult)  1   0.095
                        org.testng.internal.MethodHelper.findExpectedExceptions(IAnnotationFinder, Method)  1   0.095
                  org.testng.internal.Invoker.filterMethods(IClass, ITestNGMethod[], Invoker$Predicate) 1   0.095

21.66% is spent on executing the test, and 61% is spent inserting into the graph set. (the rest is spent on our caching implementation, which suffered a disconnect to our dev environment and is busy rebuilding it's caches).

Do you know why we would be inserting into that graph?

@benlamonica
Copy link
Author

It looks like the hashCode is based on the method, which would be the same for 100,000 entries. So each time it adds to the HashSet, it has to iterate over 100,000 instances in order to find that the object is not in the set? I will experiment with making the hashCode more unique...

@cbeust
Copy link
Collaborator

cbeust commented Aug 4, 2015

Can you turning off parallelism completely? I'm curious what this will change.

@benlamonica
Copy link
Author

I turned off parallelism and now it's running about 1000 tests every 15 seconds. I don't see any of the delays I was seeing before.

@rschmitt
Copy link
Contributor

rschmitt commented Aug 5, 2015

Doesn't sound like deadlock so much as severe lock contention.

On Tue, Aug 4, 2015 at 5:04 PM, Ben La Monica [email protected]
wrote:

I turned off parallelism and now it's running about 1000 tests every 15
seconds. I don't see any of the delays I was seeing before.


Reply to this email directly or view it on GitHub
#772 (comment).

@cbeust
Copy link
Collaborator

cbeust commented Aug 5, 2015

I turned off parallelism and now it's running about 1000 tests every 15 seconds. I don't see any of the delays I was seeing before.

That was my intuition, glad to see it confirmed. Like @rschmitt is saying, it's most likely a contention problem.

@benlamonica benlamonica changed the title Deadlock while running large test with parallel methods Severe thread contention while running large test with parallel methods Aug 5, 2015
@benlamonica
Copy link
Author

Here is a simplified example that illustrates the problem.

import static org.junit.Assert.assertNotNull;

import java.util.concurrent.atomic.AtomicInteger;

import org.testng.annotations.AfterMethod;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Factory;
import org.testng.annotations.Test;


public class ContentionTest {

    private Integer testData;

    @Factory(dataProvider="data")
    public ContentionTest(Integer data) {
        testData = data;
    }

    @DataProvider(name="data", parallel=true)
    public static Object[][] getData() {
        Integer[][] data = new Integer[600000][1];
        for(int i = 0; i < 600000; i++) {
            data[i] = new Integer[] { i };
        }
        return data;
    }
    private static AtomicInteger testCount = new AtomicInteger(0);

    @AfterMethod
    public void countTests() {
        int count = testCount.incrementAndGet();
        if (count % 1000 == 0) {
            System.out.println(count + " tests completed.");
        }
    }

    @Test
    public void one() {
        assertNotNull("obj", testData);
    }

    @Test
    public void two() {
        assertNotNull("obj", testData);
    }

    @Test
    public void three() {
        assertNotNull("obj", testData);
    }

    @Test
    public void four() {
        assertNotNull("obj", testData);
    }

    @Test
    public void five() {
        assertNotNull("obj", testData);
    }

    @Test
    public void six() {
        assertNotNull("obj", testData);
    }
}

And the suite:

<suite name="contention suite">

  <test name="contention tests" parallel="methods" thread-count="6">
    <classes>
      <class name="ContentionTest"/>
    </classes>
  </test>
</suite>

benlamonica added a commit to benlamonica/testng that referenced this issue Aug 5, 2015
benlamonica added a commit to benlamonica/testng that referenced this issue Aug 5, 2015
@benlamonica
Copy link
Author

Unfortunately I don't have any more time to devote to this issue. I'm going to have to go and find a different solution. Let me know if I can contribute any more info to you. Thanks for the great testing framework!

@cbeust
Copy link
Collaborator

cbeust commented Aug 5, 2015

Thanks for the PR and your efforts, @benlamonica, I'll take it from there!

@testn
Copy link
Contributor

testn commented Nov 4, 2015

@cbeust, @juherr the similar fix of #775 has been broken down into #848, #849, #850, #852. Each one of them improves different path of the code path that makes #772 to take such a long time to execute. If you run #772 tests on the code before and after the fix, you will see that the time it takes to execute the tests are significantly reduced.

@juherr
Copy link
Member

juherr commented Nov 5, 2015

@testn The problem is we have no proof it or any idea how much we can. Then, it will take time to reproduce and test it on our side.
Last time I tried, I had to stop before the end.

The hardest and most important thing with perf issues are not the fix, but the way to reproduce and to measure it easily.

But maybe you can help us?

@testn
Copy link
Contributor

testn commented Nov 5, 2015

@juherr, just try it with a lower number. In my case, 6000 is already big enough to see the problem. If you want to start validating issues, you probably need to create a lot of test cases for performance testing in which I don't see it much in TestNG.

@testn
Copy link
Contributor

testn commented Nov 6, 2015

@benlamonica , can you help provide the result of the test?

@shaburov
Copy link
Contributor

shaburov commented Jan 21, 2017

v 6.9.13.3
In testNG no mechanisms for forced termination of the child threads. Because of this, the execution of tests in a loop (stability test) simply hangs in the case of deadlock.

Exception in thread "pool-309-thread-5" java.lang.ArrayIndexOutOfBoundsException: 49
        at java.util.ArrayList.fastRemove(ArrayList.java:546)
        at java.util.ArrayList.remove(ArrayList.java:529)
        at org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus(GraphThreadPoolExecutor.java:82)
        at org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute(GraphThreadPoolExecutor.java:94)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

image

@juherr
Copy link
Member

juherr commented Jan 21, 2017

@shaburov Do you have a way to reproduce it every time?

@shaburov
Copy link
Contributor

shaburov commented Jan 21, 2017

@juherr
Absolutely floating defect.
But since my tests are performed continuously in the monitoring mode, the reproduced defect in the interval 2-24 hours.

But I can guarantee that it is reproduced with the following settings:
testNG.setSuiteThreadPoolSize (< equal to the number launched suites >);
XmlSuite.setThreadCount (<5 and more>);

The error is not difficult to find.
image

@juherr
Copy link
Member

juherr commented Jan 21, 2017

Collections.synchronizedList() on https://github.com/cbeust/testng/blame/master/src/main/java/org/testng/internal/thread/graph/GraphThreadPoolExecutor.java#L29 is smarter IMO.

Is the fix resolving your issue?

You made the most difficult part of the job, why not proposing a pull request then? :)

@shaburov
Copy link
Contributor

@juherr
Pull request #1305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants