Time consumption of a test case #311

schaffung · 2021-05-11T16:47:14Z

schaffung
May 11, 2021
Maintainer

With the nightly runs showing us the time being consumed for every test case run, we can see that certain test cases are taking around 20 seconds to complete. If one were to analyze the operations being performed in the said test case, it is..

Creation of volume
Starting the volume
Creating a mountdir and mounting the volume
Unmount operation
Stopping the volume
Deleting the volume

So this begs the question as to where the time is being used???

On a finer observation, it is found that around 9 seconds is the time being consumed for operations 4-5. Now this is in case of a test case wherein we don't do anything other than steps 1-6. Imagine having performed certain operations after step 3 ( logic of a test case scenario ), the time would anyway be 20+delta wherein the time consumed by the intermediate steps is delta.

In the current design, the non-disruptive tests are those test cases which can run concurrently with other test cases, or in other terms they don't affect the other test cases in a negative manner that we see some behavior which prevents us from analyzing the real test case scenario.

Now, these non-disruptive tests cases even though they consume around 20 seconds ( for example ) to run then supposing we run 100 test cases each with 4 volume types and all of them being non-disruptive, that'd mean the total time taken if run sequentially would be around = 20*4*100 = 8000 seconds.

In case of redant, as they fall under non-disruptive tests category, assuming we have 4 worker threads, the time taken would be around = (20*4*100)/4 = 2000 seconds.

Sounds good right ?

Yes, it does. But our intention is not to be complacent after a certain reduction. The next step is to see where the time is being spent. That's where we find that it takes around 9 seconds per test case to unmount a volume, stop a volume and then delete it. Assuming it takes around 4 seconds to do the steps related to the creation, start and mount, that'd mean the effective runtime of the test would be 20-9-4 = 7 seconds.

If we were to just run the test cases for their effective time, the total time taken for the 100 cases would be = (7*4*100)/4 = 700 seconds!

That is a reduction of around 91%

The question now is how do we achieve this reduction...now what is that one set of action which every test case performs and consumes the big chunk of time...setting up and tear-down. Its as simple as that.

Also provided the fact that it is a testing framework which is testing a filesystem, would the actual user be creating volumes in isolation ? or re-create a new volume for every new operation ? NO. The same ole volume is re-used and that is the solution.

The current flow treats all non-disruptive test cases as same. Letting them to do create and also destroy things ( in a manner that it isn't disruptive to the operation of other tests running parallel to it ). This implies that the worker thread can pick up a job ( test case run ) of any volume type. If we go with the same flow and create volumes before the non-disruptive tests are even started and using them to run these tests, we can get somewhat near to the desired flow but with a problem. Contention. If our workers were to pick two test cases of same volume type, they'd be in contention for that one volume we had created. So do we create an extra volume in such scenario ? ( That would be just the same old behavior )

But this is not a clean solution. We cannot just put in more resources just because the framework is not able to handle the existing requirements of the tests.

The proposed solution is to hold a worker responsible for a volume type and each worker accessing a queue specific to non-disruptive test of the volume type for which worker is responsible. In this manner, we can control the behavior of the worker thread and also prevent two tests of same volume having a contention for the existing volume.

But isn't the volume create sliding a little back from the responsibility of the test case to that of the framework ? Yes, it is. And that is the way we can obtain both

Reduced test runtime
Reduced contention of volume access.

Won't this reduce the agnostic nature of the test framework ? Yes it does, but we need to understand what we are trying to achieve here. The test framework is not the end, it is a means to the end. Without gluster there is no redant ( at-least for now. Let's not kid us into taking this into a bigger stage at this point of time, we need to pass the litmus test before going ahead and de-coupling things ). So, if the test framework is lifting some extra weight and we can get an improvement of 91 % because of that, I'm sold.

Will this cover all tests ? Not in all cases. For instance when a test case demands no volume, that gets a free pass from all this hassle. Also, the test cases which are working on creating a good amount of volumes won't be falling under this. And finally, disruptive test cases cannot come under this category or flow.

NOTE: The calculations done above are just rough figures, actual number will vary.

BarakSason · 2021-05-13T12:40:23Z

BarakSason
May 13, 2021

I strongly disagree with this analysis and the proposed solutions for the following reasons:

Even if the "init" and "terminate" phases takes about ~10 seconds, this is completely negligible compared to the time taken by actual IO operations. Just for comparison, executing Rebalance with a minimal set of actual data will take several minutes - 10 seconds in totally negligible.
The proposed solution complicated test execution in numerous ways - This is completely unnecessary. Much of the simplicity of the framework comes from the complete isolation of data between tests so (almost) no synchronization is needed - Moving ahead with the approach described here basically throws all of that simplicity out of the window for no reason whatsoever.
If there is a need to reduce "init" and "terminate" phases execution time, a more appropriate way to do so will be using async methodologies, and not with the approach suggested here.

I would strongly suggest NOT to implement the mechanisms suggested here.

0 replies

schaffung · 2021-05-13T13:09:10Z

schaffung
May 13, 2021
Maintainer Author

IO operations are not the only operations performed in most of the test cases. Rebalance is not the only component we are testing in gluster. Also the time being saved here is not per test case but for a test case's volume run. So it is not just a saving of 10 seconds. We need to see this holistically when extrapolated to tests which can run parallely from the given suite of test cases.
Now that is something to be expected, just because it is complicated doesn't mean we shy away from the improvement or advancement. One thing we must all understand is that the way testing is done in an automated environment is not how a user would do. A user won't be creating a separate volume for every test case and working on it. It is an organic step towards taking that consideration here. And as far the reason is concerned, Please refer point 1, which is backed by data (though rough, I'll update it with actual data if need be).

We are to deal with data and evidence, this being the inference, I see no qualms in taking this approach.

7 replies

BarakSason May 13, 2021

Srijan, much of what you say is plain wrong, other parts are misleading an/or you misunderstand what I'm saying.

The primary focus and aim is not to re-create everything from scratch, it is to create a test framework for glusterfs taking into account the existing test-framework, reduced time consumption, increased usability, documentation and migrating the tests with minimal effort ( not a complete re-creation of test cases ).

The previous framework also operated in the volume-per-test model, so any deviation from this is your attempt to recreate things from scratch by offering alternative mechanism which add a lot of complexity and little value.

Coming to the async part, if the test case requires a volume, we cannot just go to step two and perform operations, because we need the volume and there we are blocked

It seems you are missing what I'm trying to convey here.
Obviously we cannot continue with a tests if the volume for this test is not operational yet, but what we CAN do is to let a thread asynchronously create ALL volumes and each test (with respect to the thread running the specific test) would resume once his volume is ready.
An example, if we have X tests and the volume creation time is Y and the number of test runner threads is Z:

In the case of a sequential the volume creation time will be X*Y
In the case of parallel volume creation time will be roughly (X*Y)/Z
In an async approach volume creation time will be should be in the order of magnitude of Y (obviously not Y, but much lesser than (X*Y)/Z).

if we are to be truly async, we need to take the approach of plug-ables wherein we can have test cases as asynchronous functions invoked and then to check their status we can add another test to check the status and that requires re-imaging the test cases itself

This solution does not affect at all operations which are being executed at the test code itself - Only "init" and "terminate" are slightly modified.

Further improvements or re-imaginations can be taken as part of a new Iteration in the Milestones. With respect to the said changes, an updated design documentation will be added in the coming iteration, probably that'll help in better understanding.

Why is there a need for an updated design doc? One already exists, on which the vast majority of this work was built upon. I thought the goal is to minimize the work-to-be-done, no?

There are numerous discussions/open issues here with you being the only one expressing your opinions (this discussion is an example, up to the point I stumbled upon it) - Why not tag other people so they may share their inputs?

schaffung May 13, 2021
Maintainer Author

Srijan, much of what you say is plain wrong, other parts are misleading an/or you misunderstand what I'm saying.

The primary focus and aim is not to re-create everything from scratch, it is to create a test framework for glusterfs taking into account the existing test-framework, reduced time consumption, increased usability, documentation and migrating the tests with minimal effort ( not a complete re-creation of test cases ).

The previous framework also operated in the volume-per-test model, so any deviation from this is your attempt to recreate things from scratch by offering alternative mechanism which add a lot of complexity and little value.

Ok..Now, what I mentioned in my reply was ''not to re-create everything from scratch'', and what we are trying to do here is to rebuild the framework and that too the parts which help in execution, operationally speaking we are inspired by glusto. And I would disagree with your point as to there is little value. I'm not sure if you've done the calculation, please attempt the activity to find the time being saved when considering all the different types of test cases.

Coming to the async part, if the test case requires a volume, we cannot just go to step two and perform operations, because we need the volume and there we are blocked

It seems you are missing what I'm trying to convey here.
Obviously we cannot continue with a tests if the volume for this test is not operational yet, but what we CAN do is to let a thread asynchronously create ALL volumes and each test (with respect to the thread running the specific test) would resume once his volume is ready.
An example, if we have X tests and the volume creation time is Y and the number of test runner threads is Z:
* In the case of a sequential the volume creation time will be X*Y

* In the case of parallel volume creation time will be roughly (X*Y)/Z

* In an async approach volume creation time will be should be in the order of magnitude of X (obviously not X, but much lesser than (X*Y)/Z).

My question being why do we need to spend that X time even asynchronously if we can just perform that maybe once for the non Disruptive test cases. Going by the same line of thought, IIUC we are still attempting to create volumes for each test case, why do we need to do that ? With respect to the disruptive test cases, yes I do agree that we need to re-create volumes but why do we have to do the same in the non-disruptive scenario if we can skip over that ?

if we are to be truly async, we need to take the approach of plug-ables wherein we can have test cases as asynchronous functions invoked and then to check their status we can add another test to check the status and that requires re-imaging the test cases itself

This solution does not affect at all operations which are being executed at the test code itself - Only "init" and "terminate" are slightly modified.

Then the only change we would be attempting to do here is create the same number of volumes for each and every test case and the only difference being while volume is being created, the said thread will pick some other job, so we still have a job running in the background which will be then picked up when the current job ends. This is async but I won't call it true async because if we are to truly leverage the asynchronous operation, I'd rather make the tests like plugables and every operation will be non-blocking.
But coming back to the methodology you've proposed, we are still wasting resource and compute for volume creation when we can still make do with one volume for multiple tests.

Further improvements or re-imaginations can be taken as part of a new Iteration in the Milestones. With respect to the said changes, an updated design documentation will be added in the coming iteration, probably that'll help in better understanding.

Why is there a need for an updated design doc? One already exists, on which the vast majority of this work was built upon. I thought the goal is to minimize the work-to-be-done, no?

Yes, we do have a design doc from which we derived various design workflows but we need a detailed design doc along with the proposed changes and the direction we are taking.

There are numerous discussions/open issues here with you being the only one expressing your opinions (this discussion is an example, up to the point I stumbled upon it) - Why not tag other people so they may share their inputs?

@nik-redhat @nishith-vihar @aujjwal-redhat please share your views.

BarakSason May 14, 2021

coming back to the methodology you've proposed, we are still wasting resource and compute for volume creation when we can still make do with one volume for multiple tests.

We are taking the simplest approach which still provides a huge performance boost.
The problem is that what you suggest is not agile and it's not based on anything real.
If after having this framework run the same amount of tests as glusto you'll identify that the "init" and "terminate" phases consume a considerable amount of time (which they won't), then I'd be in favor of thinking about ways to solve this problem, but at this point all of what you say is not based on anything.
For the sake of the argument, you mentioned:

the total time taken if run sequentially would be around = 204100 = 8000 seconds
and

In case of redant, as they fall under non-disruptive tests category, assuming we have 4 worker threads, the time taken would be around = (204100)/4 = 2000 seconds

Who said 4 threads are the correct number? You chose this number. Take 16 threads and the reduction in time would be 8000/16=500, or 93.75% improvement, which is better than what you suggest anyway.
I'll go even one step further and argue that even if you are correct (and you are most likely not), the improvement with 4 threads the is 75% and with your approach is 91% (in theory), so 16% is not worth the extra effort at all.
In addition, and this is the critical part which I mentioned earlier - This test framework as to test a product in which most use-cases are IO heavy - Most of the time that will be consumed will be from rebalance/healing/geo-rep tests, so whatever seconds are "wasted" on "init" and "terminate" are absolutely negligible in comparison.

My question being why do we need to spend that X time even asynchronously if we can just perform that maybe once for the non Disruptive test cases. Going by the same line of thought, IIUC we are still attempting to create volumes for each test case, why do we need to do that ? With respect to the disruptive test cases, yes I do agree that we need to re-create volumes but why do we have to do the same in the non-disruptive scenario if we can skip over that ?

I don't think you fully realize how much complexity you'll bring in with your approach.
I'm not going to get into that, but the simplest case is that the volume (on which numerous tsts will depend on in your approach) fails for some reason, why should all the tests which depend on this volume should fail? Or will you try to work around that as well? - Endless complexity for absolutely no reason.

This is async but I won't call it true async

What I suggest is absolutely, 100% async.

You try to prematurely optimize, you make wrong assumptions and instead of focusing on completing the core functionalities that are still missing and converting tests so whatever info we gather is based on real use-cases, you spend time trying to fix what's not broken - This is not agile and a wrong approach to development in general.
Again, I strongly suggest that this effort is abandoned - I see absolutely value no in this, definitely not at this point in time. Please focus effort on completing what needs to be completed.

schaffung May 14, 2021
Maintainer Author

coming back to the methodology you've proposed, we are still wasting resource and compute for volume creation when we can still make do with one volume for multiple tests.

We are taking the simplest approach which still provides a huge performance boost.
The problem is that what you suggest is not agile and it's not based on anything real.
If after having this framework run the same amount of tests as glusto you'll identify that the "init" and "terminate" phases consume a considerable amount of time (which they won't), then I'd be in favor of thinking about ways to solve this problem, but at this point all of what you say is not based on anything.
For the sake of the argument, you mentioned:

the total time taken if run sequentially would be around = 20_4_100 = 8000 seconds
and

This has been substantiated by people who've actually run glusto and even the ,t test cases. Maybe I can give some references. The reason behind giving rough numbers is to provide easier explanation to people who aren't aware of it.

In case of redant, as they fall under non-disruptive tests category, assuming we have 4 worker threads, the time taken would be around = (20_4_100)/4 = 2000 seconds

Who said 4 threads are the correct number? You chose this number. Take 16 threads and the reduction in time would be 8000/16=500, or 93.75% improvement, which is better than what you suggest anyway.

Again example, but anyway that's fine.

My question being why do we need to spend that X time even asynchronously if we can just perform that maybe once for the non Disruptive test cases. Going by the same line of thought, IIUC we are still attempting to create volumes for each test case, why do we need to do that ? With respect to the disruptive test cases, yes I do agree that we need to re-create volumes but why do we have to do the same in the non-disruptive scenario if we can skip over that ?

I don't think you fully realize how much complexity you'll bring in with your approach.
I'm not going to get into that, but the simplest case is that the volume (on which numerous tsts will depend on in your approach) fails for some reason, why should all the tests which depend on this volume should fail? Or will you try to work around that as well? - Endless complexity for absolutely no reason.

Now this could have been raised as a question rather than an assumption from your side. I see that you've been taking assumptions and going on the fly rather than asking the actual questions. I'd recommend to take a clear approach to look at a problem and ask proper questions rather than throwing accusations. Let's keep the communications a little healthy.

This is async but I won't call it true async

What I suggest is absolutely, 100% async.

I disagree that it is 100% async, we can only say the framework is 100% async if everything runs in that manner. this is just us running vol creations in the background and initiating other tests and then pulling the tests from background to run when the volume has been created. The emphasis is why do we need to create a volume at the very first place for every test ? instead of just saying increased complexity please provide a valid reason.

You try to prematurely optimize, you make wrong assumptions and instead of focusing on completing the core functionalities that are still missing and converting tests so whatever info we gather is based on real use-cases, you spend time trying to fix what's not broken.

The way we run things was broken and that's why we initiated this effort itself ( redant). And as far as the argument goes for premature optmiziations and wrong assumptions, I would again state please don't assume that the decision has been taken on some whim, this is something which has been observed.

Again, I strongly suggest that this effort is abandoned - I see absolutely value no in this, definitely not at this point int time. Please focus effort on completing what needs to be completed.

I'm sorry that you see no value in this but the team has decided to go with this effort and work accordingly. The milestones have been clearly defined and decided based on tasks. As far as where we need to focus the attention has also been taken into care.
Thank you for your suggestion, we'll try to open an issue on async for future purposes and see how to take it up and considerable thought. But for now, this is what our current direction is.

BarakSason May 14, 2021

The way we run things was broken and that's why we initiated this effort itself ( redant).

Please don't lecture me on why this effort was initiated, as I AM the one who pointed out the flaws in the existing framework and the one who initiated this project.

I'm sorry that you see no value in this but the team has decided to go with this effort and work accordingly.

Funny enough that I'm apparently not included in the definition of this "team", as I am the the architect of this project.

I wish you success in your endeavors. I refuse to continue engaging in futile discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time consumption of a test case #311

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Time consumption of a test case #311

schaffung May 11, 2021 Maintainer

Replies: 2 comments · 7 replies

BarakSason May 13, 2021

schaffung May 13, 2021 Maintainer Author

BarakSason May 13, 2021

schaffung May 13, 2021 Maintainer Author

BarakSason May 14, 2021

schaffung May 14, 2021 Maintainer Author

BarakSason May 14, 2021

schaffung
May 11, 2021
Maintainer

Replies: 2 comments 7 replies

BarakSason
May 13, 2021

schaffung
May 13, 2021
Maintainer Author

schaffung May 13, 2021
Maintainer Author

schaffung May 14, 2021
Maintainer Author