-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yokogawa WT310E ranging failures at under 0.2 A #253
Comments
Ten hours later, I connected to the meter a different system drawing more power. The PTDaemon log continued:
Note how the negative readings at Similar positive and sensible readings continued for 10 minutes when they became negative again on the system winding down the execution:
Nevertheless, unlike the example above, the testing phase continued with the positive and sensible readings:
right until the very end:
|
In the director logs, the short negative section at the end of the ranging run manifested itself as follows:
The subsequent testing run was error-free:
|
I've launched another workload which goes just above 24 Watts. The PTDaemon log is oscillating between positive and negative values:
But on the meter itself, the Watt display shows |
Note that we cannot simply take the absolute values:
|
I guess that when a system has at least some readings above @s-idgunji @araghun At the moment I suggest we revert to the old behaviour. That is, perform the ranging run in the Auto ranging mode. I will look into preparing a patch for our Wednesday meeting. |
We have almost all of the efficientnet-lite family producing negative readings with Most of the ranging trace is under the radar (0.2 A ~ 23.4 W), fewer than 2.5% of the samples are valid:
|
I'm not sure if it's related to the power workflow update, but the overheads when measuring power seem to have increased dramatically, at least for
I get the same high execution time (~113 ms) in the ranging and testing power runs, and the same low execution time (~87 ms) in the TEST01 and TEST05 compliance tests. That's a 30% overhead, not 10% as previously observed (and for |
@psyhtest - Let's discuss this tomorrow in the WG with @araghun and the rest of the Power WG. There has been no change to the base flow, other than accepting submits that were done as improvements during v1.0. What we should see is if other teams see similar issues or is it tied to the other issue that you had mentioned (CPU only submissions that are low power envelope get loaded by Power infra flow) |
@s-idgunji We will certainly rerun the same experiment (~113 ms) to see if it's just a blip. But changing the ranging phase to use the maximum range of a meter instead of auto ranging has proved to be a wrong decision. |
@psyhtest - I don't think you can say that. The approach is right. And it eliminates issues. Perhaps for very low power system submissions there may be a cost. But then, we need to understand how to eliminate ranging related errors in our flow. Why is it that using max range and then setting the highest observed load for perf testing causing an issue ? Can you explain that ? |
@s-idgunji As I explained above, for systems that draw less than 0.2 A, the ranging log is full of negative and non-sensible values. The workflow doesn't even allow one to proceed to the testing phase failing with:
If we don't fix it, this will affect hundreds of submissions, including customer-critical ones. |
I can offer an analogy with floating-point calculations: we have an "underflow" here. But instead of zeros we get all negative values. |
@s-idgunji , @psyhtest : let us discuss this tomorrow in the WG.
|
@psyhtest - Thanks for the analogy, but that was not my ask. It was what is causing it and if it is specific to a certain class of systems (e.g. low power) . Seems like the change has been in for a while but we'd not seen any validation driven errors reported before now. I will add Dejan to this issue to help track. Update : I cannot add Dejan , so have forwarded the meeting.. next best thing. |
Also checking, how can you measure the overhead when the flow itself does not allow you to proceed per other comments . |
As I'm the only submitter experimenting with low-power systems at the moment, I'm not surprised this hasn't been reported before. Until last week, I'd been validating the r1.1 power workflow only with systems drawing more than 25 Watts, where it is not an issue.
Oh, but
But most other models we use are indeed draw under 22 Watts even on
I should confirm that by the WG meeting tomorrow. Thanks. |
@psyhtest - Given this is just your reporting, and also having covered systems > 25W (in your case) where there are no issues, I do not think we can say this method is wrong. The question is can we resolve this in time and how. We can discuss tomorrow. One of the reasons we've been asking submitters to test early and report since the changes have been so incremental and code ready to test for a while. |
Oh, I have been testing it for 2-3 months now, just not across the full range of systems and scenarios! :) Looking at submissions 1.0-633,634 in Edge - Closed - Power, I expect NVIDIA to run into exactly the same issue on all |
I hope so. It's easier to revert incremental changes. |
Defeats the purpose of the change if it is made in the right direction and works for most submitters. Let's figure out the solution tomorrow. |
@s-idgunji The previous version of the workflow worked for ALL submitters. The current version of the workflow would exclude 80-90% of the power results from the previous round. Would we carry on regardless? |
@psyhtest - I do not understand this point. Are you saying that % of power results submitted should dictate the decision ? So if a submitter was contributing 1% of the results maybe we would not consider ? To me that's what the statement appears to suggest. Rather, let's look at the solution and options live. As far as I know other submitters < 25W have not seen this issue, but I can check with Ashwin/Dilip about NVIDIA's testing. |
I've prepared a new branch which reverts two commits to On restarting the server service, I no longer see negative non-sensible values:
except from the quick change to the next range on the way up and then back. This produces only two bad samples:
which I have seen happening for v1.0 as well e.g. here. Tomorrow we are going to test it for a complete family of networks crossing the 0.2 A boundary. |
That's categorically not what I'm saying. We should consider all submitters, whether they are in the minority or in the majority.
Coming to think about this, WT333E can behave differently from WT310E in this regard. Let's discuss today. |
Thanks for clarifying because it was not clear why we referred to % of submissions. There is no question that we have to help resolve issues for every submitter. I think Dejan has reached out to understand the issue and can explain in the meeting. But we'll have to look for some solution that allows for submitters who have already prepared results with current version , vs any reverted version proposed (i.e. back to auto) so that we allow for least impact. |
Just a quick update. I've been collecting test data to verify a potential solution: 1) measurements taken with the top of r1.1; 2) measurements taken with the auto-ranging patch. Once I validate the solution works for both, I'll make a PR to r1.1. |
Thanks . @psyhtest - Has the main Inference v1.1 pointed to the Power v1.1 instead of power-dev. Do I need to follow up ? |
I'm not sure if you are looking for an explanation for this behavior, but I can explain. The WT310 specification says that its effective input range is 1% to 130% of range for voltage and current. If you're in the 20A range and drawing less than 0.2A, you are outside the analyzer's specifications and the data is invalid. PTDaemon marks all invalid data by using negative values. |
I'm investigating a very weird issue with the updated r1.1 workflow. It started with testing workflows on an Edge system. In the Offline scenario, the system power is over 24 Watts, and everything works as expected. In the Single Scenario, however, the system power is under 20 Watts, and I'm unable to take the measurements. Here's a typical failure log on the director side:
Here's how the PTDaemon log looks like on the director side:
The power figures are negative but otherwise sensible. The Amps figures are negative and non-sensible.
The text was updated successfully, but these errors were encountered: