Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yokogawa WT310E ranging failures at under 0.2 A #253

Closed
psyhtest opened this issue Aug 1, 2021 · 28 comments
Closed

Yokogawa WT310E ranging failures at under 0.2 A #253

psyhtest opened this issue Aug 1, 2021 · 28 comments

Comments

@psyhtest
Copy link
Contributor

psyhtest commented Aug 1, 2021

I'm investigating a very weird issue with the updated r1.1 workflow. It started with testing workflows on an Edge system. In the Offline scenario, the system power is over 24 Watts, and everything works as expected. In the Single Scenario, however, the system power is under 20 Watts, and I'm unable to take the measurements. Here's a typical failure log on the director side:

08-01-2021 10:56:59.225: WARNING: Uncertainty unknown for the last measurement sample! channel 0
ptd-server 2021-08-01 11:56:59,416 [INFO] Sending to ptd: 'Stop'
08-01-2021 10:56:59.417: Response to client sent: Stopping untimed measurement
ptd-server 2021-08-01 11:56:59,417 [INFO] Reply from ptd: 'Stopping untimed measurement'
ptd-server 2021-08-01 11:56:59,437 [ERROR] Got an exception
Traceback (most recent call last):
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 577, in handle_connection
    reply = self._handle_cmd(cmd, p)
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 650, in _handle_cmd                                     
    return unbool[self.session.stop(Mode.RANGING)]
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 861, in stop                                                                                
    self._maxVolts, self._maxAmps = max_volts_amps(
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 187, in max_volts_amps           
    raise MaxVoltsAmpsNegativeValuesError(f"Could not find values for {mark!r}")
ptd_client_server.lib.server.MaxVoltsAmpsNegativeValuesError: Could not find values for '2021-08-01_11-45-44_ranging'                                                                                      
ptd-server 2021-08-01 11:56:59,437 [INFO] Sending reply to client 'Error: exception'
ptd-server 2021-08-01 11:56:59,452 [INFO] Connection closed         
ptd-server 2021-08-01 11:56:59,452 [WARNING] Client connection closed unexpectedly
ptd-server 2021-08-01 11:56:59,452 [INFO] Sending to ptd: 'Stop'
08-01-2021 10:56:59.923: Completed test
08-01-2021 10:56:59.923: WARNING: No valid watts samples for this measurement!      
08-01-2021 10:56:59.924: Response to client sent: Error: no measurement to stop
ptd-server 2021-08-01 11:56:59,924 [INFO] Reply from ptd: 'Error: no measurement to stop'
ptd-server 2021-08-01 11:56:59,924 [INFO] Sending to ptd: 'SR,V,300.000000'
08-01-2021 10:56:59.924: Volt range set to 300.000000V

Here's how the PTDaemon log looks like on the director side:

me,08-01-2021 10:56:42.225,Watts,-22.100000,Volts,248.180000,Amps,-3.168000,PF,0.456200,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:43.226,Watts,-21.900000,Volts,248.130000,Amps,-3.168000,PF,0.454700,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:44.225,Watts,-21.300000,Volts,248.150000,Amps,-3.163000,PF,0.452500,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:45.225,Watts,-16.800000,Volts,248.130000,Amps,-3.130000,PF,0.427200,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:46.226,Watts,-16.000000,Volts,248.060000,Amps,-3.124000,PF,0.421700,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:47.225,Watts,-16.000000,Volts,248.140000,Amps,-3.125000,PF,0.422000,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:48.225,Watts,-15.700000,Volts,248.140000,Amps,-3.122000,PF,0.419100,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:49.226,Watts,-15.600000,Volts,248.170000,Amps,-3.121000,PF,0.417600,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:50.225,Watts,-15.500000,Volts,248.170000,Amps,-3.121000,PF,0.417300,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:51.225,Watts,-15.100000,Volts,248.090000,Amps,-3.118000,PF,0.413500,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:52.225,Watts,-15.200000,Volts,248.070000,Amps,-3.119000,PF,0.413400,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:53.225,Watts,-15.200000,Volts,248.110000,Amps,-3.119000,PF,0.413100,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:54.226,Watts,-15.200000,Volts,248.180000,Amps,-3.118000,PF,0.414100,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:55.225,Watts,-15.200000,Volts,248.220000,Amps,-3.119000,PF,0.414600,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:56.225,Watts,-15.200000,Volts,248.210000,Amps,-3.118000,PF,0.414400,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:57.226,Watts,-15.200000,Volts,248.230000,Amps,-3.118000,PF,0.415100,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:58.225,Watts,-15.200000,Volts,248.110000,Amps,-3.118000,PF,0.415300,Mark,2021-08-01_11-45-44_ranging
Time,08-01-2021 10:56:59.225,Watts,-15.200000,Volts,248.100000,Amps,-3.118000,PF,0.415200,Mark,2021-08-01_11-45-44_ranging

The power figures are negative but otherwise sensible. The Amps figures are negative and non-sensible.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 1, 2021

Ten hours later, I connected to the meter a different system drawing more power. The PTDaemon log continued:

Time,08-01-2021 21:12:31.035,Watts,-10.000000,Volts,249.930000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:32.036,Watts,-13.200000,Volts,249.940000,Amps,-3.104000,PF,0.390700,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:33.035,Watts,-14.600000,Volts,249.910000,Amps,-3.113000,PF,0.411400,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:34.036,Watts,-10.400000,Volts,249.920000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:35.035,Watts,-10.100000,Volts,249.940000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:36.035,Watts,-9.900000,Volts,249.920000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:37.036,Watts,-10.000000,Volts,249.920000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:38.035,Watts,-11.100000,Volts,249.870000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:39.035,Watts,-10.100000,Volts,249.880000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:40.035,Watts,-14.900000,Volts,249.860000,Amps,-3.116000,PF,0.412400,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:41.036,Watts,-14.900000,Volts,249.820000,Amps,-3.115000,PF,0.413800,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:42.035,Watts,-15.200000,Volts,249.830000,Amps,-3.117000,PF,0.416200,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:43.035,Watts,-14.900000,Volts,249.900000,Amps,-3.115000,PF,0.414300,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:44.035,Watts,-15.100000,Volts,249.900000,Amps,-3.116000,PF,0.416100,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:45.035,Watts,-14.900000,Volts,249.970000,Amps,-3.115000,PF,0.413300,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:46.035,Watts,-16.000000,Volts,249.980000,Amps,-3.124000,PF,0.422400,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:47.035,Watts,-16.200000,Volts,249.970000,Amps,-3.125000,PF,0.423500,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:48.035,Watts,-15.700000,Volts,249.920000,Amps,-3.121000,PF,0.419900,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:49.035,Watts,-15.000000,Volts,249.940000,Amps,-3.116000,PF,0.414400,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:50.036,Watts,-14.900000,Volts,249.930000,Amps,-3.116000,PF,0.412300,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:51.035,Watts,-14.300000,Volts,249.890000,Amps,-3.117000,PF,0.385200,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:52.035,Watts,-26.300000,Volts,249.850000,Amps,-3.197000,PF,0.471800,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:53.035,Watts,25.600000,Volts,249.880000,Amps,0.210000,PF,0.488200,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:12:54.035,Watts,25.900000,Volts,249.940000,Amps,0.212000,PF,0.489900,Mark,2021-08-01_22-12-01_ranging

Note how the negative readings at 21:12:52.035 are followed by the positive readings at 21:12:53.035.

Similar positive and sensible readings continued for 10 minutes when they became negative again on the system winding down the execution:

Time,08-01-2021 21:22:51.035,Watts,27.000000,Volts,250.060000,Amps,0.221000,PF,0.489700,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:52.035,Watts,-19.800000,Volts,250.120000,Amps,-3.160000,PF,0.420200,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:53.035,Watts,-10.100000,Volts,250.140000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:54.035,Watts,-9.800000,Volts,250.130000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:55.036,Watts,-9.900000,Volts,250.130000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:56.035,Watts,-10.100000,Volts,250.110000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:57.035,Watts,-9.600000,Volts,250.130000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:58.035,Watts,-10.100000,Volts,250.160000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:22:59.035,Watts,-9.800000,Volts,250.120000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:23:00.035,Watts,-9.800000,Volts,250.120000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging
Time,08-01-2021 21:23:01.035,Watts,-10.100000,Volts,250.120000,Amps,0.000000,PF,-1.000000,Mark,2021-08-01_22-12-01_ranging

Nevertheless, unlike the example above, the testing phase continued with the positive and sensible readings:

Time,08-01-2021 21:23:15.049,Watts,6.980000,Volts,250.240000,Amps,0.079930,PF,0.349000,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:16.050,Watts,10.000000,Volts,250.210000,Amps,0.102660,PF,0.389100,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:17.049,Watts,11.770000,Volts,250.230000,Amps,0.114740,PF,0.410000,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:18.049,Watts,7.300000,Volts,250.250000,Amps,0.082500,PF,0.353600,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:19.049,Watts,7.230000,Volts,250.280000,Amps,0.082000,PF,0.352300,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:20.049,Watts,7.020000,Volts,250.240000,Amps,0.080000,PF,0.350800,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:21.049,Watts,6.760000,Volts,250.230000,Amps,0.077860,PF,0.347100,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:22.050,Watts,8.500000,Volts,250.210000,Amps,0.091610,PF,0.370700,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:23.049,Watts,7.090000,Volts,250.200000,Amps,0.080550,PF,0.351900,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:24.049,Watts,11.330000,Volts,250.160000,Amps,0.111750,PF,0.405300,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:23:25.049,Watts,11.560000,Volts,250.110000,Amps,0.113520,PF,0.407100,Mark,2021-08-01_22-12-01_testing

right until the very end:

Time,08-01-2021 21:33:32.049,Watts,27.350000,Volts,250.230000,Amps,0.224670,PF,0.486500,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:33.049,Watts,26.850000,Volts,250.180000,Amps,0.221050,PF,0.485500,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:34.049,Watts,27.100000,Volts,250.210000,Amps,0.222110,PF,0.487500,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:35.050,Watts,27.280000,Volts,250.230000,Amps,0.223290,PF,0.488200,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:36.049,Watts,14.070000,Volts,250.250000,Amps,0.140800,PF,0.399200,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:37.049,Watts,7.080000,Volts,250.260000,Amps,0.080990,PF,0.349200,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:38.049,Watts,6.840000,Volts,250.270000,Amps,0.078870,PF,0.346600,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:39.049,Watts,7.110000,Volts,250.290000,Amps,0.081040,PF,0.350300,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:40.050,Watts,6.950000,Volts,250.300000,Amps,0.079890,PF,0.347500,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:41.049,Watts,6.890000,Volts,250.310000,Amps,0.079340,PF,0.347200,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:42.049,Watts,6.990000,Volts,250.290000,Amps,0.080120,PF,0.348300,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:43.049,Watts,6.670000,Volts,250.280000,Amps,0.077750,PF,0.342500,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:44.049,Watts,7.140000,Volts,250.320000,Amps,0.081320,PF,0.350800,Mark,2021-08-01_22-12-01_testing
Time,08-01-2021 21:33:45.050,Watts,6.870000,Volts,250.320000,Amps,0.079120,PF,0.346900,Mark,2021-08-01_22-12-01_testing

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 1, 2021

In the director logs, the short negative section at the end of the ranging run manifested itself as follows:

ptd-server 2021-08-01 22:22:51,254 [INFO] Got command from the client 'session,2021-08-01_22-12-01,stop,ranging'
08-01-2021 21:22:52.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:52.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0                                                                                                           
08-01-2021 21:22:53.035: ERROR: Bad pf reading inf from WT310                                                                                                                                              
Buffer was 20.000E+00;0;300.0E+00;1                                                                                                                                                                        
                                                                                                                                                                                                           
08-01-2021 21:22:53.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:53.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0                                                                                                           
08-01-2021 21:22:54.035: ERROR: Bad pf reading inf from WT310                                                                                                                                              
Buffer was 20.000E+00;0;300.0E+00;1                                                                                                                                                                        
                                                                                                                                                                                                           
08-01-2021 21:22:54.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:54.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0                                                                                                           
08-01-2021 21:22:55.036: ERROR: Bad pf reading inf from WT310              
Buffer was 20.000E+00;0;300.0E+00;1                                                                  
                                                                                                     
08-01-2021 21:22:55.036: ERROR: Can't evaluate uncertainty of this sample!                      
08-01-2021 21:22:55.036: WARNING: Uncertainty unknown for the last measurement sample! channel 0
08-01-2021 21:22:56.035: ERROR: Bad pf reading inf from WT310            
Buffer was 20.000E+00;0;300.0E+00;1
                                                                                                     
08-01-2021 21:22:56.035: ERROR: Can't evaluate uncertainty of this sample! 
08-01-2021 21:22:56.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0
08-01-2021 21:22:57.035: ERROR: Bad pf reading inf from WT310                                    
Buffer was 20.000E+00;0;300.0E+00;1                                                                  
                                                                                                                                                                                                           
08-01-2021 21:22:57.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:57.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0
08-01-2021 21:22:58.035: ERROR: Bad pf reading inf from WT310                                                                                                                                              
Buffer was 20.000E+00;0;300.0E+00;1                                                                                                                                                                        
                                                                                                                                                                                                           
08-01-2021 21:22:58.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:58.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0                                                                                                           
08-01-2021 21:22:59.035: ERROR: Bad pf reading inf from WT310
Buffer was 20.000E+00;0;300.0E+00;1
                                                                                                                                                                                                           
08-01-2021 21:22:59.035: ERROR: Can't evaluate uncertainty of this sample!                                                                                                                                 
08-01-2021 21:22:59.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0                                                                                                           
08-01-2021 21:23:00.035: ERROR: Bad pf reading inf from WT310                                                                                                                                              
Buffer was 20.000E+00;0;300.0E+00;1                                                                                                                                                                        

08-01-2021 21:23:00.035: ERROR: Can't evaluate uncertainty of this sample!
08-01-2021 21:23:00.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0
08-01-2021 21:23:01.035: ERROR: Bad pf reading inf from WT310
Buffer was 20.000E+00;0;300.0E+00;1

08-01-2021 21:23:01.035: ERROR: Can't evaluate uncertainty of this sample!
08-01-2021 21:23:01.035: WARNING: Uncertainty unknown for the last measurement sample! channel 0
ptd-server 2021-08-01 22:23:01,265 [INFO] Sending to ptd: 'Stop'
08-01-2021 21:23:01.265: Response to client sent: Stopping untimed measurement
ptd-server 2021-08-01 22:23:01,265 [INFO] Reply from ptd: 'Stopping untimed measurement'
ptd-server 2021-08-01 22:23:01,286 [INFO] Sending reply to client 'OK'
ptd-server 2021-08-01 22:23:01,372 [INFO] Got command from the client 'time'
ptd-server 2021-08-01 22:23:01,372 [INFO] Sending reply to client '1627852981.3721929'
ptd-server 2021-08-01 22:23:01,376 [INFO] Got command from the client 'session,2021-08-01_22-12-01,start,testing'
ptd-server 2021-08-01 22:23:01,376 [INFO] Sending to ptd: 'SR,V,250.270000'
08-01-2021 21:23:01.733: Completed test
08-01-2021 21:23:01.733: Avg watts 26.780968, min watts 25.600000, max watts 27.800000, samples 631, errors 32, valid 599

The subsequent testing run was error-free:

08-01-2021 21:33:45.747: Completed test
08-01-2021 21:33:45.747: Avg watts 26.195943, min watts 6.670000, max watts 28.210000, samples 631, errors 0, valid 631

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 1, 2021

I've launched another workload which goes just above 24 Watts. The PTDaemon log is oscillating between positive and negative values:

Time,08-01-2021 22:21:24.318,Watts,24.300000,Volts,251.890000,Amps,0.200000,PF,0.482800,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:25.318,Watts,-27.100000,Volts,251.850000,Amps,-3.199000,PF,0.481400,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:26.319,Watts,-27.100000,Volts,251.880000,Amps,-3.199000,PF,0.480900,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:27.318,Watts,-26.900000,Volts,251.870000,Amps,-3.198000,PF,0.480600,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:28.318,Watts,24.600000,Volts,251.910000,Amps,0.202000,PF,0.482600,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:29.319,Watts,-27.100000,Volts,251.880000,Amps,-3.199000,PF,0.480500,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:30.318,Watts,24.300000,Volts,251.910000,Amps,0.200000,PF,0.482300,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:31.318,Watts,24.500000,Volts,251.920000,Amps,0.201000,PF,0.484200,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:32.319,Watts,-27.100000,Volts,251.930000,Amps,-3.199000,PF,0.481000,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:33.318,Watts,24.300000,Volts,251.920000,Amps,0.200000,PF,0.480900,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:34.319,Watts,24.500000,Volts,251.890000,Amps,0.202000,PF,0.481500,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:35.318,Watts,-27.200000,Volts,251.890000,Amps,-3.199000,PF,0.480800,Mark,2021-08-01_23-18-11_ranging

But on the meter itself, the Watt display shows 0.0241 .. 0.0249 kW, while the Amps display shows 0.199 .. 0.203. It must be crossing this value 0.2 A back and forth that causes the kerfuffle with the log. Indeed, this is exactly 1% of the 20 Amps range, maximum for the WT310E.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 1, 2021

Note that we cannot simply take the absolute values:

Time,08-01-2021 22:21:32.319,Watts,-27.100000,Volts,251.930000,Amps,-3.199000,PF,0.481000,Mark,2021-08-01_23-18-11_ranging
Time,08-01-2021 22:21:33.318,Watts,24.300000,Volts,251.920000,Amps,0.200000,PF,0.480900,Mark,2021-08-01_23-18-11_ranging

24.3 W is not close to 27.1 W, even though 0.200 A and 0.199 A are. (One should replace -3. with 0. for negative readings.)

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 1, 2021

I guess that when a system has at least some readings above 0.200 A in the ranging run, the workflow can proceed to the testing run. Unfortunately, this will affect a lot of systems Krai measured for v1.0 and are looking to update for v1.1.

@s-idgunji @araghun At the moment I suggest we revert to the old behaviour. That is, perform the ranging run in the Auto ranging mode. I will look into preparing a patch for our Wednesday meeting.

@psyhtest psyhtest changed the title Ranging phase failures at below ~22 Watts Yokogawa WT310E ranging failures at under 0.2 A Aug 1, 2021
@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

We have almost all of the efficientnet-lite family producing negative readings with armnn-v21.05-neon on xavier, except for efficientnet-lite4-non-quantized. It's average power with armnn-v21.02-neon in the last round was 22.07 W - the highest of all models.

Most of the ranging trace is under the radar (0.2 A ~ 23.4 W), fewer than 2.5% of the samples are valid:

08-02-2021 08:30:35.103: Avg watts 23.593333, min watts 23.400000, max watts 24.000000, samples 627, errors 612, valid 15

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

I'm not sure if it's related to the power workflow update, but the overheads when measuring power seem to have increased dramatically, at least for resnet50 on xavier with tflite:

Verifying performance.
reference score = 112767727
test score = 86564185

I get the same high execution time (~113 ms) in the ranging and testing power runs, and the same low execution time (~87 ms) in the TEST01 and TEST05 compliance tests. That's a 30% overhead, not 10% as previously observed (and for armnn, not tflite, on xavier).

@s-idgunji
Copy link
Contributor

@psyhtest - Let's discuss this tomorrow in the WG with @araghun and the rest of the Power WG. There has been no change to the base flow, other than accepting submits that were done as improvements during v1.0. What we should see is if other teams see similar issues or is it tied to the other issue that you had mentioned (CPU only submissions that are low power envelope get loaded by Power infra flow)

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

@s-idgunji We will certainly rerun the same experiment (~113 ms) to see if it's just a blip.

But changing the ranging phase to use the maximum range of a meter instead of auto ranging has proved to be a wrong decision.

@s-idgunji
Copy link
Contributor

@s-idgunji We will certainly rerun the same experiment (~113 ms) to see if it's just a blip.

But changing the ranging phase to use the maximum range of a meter instead of auto ranging has proved to be a wrong decision.

@psyhtest - I don't think you can say that. The approach is right. And it eliminates issues. Perhaps for very low power system submissions there may be a cost. But then, we need to understand how to eliminate ranging related errors in our flow. Why is it that using max range and then setting the highest observed load for perf testing causing an issue ? Can you explain that ?

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

@s-idgunji As I explained above, for systems that draw less than 0.2 A, the ranging log is full of negative and non-sensible values. The workflow doesn't even allow one to proceed to the testing phase failing with:

ptd-server 2021-08-01 11:56:59,437 [ERROR] Got an exception
Traceback (most recent call last):
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 577, in handle_connection
    reply = self._handle_cmd(cmd, p)
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 650, in _handle_cmd                                     
    return unbool[self.session.stop(Mode.RANGING)]
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 861, in stop                                                                                
    self._maxVolts, self._maxAmps = max_volts_amps(
  File "/home/anton/projects/mlperf/power-dev-r1.1/ptd_client_server/../ptd_client_server/lib/server.py", line 187, in max_volts_amps           
    raise MaxVoltsAmpsNegativeValuesError(f"Could not find values for {mark!r}")
ptd_client_server.lib.server.MaxVoltsAmpsNegativeValuesError: Could not find values for '2021-08-01_11-45-44_ranging'                                                                                      
ptd-server 2021-08-01 11:56:59,437 [INFO] Sending reply to client 'Error: exception'

If we don't fix it, this will affect hundreds of submissions, including customer-critical ones.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

I can offer an analogy with floating-point calculations: we have an "underflow" here. But instead of zeros we get all negative values.

@araghun
Copy link
Contributor

araghun commented Aug 3, 2021

@s-idgunji , @psyhtest : let us discuss this tomorrow in the WG.
Looks like the question to evaluate is:

  1. Is the change from using Auto to max causing this issue (in other words, if we revert back for experimentation, does this issue go away). If this happens, we will need to understand why is it happening (are the -ve values messing around with the functioning), and,
  2. Is this an issue for other submitters as well

@s-idgunji
Copy link
Contributor

s-idgunji commented Aug 3, 2021

@psyhtest - Thanks for the analogy, but that was not my ask. It was what is causing it and if it is specific to a certain class of systems (e.g. low power) . Seems like the change has been in for a while but we'd not seen any validation driven errors reported before now.

I will add Dejan to this issue to help track. Update : I cannot add Dejan , so have forwarded the meeting.. next best thing.

@s-idgunji
Copy link
Contributor

s-idgunji commented Aug 3, 2021

I'm not sure if it's related to the power workflow update, but the overheads when measuring power seem to have increased dramatically, at least for resnet50 on xavier with tflite:

Verifying performance.
reference score = 112767727
test score = 86564185

I get the same high execution time (~113 ms) in the ranging and testing power runs, and the same low execution time (~87 ms) in the TEST01 and TEST05 compliance tests. That's a 30% overhead, not 10% as previously observed (and for armnn, not tflite, on xavier).

Also checking, how can you measure the overhead when the flow itself does not allow you to proceed per other comments .

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

@s-idgunji: Seems like the change has been in for a while but we'd not seen any validation driven errors reported before now.

@araghun: Is this an issue for other submitters as well

As I'm the only submitter experimenting with low-power systems at the moment, I'm not surprised this hasn't been reported before. Until last week, I'd been validating the r1.1 power workflow only with systems drawing more than 25 Watts, where it is not an issue.

@s-idgunji: Also checking, how can you measure the overhead when the flow itself does not allow you to proceed per other comments

Oh, but resnet50 on xavier draws more than 25 Watts (see above), so 95% of the samples are positive:

08-01-2021 21:23:01.733: Avg watts 26.780968, min watts 25.600000, max watts 27.800000, samples 631, errors 32, valid 599

But most other models we use are indeed draw under 22 Watts even on xavier, so I can't really measure anything, leave alone any overheads. (Also, they are for the open division, so I don't run the compliance tests for them.)

@araghun : Is the change from using Auto to max causing this issue (in other words, if we revert back for experimentation, does this issue go away).

I should confirm that by the WG meeting tomorrow.

Thanks.

@s-idgunji
Copy link
Contributor

@psyhtest - Given this is just your reporting, and also having covered systems > 25W (in your case) where there are no issues, I do not think we can say this method is wrong. The question is can we resolve this in time and how. We can discuss tomorrow.

One of the reasons we've been asking submitters to test early and report since the changes have been so incremental and code ready to test for a while.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

One of the reasons we've been asking submitters to test early and report since the changes have been so incremental and code ready to test for a while.

Oh, I have been testing it for 2-3 months now, just not across the full range of systems and scenarios! :)

Looking at submissions 1.0-633,634 in Edge - Closed - Power, I expect NVIDIA to run into exactly the same issue on all Jetson Xavier NX (MaxQ, TensorRT) benchmarks and some Jetson AGX Xavier 32GB (MaxQ, TensorRT) benchmarks. Do you know if someone at NVIDIA has already tested this?

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

the changes have been so incremental

I hope so. It's easier to revert incremental changes.

@s-idgunji
Copy link
Contributor

the changes have been so incremental

I hope so. It's easier to revert incremental changes.

Defeats the purpose of the change if it is made in the right direction and works for most submitters. Let's figure out the solution tomorrow.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

@s-idgunji The previous version of the workflow worked for ALL submitters. The current version of the workflow would exclude 80-90% of the power results from the previous round. Would we carry on regardless?

@s-idgunji
Copy link
Contributor

@s-idgunji The previous version of the workflow worked for ALL submitters. The current version of the workflow would exclude 80-90% of the power results from the previous round. Would we carry on regardless?

@psyhtest - I do not understand this point. Are you saying that % of power results submitted should dictate the decision ? So if a submitter was contributing 1% of the results maybe we would not consider ? To me that's what the statement appears to suggest. Rather, let's look at the solution and options live. As far as I know other submitters < 25W have not seen this issue, but I can check with Ashwin/Dilip about NVIDIA's testing.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 3, 2021

I've prepared a new branch which reverts two commits to server.py and updates its checksum.

On restarting the server service, I no longer see negative non-sensible values:

Time,08-03-2021 22:50:03.188,Watts,6.571000,Volts,249.370000,Amps,0.072650,PF,0.362700,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:04.189,Watts,9.666000,Volts,249.310000,Amps,0.094920,PF,0.408400,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:05.188,Watts,11.325000,Volts,249.260000,Amps,0.105710,PF,0.429800,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:06.188,Watts,6.852000,Volts,249.350000,Amps,0.074790,PF,0.367400,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:07.189,Watts,7.045000,Volts,249.340000,Amps,0.076170,PF,0.371000,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:08.188,Watts,6.911000,Volts,249.310000,Amps,0.074940,PF,0.369900,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:09.188,Watts,6.543000,Volts,249.310000,Amps,0.072170,PF,0.363700,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:10.188,Watts,8.319000,Volts,249.300000,Amps,0.085390,PF,0.390800,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:11.188,Watts,7.173000,Volts,249.210000,Amps,0.077330,PF,0.372200,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:12.189,Watts,11.307000,Volts,249.280000,Amps,0.104750,PF,0.433000,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:13.188,Watts,11.395000,Volts,249.210000,Amps,0.105530,PF,0.433300,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:14.188,Watts,11.411000,Volts,249.170000,Amps,0.105550,PF,0.433900,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:15.188,Watts,11.232000,Volts,249.180000,Amps,0.104310,PF,0.432100,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:16.188,Watts,11.613000,Volts,249.210000,Amps,0.107100,PF,0.435100,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:17.188,Watts,11.579000,Volts,249.200000,Amps,0.106830,PF,0.434900,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:18.188,Watts,11.746000,Volts,249.190000,Amps,0.107960,PF,0.436600,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:19.188,Watts,11.477000,Volts,249.250000,Amps,0.106070,PF,0.434100,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:20.188,Watts,11.818000,Volts,249.250000,Amps,0.108530,PF,0.436800,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:21.188,Watts,11.732000,Volts,249.270000,Amps,0.107970,PF,0.435900,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:22.188,Watts,9.993000,Volts,249.270000,Amps,0.096520,PF,0.415300,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:23.188,Watts,-1.000000,Volts,-1.000000,Amps,-1.000000,PF,-1.000000,Mark,2021-08-03_23-49-34_ranging
Time,08-03-2021 22:50:24.188,Watts,26.130000,Volts,249.230000,Amps,0.212740,PF,0.492800,Mark,2021-08-03_23-49-34_ranging

except from the quick change to the next range on the way up and then back.

This produces only two bad samples:

08-03-2021 23:00:32.886: Avg watts 26.480325, min watts 6.543000, max watts 28.390000, samples 630, errors 2, valid 628

which I have seen happening for v1.0 as well e.g. here.

Tomorrow we are going to test it for a complete family of networks crossing the 0.2 A boundary.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 4, 2021

Are you saying that % of power results submitted should dictate the decision ?
So if a submitter was contributing 1% of the results maybe we would not consider

That's categorically not what I'm saying. We should consider all submitters, whether they are in the minority or in the majority.

As far as I know other submitters < 25W have not seen this issue, but I can check with Ashwin/Dilip about NVIDIA's testing.

Coming to think about this, WT333E can behave differently from WT310E in this regard. Let's discuss today.

@s-idgunji
Copy link
Contributor

Are you saying that % of power results submitted should dictate the decision ?
So if a submitter was contributing 1% of the results maybe we would not consider

That's categorically not what I'm saying. We should consider all submitters, whether they are in the minority or in the majority.

Thanks for clarifying because it was not clear why we referred to % of submissions. There is no question that we have to help resolve issues for every submitter. I think Dejan has reached out to understand the issue and can explain in the meeting. But we'll have to look for some solution that allows for submitters who have already prepared results with current version , vs any reverted version proposed (i.e. back to auto) so that we allow for least impact.

@psyhtest
Copy link
Contributor Author

psyhtest commented Aug 5, 2021

Just a quick update. I've been collecting test data to verify a potential solution: 1) measurements taken with the top of r1.1; 2) measurements taken with the auto-ranging patch. Once I validate the solution works for both, I'll make a PR to r1.1.

@s-idgunji
Copy link
Contributor

Thanks . @psyhtest - Has the main Inference v1.1 pointed to the Power v1.1 instead of power-dev. Do I need to follow up ?

@gregdarnell
Copy link

I'm not sure if you are looking for an explanation for this behavior, but I can explain. The WT310 specification says that its effective input range is 1% to 130% of range for voltage and current. If you're in the 20A range and drawing less than 0.2A, you are outside the analyzer's specifications and the data is invalid. PTDaemon marks all invalid data by using negative values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants