Trigger latencies + monitoring (v5) #342

MRiganSUSX · 2024-09-13T13:03:22Z

This PR implements monitoring of latencies across trigger.
Requires appmodel PR: DUNE-DAQ/appmodel#126.

Changes:

new latency class that is used throughout: simple, stores and retrieves timestamps. Can be configured to use ms or us.
currently implemented in: CTCM, RTCM, MLT, TXProcessors, HSISM
also publishes latency information to opmon (all above), using shared proto message for easy changes. Grafana plots are already in place.
'weird' case for MLT where two instances are used, one is for readout windows (as was the case in v4)

For appmodel changes:

created LatencyMonitoringCong class, simply allowing to enable/disable latency monitoring
instantiated object using this class
expanded relevant classes and objects that are using this. Tried to abstract the configuration to limit copies of the same code (ie made part of DataProcessor), but not sure it's the best it could have been.

other small changes:

removed tlog block in DHM (left in by mistake)
some consistency changes for opmon variables

mroda88

From the point of view of what information is published, this seems ok.

The usage of the locking is doubtful at the moment.

mroda88 · 2024-09-17T10:15:20Z

include/trigger/Latency.hpp

+    std::atomic<uint64_t> m_latency_in;  // Member variable to store latency_in
+    std::atomic<uint64_t> m_latency_out; // Member variable to store latency_out
+    std::atomic<double> m_clock_ticks_conversion; // Dynamically adjusted conversion factor for clock ticks
+    mutable std::mutex m_mutex;


This class implementation is a bit weird from the technical point of view.
The mutex requirement is unclear because you already have atomic counters.

If you need we can discuss it.

schema/trigger/opmon/latency_info.proto

mroda88 · 2024-09-17T10:27:26Z

schema/trigger/opmon/moduleleveltrigger_info.proto

+
+// Message for MLT TD requests latency vars
+// Latency represents the difference between current system (clock) time and the requested TD readout window (start/end)
+// Units are ms


I would recomment to publish us

ArturSztuc

Overall looks very good! There are only 2 serious comments:

do_configure is deprecated in v5, and we need to use init instead (or conf, but only in non-DAQModule classes).
There's a logic issue with when we're loading the "current system time". We should do it at latency measurement time, like we did in v4, not at sending-to-opmon time. This will add unnecessary latency, especially in slow-flow system (e.g. 1hz rates)

ArturSztuc · 2024-10-06T15:01:16Z

plugins/CustomTCMaker.cpp

@@ -107,6 +107,15 @@ CustomTCMaker::generate_opmon_data()
  info.set_tc_failed_sent_count( m_tc_failed_sent_count.load() );  

  this->publish(std::move(info));
+
+  if ( m_running_flag.load() && m_latency_monitoring.load() ) { 


Suggested change

if ( m_running_flag.load() && m_latency_monitoring.load() ) {

if ( m_latency_monitoring.load() && m_running_flag.load() ) {

Not an important comment, I'm only writing this because I've learned something new a few weeks back and want to share :) Unnecessary micro-optimisation, but in C++ order inside if statement can matter of operations like &&, || (not true for all operators).
So in your case, whilst we're running, we will always load 2 booleans, even if we don't want monitoring. With the suggestion above, if monitoring is off, we will only load 1 boolean. Matters more task-heavy workflows so probably not here, but maybe in e.g. TPDataProcessor where every little counts.

ArturSztuc · 2024-10-06T15:04:40Z

plugins/CustomTCMaker.cpp

+
+  m_latency_monitoring.store( m_conf->get_latency_monitoring_conf()->get_enable_latency_monitoring() );


do_configure is deprecated in v5.
All the config should be inside of void init(std::shared_ptr<appfwk::ModuleConfiguration> mcfg);

Technically there's also a new conf, but only in non-DAQModule classes I think.

plugins/CustomTCMaker.cpp

plugins/MLTModule.cpp

plugins/RandomTCMakerModule.cpp

ArturSztuc · 2024-10-06T15:20:11Z

plugins/RandomTCMakerModule.cpp

 }

 void
 RandomTCMakerModule::do_configure(const nlohmann::json& /*obj*/)
 {
  //m_conf = obj.get<randomtriggercandidatemaker::Conf>();
+  m_latency_monitoring.store( m_conf->get_latency_monitoring_conf()->get_enable_latency_monitoring() );


Ditto re. do_configure being deprecated in v5, should move to init.

src/TAProcessor.cpp

src/TCProcessor.cpp

src/TPProcessor.cpp

src/trigger/HSISourceModel.hpp

MRiganSUSX · 2024-10-10T11:02:00Z

Thanks @ArturSztuc and @mroda88 for insights.

Here is v2, changes:
In trigger:

reworked the latency class (again), to be simpler, and modular
optimized the if conditions check as Artur suggested
moved config loading to init or conf as requested
the default for latency instances is now us (microseconds) throughout (and plots have been updated to reflect this, thanks to @mroda88)

In appmodel:

previous PR was closed (Trigger latency monitoring configuration appmodel#126)
new PR: config attributes related to latency in trigger appmodel#141
this is using attributes only to pass on the latency boolean

This was tested:

offline with example configs with additional debug logs
using the ehn1 example sessions and monitoring plots at grafana
using dbt-unittest-summary.sh
using integration tests: minimal_system_quick_test, tpstream_writing_test, 3ru_1df_multirun_test

Some issues and potential improvements are mentioned #347, for future consideration.

ArturSztuc

LGTM!

The only further comment I have, and something I didn't notice earlier, is that the most of the StandaloneTCMakers don't need both latency in and latency out. They create a new object and then send it straight away, so the time difference between these two is in nanoseconds. Latency out at sending would be enough (only for Random & Custom TC maker, HSI of course does receive an input and processes it).

I don't mind if that's in this PR or a separate one.

ArturSztuc · 2024-10-11T11:31:14Z

plugins/CustomTCMaker.cpp

    m_tc_made_count++;

    TLOG_DEBUG(1) << get_name() << " at timestamp " << m_timestamp_estimator->get_timestamp_estimate()
                  << ", pushing a candidate with timestamp " << candidate.time_candidate;

+    if (m_latency_monitoring.load()) m_latency_instance.update_latency_out( candidate.time_candidate );


Do we need both latency in and latency in and out here? The time difference here is literally the time taken to do m_tc_made_count++, which will be in nanoseconds. Out latency would be more than enough.

I think in & out makes sense if we have some input data, a processing stage, and output data. In the standalone TC makers we just have the output data

ArturSztuc · 2024-10-11T11:34:20Z

plugins/RandomTCMakerModule.cpp

    m_tc_made_count++;

    TLOG_DEBUG(1) << get_name() << " at timestamp " << m_timestamp_estimator->get_timestamp_estimate()
                  << ", pushing a candidate with timestamp " << candidate.time_candidate;

+    if (m_latency_monitoring.load()) m_latency_instance.update_latency_out( candidate.time_candidate );


Ditto here, we only need one latency in standalone makers, having in and out doesn't make sense, and will give identical numbers (unless we switch to nanoseconds or picoseconds.

MRiganSUSX · 2024-10-11T12:56:27Z

Thanks Artur,
I updated the standalone makers to only report latency out now. This unfortunately needed a different opmon message, so that we are not sending a message that is half empty.

ArturSztuc

LGTM, thanks for all the work!

MRiganSUSX added 15 commits August 28, 2024 17:13

latency test in RTCM

ab5812e

new latency class and first implementation

e6ef2a3

Merge branch 'mrigan/new_opmon' into mrigan/new_latency

210456d

fixing merge mistake

54c6a8a

add tcproc proto

8e270a6

Merge branch 'develop' into mrigan/new_latency

84fe1fd

fixing running_flag globally

a004b17

adding latency to tcproc, mlt

3a51e70

fix for latency get functions and weird time cases

9155d64

adding comments to proto files

6bd087c

reworking latency messages

0dcc5f2

making latency monitoring configurable

64e03f1

making latency monitoring configurable

ec83c08

trigger latency configuration propagation

a1cbfc6

adding the option for us precision in latency class

8120c0a

MRiganSUSX mentioned this pull request Sep 13, 2024

Trigger latency monitoring configuration DUNE-DAQ/appmodel#126

Closed

fixing incorrect tlog

0822cc0

MRiganSUSX requested review from ArturSztuc, mroda88 and aeoranday September 17, 2024 09:43

mroda88 reviewed Sep 17, 2024

View reviewed changes

fix for latency class

dc4ea9c

ArturSztuc requested changes Oct 6, 2024

View reviewed changes

MRiganSUSX added 6 commits October 7, 2024 11:56

merging develop

4201bdb

some latency improvements

bbddc50

small rework of latency class

6b33efd

latency: making micros the default

22f6687

simplifying latency monitoring

da978b7

Merge branch 'develop' into mrigan/new_latency

b05c8f2

This was referenced Oct 10, 2024

config attributes related to latency in trigger DUNE-DAQ/appmodel#141

Merged

Latency monitoring improvements #347

Open

MRiganSUSX requested review from ArturSztuc and mroda88 October 10, 2024 11:02

ArturSztuc approved these changes Oct 11, 2024

View reviewed changes

update for latencies for standalone makers

45905a7

ArturSztuc approved these changes Oct 11, 2024

View reviewed changes

MRiganSUSX merged commit edf35cd into develop Oct 11, 2024
1 check failed

MRiganSUSX deleted the mrigan/new_latency branch October 11, 2024 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trigger latencies + monitoring (v5) #342

Trigger latencies + monitoring (v5) #342

MRiganSUSX commented Sep 13, 2024 •

edited

Loading

mroda88 left a comment

mroda88 Sep 17, 2024

mroda88 Sep 17, 2024

ArturSztuc left a comment

ArturSztuc Oct 6, 2024

ArturSztuc Oct 6, 2024

ArturSztuc Oct 6, 2024

MRiganSUSX commented Oct 10, 2024

ArturSztuc left a comment

ArturSztuc Oct 11, 2024

ArturSztuc Oct 11, 2024

MRiganSUSX commented Oct 11, 2024

ArturSztuc left a comment

	if ( m_running_flag.load() && m_latency_monitoring.load() ) {
	if ( m_latency_monitoring.load() && m_running_flag.load() ) {


		m_latency_monitoring.store( m_conf->get_latency_monitoring_conf()->get_enable_latency_monitoring() );

Trigger latencies + monitoring (v5) #342

Trigger latencies + monitoring (v5) #342

Conversation

MRiganSUSX commented Sep 13, 2024 • edited Loading

mroda88 left a comment

Choose a reason for hiding this comment

mroda88 Sep 17, 2024

Choose a reason for hiding this comment

mroda88 Sep 17, 2024

Choose a reason for hiding this comment

ArturSztuc left a comment

Choose a reason for hiding this comment

ArturSztuc Oct 6, 2024

Choose a reason for hiding this comment

ArturSztuc Oct 6, 2024

Choose a reason for hiding this comment

ArturSztuc Oct 6, 2024

Choose a reason for hiding this comment

MRiganSUSX commented Oct 10, 2024

ArturSztuc left a comment

Choose a reason for hiding this comment

ArturSztuc Oct 11, 2024

Choose a reason for hiding this comment

ArturSztuc Oct 11, 2024

Choose a reason for hiding this comment

MRiganSUSX commented Oct 11, 2024

ArturSztuc left a comment

Choose a reason for hiding this comment

MRiganSUSX commented Sep 13, 2024 •

edited

Loading