Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger latencies + monitoring (v5) #342

Merged
merged 24 commits into from
Oct 11, 2024
Merged

Trigger latencies + monitoring (v5) #342

merged 24 commits into from
Oct 11, 2024

Conversation

MRiganSUSX
Copy link
Contributor

@MRiganSUSX MRiganSUSX commented Sep 13, 2024

This PR implements monitoring of latencies across trigger.
Requires appmodel PR: DUNE-DAQ/appmodel#126.

Changes:

  • new latency class that is used throughout: simple, stores and retrieves timestamps. Can be configured to use ms or us.
  • currently implemented in: CTCM, RTCM, MLT, TXProcessors, HSISM
  • also publishes latency information to opmon (all above), using shared proto message for easy changes. Grafana plots are already in place.
  • 'weird' case for MLT where two instances are used, one is for readout windows (as was the case in v4)

For appmodel changes:

  • created LatencyMonitoringCong class, simply allowing to enable/disable latency monitoring
  • instantiated object using this class
  • expanded relevant classes and objects that are using this. Tried to abstract the configuration to limit copies of the same code (ie made part of DataProcessor), but not sure it's the best it could have been.

other small changes:

  • removed tlog block in DHM (left in by mistake)
  • some consistency changes for opmon variables

Copy link
Contributor

@mroda88 mroda88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of what information is published, this seems ok.

The usage of the locking is doubtful at the moment.

std::atomic<uint64_t> m_latency_in; // Member variable to store latency_in
std::atomic<uint64_t> m_latency_out; // Member variable to store latency_out
std::atomic<double> m_clock_ticks_conversion; // Dynamically adjusted conversion factor for clock ticks
mutable std::mutex m_mutex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class implementation is a bit weird from the technical point of view.
The mutex requirement is unclear because you already have atomic counters.

If you need we can discuss it.

schema/trigger/opmon/latency_info.proto Outdated Show resolved Hide resolved

// Message for MLT TD requests latency vars
// Latency represents the difference between current system (clock) time and the requested TD readout window (start/end)
// Units are ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recomment to publish us

Copy link
Contributor

@ArturSztuc ArturSztuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good! There are only 2 serious comments:

  1. do_configure is deprecated in v5, and we need to use init instead (or conf, but only in non-DAQModule classes).
  2. There's a logic issue with when we're loading the "current system time". We should do it at latency measurement time, like we did in v4, not at sending-to-opmon time. This will add unnecessary latency, especially in slow-flow system (e.g. 1hz rates)

@@ -107,6 +107,15 @@ CustomTCMaker::generate_opmon_data()
info.set_tc_failed_sent_count( m_tc_failed_sent_count.load() );

this->publish(std::move(info));

if ( m_running_flag.load() && m_latency_monitoring.load() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( m_running_flag.load() && m_latency_monitoring.load() ) {
if ( m_latency_monitoring.load() && m_running_flag.load() ) {

Not an important comment, I'm only writing this because I've learned something new a few weeks back and want to share :) Unnecessary micro-optimisation, but in C++ order inside if statement can matter of operations like &&, || (not true for all operators).
So in your case, whilst we're running, we will always load 2 booleans, even if we don't want monitoring. With the suggestion above, if monitoring is off, we will only load 1 boolean. Matters more task-heavy workflows so probably not here, but maybe in e.g. TPDataProcessor where every little counts.

Comment on lines 137 to 138

m_latency_monitoring.store( m_conf->get_latency_monitoring_conf()->get_enable_latency_monitoring() );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do_configure is deprecated in v5.
All the config should be inside of void init(std::shared_ptr<appfwk::ModuleConfiguration> mcfg);

Technically there's also a new conf, but only in non-DAQModule classes I think.

plugins/CustomTCMaker.cpp Outdated Show resolved Hide resolved
plugins/MLTModule.cpp Show resolved Hide resolved
plugins/RandomTCMakerModule.cpp Outdated Show resolved Hide resolved
}

void
RandomTCMakerModule::do_configure(const nlohmann::json& /*obj*/)
{
//m_conf = obj.get<randomtriggercandidatemaker::Conf>();
m_latency_monitoring.store( m_conf->get_latency_monitoring_conf()->get_enable_latency_monitoring() );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re. do_configure being deprecated in v5, should move to init.

src/TAProcessor.cpp Show resolved Hide resolved
src/TCProcessor.cpp Show resolved Hide resolved
src/TPProcessor.cpp Show resolved Hide resolved
src/trigger/HSISourceModel.hpp Show resolved Hide resolved
@MRiganSUSX
Copy link
Contributor Author

Thanks @ArturSztuc and @mroda88 for insights.

Here is v2, changes:
In trigger:

  • reworked the latency class (again), to be simpler, and modular
  • optimized the if conditions check as Artur suggested
  • moved config loading to init or conf as requested
  • the default for latency instances is now us (microseconds) throughout (and plots have been updated to reflect this, thanks to @mroda88)

In appmodel:

This was tested:

  • offline with example configs with additional debug logs
  • using the ehn1 example sessions and monitoring plots at grafana
  • using dbt-unittest-summary.sh
  • using integration tests: minimal_system_quick_test, tpstream_writing_test, 3ru_1df_multirun_test

Some issues and potential improvements are mentioned #347, for future consideration.

Copy link
Contributor

@ArturSztuc ArturSztuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The only further comment I have, and something I didn't notice earlier, is that the most of the StandaloneTCMakers don't need both latency in and latency out. They create a new object and then send it straight away, so the time difference between these two is in nanoseconds. Latency out at sending would be enough (only for Random & Custom TC maker, HSI of course does receive an input and processes it).

I don't mind if that's in this PR or a separate one.

m_tc_made_count++;

TLOG_DEBUG(1) << get_name() << " at timestamp " << m_timestamp_estimator->get_timestamp_estimate()
<< ", pushing a candidate with timestamp " << candidate.time_candidate;

if (m_latency_monitoring.load()) m_latency_instance.update_latency_out( candidate.time_candidate );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both latency in and latency in and out here? The time difference here is literally the time taken to do m_tc_made_count++, which will be in nanoseconds. Out latency would be more than enough.

I think in & out makes sense if we have some input data, a processing stage, and output data. In the standalone TC makers we just have the output data

m_tc_made_count++;

TLOG_DEBUG(1) << get_name() << " at timestamp " << m_timestamp_estimator->get_timestamp_estimate()
<< ", pushing a candidate with timestamp " << candidate.time_candidate;

if (m_latency_monitoring.load()) m_latency_instance.update_latency_out( candidate.time_candidate );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here, we only need one latency in standalone makers, having in and out doesn't make sense, and will give identical numbers (unless we switch to nanoseconds or picoseconds.

@MRiganSUSX
Copy link
Contributor Author

Thanks Artur,
I updated the standalone makers to only report latency out now. This unfortunately needed a different opmon message, so that we are not sending a message that is half empty.

Copy link
Contributor

@ArturSztuc ArturSztuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the work!

@MRiganSUSX MRiganSUSX merged commit edf35cd into develop Oct 11, 2024
1 check failed
@MRiganSUSX MRiganSUSX deleted the mrigan/new_latency branch October 11, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants