Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for merge] start a monitoring radio plugin API #3315

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

benclifford
Copy link
Collaborator

Description

This is the beginning of a monitoring radio plugin API, driven by some work with desc (where the choice of monitoring radio has been something to pay attention to performance-wise), by @ClaudiaCumberbatch who wants to plug in another radio, and the chronolog project where a hack sesssion resulted in sending some monitoring messages into chronolog. There are also possible uses in Globus Compute when wanting to observe task behaviour in future without having any supporting Parsl-style monitoring infrastructure.

Changed Behaviour

This API is a partial prototype and might not work quite right, but there is enough there to get feedback from potential users.

Prior to this PR monitoring configuration is sent as pair: a radio name (hard-coded in executor definitions since PR #2274, and not easily user selectable) along with information that can be used by some radios (for example, the monitoring router UDP listener endpoint).

After this PR, those fields are all replaced by a pickled configuration object, some instance of RadioConfig, which will be sent to the worker and used to create an appropriate radio. A user can now specify a RadioConfig instance (most likely a subclass of RadioConfig such as UDPRadioConfig) which contains whichever configuration values are relevant for that particular radio mode - for example, HTEXRadioConfig needs no further configuration, but UDPRadioConfig contains an IP address and UDP port.

This allows plugin-style out-of-codebase radios to be implemented without modifying the parsl source tree - that's a feature Parsl has been driving towards recently in many places to accomodate differing levels of code quality and support ability: most recently @ClaudiaCumberbatch's diaspora PR #3147, which this PR should accomodate without any diaspora-specific additions to the main Parsl codebase.

Type of change

  • New feature

benclifford added a commit that referenced this pull request Apr 10, 2024
…ugin

Before this PR, the DataFlowKernel and each JobStatusPoller
PolledExecutorFacade each opened a ZMQ connection to the monitoring router.
These connections are not threadsafe, but (especially in the DFK case) no
reasoning or checking stops the DFK's connection being used in multiple
threads.

Before this PR, the MonitoringHub presented a 'send' method and stored the
DFK's ZMQ connection in self._dfk_channel, and each PolledExecutorFacade
contained a copy of the ZMQ channel code to open its own channel, configured
using parameters from the DFK passed in at construction.

This PR:

* moves the above uses to use the MonitoringRadio interface (which was
originally designed for remote workers to send monitoring information, but
seems ok here too)

* has MonitoringHub construct an appropriate MonitoringRadio instance for
use on the submit side, exposed as self.radio;

* replaces the implementation of send with a new MultiprocessingQueueRadio
which is thread safe but only works in the same multiprocessing environment
as the launched monitoring database manager process (which is true on the
submit side, but for example means this radio cannot be used on most
remote workers)

This work aligns with the prototype in #3315 (which pushes on monitoring
radio configuration for remote workers) and pushes in the direction
(without getting there) of allowing other submit-side hooks.

This work removes some monitoring specific code from the JobStatusPoller,
replacing it with a dependency injection style.
This is part of work to move JobStatusPoller facade state into other classes,
as part of job handling rearrangements in PR #3293

This PR will change how monitoring messages are delivered from the submitting
process, and the most obvious thing I can think of that will change is how
this will behave under load: heavily loaded messaging causing full buffers and
other heavy-load symptoms will now behave as multiprocessing Queues do, rather
than as ZMQ connections do. I have not attempted to characterise either of
those modes.
benclifford added a commit that referenced this pull request Apr 11, 2024
…ugin (#3338)

Before this PR, the DataFlowKernel and each JobStatusPoller
PolledExecutorFacade each opened a ZMQ connection to the monitoring router.
These connections are not threadsafe, but (especially in the DFK case) no
reasoning or checking stops the DFK's connection being used in multiple
threads.

Before this PR, the MonitoringHub presented a 'send' method and stored the
DFK's ZMQ connection in self._dfk_channel, and each PolledExecutorFacade
contained a copy of the ZMQ channel code to open its own channel, configured
using parameters from the DFK passed in at construction.

This PR:

* moves the above uses to use the MonitoringRadio interface (which was
originally designed for remote workers to send monitoring information, but
seems ok here too)

* has MonitoringHub construct an appropriate MonitoringRadio instance for
use on the submit side, exposed as self.radio;

* replaces the implementation of send with a new MultiprocessingQueueRadio
which is thread safe but only works in the same multiprocessing environment
as the launched monitoring database manager process (which is true on the
submit side, but for example means this radio cannot be used on most
remote workers)

This work aligns with the prototype in #3315 (which pushes on monitoring
radio configuration for remote workers) and pushes in the direction
(without getting there) of allowing other submit-side hooks.

This work removes some monitoring specific code from the JobStatusPoller,
replacing it with a dependency injection style.
This is part of work to move JobStatusPoller facade state into other classes,
as part of job handling rearrangements in PR #3293

This PR will change how monitoring messages are delivered from the submitting
process, and the most obvious thing I can think of that will change is how
this will behave under load: heavily loaded messaging causing full buffers and
other heavy-load symptoms will now behave as multiprocessing Queues do, rather
than as ZMQ connections do. I have not attempted to characterise either of
those modes.

Co-authored-by: Kevin Hunter Kesling <[email protected]>
benclifford added a commit that referenced this pull request Jun 14, 2024
This PR removes a use of multiprocessing fork-without-exec.

At heart, this is how the interchange has wanted to be launched for some time (because of earlier remote interchange work).

Launching multiprocessing fork caused a bunch of problems related to inheriteing state from from the parent submitting process that go away with this (jumbled logging topics, race conditions around at least logging-while-forking, inherited signal handlers).

The configuration dictionary, previously passed in memory over a fork, is now sent in pickled form over stdin. Using pickle here rather than (eg.) JSON keeps the path open for sending richer configuration objects, beyond what can be encoded in JSON. This isn't something needed right now, but at least configurable monitoring radios (the immediate driving force behind this PR) are modelled around passing arbitrary configuration objects around to configure things - and so it seems likely that if interchange monitoring configuration is exposed to the user, richer objects would be passed here. See PR #3315 for monitoring radio prototype.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch from 9ad2288 to 7d7723b Compare June 24, 2024 18:12
benclifford added a commit that referenced this pull request Jun 24, 2024
this is in preparation for pluggable/configurable radios: see PR #3315

in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one)

and in pluggable mode we would expect arbitrary new receivers to exist
so then theres no reason for these ones to be special

this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 2 times, most recently from b53cabd to d5d8409 Compare June 24, 2024 20:10
benclifford added a commit that referenced this pull request Jul 11, 2024
this is in preparation for pluggable/configurable radios: see PR #3315

in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one)

and in pluggable mode we would expect arbitrary new receivers to exist
so then theres no reason for these ones to be special

this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 2 times, most recently from 9e8c6e0 to 2f9589d Compare July 11, 2024 09:28
benclifford added a commit that referenced this pull request Jul 11, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio
configuration classes using *Radio names.

This PR frees up the *Radio namespace for that use, by renaming
non-user-exposed internal classes out of the way.
benclifford added a commit that referenced this pull request Jul 11, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio
configuration classes using *Radio names.

This PR frees up the *Radio namespace for that use, by renaming
non-user-exposed internal classes out of the way.
benclifford added a commit that referenced this pull request Jul 11, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as
attributes of the DFK. The address also existed as an attribute on
dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start

Afte this PR, those values are not added to the DFK, but instead are
accessed via dfk.monitoring.

These two attributes are now only set on a new executor when monitoring
is enabled, rather than always being intialised by the DFK. Default values
now come from the executor __init__ method, which is a more usual style
in Python for providing default values. See PR #3361

This is part of ongoing work to introduce more pluggable monitoring
network connectivity - see PR #3315
benclifford added a commit that referenced this pull request Jul 18, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as
attributes of the DFK. The address also existed as an attribute on
dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start

Afte this PR, those values are not added to the DFK, but instead are
accessed via dfk.monitoring.

These two attributes are now only set on a new executor when monitoring
is enabled, rather than always being intialised by the DFK. Default values
now come from the executor __init__ method, which is a more usual style
in Python for providing default values. See PR #3361

This is part of ongoing work to introduce more pluggable monitoring
network connectivity - see PR #3315
benclifford added a commit that referenced this pull request Jul 18, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio
configuration classes using *Radio names.

This PR frees up the *Radio namespace for that use, by renaming
non-user-exposed internal classes out of the way.
benclifford added a commit that referenced this pull request Jul 18, 2024
this is in preparation for pluggable/configurable radios: see PR #3315

in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one)

and in pluggable mode we would expect arbitrary new receivers to exist
so then theres no reason for these ones to be special

this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch from 2f9589d to 466c2db Compare July 18, 2024 09:15
benclifford added a commit that referenced this pull request Jul 24, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio
configuration classes using *Radio names.

This PR frees up the *Radio namespace for that use, by renaming
non-user-exposed internal classes out of the way.
benclifford added a commit that referenced this pull request Jul 25, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as
attributes of the DFK. The address also existed as an attribute on
dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start

Afte this PR, those values are not added to the DFK, but instead are
accessed via dfk.monitoring.

These two attributes are now only set on a new executor when monitoring
is enabled, rather than always being intialised by the DFK. Default values
now come from the executor __init__ method, which is a more usual style
in Python for providing default values. See PR #3361

This is part of ongoing work to introduce more pluggable monitoring
network connectivity - see PR #3315
benclifford added a commit that referenced this pull request Jul 26, 2024
this is in preparation for pluggable/configurable radios: see PR #3315

in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one)

and in pluggable mode we would expect arbitrary new receivers to exist
so then theres no reason for these ones to be special

this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 2 times, most recently from 72c98fa to c179253 Compare July 26, 2024 20:44
benclifford added a commit that referenced this pull request Jul 31, 2024
From an interchange perspective: this is a refactoring intended to clarify
that the interchange isn't doing anything special wrt. monitoring
messages and that it can send monitoring messages in the same way that
remote workers can.

From a monitoring perspective: this pulls ZMQ sender code out of the
interchange and puts it in a place that is more natural for ongoing
development. For example, a potential future use with Work Queue and
Task Vine is that workers would also benefit from using ZMQ to send
monitoring messages.

In some potential use cases, it might be desirable to configure the
radio used by the interchange instead of the hard-coded ZMQRadio. On-going
work in draft PR #3315 addresses configuration of different types of radio
and that work should be relevant here too.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch from d2ee35c to ddb214a Compare July 31, 2024 11:38
benclifford added a commit that referenced this pull request Jul 31, 2024
this is in preparation for pluggable/configurable radios: see PR #3315

in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one)

and in pluggable mode we would expect arbitrary new receivers to exist
so then theres no reason for these ones to be special

this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford added a commit that referenced this pull request Jul 31, 2024
A Parsl executor is configured with a submit-side monitoring radio sender,
which is used by the BlockProviderExecutor to send block status messages
to the monitoring subsystem.

Parsl executors also have a notion of a remote monitoring radio, used by
remote workers to sending monitoring messages.

This can be confusing when both of these radio senders are referred to in
the same piece of code, as happened in ongoing monitoring plugin development
in draft PR #3315.

This PR is intended to make this sitution much less ambiguous by avoiding
the mention of a monitoring radio in executor code without qualifying whether
it is a submit-side or remote-worker-side radio definition. A future PR
from the #3315 stack will introduce other monitoring radio references with
the remote prefix, replacing the current radio_mode and related attributes.
benclifford added a commit that referenced this pull request Jul 31, 2024
This is part of two ongoing strands of development:

* Preparation for arbitrary monitoring radio receivers, where UDP is not
special compared to other methods. This code separates out the UDP specific
code making it easier to move around later (see development in PR #3315)

* Avoiding poll-one-thing, poll-another-thing tight polling loops in
favour of multiple blocking loops. The two router receiver sections used
to run in a single tight non-blocking loop, each component waiting
loop_freq (= 10 ms) for something to happen. After this PR, the separated
loops are more amenable to longer blocking times - they only need to
discover when exit event is set which probably can be more on the order
of 1 second.
benclifford added a commit that referenced this pull request Aug 1, 2024
A Parsl executor is configured with a submit-side monitoring radio sender,
which is used by the BlockProviderExecutor to send block status messages
to the monitoring subsystem.

Parsl executors also have a notion of a remote monitoring radio, used by
remote workers to sending monitoring messages.

This can be confusing when both of these radio senders are referred to in
the same piece of code, as happened in ongoing monitoring plugin development
in draft PR #3315.

This PR is intended to make this sitution much less ambiguous by avoiding
the mention of a monitoring radio in executor code without qualifying whether
it is a submit-side or remote-worker-side radio definition. A future PR
from the #3315 stack will introduce other monitoring radio references with
the remote prefix, replacing the current radio_mode and related attributes.
benclifford added a commit that referenced this pull request Aug 2, 2024
From an interchange perspective: this is a refactoring intended to clarify that the interchange isn't doing anything special wrt. monitoring messages and that it can send monitoring messages in the same way that remote workers can.

From a monitoring perspective: this pulls ZMQ sender code out of the interchange and puts it in a place that is more natural for ongoing development. For example, a potential future use with Work Queue and Task Vine is that workers would also benefit from using ZMQ to send monitoring messages.

In some potential use cases, it might be desirable to configure the radio used by the interchange instead of the hard-coded ZMQRadio. On-going work in draft PR #3315 addresses configuration of different types of radio and that work should be relevant here too.
benclifford added a commit that referenced this pull request Aug 5, 2024
This is part of two ongoing strands of development:

* Preparation for arbitrary monitoring radio receivers, where UDP is not special compared to other methods. This code separates out the UDP specific code making it easier to move around later (see development in PR #3315)

* Avoiding poll-one-thing, poll-another-thing tight polling loops in favour of multiple blocking loops. The two router receiver sections used to run in a single tight non-blocking loop, each component waiting loop_freq (= 10 ms) for something to happen. After this PR, the separated loops are more amenable to longer blocking times - they only need to discover when exit event is set which probably can be more on the order of 1 second.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 2 times, most recently from 7132279 to 4f314da Compare August 15, 2024 14:02
benclifford added a commit that referenced this pull request Aug 16, 2024
The main goals of this PR is to make _migrate_logs_to_internal much more
clearly a message forwarder, rather than a message interpreter.

This follows on from PR #2168 which introduces _dispatch_to_internal to
dispatches messages based on their tag rather than on the queue the message
was received on, and is part of an ongoing series to simplify the queue
and routing structure inside the monitoring router and database code.

Further PRs in preparation (in draft PR #3315) contain further simplifications
building on this PR.

After this PR:

* the database manager will respond to a STOP message on any incoming queue,
vs previously only on the priority queue. This is a consequence of treating
the queues all the same now.

* the database manager will not perform such strong validation of message
structure based on message tag at this point. That's part of expecting the
code to forward messages, not inspect them, with later inspecting code
being the place to care abou structure. This only affects behaviour when
invalid messages are sent.

Related PRs and context:

#3567 changes the monitoring router to be more of a router and
to not inspect and modify certain in-transit messages.

There is a long slow project to regularise queues: PR #2117 makes resource
info messages look like other message so they can be dispatched alongside
other message types.

The priority queue was initially (as I understand it) introduced to attempt
to address a race condition of message order arrival vs SQL database key
constraints. The priority queue is an attempt to force certain messages to
be processed before others (not in the priority queue). However a subsequent
commit in 2019, 0a4b685, introduces a
more robust approach because this priority queue approach does not work and
so is not needed.
benclifford added a commit that referenced this pull request Aug 16, 2024
The main goals of this PR is to make _migrate_logs_to_internal much more
clearly a message forwarder, rather than a message interpreter.

This follows on from PR #2168 which introduces _dispatch_to_internal to
dispatches messages based on their tag rather than on the queue the message
was received on, and is part of an ongoing series to simplify the queue
and routing structure inside the monitoring router and database code.

Further PRs in preparation (in draft PR #3315) contain further simplifications
building on this PR.

After this PR:

* the database manager will respond to a STOP message on any incoming queue,
vs previously only on the priority queue. This is a consequence of treating
the queues all the same now.

* the database manager will not perform such strong validation of message
structure based on message tag at this point. That's part of expecting the
code to forward messages, not inspect them, with later inspecting code
being the place to care abou structure. This only affects behaviour when
invalid messages are sent.

Related PRs and context:

#3567 changes the monitoring router to be more of a router and
to not inspect and modify certain in-transit messages.

There is a long slow project to regularise queues: PR #2117 makes resource
info messages look like other message so they can be dispatched alongside
other message types.

The priority queue was initially (as I understand it) introduced to attempt
to address a race condition of message order arrival vs SQL database key
constraints. The priority queue is an attempt to force certain messages to
be processed before others (not in the priority queue). However a subsequent
commit in 2019, 0a4b685, introduces a
more robust approach because this priority queue approach does not work and
so is not needed.
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 2 times, most recently from 2f1c15a to b6de72d Compare August 26, 2024 11:31
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch 3 times, most recently from d2edccc to d175376 Compare October 31, 2024 15:39
The is based on a review comment from @khk-globus on PR #3672
Prior to this PR, MonitoringHub had a logdir parameter which let the log
directory be set separately from the DFK-level run directory.

This feature isn't monitoring-relevant - other components don't let the
user set this.

So this PR removes that feature, reducing the amount of state to be threaded
around.

When reading this patch, note that what the DFK calls the rundir is a different
directory vs what a Config object calls the rundir.
but these processes *must* be fork processes because we
can't re-import the unknown user workflow script again.

so make that more explicit.

(longer term, forked processes need to go away throughout
Parsl - see PR #2343, but this PR is not for that)
…nd forward it on

use the radio interface to make it clear that that is what they are doing

the intention there is to make it cleaner to swap out *how* they perform forwarding - clarifying that these two threads are forwarders, not doing anything else to the message flow.

after this patch, my intention is that:
 i) the ZMQ listener moves into the db manager and there is no multiprocessing queue forwarding from outside of the db manager
 ii) the UDP listener, all that is left, means this can be renamed the UDP radio receiver, sitting as a processes alongside the filesystem radio receiver (and eventually any other radio receivers)

compare to how the interchange sends htexradio messages onwards via some arbitrary other radio
(which right now is always the ZMQRadioSender) -- rather than having its own radio code.

the previous multiprocessing queue code should be replaced with multiprocessing queue sender
…UDP and HTEX radios: using pickle not parsl.serialize
in preparation for moving radio-specific radio code into these modules too
patch so far is overloading self.monitoring_radio for two different uses that should be clarified:
submit side radio and worker side radio

its a bit complicated for having a default radio (UDPRadio or HTEXRadio) even when monitoring is turned off:
I should check that the radio receiver does not get activated in this case: for example:
  * test that broken radio activation causes a startup error
  * test that configuration with broken radio doesn't cause a startup error when monitoringhub is not configured

zmq radio should always listen, and be the place where all radio receivers send their data.

udp radio and filesystem radio should turn into separate... something... processes?

for prototyping i guess it doesn't matter where I make them live, but the most behaviour preserving would keep them somehow separated?
@benclifford benclifford force-pushed the benc-monitoring-pluggable-radios branch from d175376 to 60f8eb2 Compare November 7, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant