-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] start a monitoring radio plugin API #3315
Draft
benclifford
wants to merge
8
commits into
master
Choose a base branch
from
benc-monitoring-pluggable-radios
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+574
−372
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benclifford
added a commit
that referenced
this pull request
Apr 10, 2024
…ugin Before this PR, the DataFlowKernel and each JobStatusPoller PolledExecutorFacade each opened a ZMQ connection to the monitoring router. These connections are not threadsafe, but (especially in the DFK case) no reasoning or checking stops the DFK's connection being used in multiple threads. Before this PR, the MonitoringHub presented a 'send' method and stored the DFK's ZMQ connection in self._dfk_channel, and each PolledExecutorFacade contained a copy of the ZMQ channel code to open its own channel, configured using parameters from the DFK passed in at construction. This PR: * moves the above uses to use the MonitoringRadio interface (which was originally designed for remote workers to send monitoring information, but seems ok here too) * has MonitoringHub construct an appropriate MonitoringRadio instance for use on the submit side, exposed as self.radio; * replaces the implementation of send with a new MultiprocessingQueueRadio which is thread safe but only works in the same multiprocessing environment as the launched monitoring database manager process (which is true on the submit side, but for example means this radio cannot be used on most remote workers) This work aligns with the prototype in #3315 (which pushes on monitoring radio configuration for remote workers) and pushes in the direction (without getting there) of allowing other submit-side hooks. This work removes some monitoring specific code from the JobStatusPoller, replacing it with a dependency injection style. This is part of work to move JobStatusPoller facade state into other classes, as part of job handling rearrangements in PR #3293 This PR will change how monitoring messages are delivered from the submitting process, and the most obvious thing I can think of that will change is how this will behave under load: heavily loaded messaging causing full buffers and other heavy-load symptoms will now behave as multiprocessing Queues do, rather than as ZMQ connections do. I have not attempted to characterise either of those modes.
benclifford
added a commit
that referenced
this pull request
Apr 11, 2024
…ugin (#3338) Before this PR, the DataFlowKernel and each JobStatusPoller PolledExecutorFacade each opened a ZMQ connection to the monitoring router. These connections are not threadsafe, but (especially in the DFK case) no reasoning or checking stops the DFK's connection being used in multiple threads. Before this PR, the MonitoringHub presented a 'send' method and stored the DFK's ZMQ connection in self._dfk_channel, and each PolledExecutorFacade contained a copy of the ZMQ channel code to open its own channel, configured using parameters from the DFK passed in at construction. This PR: * moves the above uses to use the MonitoringRadio interface (which was originally designed for remote workers to send monitoring information, but seems ok here too) * has MonitoringHub construct an appropriate MonitoringRadio instance for use on the submit side, exposed as self.radio; * replaces the implementation of send with a new MultiprocessingQueueRadio which is thread safe but only works in the same multiprocessing environment as the launched monitoring database manager process (which is true on the submit side, but for example means this radio cannot be used on most remote workers) This work aligns with the prototype in #3315 (which pushes on monitoring radio configuration for remote workers) and pushes in the direction (without getting there) of allowing other submit-side hooks. This work removes some monitoring specific code from the JobStatusPoller, replacing it with a dependency injection style. This is part of work to move JobStatusPoller facade state into other classes, as part of job handling rearrangements in PR #3293 This PR will change how monitoring messages are delivered from the submitting process, and the most obvious thing I can think of that will change is how this will behave under load: heavily loaded messaging causing full buffers and other heavy-load symptoms will now behave as multiprocessing Queues do, rather than as ZMQ connections do. I have not attempted to characterise either of those modes. Co-authored-by: Kevin Hunter Kesling <[email protected]>
benclifford
added a commit
that referenced
this pull request
Jun 14, 2024
This PR removes a use of multiprocessing fork-without-exec. At heart, this is how the interchange has wanted to be launched for some time (because of earlier remote interchange work). Launching multiprocessing fork caused a bunch of problems related to inheriteing state from from the parent submitting process that go away with this (jumbled logging topics, race conditions around at least logging-while-forking, inherited signal handlers). The configuration dictionary, previously passed in memory over a fork, is now sent in pickled form over stdin. Using pickle here rather than (eg.) JSON keeps the path open for sending richer configuration objects, beyond what can be encoded in JSON. This isn't something needed right now, but at least configurable monitoring radios (the immediate driving force behind this PR) are modelled around passing arbitrary configuration objects around to configure things - and so it seems likely that if interchange monitoring configuration is exposed to the user, richer objects would be passed here. See PR #3315 for monitoring radio prototype.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
from
June 24, 2024 18:12
9ad2288
to
7d7723b
Compare
benclifford
added a commit
that referenced
this pull request
Jun 24, 2024
this is in preparation for pluggable/configurable radios: see PR #3315 in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one) and in pluggable mode we would expect arbitrary new receivers to exist so then theres no reason for these ones to be special this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
2 times, most recently
from
June 24, 2024 20:10
b53cabd
to
d5d8409
Compare
benclifford
added a commit
that referenced
this pull request
Jul 11, 2024
this is in preparation for pluggable/configurable radios: see PR #3315 in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one) and in pluggable mode we would expect arbitrary new receivers to exist so then theres no reason for these ones to be special this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
2 times, most recently
from
July 11, 2024 09:28
9e8c6e0
to
2f9589d
Compare
benclifford
added a commit
that referenced
this pull request
Jul 11, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio configuration classes using *Radio names. This PR frees up the *Radio namespace for that use, by renaming non-user-exposed internal classes out of the way.
benclifford
added a commit
that referenced
this pull request
Jul 11, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio configuration classes using *Radio names. This PR frees up the *Radio namespace for that use, by renaming non-user-exposed internal classes out of the way.
benclifford
added a commit
that referenced
this pull request
Jul 11, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as attributes of the DFK. The address also existed as an attribute on dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start Afte this PR, those values are not added to the DFK, but instead are accessed via dfk.monitoring. These two attributes are now only set on a new executor when monitoring is enabled, rather than always being intialised by the DFK. Default values now come from the executor __init__ method, which is a more usual style in Python for providing default values. See PR #3361 This is part of ongoing work to introduce more pluggable monitoring network connectivity - see PR #3315
benclifford
added a commit
that referenced
this pull request
Jul 18, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as attributes of the DFK. The address also existed as an attribute on dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start Afte this PR, those values are not added to the DFK, but instead are accessed via dfk.monitoring. These two attributes are now only set on a new executor when monitoring is enabled, rather than always being intialised by the DFK. Default values now come from the executor __init__ method, which is a more usual style in Python for providing default values. See PR #3361 This is part of ongoing work to introduce more pluggable monitoring network connectivity - see PR #3315
benclifford
added a commit
that referenced
this pull request
Jul 18, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio configuration classes using *Radio names. This PR frees up the *Radio namespace for that use, by renaming non-user-exposed internal classes out of the way.
benclifford
added a commit
that referenced
this pull request
Jul 18, 2024
this is in preparation for pluggable/configurable radios: see PR #3315 in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one) and in pluggable mode we would expect arbitrary new receivers to exist so then theres no reason for these ones to be special this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
from
July 18, 2024 09:15
2f9589d
to
466c2db
Compare
benclifford
added a commit
that referenced
this pull request
Jul 24, 2024
Ongoing monitoring radio work (see PR #3315) introduces per-radio configuration classes using *Radio names. This PR frees up the *Radio namespace for that use, by renaming non-user-exposed internal classes out of the way.
benclifford
added a commit
that referenced
this pull request
Jul 25, 2024
Prior to this PR, monitoring hub address and ZMQ port were stored as attributes of the DFK. The address also existed as an attribute on dfk.monitoring, and the ZMQ port was returned by dfk.monitoring.start Afte this PR, those values are not added to the DFK, but instead are accessed via dfk.monitoring. These two attributes are now only set on a new executor when monitoring is enabled, rather than always being intialised by the DFK. Default values now come from the executor __init__ method, which is a more usual style in Python for providing default values. See PR #3361 This is part of ongoing work to introduce more pluggable monitoring network connectivity - see PR #3315
benclifford
added a commit
that referenced
this pull request
Jul 26, 2024
this is in preparation for pluggable/configurable radios: see PR #3315 in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one) and in pluggable mode we would expect arbitrary new receivers to exist so then theres no reason for these ones to be special this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
2 times, most recently
from
July 26, 2024 20:44
72c98fa
to
c179253
Compare
benclifford
added a commit
that referenced
this pull request
Jul 31, 2024
From an interchange perspective: this is a refactoring intended to clarify that the interchange isn't doing anything special wrt. monitoring messages and that it can send monitoring messages in the same way that remote workers can. From a monitoring perspective: this pulls ZMQ sender code out of the interchange and puts it in a place that is more natural for ongoing development. For example, a potential future use with Work Queue and Task Vine is that workers would also benefit from using ZMQ to send monitoring messages. In some potential use cases, it might be desirable to configure the radio used by the interchange instead of the hard-coded ZMQRadio. On-going work in draft PR #3315 addresses configuration of different types of radio and that work should be relevant here too.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
from
July 31, 2024 11:38
d2ee35c
to
ddb214a
Compare
benclifford
added a commit
that referenced
this pull request
Jul 31, 2024
this is in preparation for pluggable/configurable radios: see PR #3315 in that future work, these receivers wont be launched unless configured... or at least, the UDP one won't (and similarly the filesystem one) and in pluggable mode we would expect arbitrary new receivers to exist so then theres no reason for these ones to be special this opens up, in a future PR, the ability to not need to poll at such a high frequency - or at all... the poll-timeout is to allow the other poll to happen, but this PR makes those two polls happen in two threads.
benclifford
added a commit
that referenced
this pull request
Jul 31, 2024
A Parsl executor is configured with a submit-side monitoring radio sender, which is used by the BlockProviderExecutor to send block status messages to the monitoring subsystem. Parsl executors also have a notion of a remote monitoring radio, used by remote workers to sending monitoring messages. This can be confusing when both of these radio senders are referred to in the same piece of code, as happened in ongoing monitoring plugin development in draft PR #3315. This PR is intended to make this sitution much less ambiguous by avoiding the mention of a monitoring radio in executor code without qualifying whether it is a submit-side or remote-worker-side radio definition. A future PR from the #3315 stack will introduce other monitoring radio references with the remote prefix, replacing the current radio_mode and related attributes.
benclifford
added a commit
that referenced
this pull request
Jul 31, 2024
This is part of two ongoing strands of development: * Preparation for arbitrary monitoring radio receivers, where UDP is not special compared to other methods. This code separates out the UDP specific code making it easier to move around later (see development in PR #3315) * Avoiding poll-one-thing, poll-another-thing tight polling loops in favour of multiple blocking loops. The two router receiver sections used to run in a single tight non-blocking loop, each component waiting loop_freq (= 10 ms) for something to happen. After this PR, the separated loops are more amenable to longer blocking times - they only need to discover when exit event is set which probably can be more on the order of 1 second.
benclifford
added a commit
that referenced
this pull request
Aug 1, 2024
A Parsl executor is configured with a submit-side monitoring radio sender, which is used by the BlockProviderExecutor to send block status messages to the monitoring subsystem. Parsl executors also have a notion of a remote monitoring radio, used by remote workers to sending monitoring messages. This can be confusing when both of these radio senders are referred to in the same piece of code, as happened in ongoing monitoring plugin development in draft PR #3315. This PR is intended to make this sitution much less ambiguous by avoiding the mention of a monitoring radio in executor code without qualifying whether it is a submit-side or remote-worker-side radio definition. A future PR from the #3315 stack will introduce other monitoring radio references with the remote prefix, replacing the current radio_mode and related attributes.
benclifford
added a commit
that referenced
this pull request
Aug 2, 2024
From an interchange perspective: this is a refactoring intended to clarify that the interchange isn't doing anything special wrt. monitoring messages and that it can send monitoring messages in the same way that remote workers can. From a monitoring perspective: this pulls ZMQ sender code out of the interchange and puts it in a place that is more natural for ongoing development. For example, a potential future use with Work Queue and Task Vine is that workers would also benefit from using ZMQ to send monitoring messages. In some potential use cases, it might be desirable to configure the radio used by the interchange instead of the hard-coded ZMQRadio. On-going work in draft PR #3315 addresses configuration of different types of radio and that work should be relevant here too.
benclifford
added a commit
that referenced
this pull request
Aug 5, 2024
This is part of two ongoing strands of development: * Preparation for arbitrary monitoring radio receivers, where UDP is not special compared to other methods. This code separates out the UDP specific code making it easier to move around later (see development in PR #3315) * Avoiding poll-one-thing, poll-another-thing tight polling loops in favour of multiple blocking loops. The two router receiver sections used to run in a single tight non-blocking loop, each component waiting loop_freq (= 10 ms) for something to happen. After this PR, the separated loops are more amenable to longer blocking times - they only need to discover when exit event is set which probably can be more on the order of 1 second.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
2 times, most recently
from
August 15, 2024 14:02
7132279
to
4f314da
Compare
benclifford
added a commit
that referenced
this pull request
Aug 16, 2024
The main goals of this PR is to make _migrate_logs_to_internal much more clearly a message forwarder, rather than a message interpreter. This follows on from PR #2168 which introduces _dispatch_to_internal to dispatches messages based on their tag rather than on the queue the message was received on, and is part of an ongoing series to simplify the queue and routing structure inside the monitoring router and database code. Further PRs in preparation (in draft PR #3315) contain further simplifications building on this PR. After this PR: * the database manager will respond to a STOP message on any incoming queue, vs previously only on the priority queue. This is a consequence of treating the queues all the same now. * the database manager will not perform such strong validation of message structure based on message tag at this point. That's part of expecting the code to forward messages, not inspect them, with later inspecting code being the place to care abou structure. This only affects behaviour when invalid messages are sent. Related PRs and context: #3567 changes the monitoring router to be more of a router and to not inspect and modify certain in-transit messages. There is a long slow project to regularise queues: PR #2117 makes resource info messages look like other message so they can be dispatched alongside other message types. The priority queue was initially (as I understand it) introduced to attempt to address a race condition of message order arrival vs SQL database key constraints. The priority queue is an attempt to force certain messages to be processed before others (not in the priority queue). However a subsequent commit in 2019, 0a4b685, introduces a more robust approach because this priority queue approach does not work and so is not needed.
benclifford
added a commit
that referenced
this pull request
Aug 16, 2024
The main goals of this PR is to make _migrate_logs_to_internal much more clearly a message forwarder, rather than a message interpreter. This follows on from PR #2168 which introduces _dispatch_to_internal to dispatches messages based on their tag rather than on the queue the message was received on, and is part of an ongoing series to simplify the queue and routing structure inside the monitoring router and database code. Further PRs in preparation (in draft PR #3315) contain further simplifications building on this PR. After this PR: * the database manager will respond to a STOP message on any incoming queue, vs previously only on the priority queue. This is a consequence of treating the queues all the same now. * the database manager will not perform such strong validation of message structure based on message tag at this point. That's part of expecting the code to forward messages, not inspect them, with later inspecting code being the place to care abou structure. This only affects behaviour when invalid messages are sent. Related PRs and context: #3567 changes the monitoring router to be more of a router and to not inspect and modify certain in-transit messages. There is a long slow project to regularise queues: PR #2117 makes resource info messages look like other message so they can be dispatched alongside other message types. The priority queue was initially (as I understand it) introduced to attempt to address a race condition of message order arrival vs SQL database key constraints. The priority queue is an attempt to force certain messages to be processed before others (not in the priority queue). However a subsequent commit in 2019, 0a4b685, introduces a more robust approach because this priority queue approach does not work and so is not needed.
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
2 times, most recently
from
August 26, 2024 11:31
2f1c15a
to
b6de72d
Compare
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
3 times, most recently
from
October 31, 2024 15:39
d2edccc
to
d175376
Compare
The is based on a review comment from @khk-globus on PR #3672
Prior to this PR, MonitoringHub had a logdir parameter which let the log directory be set separately from the DFK-level run directory. This feature isn't monitoring-relevant - other components don't let the user set this. So this PR removes that feature, reducing the amount of state to be threaded around. When reading this patch, note that what the DFK calls the rundir is a different directory vs what a Config object calls the rundir.
but these processes *must* be fork processes because we can't re-import the unknown user workflow script again. so make that more explicit. (longer term, forked processes need to go away throughout Parsl - see PR #2343, but this PR is not for that)
…e, which is not true
…nd forward it on use the radio interface to make it clear that that is what they are doing the intention there is to make it cleaner to swap out *how* they perform forwarding - clarifying that these two threads are forwarders, not doing anything else to the message flow. after this patch, my intention is that: i) the ZMQ listener moves into the db manager and there is no multiprocessing queue forwarding from outside of the db manager ii) the UDP listener, all that is left, means this can be renamed the UDP radio receiver, sitting as a processes alongside the filesystem radio receiver (and eventually any other radio receivers) compare to how the interchange sends htexradio messages onwards via some arbitrary other radio (which right now is always the ZMQRadioSender) -- rather than having its own radio code. the previous multiprocessing queue code should be replaced with multiprocessing queue sender
…UDP and HTEX radios: using pickle not parsl.serialize
in preparation for moving radio-specific radio code into these modules too
patch so far is overloading self.monitoring_radio for two different uses that should be clarified: submit side radio and worker side radio its a bit complicated for having a default radio (UDPRadio or HTEXRadio) even when monitoring is turned off: I should check that the radio receiver does not get activated in this case: for example: * test that broken radio activation causes a startup error * test that configuration with broken radio doesn't cause a startup error when monitoringhub is not configured zmq radio should always listen, and be the place where all radio receivers send their data. udp radio and filesystem radio should turn into separate... something... processes? for prototyping i guess it doesn't matter where I make them live, but the most behaviour preserving would keep them somehow separated?
benclifford
force-pushed
the
benc-monitoring-pluggable-radios
branch
from
November 7, 2024 11:09
d175376
to
60f8eb2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This is the beginning of a monitoring radio plugin API, driven by some work with desc (where the choice of monitoring radio has been something to pay attention to performance-wise), by @ClaudiaCumberbatch who wants to plug in another radio, and the chronolog project where a hack sesssion resulted in sending some monitoring messages into chronolog. There are also possible uses in Globus Compute when wanting to observe task behaviour in future without having any supporting Parsl-style monitoring infrastructure.
Changed Behaviour
This API is a partial prototype and might not work quite right, but there is enough there to get feedback from potential users.
Prior to this PR monitoring configuration is sent as pair: a radio name (hard-coded in executor definitions since PR #2274, and not easily user selectable) along with information that can be used by some radios (for example, the monitoring router UDP listener endpoint).
After this PR, those fields are all replaced by a pickled configuration object, some instance of
RadioConfig
, which will be sent to the worker and used to create an appropriate radio. A user can now specify a RadioConfig instance (most likely a subclass of RadioConfig such as UDPRadioConfig) which contains whichever configuration values are relevant for that particular radio mode - for example, HTEXRadioConfig needs no further configuration, but UDPRadioConfig contains an IP address and UDP port.This allows plugin-style out-of-codebase radios to be implemented without modifying the parsl source tree - that's a feature Parsl has been driving towards recently in many places to accomodate differing levels of code quality and support ability: most recently @ClaudiaCumberbatch's diaspora PR #3147, which this PR should accomodate without any diaspora-specific additions to the main Parsl codebase.
Type of change