Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sampler discovery using regular expressions #1380

Merged
merged 2 commits into from
May 21, 2024

Conversation

nichamon
Copy link
Collaborator

This commit introduces a new feature that simplifies aggregator configuration.

  • Previously, admins needed to manually specify hostnames for all samplers in the aggregator configuration using the prdcr_add command.
  • This change enables automatic sampler discovery. Admins specifies the aggregator hostname and listening port in sampler configuration via the advertise_add command and start the advertisement with the advertise_start command. The samplers now advertise their hostname to the aggregator.
  • On the aggregator, admins specifies a regular expression to be matched with sampler hostname via the prdcr_listen_add command. The prdcr_listen_start command is used to tell the aggregator to automatically add producers corresponding to a sampler of which the hostname matches the regular expression.

This feature eliminates the need for manual configuration of sampler hostnames in the aggregator configuration file. The aggregator can now automatically discover samplers and add them as metric set producers.

The authentication domain to be used to connect to the aggregator
.RE

\fBadvertise_start\fR starts an advertisement. The parameters are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for having separate start/stop functions? Why not just start on _add() and stop on _del()? (And maybe then rename: s/add/start, s/del/stop)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I didn’t implement start and stop functionality. I didn't even implement del as I don't see a use for it. However, when I thought about troubleshooting scenarios, it’s better to have the stop command available.

Here are the reasons:

  • Separating deletion and stopping from each other allows users to restart the prdcr_listen without typing the whole add line again. Use cases of stopping a prdcr_listen happen when things don’t go as planned. For example, the aggregator cannot keep up with updating sets, while more producers are being automatically added to the aggregators. To inspect the situation, I would like to stop adding more producers to the aggregator to prevent adding more load on it. Hence, prdcr_listen_stop would become useful. Whether that will happen, I don’t know. But at some point, we will definitely need to do troubleshooting, and an ability to stop a prdcr_listen could be beneficial.

  • Providing prdcr_listen_stop calls for the existence of prdcr_listen_start to restart a stopped prdcr_listen.

Copy link
Collaborator

@morrone morrone Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Since those are pretty uncommon use cases, how about instead of having the current add/start/stop/del, have:

_start(takes all the options that your _add currently takes)
_pause
_resume
_stop

The most common use cases would only do start and maybe stop. But people that need to pause and resume for debugging/whatever can do so.

@baallan
Copy link
Collaborator

baallan commented Apr 1, 2024

This needs to take hostlist syntax in addition to or preferably instead of regex for identifying lists of aggregators or lists of producers. It's difficult to correctly use regex to yield the equivalent of lists with holes for persistently unused equipment, e.g. node[1-3,5-22,24-80].

@morrone
Copy link
Collaborator

morrone commented Apr 2, 2024

This needs to take hostlist syntax in addition to or preferably instead of regex for identifying lists of aggregators or lists of producers. It's difficult to correctly use regex to yield the equivalent of lists with holes for persistently unused equipment, e.g. node[1-3,5-22,24-80].

That would seem to be counter to the goal of this PR. The goal seems to be to allow any sampler daemons matching a pattern to connect, and not have to know what nodes those are going to be. Gaps of persistently unsused equipment would not pose a problem, as far as I can tell, because unused equipment presumably can't connect to the aggregator. In which case I'm not entirely clear on what the use case is for the regex. Maybe it is just a sanity check?

Hostlist support would be great on the aggregator side when this PR's code is not being used. It would allow a single section rather than repeating prdcr over-and-over for each sampler daemon. But for better or worse, that is currently being addressed at the maestro config level rather than in the ldmsd native configuration language.

@tom95858
Copy link
Collaborator

tom95858 commented Apr 3, 2024

@baallan, "persistently unused equipment" wouldn't be advertising. Also the whole point of this is avoiding the hostlist syntax. I do agree that supporting the hostlist syntax may be nice because it's easier to write than regex, but the whole point is to avoid having to use it.

To @morrone question, the idea is that all of the aggregators know about all the samplers, but they only actually communicate with the one's that have been started.

The intention here is to significantly reduce the complexity of the sampler configuration at the aggregator; especially in the case of cloud deployment where the node that is being sampled is walking around because the cloud logic is messing with you.

@baallan
Copy link
Collaborator

baallan commented Apr 3, 2024

@tom95858 what's the point in avoiding the hostlist syntax? it makes admin work much more pleasant and there are open-source codes for python and c to parse it.

@tom95858
Copy link
Collaborator

tom95858 commented Apr 3, 2024

@baallan, i think you're missing the point. The idea is not to avoid the hostlist syntax, it's to avoid the hostlist altogether.

**Aggregator Side Commands**

.IP \fBprdcr_listen-add
.RI "name=" NAME " regex=" REGEX " " RECONNECT=" INTVL
Copy link
Collaborator

@baallan baallan Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prdcr_listen_add This needs to include a parameter to specify matching a specific user owning the producer (i.e. a user-owned sampler daemon on a remote host must be blockable from getting included on the basis of a munge confirmed failure to be uid 0 (if uid 0 is required). I expect there may be some scenarios where this is optional and an admin might want to allow user-owned daemons to publish.

@baallan
Copy link
Collaborator

baallan commented Apr 3, 2024

@baallan, i think you're missing the point. The idea is not to avoid the hostlist syntax, it's to avoid the hostlist altogether.

As @morrone noted, the place for hostlist syntax is in prdcr_listen_add, and even more strongly it is in prdcr_add (and friends) in the existing agg configuration for scenarios where advertising through firewalls isn't possible.

.PP
**Aggregator Side Commands**

.IP \fBprdcr_listen-add
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there is a typo here: s/-/_/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morrone Thanks for catching this and looking into the details!

@morrone
Copy link
Collaborator

morrone commented Apr 4, 2024

To @morrone question, the idea is that all of the aggregators know about all the samplers, but they only actually communicate with the one's that have been started.

I am not sure that I follow. Maybe we're using terms differently. How, in the scope of the new functionality in this PR, would an aggregator "know about" any of the samplers without the samplers first advertising themselves to the aggregator?

I am not advocating that hostlist be added to prdcr_listen_add. As Tom said that would seem to be counter to the point of this PR: to "avoid the hostlist altogether". But for the very same reason, the "regex" in prdcr_listen_add should also be removed, right? If there is a reason to have a regex to match hostnames, then it is almost certainly better to have a hostlist. But I am suggesting that we should have neither of them, and the regex should be removed.

I also still don't understand the point of prdr_listen_start and prdcr_listen_stop. Why doesn't it just start on prdcr_listen_add and stop on prdcr_listen_del?

@tom95858
Copy link
Collaborator

tom95858 commented Apr 4, 2024

@morrone, the purpose of the regex on the listen is to handle the case where we are monitoring multiple clusters. So we've got cluster A, B, and C. I want to have three producer listeners, one for each cluster. So I do a prdcr_listen for each cluster with a different regex for each one, A., B., and C.*. Now we're just spinning this up, so maybe the regex isn't necessary since the start and stop paths are still there. Anyway, that was the rationale. I was worried about accepting connections from anyone to exchange the sampler information and wanted to have a way to filter the call to accept.

@tom95858
Copy link
Collaborator

tom95858 commented Apr 4, 2024

To @morrone question, the idea is that all of the aggregators know about all the samplers, but they only actually communicate with the one's that have been started.

I am not sure that I follow. Maybe we're using terms differently. How, in the scope of the new functionality in this PR, would an aggregator "know about" any of the samplers without the samplers first advertising themselves to the aggregator?
They don't...initially. So suppose we just enable this functionality....no hostlist is specified, and all the samplers tell all the aggregators about themselves. Meanwhile, we are monitoring the aggregators and distributing configuration across them. These entities now know about all of the samplers. So this daemon, service, whatever, says ok, let's divy this up and starts a roughly equal number of producers on the aggregators.

I am not advocating that hostlist be added to prdcr_listen_add. As Tom said that would seem to be counter to the point of this PR: to "avoid the hostlist altogether". But for the very same reason, the "regex" in prdcr_listen_add should also be removed, right? If there is a reason to have a regex to match hostnames, then it is almost certainly better to have a hostlist. But I am suggesting that we should have neither of them, and the regex should be removed.

I also still don't understand the point of prdr_listen_start and prdcr_listen_stop. Why doesn't it just start on prdcr_listen_add and stop on prdcr_listen_del?

Because all of the aggregators know about all of the samplers, but they are not responsible for them. So the start and stop are there to alert a given aggregator that it is responsible for a particular client. The reason it's done this way is to reduce the failover time when an aggregator goes away, IOW, it doesn't require adding, configuring and starting a new producer...just starting it.

@morrone
Copy link
Collaborator

morrone commented Apr 4, 2024

Because all of the aggregators know about all of the samplers, but they are not responsible for them. So the start and stop are there to alert a given aggregator that it is responsible for a particular client. The reason it's done this way is to reduce the failover time when an aggregator goes away, IOW, it doesn't require adding, configuring and starting a new producer...just starting it.

I don't think it works that way. Maybe you are confusing prdcr_listen_add with advertise_start? prdcr_listen_add with doesn't take any options that would allow it to differentiate individual clients.

@morrone
Copy link
Collaborator

morrone commented Apr 4, 2024

@morrone, the purpose of the regex on the listen is to handle the case where we are monitoring multiple clusters. So we've got cluster A, B, and C. I want to have three producer listeners, one for each cluster. So I do a prdcr_listen for each cluster with a different regex for each one, A., B., and C.*. Now we're just spinning this up, so maybe the regex isn't necessary since the start and stop paths are still there. Anyway, that was the rationale. I was worried about accepting connections from anyone to exchange the sampler information and wanted to have a way to filter the call to accept.

Surely any sampler trying to advertise to the aggregator needs to be authenticated and authorized well before the regex code comes into play, correct? If it doesn't then I think there is work that needs to be done before this can land. Because a regex is entirely insufficient to suffice as a security mechanism.

@nichamon
Copy link
Collaborator Author

nichamon commented Apr 5, 2024

@tom95858 @morrone @baallan @valleydlr It's better to continue the discussion in a meeting. I'll send an email to setup a meeting next week.

**Aggregator Side Commands**

.IP \fBprdcr_listen-add
.RI "name=" NAME " regex=" REGEX " " RECONNECT=" INTVL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a discussion this morning, we talked about making the regex parameter optional. If it is not provided, we accept all.

@nichamon nichamon marked this pull request as draft April 24, 2024 22:04
@nichamon
Copy link
Collaborator Author

I changed this to a draft because I'm in the middle of addressing the following.

  • Make regex as an optional attribute
  • Make advertise_start create a config object if missing
  • Correct the typo in the man page
  • Add ip attribute prdcr_listen_add to restrict the IP addresses causing producer creation

@nichamon nichamon force-pushed the prdcr_listen branch 5 times, most recently from a623d04 to 94b2dc7 Compare May 15, 2024 17:17
@nichamon nichamon marked this pull request as ready for review May 15, 2024 17:29
@nichamon
Copy link
Collaborator Author

I made the three changes.

This commit introduces a new feature that simplifies aggregator
configuration.

* Previously, admins needed to manually specify hostnames for all
  samplers in the aggregator configuration using the `prdcr_add`
  command.
* This change enables samplers to advertise themselves to an aggregator.
  Admins specifies the aggregator hostname and listening port in sampler
  configuration via the `advertise_add` command and start the
  advertisement with the `advertise_start` command. The samplers now
  advertise their hostname to the aggregator.
* On the aggregator, admins may specify a regular expression to be
  matched with sampler hostname or an IP range in the CIDR format using
  the `prdcr_listen_add` command. The `prdcr_listen_start` command is used
  to tell the aggregator to automatically add producers corresponding to a
  sampler of which the hostname matches the regular expressions or the
  IP is in the subnet mask given at the `prdcr_listen_add` line. If
  neither of the regular expression or the IP range is given, LDMSD
  creates a producer when it receives an advertisement.

This feature eliminates the need for manual configuration of sampler
hostnames in the aggregator configuration file. The aggregator can now
automatically discover samplers and add them as metric set producers.
@nichamon
Copy link
Collaborator Author

@tom95858 I fixed the port number and added the attribute to disable LDMSD automatically start a producer added by prdcr_listen. The pull request is ready.

@tom95858 tom95858 merged commit af3c3ef into ovis-hpc:OVIS-4 May 21, 2024
14 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants