-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sampler discovery using regular expressions #1380
Conversation
ldms/man/ldmsd_sampler_discovery.man
Outdated
The authentication domain to be used to connect to the aggregator | ||
.RE | ||
|
||
\fBadvertise_start\fR starts an advertisement. The parameters are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for having separate start/stop functions? Why not just start on _add() and stop on _del()? (And maybe then rename: s/add/start, s/del/stop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I didn’t implement start
and stop
functionality. I didn't even implement del
as I don't see a use for it. However, when I thought about troubleshooting scenarios, it’s better to have the stop
command available.
Here are the reasons:
-
Separating deletion and stopping from each other allows users to restart the prdcr_listen without typing the whole
add
line again. Use cases of stopping a prdcr_listen happen when things don’t go as planned. For example, the aggregator cannot keep up with updating sets, while more producers are being automatically added to the aggregators. To inspect the situation, I would like to stop adding more producers to the aggregator to prevent adding more load on it. Hence, prdcr_listen_stop would become useful. Whether that will happen, I don’t know. But at some point, we will definitely need to do troubleshooting, and an ability to stop a prdcr_listen could be beneficial. -
Providing prdcr_listen_stop calls for the existence of prdcr_listen_start to restart a stopped prdcr_listen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Since those are pretty uncommon use cases, how about instead of having the current add/start/stop/del, have:
_start(takes all the options that your _add currently takes)
_pause
_resume
_stop
The most common use cases would only do start and maybe stop. But people that need to pause and resume for debugging/whatever can do so.
This needs to take hostlist syntax in addition to or preferably instead of regex for identifying lists of aggregators or lists of producers. It's difficult to correctly use regex to yield the equivalent of lists with holes for persistently unused equipment, e.g. node[1-3,5-22,24-80]. |
That would seem to be counter to the goal of this PR. The goal seems to be to allow any sampler daemons matching a pattern to connect, and not have to know what nodes those are going to be. Gaps of persistently unsused equipment would not pose a problem, as far as I can tell, because unused equipment presumably can't connect to the aggregator. In which case I'm not entirely clear on what the use case is for the regex. Maybe it is just a sanity check? Hostlist support would be great on the aggregator side when this PR's code is not being used. It would allow a single section rather than repeating prdcr over-and-over for each sampler daemon. But for better or worse, that is currently being addressed at the maestro config level rather than in the ldmsd native configuration language. |
@baallan, "persistently unused equipment" wouldn't be advertising. Also the whole point of this is avoiding the hostlist syntax. I do agree that supporting the hostlist syntax may be nice because it's easier to write than regex, but the whole point is to avoid having to use it. To @morrone question, the idea is that all of the aggregators know about all the samplers, but they only actually communicate with the one's that have been started. The intention here is to significantly reduce the complexity of the sampler configuration at the aggregator; especially in the case of cloud deployment where the node that is being sampled is walking around because the cloud logic is messing with you. |
@tom95858 what's the point in avoiding the hostlist syntax? it makes admin work much more pleasant and there are open-source codes for python and c to parse it. |
@baallan, i think you're missing the point. The idea is not to avoid the hostlist syntax, it's to avoid the hostlist altogether. |
ldms/man/ldmsd_sampler_discovery.man
Outdated
**Aggregator Side Commands** | ||
|
||
.IP \fBprdcr_listen-add | ||
.RI "name=" NAME " regex=" REGEX " " RECONNECT=" INTVL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prdcr_listen_add This needs to include a parameter to specify matching a specific user owning the producer (i.e. a user-owned sampler daemon on a remote host must be blockable from getting included on the basis of a munge confirmed failure to be uid 0 (if uid 0 is required). I expect there may be some scenarios where this is optional and an admin might want to allow user-owned daemons to publish.
As @morrone noted, the place for hostlist syntax is in prdcr_listen_add, and even more strongly it is in prdcr_add (and friends) in the existing agg configuration for scenarios where advertising through firewalls isn't possible. |
ldms/man/ldmsd_sampler_discovery.man
Outdated
.PP | ||
**Aggregator Side Commands** | ||
|
||
.IP \fBprdcr_listen-add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there is a typo here: s/-/_/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morrone Thanks for catching this and looking into the details!
I am not sure that I follow. Maybe we're using terms differently. How, in the scope of the new functionality in this PR, would an aggregator "know about" any of the samplers without the samplers first advertising themselves to the aggregator? I am not advocating that hostlist be added to prdcr_listen_add. As Tom said that would seem to be counter to the point of this PR: to "avoid the hostlist altogether". But for the very same reason, the "regex" in prdcr_listen_add should also be removed, right? If there is a reason to have a regex to match hostnames, then it is almost certainly better to have a hostlist. But I am suggesting that we should have neither of them, and the regex should be removed. I also still don't understand the point of prdr_listen_start and prdcr_listen_stop. Why doesn't it just start on prdcr_listen_add and stop on prdcr_listen_del? |
@morrone, the purpose of the regex on the listen is to handle the case where we are monitoring multiple clusters. So we've got cluster A, B, and C. I want to have three producer listeners, one for each cluster. So I do a prdcr_listen for each cluster with a different regex for each one, A., B., and C.*. Now we're just spinning this up, so maybe the regex isn't necessary since the start and stop paths are still there. Anyway, that was the rationale. I was worried about accepting connections from anyone to exchange the sampler information and wanted to have a way to filter the call to accept. |
Because all of the aggregators know about all of the samplers, but they are not responsible for them. So the start and stop are there to alert a given aggregator that it is responsible for a particular client. The reason it's done this way is to reduce the failover time when an aggregator goes away, IOW, it doesn't require adding, configuring and starting a new producer...just starting it. |
I don't think it works that way. Maybe you are confusing prdcr_listen_add with advertise_start? prdcr_listen_add with doesn't take any options that would allow it to differentiate individual clients. |
Surely any sampler trying to advertise to the aggregator needs to be authenticated and authorized well before the regex code comes into play, correct? If it doesn't then I think there is work that needs to be done before this can land. Because a regex is entirely insufficient to suffice as a security mechanism. |
@tom95858 @morrone @baallan @valleydlr It's better to continue the discussion in a meeting. I'll send an email to setup a meeting next week. |
ldms/man/ldmsd_sampler_discovery.man
Outdated
**Aggregator Side Commands** | ||
|
||
.IP \fBprdcr_listen-add | ||
.RI "name=" NAME " regex=" REGEX " " RECONNECT=" INTVL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a discussion this morning, we talked about making the regex parameter optional. If it is not provided, we accept all.
I changed this to a draft because I'm in the middle of addressing the following.
|
a623d04
to
94b2dc7
Compare
I made the three changes. |
This commit introduces a new feature that simplifies aggregator configuration. * Previously, admins needed to manually specify hostnames for all samplers in the aggregator configuration using the `prdcr_add` command. * This change enables samplers to advertise themselves to an aggregator. Admins specifies the aggregator hostname and listening port in sampler configuration via the `advertise_add` command and start the advertisement with the `advertise_start` command. The samplers now advertise their hostname to the aggregator. * On the aggregator, admins may specify a regular expression to be matched with sampler hostname or an IP range in the CIDR format using the `prdcr_listen_add` command. The `prdcr_listen_start` command is used to tell the aggregator to automatically add producers corresponding to a sampler of which the hostname matches the regular expressions or the IP is in the subnet mask given at the `prdcr_listen_add` line. If neither of the regular expression or the IP range is given, LDMSD creates a producer when it receives an advertisement. This feature eliminates the need for manual configuration of sampler hostnames in the aggregator configuration file. The aggregator can now automatically discover samplers and add them as metric set producers.
@tom95858 I fixed the port number and added the attribute to disable LDMSD automatically start a producer added by prdcr_listen. The pull request is ready. |
This commit introduces a new feature that simplifies aggregator configuration.
prdcr_add
command.advertise_add
command and start the advertisement with theadvertise_start
command. The samplers now advertise their hostname to the aggregator.prdcr_listen_add
command. Theprdcr_listen_start
command is used to tell the aggregator to automatically add producers corresponding to a sampler of which the hostname matches the regular expression.This feature eliminates the need for manual configuration of sampler hostnames in the aggregator configuration file. The aggregator can now automatically discover samplers and add them as metric set producers.