Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+metrics #30 basic metrics "skeleton" #42

Merged
merged 3 commits into from
Aug 27, 2019
Merged

+metrics #30 basic metrics "skeleton" #42

merged 3 commits into from
Aug 27, 2019

Conversation

ktoso
Copy link
Member

@ktoso ktoso commented Aug 26, 2019

Introduces skeleton for using metrics in the actor system

Resolves #30

Motivation:

A core feature of the actor system is to be introspectible by default.
Metrics must be offered at the core, and should include general things like actor counts (per groups), message queues and latencies etc.

This is just a "skeleton", we'll get more metrics in place as we go.

Modifications:

  • system gains .metrics which has all the well typed and helpers to emitting metrics

  • settings for the metrics, though not much action there yet

  • just two PoC metrics for now: actor count and membership status:

# Metrics is the name of the actor system, so it'd be whatever users name their system/node with.

# TYPE Metrics.actors.lifecycle counter
Metrics.actors.lifecycle 0
Metrics.actors.lifecycle{root="/user", event="start"} 13
Metrics.actors.lifecycle{root="/user", event="stop"} 5
Metrics.actors.lifecycle{root="/system", event="start"} 6
# TYPE Metrics.cluster.members gauge
Metrics.cluster.members 1.0 # for simplicity of reporting == .up members
Metrics.cluster.members{status="joining"} 0.0
Metrics.cluster.members{status="up"} 1.0
Metrics.cluster.members{status="down"} 0.0
Metrics.cluster.members{status="leaving"} 0.0
Metrics.cluster.members{status="removed"} 0.0
Metrics.cluster.members{reachability="unreachable"} 0.0

Need to figure out if we need to push upstream the AddGauge or not.

Result:

Basic actor system metrics;


/// Timing how long it takes to converge (i.e. an update to reach all members)
// TODO: how to measure this without huge overhead, maybe opt in
// let crdt_convergence_time:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yim-lee I wonder about pulling off that metrics, this is not for now at all, but something to keep in mind for later on. I know for sure this would quite excite some people... :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. would be interesting if we can pull this off. need to define what a unit of update is and when it starts (presumably that's whenever ActorOwned sends write command to local replicator?), when the updates reaches all members (after direct replication? gossip?). and since gossip is periodic, it might seem like convergence takes longer than it should but maybe that's ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(presumably that's whenever ActorOwned sends write command to local replicator?)

yeah... I think we'll pull it off by "attach some metadata to envelope when pereforming the write (that's "baggage" I sometimes mention), and we carry it around during replication; once a replica notices that "all nodes have now seen this write" we emit a "replication complete: timestamp" event or something like that...

Definitely doable, just need to think it through when we get there :)

// let crdt_convergence_time:

// ==== ------------------------------------------------------------------------------------------------------------
// MARK: Actors Group-metrics (i.e. all actors of given "type" or "role")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in general measuring every single actor by itself is "too much", so that's why we have the metrics .group so we can group them and offer "those actors in this group process on avg like that" etc.

} else {
self.actors_count.increment()
}
}
Copy link
Member Author

@ktoso ktoso Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a case that examplifies why we need the "add" @tomerd

I'm still thinking of workarounds and/or how to change the metrics api... Gauge IS-A Recorder does not really fit.., since we'd have to force RecorderHandler to accept add/sub which seems wrong. Gauge is its own thing really if we decide to go this way. OR we make AddGauge a completely new type that can be supported?


We figured it out, by expressing the lifecycle as independent events and their counters we get all we need and even more information 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yim-lee for the idea!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By forcing you to read CRDT papers? Yup, that's me. 😃

@ktoso ktoso requested review from tomerd, drexin and yim-lee August 26, 2019 08:16
} else {
self.actors_count.increment()
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yim-lee for the idea!

@@ -164,6 +167,7 @@ internal final class ActorShell<Message>: ActorContext<Message>, AbstractActor {
_ = system.userCellInitCounter.sub(1)
}
#endif
system.metrics.recordActorStop(self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ note: I'd want to have semantically meaningful "do a thing" methods for all metrics, we should not have to wiggle around with +1 or other "which counter exactly" here; all this should be encapsulated in the Metrics files so we can quickly skim it there what counter is updated when.

// let path = self._myCell.address.description
// return "\(type(of: self))(\(path))"
let prettyTypeName = String(reflecting: Message.self).split(separator: ".").dropFirst().joined(separator: ".")
return "ActorShell<\(prettyTypeName)>(\(self.path))"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidenote: without this the type was quite useless -- everything was <Message> 😉

self.cluster_members_leaving.record(leaving)
self.cluster_members_removed.record(removed)
self.cluster_unreachable_members.record(unreachable)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least it is O(n) over the members and not worse... we could make the membership keep count as well, though realistically this is fine.

@@ -160,17 +177,21 @@ let package = Package(
/* --- samples --- */

.executable(
name: "DistributedActorsSampleDiningPhilosophers",
name: "SampleDiningPhilosophers",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 like the shorter names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Sources/DistributedActors/Props.swift Show resolved Hide resolved
}

public init() {
self.init(mailbox: .default(), dispatcher: .default, supervision: .init())
self.init(mailbox: .default(), dispatcher: .default, supervision: .init(), metrics: .default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: we are not being consistent with how default instance is created (.default() vs. .default vs. .init())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, going to stick to default 👍

public init(mailbox: MailboxProps, dispatcher: DispatcherProps, supervision: SupervisionProps) {
public var metrics: MetricsProps

public init(mailbox: MailboxProps, dispatcher: DispatcherProps, supervision: SupervisionProps, metrics: MetricsProps) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we combine this initializer with init() and have a single init where all parameters as optionals instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okey :)

Sources/DistributedActors/Props.swift Show resolved Hide resolved

let actorsLifecycle = settings.makeLabel("actors", "lifecycle")
self.actors_count_user = .init(label: actorsLifecycle, positive: [rootUser, dimStart], negative: [rootUser, dimStop])
self.actors_count_system = .init(label: actorsLifecycle, positive: [rootSystem, dimStart], negative: [rootSystem, dimStop])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for your help here :-)

/// Timing how long it takes to converge (i.e. an update to reach all members)
// TODO: how to measure this without huge overhead, maybe opt in
// let crdt_convergence_time:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be very exciting; I know for sure some people internally will be very excited if we have CRDTs with convergence metrics <3

// ==== ------------------------------------------------------------------------------------------------------------
// MARK: Cluster Metrics

let cluster_members: Gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason these variable names use underscore instead of camel case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wanted them to be visually same as the names they end up being in the metrics:

/// actors.count { root=/system, event=stop }
let actors_count ... 

/// cluster.members ... 
let cluster_members ... 

etc

If too annoying we can move to camel, WDYT 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Just curious. 👍

Maybe document that in the code so we can be consistent with the pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks for noticing it's not documented wholely :)

@@ -21,6 +21,7 @@ declare -r logs=$1
failures_count=0
failures=()

IFS=$'\n'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(was broken without this, makes \n the separator for the for)

@ktoso
Copy link
Member Author

ktoso commented Aug 27, 2019

Addressed feedback in a18fe28

@ktoso
Copy link
Member Author

ktoso commented Aug 27, 2019

Failure was the known #17 which we should look at; likely fleaky tests

@ktoso
Copy link
Member Author

ktoso commented Aug 27, 2019

@swift-server-bot test this please

@ktoso ktoso merged commit c7ebbb6 into apple:master Aug 27, 2019
@ktoso ktoso deleted the wip-metrics branch August 27, 2019 05:34
Copy link
Member

@yim-lee yim-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor things 👍

@@ -18,6 +18,8 @@ public struct SupervisionProps {
// on purpose stored as list, to keep order in which the supervisors are added as we "scan" from first to last when we handle
internal var supervisionMappings: [ErrorTypeBoundSupervisionStrategy]

public static let `default`: SupervisionProps = .init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

public init() {
self.init(mailbox: .default(), dispatcher: .default, supervision: .init())
self.init(mailbox: .default(), dispatcher: .default, supervision: .default, metrics: .default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.mailbox = mailbox
self.dispatcher = dispatcher
self.supervision = supervision
self.metrics = metrics
}

public init() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess technically we don't need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true, removing

let negative: Counter

init(label: String, positive positiveDimensions: [(String, String)] = [], negative negativeDimensions: [(String, String)] = []) {
assert(positiveDimensions.map { "\($0)\($1)" }.joined() != negativeDimensions.map { "\($0)\($1)" }.joined(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good assertion


init(label: String, positive positiveDimensions: [(String, String)] = [], negative negativeDimensions: [(String, String)] = []) {
assert(positiveDimensions.map { "\($0)\($1)" }.joined() != negativeDimensions.map { "\($0)\($1)" }.joined(),
"Dimensions for PNCounter pair [\(label)] MUST NOT be equal.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PNCounter intended? Not MetricsPNCounter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

public var systemName: String?

/// Segment prefixed before all metrics exported automatically by the actor system.
public var systemMetricsPrefix: String? = "sact"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/// Prefix all metrics with this segment.
///
/// Defaults to the actor systems' name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systems' => system's?

@ktoso
Copy link
Member Author

ktoso commented Aug 27, 2019

Whoops thanks will follow up on those comments 👍

ktoso added a commit that referenced this pull request Aug 31, 2019
* +metrics #30 basic metrics "skeleton"

* +metrics #30 implement actor lifecycle as start/stop pn-counter metric

* Addressed review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offer basic actor metrics
2 participants