-
Notifications
You must be signed in to change notification settings - Fork 279
Better load monitoring #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
metrics::incr_grpc_message_sent_counter(&subscriber_id); | ||
metrics::incr_grpc_bytes_sent(&subscriber_id, proto_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linuskendall
This is the main sending code to the client.
Here I measure the size of the proto payload I will send to the client via gRPC.
Only payload that passes client's filter will be measure.
Finnally, I increase the subscriber message counter.
yellowstone-grpc-geyser/src/grpc.rs
Outdated
set_subscriber_pace( | ||
&subscriber_id, | ||
client_loop_pace.current_load().per_second() as i64, | ||
); | ||
|
||
set_subscriber_send_bandwidth_load( | ||
&subscriber_id, | ||
stream_tx.estimated_send_rate().per_second() as i64, | ||
); | ||
|
||
set_subscriber_recv_bandwidth_load( | ||
&subscriber_id, | ||
stream_tx.estimated_consuming_rate().per_second() as i64, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I measure three things:
The overall loop processing pace set_subscriber_pace
where the unit is "geyser event"/second.
The actual bandwidth load we are sending to the downstream client (only filtered data matching client's filters).
The actual bandwidth consumption rate a client do per second.
let subscriber_id = request | ||
.metadata() | ||
.get("x-subscription-id") | ||
.and_then(|h| h.to_str().ok().map(|s| s.to_string())) | ||
.or(request.remote_addr().map(|addr| addr.to_string())); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to identitfy a downstream client I check if x-subscription-id
is present otherwise I use its ip address.
@@ -650,6 +686,7 @@ impl GrpcService { | |||
} | |||
// Dedup accounts by max write_version | |||
Message::Account(msg) => { | |||
metrics::observe_geyser_account_update_received(msg.account.data.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code is inside the "geyser loop" which process every geyser event from agave.
I measure the account data size and put it inside an histogram.
static ref GEYSER_ACCOUNT_UPDATE_RECEIVED: Histogram = Histogram::with_opts( | ||
HistogramOpts::new( | ||
"geyser_account_update_data_size_kib", | ||
"Histogram of all account update data (kib) received from Geyser plugin" | ||
) | ||
.buckets(vec![5.0, 10.0, 20.0, 30.0, 50.0, 100.0, 200.0, 300.0, 500.0, 1000.0, 2000.0, 3000.0, 5000.0, 10000.0]) | ||
).unwrap(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for an histogram to measure account data our geyser plugin receives since histogram gives us two other metrics for "free":
the total sum of data (geyser_account_update_data_size_kib_sum
) and the account update counts (geyser_account_update_data_size_kib_count
).
The bucket is based of previous work I did for fumarole.
The P90 of account data size should be below 5KiB.
The P95 should be < 20kIB.
About 1% of account data can be above 1MiB.
The max size (bucket) is ~10mb which match the max account data size.
/// Exponential Moving Average (EMA) for load tracking. | ||
/// | ||
#[derive(Debug)] | ||
pub struct Ema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Exponential-Moving-Avg algorithm to compute an average load.
This is important to use to have a smoothed view of the load we are sending to downstream client and remove noise from graph, thus helping us with better visibility.
@@ -67,6 +69,59 @@ lazy_static::lazy_static! { | |||
Opts::new("missed_status_message_total", "Number of missed messages by commitment"), | |||
&["status"] | |||
).unwrap(); | |||
|
|||
static ref GRPC_MESSAGE_SENT: IntCounterVec = IntCounterVec::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measure how many geyser event we sent to each subscriber.
&["subscriber_id"] | ||
).unwrap(); | ||
|
||
static ref GRPC_BYTES_SENT: IntCounterVec = IntCounterVec::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measures how many bytes (protobuffer encoded data) to each subscriber
Added sevral metrics for geyser and load tracking client session.
Client load uses EMA (Exponential-Moving-Average) to smooth out metrics and remove noise and extreme short-live spike in prometheus/grafana.