Skip to content

Conversation

lvboudre
Copy link
Contributor

@lvboudre lvboudre commented Aug 1, 2025

Added sevral metrics for geyser and load tracking client session.
Client load uses EMA (Exponential-Moving-Average) to smooth out metrics and remove noise and extreme short-live spike in prometheus/grafana.

@lvboudre lvboudre requested a review from linuskendall August 1, 2025 14:44
Comment on lines +1053 to +1054
metrics::incr_grpc_message_sent_counter(&subscriber_id);
metrics::incr_grpc_bytes_sent(&subscriber_id, proto_size);
Copy link
Contributor Author

@lvboudre lvboudre Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linuskendall
This is the main sending code to the client.
Here I measure the size of the proto payload I will send to the client via gRPC.
Only payload that passes client's filter will be measure.
Finnally, I increase the subscriber message counter.

Comment on lines 926 to 939
set_subscriber_pace(
&subscriber_id,
client_loop_pace.current_load().per_second() as i64,
);

set_subscriber_send_bandwidth_load(
&subscriber_id,
stream_tx.estimated_send_rate().per_second() as i64,
);

set_subscriber_recv_bandwidth_load(
&subscriber_id,
stream_tx.estimated_consuming_rate().per_second() as i64,
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I measure three things:

The overall loop processing pace set_subscriber_pace where the unit is "geyser event"/second.

The actual bandwidth load we are sending to the downstream client (only filtered data matching client's filters).
The actual bandwidth consumption rate a client do per second.

Comment on lines +1225 to +1230
let subscriber_id = request
.metadata()
.get("x-subscription-id")
.and_then(|h| h.to_str().ok().map(|s| s.to_string()))
.or(request.remote_addr().map(|addr| addr.to_string()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to identitfy a downstream client I check if x-subscription-id is present otherwise I use its ip address.

@@ -650,6 +686,7 @@ impl GrpcService {
}
// Dedup accounts by max write_version
Message::Account(msg) => {
metrics::observe_geyser_account_update_received(msg.account.data.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of code is inside the "geyser loop" which process every geyser event from agave.
I measure the account data size and put it inside an histogram.

Comment on lines +110 to 117
static ref GEYSER_ACCOUNT_UPDATE_RECEIVED: Histogram = Histogram::with_opts(
HistogramOpts::new(
"geyser_account_update_data_size_kib",
"Histogram of all account update data (kib) received from Geyser plugin"
)
.buckets(vec![5.0, 10.0, 20.0, 30.0, 50.0, 100.0, 200.0, 300.0, 500.0, 1000.0, 2000.0, 3000.0, 5000.0, 10000.0])
).unwrap();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for an histogram to measure account data our geyser plugin receives since histogram gives us two other metrics for "free":

the total sum of data (geyser_account_update_data_size_kib_sum) and the account update counts (geyser_account_update_data_size_kib_count).

The bucket is based of previous work I did for fumarole.
The P90 of account data size should be below 5KiB.
The P95 should be < 20kIB.
About 1% of account data can be above 1MiB.
The max size (bucket) is ~10mb which match the max account data size.

/// Exponential Moving Average (EMA) for load tracking.
///
#[derive(Debug)]
pub struct Ema {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Exponential-Moving-Avg algorithm to compute an average load.

This is important to use to have a smoothed view of the load we are sending to downstream client and remove noise from graph, thus helping us with better visibility.

@lvboudre lvboudre requested a review from leafaar August 1, 2025 16:32
@lvboudre lvboudre requested a review from leafaar August 1, 2025 20:22
@@ -67,6 +69,59 @@ lazy_static::lazy_static! {
Opts::new("missed_status_message_total", "Number of missed messages by commitment"),
&["status"]
).unwrap();

static ref GRPC_MESSAGE_SENT: IntCounterVec = IntCounterVec::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measure how many geyser event we sent to each subscriber.

&["subscriber_id"]
).unwrap();

static ref GRPC_BYTES_SENT: IntCounterVec = IntCounterVec::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measures how many bytes (protobuffer encoded data) to each subscriber

@lvboudre lvboudre merged commit 04ad134 into master Aug 4, 2025
4 checks passed
@lvboudre lvboudre deleted the better-load-monitoring branch August 4, 2025 17:24
@lvboudre lvboudre mentioned this pull request Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants