[feature] Support replication lag checking #4

bbrodriges · 2021-06-22T15:35:00Z

This commit adds ability to filter replica nodes by replication lag. User can provide ReplicationLagChecker function and MaxreplicationLag value as options to cluster constructor to check nodes and remove any replica which is too slow.

Example:

// Construct cluster nodes
nodes := []hasql.Node{...}

// Use options to fine-tune cluster behavior
opts := []hasql.ClusterOption{
	hasql.WithUpdateInterval(2 * time.Second),        // set custom update interval
	hasql.WithNodePicker(hasql.PickNodeRoundRobin()), // set desired nodes selection algorithm
	hasql.WithReplicationLagChecker(checkers.PostgreSQLReplicationLag), // set appropriate replication lag checker
	hasql.WithMaxReplicationLag(10 * time.Second), // set desired maximum lag value
}

// Create cluster handler
c, err := hasql.NewCluster(nodes, checkers.PostgreSQL, opts...)
if err != nil {
	panic(err)
}
defer func() { _ = c.Close() }() // close cluster when it is not needed

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

// Wait for any alive standby
node, err = c.WaitForStandby(ctx)
if err != nil {
	panic(err)
}

...

codecov-commenter · 2021-06-22T15:37:40Z

Codecov Report

Merging #4 (e71a5ac) into master (ebd5959) will decrease coverage by 2.34%.
The diff coverage is 65.51%.

@@            Coverage Diff             @@
##           master       #4      +/-   ##
==========================================
- Coverage   85.25%   82.90%   -2.35%     
==========================================
  Files           7        7              
  Lines         217      234      +17     
==========================================
+ Hits          185      194       +9     
- Misses         27       35       +8     
  Partials        5        5

Flag	Coverage Δ
unittests	`82.90% <65.51%> (-2.35%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
node.go	`75.00% <ø> (ø)`
cluster_opts.go	`83.33% <50.00%> (-16.67%)`	⬇️
check_nodes.go	`84.84% <63.63%> (-7.88%)`	⬇️
cluster.go	`88.49% <100.00%> (+0.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebd5959...e71a5ac. Read the comment docs.

sidh · 2021-06-22T17:05:42Z

check_nodes.go

+			return fmt.Errorf("cannot check node replication lag: %w", err)
+		}
+		if lag < maxLag {
+			return fmt.Errorf("replication lag is too big: %s", lag)


We need typed errors since tracer.NodeDead can return error for both 'cant check' and 'too much lag'.

The question is do we need to create an actualy type for replication lag? User might want to know the exact lag. Or he might not. Sentinels are a bit easier to work with but provide less info.

sidh · 2021-06-22T17:07:13Z

cluster.go

@@ -53,6 +53,9 @@ type Cluster struct {
 	updateTimeout  time.Duration
 	checker        NodeChecker
 	picker         NodePicker
+	// Replication lag configuration
+	lagChecker  ReplicationLagChecker
+	maxLagValue time.Duration


maxAllowedLag is more descriptive IMO.

sidh · 2021-06-22T17:13:28Z

check_nodes.go

-	Node    Node
+	Node Node
+
+	Primary bool


I really don't like adding more transitory values to node. This provokes accessing them after check itself when these values might be invalid.

Can you please add a comment before these values making it clear they are transitory and should not be used after the check itself? Also making them private seems like a good idea.

secwall · 2021-06-22T17:33:39Z

checkers/postgresql.go

+
+// PostgreSQLReplicationLag returns replication lag value for PostgreSQL replica node.
+func PostgreSQLReplicationLag(ctx context.Context, db *sql.DB) (time.Duration, error) {
+	return ReplicationLag(ctx, db, "SELECT NOW() - pg_last_xact_replay_timestamp()")


Unfortunately on low-activity cluster this would lead to marking all replicas as lagging.
We should probably consider using pg_stat_replication on primary with filter on reply_time.

secwall · 2021-06-22T17:55:20Z

We should also consider a corner case with dead primary.
So it seems that our logic should be more complex (and stateful):

If primary is alive we could select replicas from pg_stat_replication with state = 'streaming' and reply_time < <some threshold> and reply_lag < <some threshold> (and remember a set of replicas passing our checks on each iteration)
If primary is dead and we have previous replica lag selection results we could use it
If primary is dead and we have no previous replica lag selection results we could get a replica with maximal pg_last_wal_replay_lsn() and compute lags (with select max_replica_replay_ts - pg_last_xact_replay_timestamp()) from it's pg_last_xact_replay_timestamp() on other replicas

Any thoughts?

sidh · 2021-06-22T18:22:49Z

We should also consider a corner case with dead primary.
So it seems that our logic should be more complex (and stateful):

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

2. If primary is dead and we have previous replica lag selection results we could use it

3. If primary is dead and we have no previous replica lag selection results we could get a replica with maximal `pg_last_wal_replay_lsn()` and compute lags (with `select max_replica_replay_ts - pg_last_xact_replay_timestamp()`) from it's `pg_last_xact_replay_timestamp()` on other replicas

Any thoughts?

Sounds good. The only possible problem I see is keeping state. That might be a bit more complicated than it sounds. But gotta try first.

bbrodriges · 2021-06-23T10:48:53Z

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

We should probably measure lag only from async streaming replicas.

bbrodriges · 2021-06-23T10:53:18Z

Also reply_time column has been introduced since PG 12. We need to find a way to mimic this column for older versions.

secwall · 2021-06-23T11:16:00Z

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

We should probably measure lag only from async streaming replicas.

Hmm, should we?
There are several possible cases:

We use quorum replication => all replicas are quorum and only analytical ones are async. Should we limit our selection with analytical ones? Probably not. But different quorum replicas could have different lag. So we should check it.
We use on or remote_write for synchronous_commit (for performance reasons). In this mode sync replicas could be significantly behind async ones in terms of wal apply. So sync replicas also should be checked for lag.
We use remote_apply for synchronous_commit. Yep, this is the case where we could omit checking sync replica lag. Unfortunately synchronous_commit is session-level setting. So some sessions could use case number 2. We should probably be ready for that.

support replication lag checking

a16971d

bbrodriges requested a review from sidh June 22, 2021 15:35

fix linter error

e71a5ac

sidh requested changes Jun 22, 2021

View reviewed changes

secwall reviewed Jun 22, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Support replication lag checking #4

[feature] Support replication lag checking #4

bbrodriges commented Jun 22, 2021

codecov-commenter commented Jun 22, 2021 •

edited

Loading

sidh Jun 22, 2021

sidh Jun 22, 2021

sidh Jun 22, 2021

secwall Jun 22, 2021

secwall commented Jun 22, 2021

sidh commented Jun 22, 2021

bbrodriges commented Jun 23, 2021 •

edited

Loading

bbrodriges commented Jun 23, 2021

secwall commented Jun 23, 2021

[feature] Support replication lag checking #4

Are you sure you want to change the base?

[feature] Support replication lag checking #4

Conversation

bbrodriges commented Jun 22, 2021

codecov-commenter commented Jun 22, 2021 • edited Loading

Codecov Report

sidh Jun 22, 2021

Choose a reason for hiding this comment

sidh Jun 22, 2021

Choose a reason for hiding this comment

sidh Jun 22, 2021

Choose a reason for hiding this comment

secwall Jun 22, 2021

Choose a reason for hiding this comment

secwall commented Jun 22, 2021

sidh commented Jun 22, 2021

bbrodriges commented Jun 23, 2021 • edited Loading

bbrodriges commented Jun 23, 2021

secwall commented Jun 23, 2021

codecov-commenter commented Jun 22, 2021 •

edited

Loading

bbrodriges commented Jun 23, 2021 •

edited

Loading