Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Support replication lag checking #4

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bbrodriges
Copy link
Contributor

This commit adds ability to filter replica nodes by replication lag. User can provide ReplicationLagChecker function and MaxreplicationLag value as options to cluster constructor to check nodes and remove any replica which is too slow.

Example:

// Construct cluster nodes
nodes := []hasql.Node{...}

// Use options to fine-tune cluster behavior
opts := []hasql.ClusterOption{
	hasql.WithUpdateInterval(2 * time.Second),        // set custom update interval
	hasql.WithNodePicker(hasql.PickNodeRoundRobin()), // set desired nodes selection algorithm
	hasql.WithReplicationLagChecker(checkers.PostgreSQLReplicationLag), // set appropriate replication lag checker
	hasql.WithMaxReplicationLag(10 * time.Second), // set desired maximum lag value
}

// Create cluster handler
c, err := hasql.NewCluster(nodes, checkers.PostgreSQL, opts...)
if err != nil {
	panic(err)
}
defer func() { _ = c.Close() }() // close cluster when it is not needed

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

// Wait for any alive standby
node, err = c.WaitForStandby(ctx)
if err != nil {
	panic(err)
}

...

@bbrodriges bbrodriges requested a review from sidh June 22, 2021 15:35
@codecov-commenter
Copy link

codecov-commenter commented Jun 22, 2021

Codecov Report

Merging #4 (e71a5ac) into master (ebd5959) will decrease coverage by 2.34%.
The diff coverage is 65.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #4      +/-   ##
==========================================
- Coverage   85.25%   82.90%   -2.35%     
==========================================
  Files           7        7              
  Lines         217      234      +17     
==========================================
+ Hits          185      194       +9     
- Misses         27       35       +8     
  Partials        5        5              
Flag Coverage Δ
unittests 82.90% <65.51%> (-2.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
node.go 75.00% <ø> (ø)
cluster_opts.go 83.33% <50.00%> (-16.67%) ⬇️
check_nodes.go 84.84% <63.63%> (-7.88%) ⬇️
cluster.go 88.49% <100.00%> (+0.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebd5959...e71a5ac. Read the comment docs.

return fmt.Errorf("cannot check node replication lag: %w", err)
}
if lag < maxLag {
return fmt.Errorf("replication lag is too big: %s", lag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need typed errors since tracer.NodeDead can return error for both 'cant check' and 'too much lag'.

The question is do we need to create an actualy type for replication lag? User might want to know the exact lag. Or he might not. Sentinels are a bit easier to work with but provide less info.

@@ -53,6 +53,9 @@ type Cluster struct {
updateTimeout time.Duration
checker NodeChecker
picker NodePicker
// Replication lag configuration
lagChecker ReplicationLagChecker
maxLagValue time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxAllowedLag is more descriptive IMO.

Node Node
Node Node

Primary bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like adding more transitory values to node. This provokes accessing them after check itself when these values might be invalid.

Can you please add a comment before these values making it clear they are transitory and should not be used after the check itself? Also making them private seems like a good idea.


// PostgreSQLReplicationLag returns replication lag value for PostgreSQL replica node.
func PostgreSQLReplicationLag(ctx context.Context, db *sql.DB) (time.Duration, error) {
return ReplicationLag(ctx, db, "SELECT NOW() - pg_last_xact_replay_timestamp()")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately on low-activity cluster this would lead to marking all replicas as lagging.
We should probably consider using pg_stat_replication on primary with filter on reply_time.

@secwall
Copy link

secwall commented Jun 22, 2021

We should also consider a corner case with dead primary.
So it seems that our logic should be more complex (and stateful):

  1. If primary is alive we could select replicas from pg_stat_replication with state = 'streaming' and reply_time < <some threshold> and reply_lag < <some threshold> (and remember a set of replicas passing our checks on each iteration)
  2. If primary is dead and we have previous replica lag selection results we could use it
  3. If primary is dead and we have no previous replica lag selection results we could get a replica with maximal pg_last_wal_replay_lsn() and compute lags (with select max_replica_replay_ts - pg_last_xact_replay_timestamp()) from it's pg_last_xact_replay_timestamp() on other replicas

Any thoughts?

@sidh
Copy link
Collaborator

sidh commented Jun 22, 2021

We should also consider a corner case with dead primary.
So it seems that our logic should be more complex (and stateful):

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

2. If primary is dead and we have previous replica lag selection results we could use it

3. If primary is dead and we have no previous replica lag selection results we could get a replica with maximal `pg_last_wal_replay_lsn()` and compute lags (with `select max_replica_replay_ts - pg_last_xact_replay_timestamp()`) from it's `pg_last_xact_replay_timestamp()` on other replicas

Any thoughts?

Sounds good. The only possible problem I see is keeping state. That might be a bit more complicated than it sounds. But gotta try first.

@bbrodriges
Copy link
Contributor Author

bbrodriges commented Jun 23, 2021

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

We should probably measure lag only from async streaming replicas.

@bbrodriges
Copy link
Contributor Author

Also reply_time column has been introduced since PG 12. We need to find a way to mimic this column for older versions.

@secwall
Copy link

secwall commented Jun 23, 2021

1. If primary is alive we could select replicas from `pg_stat_replication` with `state = 'streaming'` and `reply_time < <some threshold>` and `reply_lag < <some threshold>` (and remember a set of replicas passing our checks on each iteration)

We should probably measure lag only from async streaming replicas.

Hmm, should we?
There are several possible cases:

  1. We use quorum replication => all replicas are quorum and only analytical ones are async. Should we limit our selection with analytical ones? Probably not. But different quorum replicas could have different lag. So we should check it.
  2. We use on or remote_write for synchronous_commit (for performance reasons). In this mode sync replicas could be significantly behind async ones in terms of wal apply. So sync replicas also should be checked for lag.
  3. We use remote_apply for synchronous_commit. Yep, this is the case where we could omit checking sync replica lag. Unfortunately synchronous_commit is session-level setting. So some sessions could use case number 2. We should probably be ready for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants