-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD-636: add automated backup sidecar #1287
base: master
Are you sure you want to change the base?
Conversation
@Elbehery: This pull request references ETCD-636 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Elbehery The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3d39bd9
to
7491a2d
Compare
/retest |
822750b
to
803e77a
Compare
bindata/etcd/pod.yaml
Outdated
- | | ||
#!/bin/sh | ||
set -euo pipefail | ||
cp --verbose --recursive --preserve --reflink=auto /var/lib/etcd/ /var/backup/etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you make this a command, so we have some go code we can also test properly. Eg in here: https://github.com/openshift/cluster-etcd-operator/tree/master/pkg/cmd
Didn't we want to take the snapshot with etcdctl for starters? and there's no retention and schedule either
803e77a
to
fbd8c0b
Compare
/label tide/merge-method-squash |
pkg/operator/periodicbackupcontroller/periodicbackupcontroller.go
Outdated
Show resolved
Hide resolved
a15f310
to
eb4321b
Compare
fa0a0d5
to
0d28c7a
Compare
4e21c25
to
08917b6
Compare
08917b6
to
898db4b
Compare
/retest-required |
3a09a28
to
7bd5e3b
Compare
|
||
var errs []error | ||
go func() { | ||
err := b.scheduleBackup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cronjob.Start() needs to block this command, otherwise you will have an infinitely restarting sidecar that copies backups all the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call is non blocking, and there is no return from the task()
to know when to unblock
maybe i find another package better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i changed the approach
- one scheduler being created upon
Run()
- the scheduler runs only one job
backup()
- the
prune()
is being invoked upon successful completion of the Job (i.e. backup()) - the scheduler configs is to :
- run only one job at a time
- the job run is singleton
- the job is never being re-queued
if idx == -1 { | ||
sErr := fmt.Errorf("could not find default backup CR, found [%v]", backups.Items) | ||
klog.Error(sErr) | ||
return sErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not an error, if it's not there the user likely has it removed for a reason. We have two options here:
- the sidecar is patched out by the operator when this CRD is not available
- you keep this cron server running without any task to schedule, but you need to have some reconciliation loop to check whether the default CRD comes back eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go with the first option, then we need to patch the CRD client out of this command entirely and use configuration args to pass the CRD (similar to the backup prune command). Otherwise we're going to have issues with race conditions on static pod rollouts when people create+delete the default CRD, which will send this sidecar crash looping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we go with second option, that is we return non-err above, why do we need to a sync()
, as this is a cmd ?
iiuc, it does check CRD upon every invocation, or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you envision the lifecycle of this sidecar then? we need to run it 24/7 to keep the cron scheduler running correctly. So when it never restarts, how do we detect the CRD has changed?
f01d91b
to
f462c3b
Compare
return err | ||
} | ||
|
||
b.scheduler.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the documentation I can see:
// Start begins scheduling jobs for execution based
// on each job's definition. Job's added to an already
// running scheduler will be scheduled immediately based
// on definition. Start is non-blocking.
Start()
https://pkg.go.dev/github.com/go-co-op/gocron/v2#Scheduler
so this sidecar will continue to restart every time it runs, not sure how it is able to schedule a cronjob accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job will be added to the scheduler immediately, and will be scheduled upon definition
iiuc, this means the job is being added, but the execution is based on the crontab
why this will restart the whole sidecar ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no crontab here, the library will just sleep on a go channel 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
achsoooo .. but why the container will restart every time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's stopping it from exiting the command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying whether I'm stupid here, but this example immediately exits without running any task:
import (
"fmt"
gcron "github.com/go-co-op/gocron/v2"
)
func main() {
scheduler, err := gcron.NewScheduler(
gcron.WithLimitConcurrentJobs(1, gcron.LimitModeWait),
gcron.WithGlobalJobOptions(
gcron.WithLimitedRuns(1),
gcron.WithSingletonMode(gcron.LimitModeWait)))
if err != nil {
panic(err)
}
_, err = scheduler.NewJob(gcron.CronJob("* * * * *", false), gcron.NewTask(func() {
fmt.Printf("hello world\n")
}))
if err != nil {
panic(err)
}
scheduler.Start()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?
well, hence my question on the other thread about the life cycle you're thinking about, because this command is not going to work :'D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no u r right, its my bad :D ..
but i followed the same approach as in the requestBackup
and prune
CMDs
probably we need a controller that reacts only to the default
CR and invoke the cmd
wdyt
46881b6
to
5fe990f
Compare
/retest-required |
5fe990f
to
f6eef66
Compare
a85e9fa
to
b966a02
Compare
b966a02
to
f4c8b1a
Compare
@Elbehery: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR add an etcd backup sidecar container to the etcd pod manifest.
The container copies the snapshot state upon changes from the etcd data dir into backup dir.
fixes https://issues.redhat.com/browse/ETCD-636
cc @openshift/openshift-team-etcd