New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ETCD-636: add automated backup sidecar #1287

Open

Elbehery wants to merge 19 commits into openshift:master from Elbehery:add_backup_sidecar

Contributor

Elbehery commented Jun 30, 2024

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

openshift-ci-robot added the jira/valid-reference label

openshift-ci-robot commented Jun 30, 2024 •

edited by openshift-ci bot

Loading

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from dusk125 and tjungblu

June 30, 2024 14:19

Contributor

openshift-ci bot commented Jun 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Elbehery]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the approved label

Elbehery force-pushed the add_backup_sidecar branch from 3d39bd9 to 7491a2d Compare

June 30, 2024 19:04

Contributor Author

Elbehery commented Jun 30, 2024

/retest

Elbehery force-pushed the add_backup_sidecar branch 2 times, most recently from 822750b to 803e77a Compare

July 1, 2024 00:07

tjungblu reviewed

View reviewed changes

bindata/etcd/pod.yaml Outdated

+                    - |
+                      #!/bin/sh
+                      set -euo pipefail
+                      cp --verbose --recursive --preserve --reflink=auto /var/lib/etcd/ /var/backup/etcd

Contributor

tjungblu Jul 1, 2024

I suggest you make this a command, so we have some go code we can also test properly. Eg in here: https://github.com/openshift/cluster-etcd-operator/tree/master/pkg/cmd

Didn't we want to take the snapshot with etcdctl for starters? and there's no retention and schedule either


          ETCD-636: add automated backup side car

fbd8c0b

Elbehery force-pushed the add_backup_sidecar branch from 803e77a to fbd8c0b Compare

July 9, 2024 11:21

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated Show resolved Hide resolved

Contributor Author

Elbehery commented Jul 9, 2024

/label tide/merge-method-squash

openshift-ci bot added the tide/merge-method-squash label

Elbehery commented

View reviewed changes

pkg/operator/periodicbackupcontroller/periodicbackupcontroller.go Outdated Show resolved Hide resolved


          skip recurring backups upon default annotation

eb4321b

Elbehery force-pushed the add_backup_sidecar branch from a15f310 to eb4321b Compare

July 9, 2024 14:51

Elbehery added 4 commits

July 9, 2024 17:42


          extract backup specs

415892d


          split backup procedure

f552b51


          update go mod sum


          add scheduling

0d28c7a

Elbehery force-pushed the add_backup_sidecar branch from fa0a0d5 to 0d28c7a Compare

July 9, 2024 16:31


          add vendor

99b5168

Elbehery force-pushed the add_backup_sidecar branch 3 times, most recently from 4e21c25 to 08917b6 Compare

July 9, 2024 17:44


          add prune schedule

898db4b

Elbehery force-pushed the add_backup_sidecar branch from 08917b6 to 898db4b Compare

July 9, 2024 19:48

Contributor Author

Elbehery commented Jul 10, 2024

/retest-required


          add unit test

7bd5e3b

Elbehery force-pushed the add_backup_sidecar branch from 3a09a28 to 7bd5e3b Compare

July 10, 2024 02:13

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated Show resolved Hide resolved

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated

+              	var errs []error
+              	go func() {
+              		err := b.scheduleBackup()

Contributor

tjungblu Jul 10, 2024

the cronjob.Start() needs to block this command, otherwise you will have an infinitely restarting sidecar that copies backups all the time

Contributor Author

Elbehery Jul 10, 2024

the call is non blocking, and there is no return from the task() to know when to unblock

maybe i find another package better

Contributor Author

Elbehery Jul 10, 2024

so i changed the approach

one scheduler being created upon Run()
the scheduler runs only one job backup()
the prune() is being invoked upon successful completion of the Job (i.e. backup())
the scheduler configs is to :
- run only one job at a time
- the job run is singleton
- the job is never being re-queued

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated Show resolved Hide resolved

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go

+              	if idx == -1 {
+              		sErr := fmt.Errorf("could not find default backup CR, found [%v]", backups.Items)
+              		klog.Error(sErr)
+              		return sErr

Contributor

tjungblu Jul 10, 2024

that's not an error, if it's not there the user likely has it removed for a reason. We have two options here:

the sidecar is patched out by the operator when this CRD is not available
you keep this cron server running without any task to schedule, but you need to have some reconciliation loop to check whether the default CRD comes back eventually

Contributor

tjungblu Jul 10, 2024

if we go with the first option, then we need to patch the CRD client out of this command entirely and use configuration args to pass the CRD (similar to the backup prune command). Otherwise we're going to have issues with race conditions on static pod rollouts when people create+delete the default CRD, which will send this sidecar crash looping.

Contributor Author

Elbehery Jul 10, 2024

so if we go with second option, that is we return non-err above, why do we need to a sync(), as this is a cmd ?

iiuc, it does check CRD upon every invocation, or ?

Contributor

tjungblu Jul 10, 2024

how do you envision the lifecycle of this sidecar then? we need to run it 24/7 to keep the cron scheduler running correctly. So when it never restarts, how do we detect the CRD has changed?

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated Show resolved Hide resolved

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Show resolved Hide resolved

Elbehery added 3 commits

July 10, 2024 10:34


          remove copy backup

43dbb40


          add scheduling opts

7f27ef7


          expose PruneOpts

f462c3b

Elbehery force-pushed the add_backup_sidecar branch from f01d91b to f462c3b Compare

July 10, 2024 13:32

tjungblu reviewed

View reviewed changes

pkg/cmd/backuprestore/backupnoconfig.go Outdated

+              		return err
+              	}
+              	b.scheduler.Start()

Contributor

tjungblu Jul 10, 2024

From the documentation I can see:

	// Start begins scheduling jobs for execution based
	// on each job's definition. Job's added to an already
	// running scheduler will be scheduled immediately based
	// on definition. Start is non-blocking.
	Start()

https://pkg.go.dev/github.com/go-co-op/gocron/v2#Scheduler

so this sidecar will continue to restart every time it runs, not sure how it is able to schedule a cronjob accordingly

Contributor Author

Elbehery Jul 10, 2024 •

edited

Loading

the job will be added to the scheduler immediately, and will be scheduled upon definition

iiuc, this means the job is being added, but the execution is based on the crontab

why this will restart the whole sidecar ?

Contributor

tjungblu Jul 10, 2024

there is no crontab here, the library will just sleep on a go channel 🤣

Contributor Author

Elbehery Jul 10, 2024

achsoooo .. but why the container will restart every time ?

Contributor

tjungblu Jul 10, 2024

what's stopping it from exiting the command?

Contributor Author

Elbehery Jul 10, 2024

achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?

Contributor

tjungblu Jul 10, 2024

just trying whether I'm stupid here, but this example immediately exits without running any task:

import (
	"fmt"
	gcron "github.com/go-co-op/gocron/v2"
)

func main() {
	scheduler, err := gcron.NewScheduler(
		gcron.WithLimitConcurrentJobs(1, gcron.LimitModeWait),
		gcron.WithGlobalJobOptions(
			gcron.WithLimitedRuns(1),
			gcron.WithSingletonMode(gcron.LimitModeWait)))
	if err != nil {
		panic(err)
	}

	_, err = scheduler.NewJob(gcron.CronJob("* * * * *", false), gcron.NewTask(func() {
		fmt.Printf("hello world\n")
	}))
	if err != nil {
		panic(err)
	}
	scheduler.Start()

}

Contributor

tjungblu Jul 10, 2024

achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?

well, hence my question on the other thread about the life cycle you're thinking about, because this command is not going to work :'D

Contributor Author

Elbehery Jul 10, 2024

no u r right, its my bad :D ..

but i followed the same approach as in the requestBackup and prune CMDs

probably we need a controller that reacts only to the default CR and invoke the cmd

wdyt

Elbehery added 5 commits

July 10, 2024 15:43


          add subcmd

7eae9da


          bump go.mod

e144b9a


          bump go mod sum

03f0796


          use cron daemon

2030a1e


          add vendor

fc34fe8

Elbehery force-pushed the add_backup_sidecar branch 3 times, most recently from 46881b6 to 5fe990f Compare

July 12, 2024 10:18

Contributor Author

Elbehery commented Jul 12, 2024

/retest-required

Elbehery force-pushed the add_backup_sidecar branch from 5fe990f to f6eef66 Compare

July 12, 2024 17:30


          add sidecar container

f3af2ab

Elbehery force-pushed the add_backup_sidecar branch 2 times, most recently from a85e9fa to b966a02 Compare

July 12, 2024 22:56


          use in-cluster config

f4c8b1a

Elbehery force-pushed the add_backup_sidecar branch from b966a02 to f4c8b1a Compare

July 13, 2024 10:46

Contributor

openshift-ci bot commented Jul 13, 2024

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-qe-no-capabilities	`803e77a`	link	false	`/test e2e-gcp-qe-no-capabilities`
ci/prow/e2e-agnostic-ovn	`f4c8b1a`	link	true	`/test e2e-agnostic-ovn`
ci/prow/e2e-operator	`f4c8b1a`	link	true	`/test e2e-operator`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`f4c8b1a`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-etcd-recovery	`f4c8b1a`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-agnostic-ovn-upgrade	`f4c8b1a`	link	true	`/test e2e-agnostic-ovn-upgrade`
ci/prow/e2e-aws-ovn-single-node	`f4c8b1a`	link	true	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`f4c8b1a`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-aws-ovn-etcd-scaling	`f4c8b1a`	link	true	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-operator-fips	`f4c8b1a`	link	false	`/test e2e-operator-fips`
ci/prow/e2e-aws-ovn-serial	`f4c8b1a`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-aws-etcd-certrotation	`f4c8b1a`	link	false	`/test e2e-aws-etcd-certrotation`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment