Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-636: add automated backup sidecar #1287

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Elbehery
Copy link
Contributor

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 30, 2024

@Elbehery: This pull request references ETCD-636 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This PR add an etcd backup sidecar container to the etcd pod manifest.

The container copies the snapshot state upon changes from the etcd data dir into backup dir.

fixes https://issues.redhat.com/browse/ETCD-636

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dusk125 and tjungblu June 30, 2024 14:19
Copy link
Contributor

openshift-ci bot commented Jun 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2024
@Elbehery
Copy link
Contributor Author

/retest

@Elbehery Elbehery force-pushed the add_backup_sidecar branch 2 times, most recently from 822750b to 803e77a Compare July 1, 2024 00:07
- |
#!/bin/sh
set -euo pipefail
cp --verbose --recursive --preserve --reflink=auto /var/lib/etcd/ /var/backup/etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you make this a command, so we have some go code we can also test properly. Eg in here: https://github.com/openshift/cluster-etcd-operator/tree/master/pkg/cmd

Didn't we want to take the snapshot with etcdctl for starters? and there's no retention and schedule either

@Elbehery
Copy link
Contributor Author

Elbehery commented Jul 9, 2024

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jul 9, 2024
@Elbehery Elbehery force-pushed the add_backup_sidecar branch 3 times, most recently from 4e21c25 to 08917b6 Compare July 9, 2024 17:44
@Elbehery
Copy link
Contributor Author

/retest-required


var errs []error
go func() {
err := b.scheduleBackup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cronjob.Start() needs to block this command, otherwise you will have an infinitely restarting sidecar that copies backups all the time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call is non blocking, and there is no return from the task() to know when to unblock

maybe i find another package better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i changed the approach

  • one scheduler being created upon Run()
  • the scheduler runs only one job backup()
  • the prune() is being invoked upon successful completion of the Job (i.e. backup())
  • the scheduler configs is to :
    • run only one job at a time
    • the job run is singleton
    • the job is never being re-queued

if idx == -1 {
sErr := fmt.Errorf("could not find default backup CR, found [%v]", backups.Items)
klog.Error(sErr)
return sErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not an error, if it's not there the user likely has it removed for a reason. We have two options here:

  • the sidecar is patched out by the operator when this CRD is not available
  • you keep this cron server running without any task to schedule, but you need to have some reconciliation loop to check whether the default CRD comes back eventually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with the first option, then we need to patch the CRD client out of this command entirely and use configuration args to pass the CRD (similar to the backup prune command). Otherwise we're going to have issues with race conditions on static pod rollouts when people create+delete the default CRD, which will send this sidecar crash looping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we go with second option, that is we return non-err above, why do we need to a sync(), as this is a cmd ?

iiuc, it does check CRD upon every invocation, or ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you envision the lifecycle of this sidecar then? we need to run it 24/7 to keep the cron scheduler running correctly. So when it never restarts, how do we detect the CRD has changed?

return err
}

b.scheduler.Start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation I can see:

	// Start begins scheduling jobs for execution based
	// on each job's definition. Job's added to an already
	// running scheduler will be scheduled immediately based
	// on definition. Start is non-blocking.
	Start()

https://pkg.go.dev/github.com/go-co-op/gocron/v2#Scheduler

so this sidecar will continue to restart every time it runs, not sure how it is able to schedule a cronjob accordingly

Copy link
Contributor Author

@Elbehery Elbehery Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the job will be added to the scheduler immediately, and will be scheduled upon definition

iiuc, this means the job is being added, but the execution is based on the crontab

why this will restart the whole sidecar ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no crontab here, the library will just sleep on a go channel 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

achsoooo .. but why the container will restart every time ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's stopping it from exiting the command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying whether I'm stupid here, but this example immediately exits without running any task:

import (
	"fmt"
	gcron "github.com/go-co-op/gocron/v2"
)

func main() {
	scheduler, err := gcron.NewScheduler(
		gcron.WithLimitConcurrentJobs(1, gcron.LimitModeWait),
		gcron.WithGlobalJobOptions(
			gcron.WithLimitedRuns(1),
			gcron.WithSingletonMode(gcron.LimitModeWait)))
	if err != nil {
		panic(err)
	}

	_, err = scheduler.NewJob(gcron.CronJob("* * * * *", false), gcron.NewTask(func() {
		fmt.Printf("hello world\n")
	}))
	if err != nil {
		panic(err)
	}
	scheduler.Start()

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

achsooo .. ok .. so either we run it a daemon or have a controller that invoke the cmd upon reconciliation, or ?

well, hence my question on the other thread about the life cycle you're thinking about, because this command is not going to work :'D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no u r right, its my bad :D ..

but i followed the same approach as in the requestBackup and prune CMDs

probably we need a controller that reacts only to the default CR and invoke the cmd

wdyt

@Elbehery Elbehery force-pushed the add_backup_sidecar branch 3 times, most recently from 46881b6 to 5fe990f Compare July 12, 2024 10:18
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery Elbehery force-pushed the add_backup_sidecar branch 2 times, most recently from a85e9fa to b966a02 Compare July 12, 2024 22:56
Copy link
Contributor

openshift-ci bot commented Jul 13, 2024

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities 803e77a link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-agnostic-ovn f4c8b1a link true /test e2e-agnostic-ovn
ci/prow/e2e-operator f4c8b1a link true /test e2e-operator
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown f4c8b1a link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-recovery f4c8b1a link false /test e2e-aws-etcd-recovery
ci/prow/e2e-agnostic-ovn-upgrade f4c8b1a link true /test e2e-agnostic-ovn-upgrade
ci/prow/e2e-aws-ovn-single-node f4c8b1a link true /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown f4c8b1a link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-ovn-etcd-scaling f4c8b1a link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-operator-fips f4c8b1a link false /test e2e-operator-fips
ci/prow/e2e-aws-ovn-serial f4c8b1a link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-etcd-certrotation f4c8b1a link false /test e2e-aws-etcd-certrotation

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants