-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using benji to backup hundreds of database servers #97
Comments
Deduplication is per storage so if you'd have separate a S3 bucket per server deduplication would only be performed on the data coming from that server. If you want to go that route I'd suggest to use one Benji database for all servers and then automatically generating a Benji configuration per server consisting of a common part (transforms, ios, database credentials and such) and one storage definition with a unique name and bucket configuration. I have not tested this but Benji should be able to cope with the fact that not all configuration files include all storages as long as the name is unique. If this does not work for some reason this could be easily fixed I think. You should also be able to generate the transform definitions per server (for example to assign different encryption keys). I'd suggest to use unique names here too to prevent a mix-up. My suggestion based on the limited information on your use case would be to use one storage for all servers to get the maximum benefit from deduplication. You can use labels to add extra information like server name or customer id and then use |
Yes, we're aware of this, it should not be a big issue for us - our motivation to keep the storage for each server isolated is to limit the impact of human mistakes, corruption due to misconfiguration etc.
Indeed, this is how we have configured it in our testing; two servers with benji instances, common configuration except for storage backends (which have unique id's and name) and a shared benji postgresql database. We have a separate recovery node that has all storages included in its benji config, and we can therefore recover any server from this node. Seems to work flawlessly!
We currently do not require encryption, but have one common transform definition like so;
Should we use unique names here as well, even though we don't encrypt?
The database system we want to backup are postgres-servers, and our data-model is defined as every tenant in the system has their own database schema. We have 4000 tenant schemas in every server, and every schema has the same table/index structure. This why we have so many files on disk, since every table and index results in a new file. The gains we get from benji deduplication are already huge, a 72gb instance only used up 4.9 gb in S3 - very impressive 👍
We do indeed do this by passing the hostname as a label; I do have a few other questions;
|
Thanks for telling more about your use case and your experience so far.
No. Unique names would only make sense if the actual configuration of the transforms differ. Like different encryption keys.
I had hoped to get more feedback from users (see #67) to better asses how many users there actually are and how they are using Benji to build more confidence in the stability of the code base. To actually answer your question: There currently is no estimate.
For the database I've provided automatic migrations from the beginning and this has worked out quite well I think. The structure of the object metadata hasn't seen any significant changes for quite a while and Benji can still read the older version. The same holds true for the format of the exported metadata for which there are currently four different revisions. So apart from bugs it should actually be possible to upgrade any released version to any other later released version. Downgrades are another matter and are currently only possible in a limited number of circumstances. We could provide code for automated downgrades of the database schema but I'm not sure if it is worth the effort even for stable releases. I'm planning on continuing to provide backwards compatibility. So even if there are changes to the data structures Benji will be able to read the old versions and the database schema will be migrated automatically.
There are other disadvantages to using so many different storage apart from losing the space savings of global deduplication:
I still understand your reasoning just wanted to add these two as they came to my mind and could be an issue in the long run. |
Great! 👍
Alright, understandable. If we end up using benji, I'll be sure to drop a comment in #67, like I said we'll have an extensive deployment, and generally do several recovery operations per week to look at previous database states for various reasons. So it should help build confidence (provided it works well!)
As long as benji is 'forward-compatible', it will not be a problem for us. Our concern was that a benji release might break the previous format, forcing us to 'reset' the backups and start over.
We're still debating internally if we should go with one or several buckets.. You see no problem with performance degradation by going with a single bucket? Even though benji process blocks and not files, there will likely be a lot of (block)-files in a single combined bucket. Won't de-duplication performance go down as there are more and more blocks to check? Quickly counting, the raw data on disk is somewhere around ~20TB with ~675000000 files, combined. Two more questions!
|
I'd think that it mainly depends on how the object store handles large numbers of objects in the same bucket and I'd assume that services like S3 are optimized to work well in such a scenario. I'm not completely sure about the database. It is going to be smaller with a unified storage due to deduplication but the database optimizer might work better with multiple storages. You could consider splitting your database deployments into groups where each group uses one storage. That would get you the benefit of better deduplication and you'd still be safer from human error or software failure. Would it be possible to test how much you would benefit from deduplication? Maybe we're discussing a non-issue.
That was an experiment of mine. Last time I tried it out it worked. But the topic of how to bundle (or externally provide) Ceph's python modules is still unsolved and I haven't invested any work into that yet, it might just work. I'm still interested in pyinstaller as it would provide a way to easily distribute Benji.
Some would consider it good practice, especially if you're using the option to directly compare the backup to the source snapshot. But your backups will take longer that way and generate more io (even more so if you're comparing to the source snapshot). So as usual the answer is: It depends. |
We'll likely be using https://min.io local S3 service, with XFS as backend file-system.. I interpret your answer as essentially if the S3 storage is fine, benji should be too?
Heh, we came up with that idea internally as well 👍
Yes, we're in a PoC phase right now so that will be one of the things we'll test, for sure!
For us, no Ceph support is not an issue since we'll be backing up LVM and storing that data in S3.. I'll be sure to test the pyinstaller spec and submit improvements, should we find any!
We'll probably keep it scrubbing for now then, if we see I/O utilization problems we can omit it later or run it at a later point in time. Thank you for your helpful advice and replies! Like I said above, we're in the Proof-of-Concept stage working out some tooling around benji to work in our setup, and it continues to impress us. Some numbers for you; |
Yes.
Please do. I quite like the idea of distributing Benji as a single "binary" as it simplifies installation immensely. (Of course I know that there are also disadvantages). As you mention LVM, there is optimization potential in this area, which could be quite substantial if we consider how much Ceph's snapshot diffs help speed up backups. See #59.
Thanks! |
Update; We have deployed benji to our development environment (15 db-servers) and have had it running for a few weeks. After some internal debate we ended up using a single bucket for benji per environment (so one dedicated bucket only for dev, one for test and one for prod). Since we have a secondary backup system in place as well, we thought the odds of losing both systems to a human mistake very low. So we should get some good numbers on dedup as you suggested :) We have run into one minor issue so far, see #101
I had some trouble building a standalone benji binary with pyinstaller but got it working in the end with some changes to the .spec-file, I'll try to get around creating a PR with those fixes. Note; since we don't use Ceph, I did not look into building those modules.
Cool, would love to see that implemented even though for us, benji is definitively fast enough as it is :) |
Thanks for the update! I'm going to close this issue now, take a look at #101 and I'm looking forward to the PR. Any fixes to the spec file are definitely welcome. |
Hi,
I'm evaluating using Benji for backing up hundreds of database servers (virtual machines) using LVM snapshots. Since we have a special data model, it results in several million files in the database storage directory on each server. This is causing problems (memory usage and very long run-times) for 'native' backup tools we have been looking at. However, since benji is block-based, we're getting very promising performance on both backup and restore in our tests.
My question here is mostly related to how we should implement benji on a large scale. The docs are rather sparse; ( https://benji-backup.me/configuration.html#multiple-instance-installations )
Our plan is to have one S3 bucket per server, which contains the lvm block-backup of that specific server. My current train of though is to have a central benji postgres db and have benji instances on all servers share that database. Since we have different S3 buckets for every server, that means a new 'storage id' for every server. This should not be a big deal since we have automation tools that would take care of that for us.
My question is basically, does this sound like a good way to implement benji on this scale? I guess another approach would be to have separate postgres-schemas for each benji instance on every server. We would like to keep the backups of individual servers in separate S3 buckets, or atleast in different folders inside in a shared bucket.
The text was updated successfully, but these errors were encountered: