Skip to content

Commit

Permalink
noolm
Browse files Browse the repository at this point in the history
  • Loading branch information
gdoteof committed Jun 10, 2023
1 parent 22f0b52 commit 6837a26
Showing 1 changed file with 14 additions and 12 deletions.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,37 @@

This cluster is built from this great repo: https://github.com/onedr0p/flux-cluster-template.

If you are here because you want a cluster that also has rook-ceph or OLM, and the one above isn't quite scratching the itch, please do us both a favor and read my notes, then go through everything at the above repo, then come back here.
If you are here because you want a cluster that also has rook-ceph, and the one above isn't quite scratching the itch, please do us both a favor and read my notes, then go through everything at the above repo, then come back here.

- Baremetal: if you are not on baremetal, you are in the wrong place. [kube-vip](https://kube-vip.io/docs/installation/static/#arp), [MetalLB](https://metallb.universe.tf/concepts/) and to a large extent, [cloudflared](https://github.com/cloudflare/cloudflared) are of no value within a cloud or virtual environment, there are much better solutions.
- Baremetal: if you are not on baremetal, you are in the wrong place. [kube-vip](https://kube-vip.io/docs/installation/static/#arp), [MetalLB](https://metallb.universe.tf/concepts/) and to a large extent, [cloudflared](https://github.com/cloudflare/cloudflared) are of no value within a cloud or virtual environment, there are much better solutions.

- Networking: Think about what your networking looks like. I'd suggest setting up your subnet to work with the defaults rather than try and change the network to match your subnet. I have a "real" interface coming off a switch, which takes the IP 10.10.1.1/24 and serves DHCP. Each mac address which gets a reservation has its first IP permanently reserved (so it acts like a static IP). I plug directly into that switch to operate on the cluster, but I also have a jump box set up, and allow ingress from the management LAN of my home network.
- Networking: Think about what your networking looks like. I'd suggest setting up your subnet to work with the defaults rather than try and change the network to match your subnet. I have a "real" interface coming off a switch, which takes the IP 10.10.1.1/24 and serves DHCP. Each mac address which gets a reservation has its first IP permanently reserved (so it acts like a static IP). I plug directly into that switch to operate on the cluster, but I also have a jump box set up, and allow ingress from the management LAN of my home network.

- Drives: You might be excited to run a cluster from a bunch of dev boards; which is how this started out. Most of these dev boards only have 1 nvme slot. 2 of my control nodes boots from an A2 TF card, one of the workers boots from a usb-c. The one of the A2 boot is an arm64 which uses XFS (nearly 50% faster than its ext4 counterpart on big writes). The usb-c is using btrfs. 2 of the three worker nodes are booting from nvme and use it for the whoel filesystem. I did not do any actual benchmarking, but I built many clusters with just the TF card as boot devices, and I was surprised at how little difference ext4 nvme vs A2 tf cards for root filesystems would be. Throughput for container syncing was much faster on the nvmes, but they had no problem keeping up with operations.
- Drives: You might be excited to run a cluster from a bunch of dev boards; which is how this started out. Most of these dev boards only have 1 nvme slot. 2 of my control nodes boots from an A2 TF card, one of the workers boots from a usb-c. The one of the A2 boot is an arm64 which uses XFS (nearly 50% faster than its ext4 counterpart on big writes). The usb-c is using btrfs. 2 of the three worker nodes are booting from nvme and use it for the whoel filesystem. I did not do any actual benchmarking, but I built many clusters with just the TF card as boot devices, and I was surprised at how little difference ext4 nvme vs A2 tf cards for root filesystems would be. Throughput for container syncing was much faster on the nvmes, but they had no problem keeping up with operations.
a. Eventually I'd like to have the ETCD cluster be backed by ceph data, but until then I will leave the 2 NVME masters.
b. for worker nodes, dedicating 4GB and an m.2 slot to ceph cluster probably makes sense for dev boards, but for your control nodes, its a bit tougher.


### Rook (Ceph)
Ceph has traditionally been run in its own cluster, and Rook allows us to orchestrate a Ceph cluster within our Kubernetes cluster. The most important thing to look at when configuring ceph is the device configuration. The easiest way by far is to just plug in brand new disks and set `useAllNodes` to true; and the cluster will happily slurp everything right up.

However, be warned, a default configuration of an OSD (the daemon which manages the disk) with all the monitoring/alerting etc is 4GB in memory requests. By default there will be a single OSD per configured device; this cluster has a variety; a low memory worker with a 2tb nvme has only a single OSD; while a high memory worker with 2x2tb nvme has 8 OSDs between them.
Ceph has traditionally been run in its own cluster, and Rook allows us to orchestrate a Ceph cluster within our Kubernetes cluster. The most important thing to look at when configuring ceph is the device configuration. The easiest way by far is to just plug in brand new disks and set `useAllNodes` to true; and the cluster will happily slurp everything right up.

However, be warned, a default configuration of an OSD (the daemon which manages the disk) with all the monitoring/alerting etc is 4GB in memory requests. By default there will be a single OSD per configured device; this cluster has a variety; a low memory worker with a 2tb nvme has only a single OSD; while a high memory worker with 2x2tb nvme has 8 OSDs between them.

If like me, it takes you about 100 iterations before the cluster comes up the way you like; there are many types of fingerprints that can be left behind which will have ceph refuse to provision the disks. The most common are latent partitions, but with encryption enabled; there are other block-device-level artifacts that remain, after you thought you were starting fresh.
If like me, it takes you about 100 iterations before the cluster comes up the way you like; there are many types of fingerprints that can be left behind which will have ceph refuse to provision the disks. The most common are latent partitions, but with encryption enabled; there are other block-device-level artifacts that remain, after you thought you were starting fresh.

As such, there are a couple additional ansible scripts; the primary one I would recommend using is `task ansible:rancher-nuke`; as it will delete the /var/lib/rancher directory which the parent repo of this one chooses not to. Without removing this directory, many container artifacts stick around between installs, which operators tend to not like.
As such, there are a couple additional ansible scripts; the primary one I would recommend using is `task ansible:rancher-nuke`; as it will delete the /var/lib/rancher directory which the parent repo of this one chooses not to. Without removing this directory, many container artifacts stick around between installs, which operators tend to not like.

If you are using encryption (which this repo is), you will also need to clean the ceph level artifacts off the block devices, which you can do with `task ansible:ceph-nuke` if you have non-nvme drives that need to be cleaned, this script may not work without manually unmounting them, but look into `sgdisk` to see more about what is going on there.

### Configuration
`task ansible:configure` has been disabled; it is very useful to significaly shorten the iteration loop when getting started, so I do not suggest that you also disable it before you've begun; however, I have slighly cusotmized the ansible yaml in a way that would be overwritten by re-running that configuration generation script, and those changes are not going upstream into the configurator. If you want to follow along with this repository, I suggest starting from the one [I started from](https://github.com/onedr0p/flux-cluster-template), and then once the config is generated, just edit the ansible yaml directly as necessary.

`task ansible:configure` has been disabled; it is very useful to significaly shorten the iteration loop when getting started, so I do not suggest that you also disable it before you've begun; however, I have slighly cusotmized the ansible yaml in a way that would be overwritten by re-running that configuration generation script, and those changes are not going upstream into the configurator. If you want to follow along with this repository, I suggest starting from the one [I started from](https://github.com/onedr0p/flux-cluster-template), and then once the config is generated, just edit the ansible yaml directly as necessary.

### OLM - Operator Lifecycle Manager
OLM has gone out of their way to not provide a helm chart for installation, insisting that their installation be T[he One Exception](https://github.com/operator-framework/operator-lifecycle-manager/issues/829) declarative config. We are following an external chart which tracks the OLM chart repository and installs the OLM operator.

I was having issues ([and saw many others having the same issues](https://github.com/operator-framework/operator-lifecycle-manager/issues/1138)) with OLM failing on arm64 because the containers are not properly tagged and they end up with x86 architecture binaries that fail. To "avoid" that, the OLM and any operators under its namespace are bound to amd64 nodes. Given arm64 is technically supported, you might be okay just removing the affinity. Everything else should support amd64 and/or arm64 without issue.
OLM has gone out of their way to not provide a helm chart for installation, insisting that their installation be T[he One Exception](https://github.com/operator-framework/operator-lifecycle-manager/issues/829) declarative config. We ~~are~~ were following an external chart which tracks the OLM chart repository and installs the OLM operator. OLM is archaic at this point and antithetical to the design principles of kubernetes. It is a shame that it is the only way to install some operators, but it is what it is. (this paragraph almost entirely created by github copilot).

I have removed OLM and suggest that you do not bother with it if you are on baremetal and are not interested in a layer of virtualization on top. If you are on a cloud provider, you may want to look into it, but I have not found it to be useful.

## 📂 Repository structure

Expand Down

0 comments on commit 6837a26

Please sign in to comment.