Skip to content

Commit

Permalink
Add capacity docs for zos (#1967)
Browse files Browse the repository at this point in the history
Fixes #1966
  • Loading branch information
muhamadazmy authored May 12, 2023
1 parent adf968f commit c8bfd14
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 1 deletion.
75 changes: 75 additions & 0 deletions docs/internals/capacity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Capacity

This document describes how ZOS does the following tasks:

- Reserved system resources
- Memory
- Storage
- Calculation of free usable capacity for user workloads

## System reserved capacity

ZOS always reserve some amount of the available physical resources to its own operation. The system tries to be as protective
as possible of it's critical services to make sure that the node is always reachable and usable even if it's under heavy load

ZOS make sure it reserves Memory and Storage (but not CPU) as per the following:

### Reserved Memory

ZOS reserve 10% of the available system memory for basic services AND operation overhead. The operation overhead can happen as a side effect of running user workloads. For example, a user network while in theory does not consume any memory, in matter of fact it also consume some memory (kernel buffers, etc...). Same for a VM. A user VM can be assigned say 5G but the process that running the VM can/will take few extra megabytes to operate.

This is why we decided to play on the safe side, and reserve 10% of total system memory to the system overhead, with a **MIN** reserved memory of 2GB

```python
reserved = min(total_in_gb * 0.1, 2G)
```

### Reserved Storage

While ZOS does not require installation, but it needs to download and store many things to operate correctly. This include the following:

- Node identity. Information about the node id and keys
- The system binaries, those what include all zos to join the grid and operate as expected
- Workload flists. Those are the flists of the user workloads. Those are downloaded on demand so they don't always exist.
- State information. Tracking information maintained by ZOS to track the state of workloads, owner-ship, and more.

This is why the system on first start allocates and reserve a part of the available SSD storage and is called `zos-cache`. Initially is `5G` (was 100G in older version) but because the `dynamic` nature of the cache we can't fix it at `5G`

The required space to be reserved by the system can dramatically change based on the amount of workloads running on the system. For example if many users are running many different VMs, the system will need to download (and cache) different VM images, hence requiring more cache.

This is why the system periodically checks the reserved storage and then dynamically expand or shrink to a more suitable value in increments of 5G. The expansion happens around the 20% of current cache size, and shrinking if went below 20%.

## User Capacity

All workloads requires some sort of a resource(s) to run and that is actually what the user hae to pay for. Any workload can consume resources in one of the following criteria:

- CU (compute unit in vCPU)
- MU (memory unit in bytes)
- NU (network unit in bytes)
- SU (ssd storage in bytes)
- HU (hdd storage in bytes)

A workloads, based on the type can consume one or more of those resource types. Some workloads will have a well known "size" on creation, others might be dynamic and won't be know until later.

For example, a disk workload SU consumption will be know ahead. Unlike the NU used by a network which will only be known after usage over a certain period of time.

A single deployment can has multiple workloads each requires a certain amount of one or more capacity types (listed above). ZOS then for each workloads type compute the amount of resources needed per workload, and then check if it can provide this amount of capacity.

> This means that a deployment that define 2 VMs can partially succeed to deploy one of the VMs but not the other one if the amount of resources it requested are higher than what the node can provide
### Memory

How the system decide if there are enough memory to run a certain workload that demands MU resources goes as follows:

- compute the "theoretically used" memory by all user workloads excluding `self`. This is basically the sum of all consumed MU units of all active workloads (as defined by their corresponding deployments, not as per actually used in the system).
- The theoretically used memory is topped with the system reserved memory.
- The the system checks actually used memory on the system this is done simply by doing `actual_used = memory.total - memory.available`
- The system now can simply `assume` an accurate used memory by doing `used = max(actual_used, theoretically_used)`
- Then `available = total - used`
- Then simply checks that `available` memory is enough to hold requested workload memory!

### Storage

Storage is much simpler to allocate than memory. It's completely left to the storage subsystem to find out if it can fit the requested storage on the available physical disks or not, if not possible the workloads is marked as error.

Storage tries to find the requested space based on type (SU or HU), then find the optimal way to fit that on the available disks, or spin up a new one if needed.
13 changes: 12 additions & 1 deletion docs/internals/internals.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Introduction

This document explains in a nutshell the internals of ZOS. This includes the boot process, architecture, the internal modules (and their responsibilities), and the inter-process communication.

## Booting

ZOS is a linux based operating system in the sense that we use the main-stream linux kernel with no modifications (but heavily customized). The base image of ZOS includes linux, busybox, [zinit](https://github.com/threefoldtech/zinit) and other required tools that are needed during the boot process. The base image is also shipped with a bootstrap utility that is self-updating on boot which kick starts everything.

For more details about the ZOS base image please check [0-initramfs](https://github.com/threefoldtech/0-initramfs).
Expand All @@ -15,24 +17,29 @@ The base `ZOS` image has a zinit config to start the basic services that are req
- bootstrap: The bootstrap process which takes care of downloading all required zos binaries and modules. This one requires the `internet` service to actually succeed.

## Bootstrap

`bootstrap` is a utility that resides on the base image. It takes care of downloading and configuring all zos main services by doing the following:

- It checks if there is a more recent version of itself available. If it exists, the process first updates itself before proceeding.
- It checks zos boot parameters (for example, which network you are booting into) as set by https://bootstrap.grid.tf/.
- It checks zos boot parameters (for example, which network you are booting into) as set by <https://bootstrap.grid.tf/>.
- Once the network is known, let's call it `${network}`. This can either be `production`, `testing`, or `development`. The proper release is downloaded as follows:
- All flists are downloaded from one of the [hub](https://hub.grid.tf/) `tf-zos-v3-bins.dev`, `tf-zos-v3-bins.test`, or `tf-zos-v3-bins` repos. Based on the network, only one of those repos is used to download all the support tools and binaries. Those are not included in the base image because they can be updated, added, or removed.
- The flist `https://hub.grid.tf/tf-zos/zos:${network}-3:latest.flist.md` is downloaded (note that ${network} is replaced with the actual value). This flist includes all zos services from this repository. More information about the zos modules are explained later.
- Once all binaries are downloaded, `bootstrap` finishes by asking zinit to start monitoring the newly installed services. The bootstrap exits and will never be started again as long as zos is running.
- If zos is restarted the entire bootstrap process happens again including downloading the binaries because ZOS is completely stateless (except for some cached runtime data that is preserved across reboots on a cache disk).

## Zinit

As mentioned earlier, `zinit` is the process manager of zos. Bootstrap makes sure it registers all zos services for zinit to monitor. This means that zinit will take care that those services are always running, and restart them if they have crashed for any reason.

## Architecture

For `ZOS` to be able to run workloads of different types it has split its functionality into smaller modules. Where each module is responsible for providing a single functionality. For example `storaged` which manages machine storages, hence it can provide low level storage capacity to other services that need it.

As an example, imagine that you want to start a `virtual machine`. For a `virtual machine` to be able to run it will require a `rootfs` image or the image of the VM itself this is normally provided via an `flist` (managed by `flistd`), then you would need an actual persistent storage (managed by `storaged`), a virtual nic (managed by `networkd`), another service that can put everything together in a form of a VM (`vmd`). Then finally a service that orchestrates all of this and translates the user request to an actual workload `provisiond`, you get the picture.

### IPC

All modules running in zos needs to be able to interact with each other. As it shows from the previous example. For example, `provision` daemon need to be able to ask `storage` daemon to prepare a virtual disk. A new `inter-process communication` protocol and library was developed to enable this with those extra features:

- Modules do not need to know where other modules live, there are no ports, and/or urls that have to be known by all services.
Expand All @@ -47,6 +54,7 @@ For more details about the message bus please check [zbus](https://github.com/th
`zbus` allows auto generation of `stubs` which are generated clients against a certain module interface. Hence a module X can interact with a module Y by importing the generated clients and then start making function calls.

## ZOS Processes (modules)

Modules of zos are completely internal. There is no way for an external user to talk to them directly. The idea is the node exposes a public API over rmb, while internally this API can talk to internal modules over `zbus`.

Here is a list of the major ZOS modules.
Expand All @@ -60,3 +68,6 @@ Here is a list of the major ZOS modules.
- [VM](vmd/readme.md)
- [Provision](provision/readme.md)

## Capacity

In [this document](capacity.md) you can find detail description of how ZOS does capacity planning.

0 comments on commit c8bfd14

Please sign in to comment.