Tiered Storage #1253

Lazin · 2021-04-23T14:30:11Z

Lazin
Apr 23, 2021
Collaborator

The Goals

Enable disaster recovery for the entire cluster or individual topics (use case - the topic is accidentally deleted).
Emulate infinite storage. Allow consumers to read data from S3 even if the data was deleted locally.

Archived Data

Currently the archival subsystem uploads the following data:

Topic manifests that contain topic configuration data (number of partitions, retention policy etc).
Partition manifests that contain the list of log segments and some metadata. The important bit of information in partition manifest is the last_offset. This is the committed offset of the last uploaded segment (the segments are uploaded in offset order).
Log segments. Only non compacted and sealed segments are uploaded. The segment in S3 doesn’t necessarily represent the on-disk segment on one of the partitions. In case of leadership transition the new leader could find that its own segment doesn’t align with the one already uploaded to S3 perfectly. In this case the committed offset of the uploaded segment is greater than the base offset of the local segment. To overcome this the new leader will truncate the prefix of the next uploaded segment to eliminate the overlap. The process is not precise because to do this, the new leader uses a sparse index. Because of that the segments on S3 can sometimes overlap a little.

On startup the archival subsystem downloads manifest from S3 to get the last_offset. The last_offset is then advanced by uploading new segments. This sequence is performed per-NTP.

Individual Topic Recovery

To recover the topic from S3 we can leverage an already existing offset recovery mechanism. When the topic is created the directory structure for every partition is created first. The raft group bootstrap performed next. We can download the data from S3 and put it into the designated directory before bootstrapping the raft group. After that the recovery mechanism should scan the downloaded segments and reconcile the available offsets. After that the data should be available to consumers.

One possible approach to this is to add the custom topic configuration parameter. This parameter should contain the S3 path to the topic manifest. Also, the partition count should match the count in the topic manifest (or the value in topic manifest should act as an override). Also, precautions should be made in order for this mechanism to trigger download only once and not after every restart.

The topic manifest also contains data related to the retention policy. This data shouldn’t override values provided by the user. The recovery process should respect retention policy in place. S3 bucket may contain way more data than retention policy allows Repdanda to keep on disk.

Partial Topic Recovery

It is possible to download less data than retention policy allows. We can download as little data as possible to bootstrap the raft group and let the “infinite storage” to hydrate the remaining segments down the road.

Cluster Recovery

To recover the whole cluster we need to recover all topics and metadata (users, acls, consumer groups, coprocessor data, etc). To do this we need information stored in the system topics so the archival subsystem should be updated to start uploading this data. Right now it’s not uploaded. During disaster recovery we need to download the controller log and parse it to retrieve the data. After that we will have a list of topics to create. After that we should be able to recover them one by one.

Cluster recovery procedure can perform partial topic recovery to speedup the process.

Tiered Storage

When a user attempts to read the offset that is less than the smallest stored offset the storage subsystem will fetch the missing log segments from S3. To be able to do this, the storage subsystem should track last_offset for every partition. In the read path we should check if the required offset is not available locally then check it over last_offset. If the offset is less than last_offset then we can download the missing segments from S3.

Last Offset Tracking

Currently last_offset is stored in the partition manifest. This is not convenient for infinite storage implementation because the manifest is downloaded from S3 on startup. Also, the manifest is available only on a leader of the raft group. When the leadership transfer happens the partition manifest is re-downloaded on a new leader. This makes partition manifest a bad place to store last_offset.

Alternatively, we can store last_offset using the system topic or using the raft log.

Storing last offset in a topic

We can use a very simple data model. NTP is a key and last_offset is a value. To run the request over archive data we need to fetch a latest record with matching NTP. Normally, we will update last_offset after a segment upload. This means that the total number of produced records will be roughly equal to the total number of log segments (all topics, all partitions). Another way to estimate the total number of records is to divide the total number of produced bytes (not stored) by the size of the log segment. For instance, if the cluster received 1PB of data over its lifetime and log segment size is 1GB (default) the system topic will have 1.000.000 records.

This approach has the following advantages

It’s relatively easy to implement;
We can restore the whole topic using normal topic recovery mechanism, no interference with partial topic recovery;
Compaction turns it into a nice key-value store;
It doesn’t mess with offsets (no extra control batches are produced);
Can be consumed by some other subsystems more easily;

and disadvantages

No data locality, topic partition that holds the data can be located on any machine (this can be mitigated by increasing the replication factor);
The hop to another node could be needed to fetch the data (not a big deal compared to the hydration of the log segments);

Storing last offset in the raft log

We can introduce a control batch that will store the last_offset. This control batch should be generated every time we’re uploading a new segment to S3. Number of such batches should be roughly equal to the number of segments in the partition log (including the ones that were already reclaimed). Also, we can say that every log segment should have one such control batch that signifies upload of the adjacent log segment.

This approach has the following advantages

Data locality (everything is stored in the raft log);
We can run topic recovery partially, until we will fetch the log segment that has this control batch;

and disadvantages

Could be more complex to implement;
Increases number of control batches in the raft log (might not be a problem soon).

Hydration

The actual hydration of the log segment should be implemented on log_manager level in storage. The decision to hydrate a range of offsets from S3 should be made on cluster level since this the last_offset is managed by the raft.

Eviction

The downloaded log segments are candidates for deletion because their offsets are less than max collectable offset. This means that on the next deletion round these log segments will be removed. To prevent this we should postpone the deletion. We may use an LRU cache in such a way that will prevent the segment to be deleted if it’s not a candidate for eviction in an LRU cache. Another possible mitigation strategy is to set TTL on such segments, or a combination of both methods.

dotnwat · 2021-04-27T15:00:16Z

dotnwat
Apr 27, 2021
Maintainer

@mmaslankaprv @rystsov please take a look.

0 replies

mmaslankaprv · 2021-05-05T09:51:41Z

mmaslankaprv
May 5, 2021
Maintainer

@Lazin can we describe hydration in more details. i.e. how are we going to handle read (if we are going to wait for S3 segment to be downloaded until returning from first fetch), are we going to fetch whole segment and then unpack it ? I am wondering what will happen if two requests will want to fetch the same offset that is old enough to cause the S3 segment download. Are we going to somehow identify that download is already in progresses ?

1 reply

Lazin May 5, 2021
Collaborator Author

I think that we can consume the segment while it's downloading. The segments doesn't need to be unpacked. But we have to download whole segments. S3 allows us to download a file range but offsets can't be easily translated to file offsets without an index and the indexes are not uploaded.

> I am wondering what will happen if two requests will want to fetch the same offset that is old enough to cause the S3 segment download.
I think that we should be able to track that. Also, the first reader will populate the batch cache and the second will use that materialized batches. I can definitely fill in more details there.

mmaslankaprv · 2021-05-05T12:07:27Z

mmaslankaprv
May 5, 2021
Maintainer

I am wondering about one more thing in here. i.e. which start_offset we should return to the client. I think it should be the earliest offset that is available in S3.

0 replies

rkruze · 2021-08-18T00:32:49Z

rkruze
Aug 18, 2021

So we only upload sealed segments. Should we also have a timeout for when we seal a segment file? Since a topic with very little data might not upload data for a long time to S3/GCS.

For example - Grafana Loki does something similar with S3 and has a timeout of 30 minutes in which if it has not hit the size limit, it will still flush and push the data to S3.

1 reply

Lazin Aug 18, 2021
Collaborator Author

Similar problem is discussed here - #1580

We can easily upload any subset of the segment as another segment. For instance, if we have a segment 500_1_v1.log which contains 1000 offsets we can upload first half as 500_1_v1.log and then the second half 1000_1_v1.log. The only problem here is that we might end up with very large number of segments. One possible solution to this is to allow re-uploads. E.g. in the example above we can upload 500_1_v1.log first time with only half of the data and after some time we can re-upload the whole segment with full data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiered Storage #1253

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tiered Storage #1253

Lazin Apr 23, 2021 Collaborator

The Goals

Archived Data

Individual Topic Recovery

Partial Topic Recovery

Cluster Recovery

Tiered Storage

Last Offset Tracking

Storing last offset in a topic

Storing last offset in the raft log

Hydration

Eviction

Replies: 4 comments · 2 replies

dotnwat Apr 27, 2021 Maintainer

mmaslankaprv May 5, 2021 Maintainer

Lazin May 5, 2021 Collaborator Author

mmaslankaprv May 5, 2021 Maintainer

rkruze Aug 18, 2021

Lazin Aug 18, 2021 Collaborator Author

Lazin
Apr 23, 2021
Collaborator

Replies: 4 comments 2 replies

dotnwat
Apr 27, 2021
Maintainer

mmaslankaprv
May 5, 2021
Maintainer

Lazin May 5, 2021
Collaborator Author

mmaslankaprv
May 5, 2021
Maintainer

rkruze
Aug 18, 2021

Lazin Aug 18, 2021
Collaborator Author