Replies: 4 comments 2 replies
-
@mmaslankaprv @rystsov please take a look. |
Beta Was this translation helpful? Give feedback.
-
@Lazin can we describe hydration in more details. i.e. how are we going to handle read (if we are going to wait for S3 segment to be downloaded until returning from first fetch), are we going to fetch whole segment and then unpack it ? I am wondering what will happen if two requests will want to fetch the same offset that is old enough to cause the S3 segment download. Are we going to somehow identify that download is already in progresses ? |
Beta Was this translation helpful? Give feedback.
-
I am wondering about one more thing in here. i.e. which |
Beta Was this translation helpful? Give feedback.
-
So we only upload sealed segments. Should we also have a timeout for when we seal a segment file? Since a topic with very little data might not upload data for a long time to S3/GCS. For example - Grafana Loki does something similar with S3 and has a timeout of 30 minutes in which if it has not hit the size limit, it will still flush and push the data to S3. |
Beta Was this translation helpful? Give feedback.
-
The Goals
Archived Data
Currently the archival subsystem uploads the following data:
last_offset
. This is the committed offset of the last uploaded segment (the segments are uploaded in offset order).On startup the archival subsystem downloads manifest from S3 to get the
last_offset
. Thelast_offset
is then advanced by uploading new segments. This sequence is performed per-NTP.Individual Topic Recovery
To recover the topic from S3 we can leverage an already existing offset recovery mechanism. When the topic is created the directory structure for every partition is created first. The raft group bootstrap performed next. We can download the data from S3 and put it into the designated directory before bootstrapping the raft group. After that the recovery mechanism should scan the downloaded segments and reconcile the available offsets. After that the data should be available to consumers.
One possible approach to this is to add the custom topic configuration parameter. This parameter should contain the S3 path to the topic manifest. Also, the partition count should match the count in the topic manifest (or the value in topic manifest should act as an override). Also, precautions should be made in order for this mechanism to trigger download only once and not after every restart.
The topic manifest also contains data related to the retention policy. This data shouldn’t override values provided by the user. The recovery process should respect retention policy in place. S3 bucket may contain way more data than retention policy allows Repdanda to keep on disk.
Partial Topic Recovery
It is possible to download less data than retention policy allows. We can download as little data as possible to bootstrap the raft group and let the “infinite storage” to hydrate the remaining segments down the road.
Cluster Recovery
To recover the whole cluster we need to recover all topics and metadata (users, acls, consumer groups, coprocessor data, etc). To do this we need information stored in the system topics so the archival subsystem should be updated to start uploading this data. Right now it’s not uploaded. During disaster recovery we need to download the controller log and parse it to retrieve the data. After that we will have a list of topics to create. After that we should be able to recover them one by one.
Cluster recovery procedure can perform partial topic recovery to speedup the process.
Tiered Storage
When a user attempts to read the offset that is less than the smallest stored offset the storage subsystem will fetch the missing log segments from S3. To be able to do this, the storage subsystem should track
last_offset
for every partition. In the read path we should check if the required offset is not available locally then check it overlast_offset
. If the offset is less thanlast_offset
then we can download the missing segments from S3.Last Offset Tracking
Currently
last_offset
is stored in the partition manifest. This is not convenient for infinite storage implementation because the manifest is downloaded from S3 on startup. Also, the manifest is available only on a leader of the raft group. When the leadership transfer happens the partition manifest is re-downloaded on a new leader. This makes partition manifest a bad place to storelast_offset
.Alternatively, we can store
last_offset
using the system topic or using the raft log.Storing last offset in a topic
We can use a very simple data model. NTP is a key and
last_offset
is a value. To run the request over archive data we need to fetch a latest record with matching NTP. Normally, we will updatelast_offset
after a segment upload. This means that the total number of produced records will be roughly equal to the total number of log segments (all topics, all partitions). Another way to estimate the total number of records is to divide the total number of produced bytes (not stored) by the size of the log segment. For instance, if the cluster received 1PB of data over its lifetime and log segment size is 1GB (default) the system topic will have 1.000.000 records.This approach has the following advantages
and disadvantages
Storing last offset in the raft log
We can introduce a control batch that will store the
last_offset
. This control batch should be generated every time we’re uploading a new segment to S3. Number of such batches should be roughly equal to the number of segments in the partition log (including the ones that were already reclaimed). Also, we can say that every log segment should have one such control batch that signifies upload of the adjacent log segment.This approach has the following advantages
and disadvantages
Hydration
The actual hydration of the log segment should be implemented on log_manager level in storage. The decision to hydrate a range of offsets from S3 should be made on cluster level since this the
last_offset
is managed by the raft.Eviction
The downloaded log segments are candidates for deletion because their offsets are less than max collectable offset. This means that on the next deletion round these log segments will be removed. To prevent this we should postpone the deletion. We may use an LRU cache in such a way that will prevent the segment to be deleted if it’s not a candidate for eviction in an LRU cache. Another possible mitigation strategy is to set TTL on such segments, or a combination of both methods.
Beta Was this translation helpful? Give feedback.
All reactions