F2FS notes

Introduction

The particular write constraints of flash-based storage devices, eliminating in-place updates and requiring the FTL to handle on-device garbage collection, makes data structures that limit random write operations preferable. The main candidate for such a data structure in a file system is a log-structured file system (LFS). Over the years many different implementations exist, where earlier LFS designs revolved around limiting arm movement on HDDs, F2FS has become the de-facto standard file system in the Linux Kernel for flash based storage devices. With LFS, the core benefit of such a file system is its append-only nature, hence avoiding random writes since all data is appended at the end of the log. However, a drawback of such a design is that, similar to the FTL, now the file system is required to run its own garbage collection, if file data is overwritten, this data is simply marked as invalid, and new data is written at the end of the log. In this post we will summarize the high-level workings of F2FS, its integration into evolving storage technologies, such as newly standardized NVMe ZNS SSD, how to use F2FS, and how to extract performance statistics of F2FS during runtime. For detailed information, we provide a set of additional resources.

F2FS Internals

In order to build an effective LFS for flash-based storage, F2FS relies on several key components. We will look into the layout of F2FS data structures and the algorithms required for making the necessary GC for LFS more performant. For full details of each component, consult the F2FS paper [1] and its Kernel Documentation [2].

On-Disk Layout

The disk layout aims to minimize random writes (although they are not fully eliminated for some components). Screenshot_2022-09-22_17-10-58 [1]

Superblock

Important for identifying key information about the file system, including version number and other information about the file system, the superblock is required to be able to mount a file system. It always resides in the beginning of the address space for the device (or partition/namespace, depending on the device setup, however it is always first before other metadata). When mounting, the Kernel will check this superblock to find required information. F2FS maintains 2 separate superblocks, which are both created at file system formatting time.

Checkpoint

The superblock is followed by the checkpoint area, which is responsible for storage file system information and providing failure recovery. It stores consistent states of the failure system, such that after a failure (e.g., power loss or system crash) the file system can recover the checkpoint and return to a valid state.

Segment Information Table

F2FS utilizes the storage space through a number of segments, which are 2MiB in size. Segments are grouped together into sections, which are aligned to the FTL GC unit (the number of segments per section is 1 by default, so segment size = section size). Sections represent the unit of GC, where the victim is selected with segment information. Segments themselves consist of blocks, which are 4KiB in F2FS, irregardless of if the device has a sector size 512B or 4KiB, F2FS always does allocations in 4KiB blocks. The Segment Information Table (SIT) follows the checkpoint area, and contains information about each segment, including number of valid/invalid blocks.

Node Address Table

A key feature for F2FS to avoid excessive write commands is through the use of the Node Address Table (NAT). If for instance data of an existing file is updated, the new data is written at the end of the log and the inode for the file has to be updated to point to the new location of the data. However, this implies that now this inode would have to be rewritten, as there are no in-place updates, a new inode would have to be appended at the end of the log. Similarly, now the parent node (i.e., metadata of a directory to show the file) would also have to be updated to locate the new inode of the updated file it contains. This process continues up to the root of the tree (the root directory in this case).

To avoid this, F2FS utilizes the NAT, where inodes instead of having the block address of newly written data, contain an index of the NAT, and the NAT identifies the block address. Therefore, on a data update, only the NAT entry is required to update, and the inode can maintain the same NAT entry. This eliminates the update propagation on inodes in the tree, however requires the NAT to be randomly written to update the individual entry, instead of utilizing the log.

Segment Summary Area

The Segment Summary Area (SSA) follows the NAT, and maintains several metadata entries about segments, such as file ownership, parent nodes, etc.

Garbage Collection

F2FS utilizes several methods for ensuring efficient GC and providing better system performance under heavy load on the storage device. In particular, it utilizes two different kinds of GC:

Foregound GC: is run when the file system is largely utilized, such that the available number of free sections during an I/O are dropping below a certain threshold.

Background GC: is run when the system is idle for a particular period. The configurations for this are that a thread sleeps at least 30 seconds and maximum of 60 seconds, and is woken up to check if the system is idle such that it can run

#define DEF_GC_THREAD_URGENT_SLEEP_TIME 500 /* 500 ms */
#define DEF_GC_THREAD_MIN_SLEEP_TIME    30000   /* milliseconds */
#define DEF_GC_THREAD_MAX_SLEEP_TIME    60000
#define DEF_GC_THREAD_NOGC_SLEEP_TIME   300000  /* wait 5

GC Policies

The different types of garbage collection also utilize different GC policies.

Greedy GC

Foreground cleaning relies on greedy GC, which selects the section that yields the most amount of free space after having been garbage collected (i.e., the section with the highest number of invalid blocks).

Cost-Benefit GC

The background GC utilizes the cost-benefit policy, which evaluates the age of the segments in the section, based on the average of the last access time for all segments in that section.

Age-Threshold GC

Lastly, F2FS employs age-threshold (AT) based GC (which is not enabled by default and must be manually set during format time). With AT GC, F2FS sets a minimum lifetime for data (defaults to 7 days), where the section yielding the most amount of free space and having an age above the threshold, is being selected as the victim for GC. This solves the issue of constantly garbage collecting the same section if the section contains particularly hot data.

Slack Space Recycling

Furthermore, F2FS has the option to enable Slack Space Recycling (SSR), which uses invalid blocks in segments as write destinations if no space is available and no GC should be run, avoiding foreground GC, however requiring a random write to the invalid blocks in the segment.

Data Grouping

In order to avoid increased GC, F2FS utilizes different classifications for data based on its update frequency. If data that is less frequently updated (hot data) is grouped into the same segment as data that is frequently updated (cold data), during garbage collection the cold data would need to be moved. Over time as the hot data is updated, the cold data will constantly have to be moved with the hot data during GC, hence introducing a significant amount of increased writing to move cold data. Therefore, F2FS groups data into hot/warm/cold groups, as well as hot/warm/cold groupings for metadata. Metadata is grouped different from data, and is classified as hot/warm (hot for directory nodes, warm for node blocks), since metadata is commonly updated through simple operations such as changing the last access time of a file.

The full classification is as follows:

Hot node: contains direct node blocks of directories.
Warm node: contains direct node blocks except hot node blocks.
Cold node: contains indirect node blocks
Hot data: contains dentry blocks
Warm data: contains data blocks except hot and cold data blocks
Cold data: contains multimedia data or migrated data blocks

Initial Classification

Initial classification depends on the type, as shown above, directory nodes blocks are classified directly as hot nodes, whereas file data is initially assigned as warm data, unless it is falling into one of the extension types from the extension list. For data, the they are initially grouped into the warm data area, containing data blocks.

Extension List

F2FS has the possibility to provide an extension list (e.g., mp3,mp4,gif) during formatting of the file system, based on which F2FS assigns temperatures to files. Logically multimedia files, such as mp4 data is likely written once and not modified again, hence it can directly be classified as cold data, which F2FS does through the extension list (which will be written to the superblock) and assigned during file creation time, which then calls the set_file_temperature() function.

There are two default extension lists that are used by mkfs.f2fs, one for hot data, and a second for cold data. The hot data contains the following by default,

const char *hot_ext_lists[] = {
    "db",

#ifndef WITH_ANDROID
    /* Virtual machines */
    "vmdk", // VMware or VirtualBox
    "vdi", // VirtualBox
    "qcow2", // QEMU
#endif

    NULL
};

And the cold extension list contains several more (mainly multimedia file extensions),

const char *media_ext_lists[] = {
    "mp", // Covers mp3, mp4, mpeg, mpg
    "wm", // Covers wma, wmb, wmv
    "og", // Covers oga, ogg, ogm, ogv
    "jp", // Covers jpg, jpeg, jp2

    /* video */
    "avi",
    "m4v",
    "m4p",
    "mkv",
    "mov",
    "webm",

    /* audio */
    "wav",
    "m4a",
    "3gp",
    "opus",
    "flac",

    /* image */
    "gif",
    "png",
    "svg",
    "webp",

    /* archives */
    "jar",
    "deb",
    "iso",
    "gz",
    "xz",
    "zst",

    /* others */
    "pdf",
    "pyc", // Python bytecode
    "ttc",
    "ttf",
    "exe",

    /* android */
    "apk",
    "cnt", // Image alias
    "exo", // YouTube
    "odex", // Android RunTime
    "vdex", // Android RunTi 
    "so",

    NULL
};

Reclassification

As data in segments is being updated, dirty blocks are counted, and if the number of dirty blocks exceeds a threshold (DEF_MIN_HOT_BLOCKS 16), the inode for the flag is set to FI_HOT_DATA. During GC, any data that is being migrated is assigned to be FI_COLD_DATA (in its host inode), as it was not modified during the time it was in the segment, and hence can be again grouped with other cold data.

Lifetime Hints

Furthermore, hints can be passed from user-space to the file system. Linux employs several hints for lifetime of file (see here). F2FS uses fcntl() to take a hint about hot/cold files and utilizes this to classify the data as hot/cold.

These hints set the VFS inode, from which F2FS retrieves the flag set and assigns it to the respective segment. Possible hints are:

/*
 * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
 * used to clear any hints previously set.
 */
#define RWH_WRITE_LIFE_NOT_SET    0
#define RWH_WRITE_LIFE_NONE       1
#define RWH_WRITE_LIFE_SHORT      2
#define RWH_WRITE_LIFE_MEDIUM     3
#define RWH_WRITE_LIFE_LONG       4
#define RWH_WRITE_LIFE_EXTREME    5

From which F2FS utilizes the following hints to assign hotness of segments:

int f2fs_rw_hint_to_seg_type(enum rw_hint hint)
{
    switch (hint) {
        case WRITE_LIFE_SHORT:
            return CURSEG_HOT_DATA;
        case WRITE_LIFE_EXTREME:
            return CURSEG_COLD_DATA;
        default:
            return CURSEG_WARM_DATA;
    }
}

Using F2FS with ZNS

With the standardization of NVMe ZNS SSD, F2FS was and still is one of the only file systems to support these new devices (BetrFS Support in the Kernel is on the way!). In order to integrate ZNS into F2FS, it did not require many changes, since it largely adheres to append-only writes, as are necessitated by ZNS. The only changes were that segments can be only partially usable, since the capacity of zones can be smaller than the zone size, and may not be a multiple of 2MiB. Such segments on the zone capacity boundaries are then simply marked as partially usable up to the zone capacity, and segments between zone capacity and the zone size are marked as not usable. Secondly, it still requires a conventional block device, which is randomly writable, for the metadata (superblock, CP, NAT, SIT, and SSA), since it cannot do random updates on ZNS. For more detail on ZNS integration into F2FS, see [3]. The main area, where the segments for data (node and file data) start, begins on the ZNS device. This does not span across conventional device - ZNS device boundaries, even though F2FS treats the devices as a single contiguous address space.

Since we have information on the erase unit of ZNS devices (a zone), F2FS assigns the number of segments in a section to be equal to the zone size. For instance, a device with a 2GiB zone size will have the default 2MiB segment size, and a section size that contains 1024 segments (so 2GiB). The F2FS zone is then equal to a single section, hence aligning it to the erase unit of the device. Different hotness classification from the multi-headed logging is assigned to different F2FS zones, hence also aligning to different erase blocks on the device with ZNS. Recall, GC happens at the unit of a segment, moving valid blocks from from a single segment per GC call. On ZNS, once a section contains no more valid and usable segments, it can be erased.

Direct I/O

One may notice that when benchmarking F2FS on ZNS with fio and using direct I/O, performance is significantly higher with F2FS at low I/O depths than raw device performance can achieve at these depths. Below we show a fio benchmark where we measure the throughput in KIOPS for writing a file with F2FS with increasing outstanding I/Os (iodepth, see fio notes for more info), and similarly repeat this with writing the raw ZNS device. Everything is run with direct I/O.

Screenshot_2022-10-31_11-27-02

However as mentioned, F2FS can achieve higher throughput at lower depths than the raw device can. This is due to F2FS falling back to buffered I/O when being used on ZNS. This is done in order to sequentialize the write I/Os with the writes in the LFS. This is visible in the source code in the f2fs_force_buffered_io function (f2fs.h:4457 in 5.* kernels and moved to file.c:811 in kernel 6.0):

        /*
         * for blkzoned device, fallback direct IO to buffered IO, so
         * all IOs can be serialized by log-structured write.
         */
        if (f2fs_sb_has_blkzoned(sbi))
                return true;

Formatting and Mounting F2FS with ZNS

This requires the mkfs.f2fs tool from f2fs-tools to be installed (retrieve from here and follow build instructions). If we have a regular block device (e.g., a conventional SSD) and a ZNS, we can create a file system with the following commands: (replacing the device parameters with their respective names)

sudo mkfs.f2fs -f -m -c /dev/${ZNS} /dev/${SSD}

This will format the devices (formatting the SSD and calling a REQ_OP_ZONE_RESET_ALL on the ZNS device to reset all zones), and create the file system, writing the superblock on the SSD. For convenience, we have a script here that cleans up existing file systems on the ZNS device, and automates creating of the file system.

Then we must ensure a correct scheduler is set for the ZNS device, such that requests are not reorder, which results in I/O errors from the device (again, replace with respective device name).

echo mq-deadline | sudo tee /sys/block/${ZNS}/queue/schedule

Lastly, we can mount the file system, specifying only the conventional block device, as it contains the superblock with all necessary info.

sudo mount -t f2fs /dev/${ZNS} /mnt/f2fs

Extracting F2FS Statistics

An important feature of F2FS is seeing its performance (particularly important to us doing research with it!). To our benefit, we can easily extract numerous metrics from F2FS in its debug logs. The only requirement is that we are running the Linux Kernel in debug mode, with F2FS debug enabled (see my kernel config file here to match the options). Then we can extract metrics from the status of f2fs with the following commands:

# Need to be root for this
sudo su

cat /sys/kernel/debug/f2fs/status

This will simply cat all the statistics F2FS currently has (it includes several metrics on GC information, counts on valid/invalid blocks, allocation information, etc.). For more precise output we could redirect the output and grep for some metric we are interested in (e.g., GC):

cat /sys/kernel/debug/f2fs/status | grep -A5 "GC"

Resources

[1] Lee, Changman, et al. "F2FS: A New File System for Flash Storage." 13th USENIX Conference on File and Storage Technologies (FAST 15). 2015.

[2] Linux Kernel. "WHAT IS Flash-Friendly File System (F2FS)?." https://www.kernel.org/doc/Documentation/filesystems/f2fs.txt

[3] Nick Tehrany and Animesh Trivedi. "Understanding NVMe Zoned Namespace (ZNS) Flash SSD Storage Devices." arXiv preprint arXiv:2206.01547 (2022).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly