Special Vdev and sync writes to ZIL #16504
Replies: 3 comments
-
ZIL blocks are not controlled by recordsize property of the dataset, since ZIL operates in terms of user write requests, not dataset blocks. ZIL sizes its blocks as it sees fit with 4KB granularity up to 128KB (IIRC tunable). ZIL has its own allocation logic different from normal writes. In absence of SLOG it will try to use embedded log metaslabs from all vdevs, and I don't see it to prefer special vdevs there, which looks odd indeed, unless I miss something. If still no success, it will use normal vdevs. If indirect write is configured and/or decided, the above will apply to ZIL block itself, while actual data will be written to special or normal vdev according to small blocks settings, but it is a bit special case. And all the vdevs where related writes happened will need to flush their caches before write to return. IMO power loss protection term is confusing. Any disk used with ZFS must honor cache flush requests and be in that sense protected, or data corruption may occur in case of power loss. It is just for disks not participating in ZIL writes those cache flushes are relatively rare -- few times per TXG commit, and so the performance is not important, while for those where ZIL writes flushes are much more often and so must be very fast, which may be achieved by the actual reliable write cache protection and ignoring cache flushes, if vendor believes it is safe. To summarize, special vdev is not really an SLOG, while seems this aspect could benefit from some love. If you need massive sync writes, I'd say you should better have an SLOG or your main vdevs should be fast enough on sync writes. |
Beta Was this translation helpful? Give feedback.
-
With an Slog situation is clear, all sync loggings go to the Slog instead the ZIL area, other writes to normal or special vdevs depending on ZFS blocksize. With a special vdev you can control via the small blocksize threshold which writes go to the special vdev while larger ones go to normal vdevs. So the question is: with a small blocksize of say 128K and a recsize of 128K there are not any regular writes with a size > 128K, so everything goes to the special vdev. Does this include writes to the ZIL? I think so but I am not 100% sure as this would mean that with such a setting (good for VM storage where you want such a resize or smaller and need sync) there is no need for an Slog as ZIL is on the special vdev???. I have never seen a clarification about this item from ZFS devs. Powerloss protection With Flash (beside Optane), situation is different. For good reasons you can buy "cheap" desktop ones and enterprise ones where the only difference is sometimes powerloss protection ex Samsung Pro vs SM/PM. This is because you cannot fully control writes to a special cell from outside as the real ondisk structure is blocks build from pages where a partly witten block needs a read/erase/write cycle prior update. Additionally you often have a dram cache for performance and a lot of background activity (garbage collection, trim) with many risks of datacorruptions due a powerloss in the wrong moment. With a single Slog, I would see plp as essential. With a mirror there is a good chance that a problem does not affects both mirrors, so ZFS may repair problems. You are right. A future pool from high performance NVMe does not require an Slog or a special vdev. But a pool from mechanical disk where sync write is in the area of 50MB/s or less, an Slog is needed and a special vdev wanted for small io. If would be helpfull if the special vdev can be used for ZIL functionality with proper settings. |
Beta Was this translation helpful? Give feedback.
-
Im not sure but I suspect writes still go to to the zil on the disk based pool, simply based on performance I saw: I have a HDD mirror accelerated with some Optane nvmes as special devices. On this pool I made a dataset intended to store a database, and made it record_size = special_small_blocks such that all writes would go to the optanes. The performance was terrible with sync writes, a lot like "random sync IO to HDDs" levels of performance, which forced me to set sync=off for that dataset. It would be nice to be able to specify the ZIL to only be on the special devices! |
Beta Was this translation helpful? Give feedback.
-
Just want a confirmation:
If I add a special vdev to a diskbased pool with a dataset setting recsize <= small blocksize,
I force all writes of this dataset to the special vdev.
When I enable sync on this dataset, this should include all sync writes to ZIL, making an additional Slog obsolete?
I asume in this case, the special vdev should provide powerloss protection to protect ZIL writes.
Beta Was this translation helpful? Give feedback.
All reactions