ZFS block pointer rewrite status, 2013


January 13, 2014

Through one part idle internet window shopping and one part good luck, I came across a  supply (now sold out) of the disks I use in my main RAID at home, nearly 50% off the next lowest price I’d ever seen for them–open-box (but not refurb) disks from a reputable online retailer with only a few months already consumed on Seagate’s 5 year warranty.  This seemed to me by far the best deal I’d seen on these disks in 5 years of looking, so I immediately bought some.

So now I am at one of those rarely encountered times where I have sufficient unused hard disk capacity to create a whole second copy of my current stash of bits if I wanted, and that gives me the flexibility to switch to another disk/filesystem/volume management arrangement if I want.

I always assume that at some point the amount of data I’ve accumulated will grow too large to allow for another opportunity to start from scratch like this, and so even though I’ve tried to choose wisely in selecting a storage arrangement in the past, and as a result could feasibly grow my current MD-based RAID out to 22 TB or so, I still want to make sure there isn’t something better out there that I could switch to since I do have the chance to do it now. Plus, the last time I investigated storage options was three years ago (which put me on my current 2TB-per-member MD RAID with ext4), so I took another look.

There are lots of things I like in ZFS, such as the added data integrity safeguards, snapshots to protect me from the fat-fingered rm -rf *, and with support for remote mirroring, it would definitely be worth another look at nearly-realtime backups to my hosted server (something I previously rejected due to the lack of snapshots.) A less unique but still important feature for long-term scalability is logical volumes, as I think a single RAID with > 12 drives would be a stretch even for the relatively light load my home array experiences.

But, as many a home user before me can tell you, ZFS has one huge feature hole for nerd@home scale operations: it is not possible to drop in another disk or two onto an existing RAID set. Enterprise users will drop another rack of raid sets into an existing zpool, which is possible with ZFS and makes sense. But do that after buying a handful of disks, and you’ll have a bunch of little RAID sets with wasted parity disks everywhere. RAID reconfiguration is something that’s been possible with MD RAID since I fired up my first array in 2003, and became possible as on online operation some years afterwards. It’s a feature I’ve used several times that I’m not comfortable giving up.

So, I dug into the state of development of this for ZFS, and was pleased to find some pretty current and comprehend-able information from Matt Ahrens, co-founder of ZFS way back when it was at Sun (awesomely, he seems to still be involved in ZFS through employment elsewhere.) The short summary sounds like achieving this via “block pointer rewrite” will almost surely never happen, because any code that correctly updates all necessary data structures would, in Matt’s opinion, introduce code that is too confusing/ugly/unmaintainable to encourage further ZFS development of other kinds. The YouTube video from October 2013 that I found this on is a great watch, if you’re interested in more detail. He also says, however, that there may be less ambitious approaches than block location rewrite to achieve the similar task of removing a device from a pool, so perhaps the more targeted task of adding a disk to a RAID-Z could also be tackled another way. Something I might try to learn more about?

First though, I need to research ZFS basics a bit more to find out why I shouldn’t just build a zpool of one or two vdevs (configured from ZFS’s point of view as plain disks) that actually happen to be MD RAID devices. Would it work to just let MD RAID do what it does best, and let ZFS do what it does best on top of it?

(Update 1/2014: the above md/zfs scheme would be ill-advised because ZFS implements a nontraditional “RAID” that includes couplings to the filesystem level, providing advantages such as self-healing data in the event of a parity/data disagreement, since the “RAID” subsystem and filesystem can compare notes to see if the data aligns with ZFS’s filesystem-level hashing (meaning the parity must be bad) or does not align with ZFS’s filesystem-level hash, meaning the data should be reconstructed from parity.)