ZFS it is

big-storage-home
Published

April 29, 2014

Quite a few months back, in a post about the lack of ZFS block pointer rewrite, I mentioned that I’d begun investigations on whether I should migrate my home file server to a more modern filesystem. At that time, I already knew a few things about ZFS, but said that I wasn’t prepared to actually use it at home because it lacked the ability to add disks to a raid set. Well, the move to a new filesystem is now underway, and despite the lack of raidset reconfiguration, ZFS it is anyway.

So how did that happen?

When it came down to it, I basically had three options: ZFS, btrfs, or punt and do nothing for now, in hopes that either ZFS would support adding drives in future or btrfs would become more stable in future. I set up a test server with five old/shelved 320gb disks, and installed Ubuntu 14.04 with ZFS on Linux and btrfs tools. Then I set about evaluating both of them live & in person.

Btrfs Evaluation

I was really hoping my evaluations with btrfs would be positive - sure, it’s not “officially stable” yet, but that’ll come with time, and it pretty much does everything feature-wise. So I created a btrfs filesystem in its raid5-like layout on my 5 320gb drives, and queried both df and btrfs-tools for space available. 1.5TiB, it said. Ok, so that’s a little odd, since there can’t be more than 4x320gb available, but moving on…

I copied 20 gb or so over to it fine, then pulled one of the five disks from the online system and checksummed the 20 gb against the original. All ok. Then I went hunting for some kind of indication from btrfs tools that my filesystem was in a degraded state and action was needed. There were plenty of indications of trouble in syslog, but coming from an unhappy disk controller driver that’d lost contact with a disk. The closest I could come in btrfs was through ‘btrfs device stats,’ which showed a handful of read and (strangely, since I was reading the data to checksum it) more write errors on the device removed. Given that btrfs is designed to detect and correct such errors on disks, if I saw a small number of read/write errors to a device in a real-world scenario where I believed all disks were present, it wouldn’t be clear to me what if any action I should take. So strike one, screwy space accounting, strike two, lack of clear reporting of degraded state.

Next, I went ahead and filled the filesystem up to see what those df numbers would do as the filesystem approached the sensible, 4x320gb definition of full. What I ended up getting was, as I wrote more data to it, the Available value in df decreased faster than the Used value increased, so that by the time I’d written ~4x320gb, the Used value showed 1.2T, but the Avail value showed just a few GiBs. I decided I could live with that; I know how much actual space I have, so just remember not to rely on df / btrfs tools for accurate space available values on half-full filesystems basically.

Then, I pulled a different disk from the one I had originally removed, and wrote another GB or two. I forget if the first kernel panics happened during the writes or when I started trying to read it back, but either way I ended up with an unstable system that didn’t even shutdown without hanging and hung on reads to some of the btrfs files. In a last-ditch chance for btrfs to redeem itself, I hard reset the system and let it do its offline repair operation before mounting the filesystem again. This resulted in my last file disappearing completely, but didn’t prevent the system from panicking / becoming unstable again. Strike 3.

This left either ZFS or punt.

ZFS Evaluation

ZFS’s administration tools are a treat in comparison to btrfs’s. I easily created a raidz zpool of my five disks and then a couple of ZFS filesystems with different options enabled. Space reporting through any tool you like (df, zfs list, zfs get all, zpool list, zpool get all) made sense, and I saw some real potential in being able to do things like say “dedupe in this filesystem which I would like mounted where my photos directory used to be” (not so much data that the dedupe table will swamp my RAM, and I have a bad habit of being messy and having the same photo copied off cameras more than once), or say “run transparent gzip compression on this filesystem which I would like mounted where my log files go”, or “use lz4 compression and the ARC for my coding/music/wife’s stuff, but don’t compress and never cache data coming from the videos filesystem,” since video data will almost never be accessed more than once in time proximity and my tests proved it hardly compresses at all. Since filesystems all share the space of the single zpool they reside on as needed, there’s no administrative overhead to resizing filesystems or volumes to make all this happen (unless you want…there’s a filesystem quota you can use.)

I put ZFS through the same paces as btrfs in terms of disk-pulling. The difference was it handled it without issue, and not only did it give me detailed information about the pool’s degraded status (through zpool status) after yanking a drive, it even gave me in the output suggested actions to take.

Basically I concluded ZFS kicks butt.

I still have one major conceptual beef with ZFS, but it doesn’t apply to installations of my size, so whatever.

“Wait & See” Evaluation

The third choice was to keep adding more data to my md-raid/ext4 configuration for, realistically, at least a few years, until ZFS added the one missing feature or btrfs’ multiple device support matured. The rub with that is, there’s surely never going to be a way to transform in-place an md-raid to a btrfs or ZFS raid, so the migration further down the road would involve buying my entire storage utilization over again in blank disks and copying. But, that’s the same thing I’d end up doing if I moved to ZFS now and added a second RAID set to the zpool years down the road. So, I concluded, why not start getting the feature and integrity assurance benefits of ZFS today.