Crazy scheme of the week: melding heterogeneous filesystems

Published

January 27, 2014

At work, we have an interesting problem. The primary storage system at MSI is from Panasas, a vendor that specializes in high aggregate I/O performance from a single filesystem to thousands of Linux compute nodes. Besides having a stunningly high CPU-to-disk ratio (read: being expensive), a big part of Panasas’ performance strategy is to remove the bottleneck that could easily occur at filer heads in traditional NAS when thousands of compute nodes are all asking for I/O ops. They do this by outsourcing much of the NAS head’s role to the compute nodes themselves, leaving only a thin metadata server that puts compute nodes into direct contact with the storage blades that contain the files being accessed.

The problem is a large amount of the data housed on our Panasas system doesn’t really need the privilege of residing on such a high-speed system, either because it is owned by non-HPC users who are only accessing it from one node at a time, or (the bigger problem) because nobody’s looked at it in 2 years, and they’re not likely to again anytime soon. And unfortunately, for all the things Panasas does, HSM is not currently one of them.

We could easily whip up a separate filesystem on a denser SAN or something, but offloading all the data at uber-rest to it would not be transparent to the users who currently have data resident on the Panasas. Doesn’t mean it can’t happen, but not ideal.

An alternative that I’ve been kicking around is to create a pseudo-filesystem that only actually stores metadata, and delegates I/O to one of several other filesystems. The idea is basically that the Linux kernel provides a well-defined API to filesystems for operations at the file level which every mountable filesystem in Linux conforms to. It is a common denominator whether your filesystem is a FAT-16 floppy disk, an ext4 local hdd, or even a DirectFlow Panasas filesystem. So, it ought to theoretically be possible to write another piece of code that knows how to consume the same API, but isn’t the kernel itself…and in fact also implements this API.

This pseudo-filesystem would be mounted by the kernel, and it in turn would be configured to mount any other filesystems of the administrator’s choice. Then, when asked for a file, the pseudo-fs would query a metadata server of its own to determine which underlying filesystem the file resides on, and then simply pass through all further interactions on that file handle to the API methods of the appropriate underlying filesystem.

With a scheme like this, one additional step is added when opening a file, but (importantly), the pseudo-fs would not be expected to introduce a measurable performance impact on all subsequent I/O to opened files, since native performance charactaristics of the underlying filesystems would be preserved. In my case, data transfer from storage to compute node would still happen using Panasas’ proprietary DirectFlow protocol if the data was on the high-speed Panasas system.

Clearly, completing this would be a very ambitious undertaking, but so far I haven’t discovered any fundamental reasons why this system wouldn’t work, and if such a thing existed, it might prove to be a unique open-source solution to HSM across any combination of heterogeneous storage systems.

Fortunately, it feels like a project with a pretty manageable, progressive roadmap. A feasible and likely very instructive first step would be to simply implement a kernel module providing a mountable filesystem that in fact passes all calls through to some single underlying filesystem.

Now we just have to see if this ever becomes more than my latest crazy scheme of the week.