I got 4 20TB drives from Amazon around Black Friday that I want to get setup for network storage. I’ve got 3 descent Ryzen 5000 series desktops that I was thinking about setting up so that I could build my own mini-Kubernetes cluster, but I don’t know if I have enough motivation. I’m pretty OCD so small projects often turn into big projects.
I don’t have an ECC motherboard though, so I want to get some input if BTRFS, ZFS, TrueNAS, or some other solution should be relatively safe without it? I guess it is a risk-factor but I haven’t had any issues yet (fingers crossed). I’ve been out of the CNCF space for a while but Rook used to be the way to go for Ceph on Kubernetes. Has there been any new projects worth checking out or should I just do RAID and get it over with? Does Ceph offer the same level of redundancy or performance? The boards have a single M.2 slot so I could add in some SSD caching.
If I go with RAID, should I do RAID 5 or 6? I’m also a bit worried because the drives are all the same so if there is an issue it could hit multiple drives at once, but I plan to try to have an online backup somewhere and if I order more drives I’ll balance it out with a different manufacturer.
Based on the hardware you have I would go with ZFS (using TrueNAS would probably be easiest). Generally with such large disks I would suggest using at least 2 parity disks, but seeing as you only have 4 that means you would lose half your storage to be able to survive 2 disk failures. Reason for the (at least) 2 parity disks is (especially with identical disks) the risk of failure during a rebuild after one failure is pretty high since there is so much data to rebuild and write to your new disk (like, it will probably take more than a day).
Can’t talk much about backup as I just have very little data that I care enough about to backup, and just throw that into cloud object storage as well as onto my local high-reliability storage.
I have tried many different solutions, so will give you a quick overview of my experiences, thoughts, and things I have heard/seen:
Single Machine
Do this unless you have to scale beyond one machine
ZFS (on TrueNAS)
- It’s great, with a few exceptions.
- Uses data checksums so it can detect bitrot when performing a “scrub”.
- Super simple to manage, especially with the
FreeNASTrueNAS GUI. - Can run some services via Jails and/or plugins
- It only works on a single machine, which became a limiting factor for me.
- It can’t add disks one at a time, you have to add an entire vdev (another set of drives in RAID-Z or whatever you choose).
- You have to upgrade all disks in a vdev to use higher capacity disks.
- Has lots of options for how to use disks in vdevs:
- Stripe (basically RAID-0, no redundancy, only for max performance)
- Mirror (basically RAID-1, same data on every disk in the vdev)
- RAID-Zx (basically RAID-5, RAID-6, or <unnamed raid level better than 6>, uses x # of disks for parity, meaning that many disks can be lost)
- ZFS send seems potentially neat for backups, though I have never used it
MDADM
- It’s RAID, just in your linux kernel.
- Has been in the kernel for years, is quite reliable. (I’ve been using it for literally years on a few different boxes as ZFS on Linux was less mature at the time)
- You can make LVM use it mostly transparently.
- I would probably run ZFS for new installs instead.
BTRFS
- Can’t speak to this one with personal experience.
- Have heard it works best on SSDs, not sure if that is the case any more.
- The RAID offerings used to be questionable, pretty sure that isn’t the case any more.
UnRaid
- It’s a decently popular option.
- It lets you mix disks of different capacity, and uses your largest disk for parity
- Can just run docker containers, which is great.
- Uses a custom solution for parity, so likely less battle-hardened and less eyes on it vs ZFS or MDAM.
- Parity solution reminds me of RAID-4, which may mean higher wear on your parity drive in some situations/workloads.
- I think they added support for more than one parity disk, so that’s neat.
Raid card
- Capabilities and reliability can vary by vendor
- Must have battery backup if you are using write-back for performance gains
- Seemingly have fallen out of favor to JBODs with software solutions (ZFS, BTRFS, UnRaid, MDADM)
- I use the PERCs in my servers for making a RAID-10 pool out of local 2.5in disks on some of my servers. Works fine, no complaints.
JBOD
- Throwing this in here as it is still mostly one machine, and worth mentioning
- You can buy basically a stripped down server (just a power supply and special SAS expander card) that you can put disks in, and lets you connect that shelf of storage to your actual server
- May let you scale some of those “Single Machine” solutions beyond the number of drive bays you have.
- Is putting a number of eggs in one basket as far as hardware goes if the host server dies, up to you to decide how you want to approach that.
Multi-machine
Ceph
- Provides block (RBD), FS, and Object (S3) storage interfaces.
- Used widely by cloud providers
- Companies I’ve seen run it often have a whole (small) team just to build/run/maintain it
- I had a bad experience with it
- Really annoying to manage (even with cephadm)
- Broke for unclear reasons, while appearing everything was working
- I lost all the data I put into during testing
- My experience may not be representative of what yours would be
SeaweedFS
- Really neat project
- Combines some of the best properties of replication and erasure coding
- Stores data in volume files of X size
- Read/Write happens on replica volumes
- Once a volume fills you can set it as read only and convert it to erasure coding for better space efficiency
- This can make it harder to reclaim disk space, so depending on your workload may bot be right for you
- Has lots of storage configuration options for volumes to tolerate machine/rack/row failures.
- Can shift cold data to cloud storage and I think even can back itself up to cloud storage
- Can provide S3, WebDAV, and FUSE storage natively
- Very young project
- Management story is not entirely figured out yet
- I also lost data while testing this, though root cause there was unreliable hardware
Tahoe LAFS
- Very brief trial
- Couldn’t wrap my head around it
- Seems interesting
- Seems mostly designed for storing things reliably on untrusted machines, so my use case was probably not ideal for it.
MooseFS/LizardFS
- Looked neat and had many of the features I want
- Some of those features are only on (paid) MooseFS Pro or LizardFS (seemingly abandoned/unmaintained)
Gluster
- Disks can be put into many different volume configurations depending on your needs
- Distributed (just choose a disk for each file, no redundancy)
- Replicated (store every file on every disk, very redundant, wastes lots of space, only as much space as the smallest disk)
- Distributed Replicated (Distributed across Replicated sets, add X disks as a Replicated set of disks, choose one of the replica sets and store the file on every disk in that set, is how you scale Replicated disks, each replica can only be as big as the smallest member disk, you must add X disks at a time)
- Dispersed (store each file across every disk using X disks for parity, tolerates X disk failures, only as much space as the smallest disk * (number of disks - X), means you are only losing X disks worth of parity)
- Distributed Dispersed (Distributed across Dispersed sets, add X disks as a Dispersed set of disks with Y parity, choose one of the disperse sets and store each file across its X disks using Y disks for parity, is how you scale Dispersed disks, each disperse only has as much space as the smallest disk * (X - Y), you must add X disks at a time)
- Also gets used by enterprises
- Anything but dispersed stores full files on a normal filesystem (vs Ceph using its own special filesystem, vs Seaweed that stores things in volume files) meaning in a worst case recovery scenario you can read the disks directly.
- Very easy to configure
- I am using it now, my testing of it went well
- Jury is still out on what maintenance looks like
Kubernetes-native
Consider these if you’re using k8s. You can use any of the single-machine options (or most of the multi-machine options) and find a way to use them in k8s (natively for gluster and some others, or via NFS). I had a lot of luck just using NFS from my TrueNAS storage server in my k8s cluster.
Rook
- Uses Ceph under the hood
- Used it very briefly and it seemed fine.
- I have heard good things, but am skeptical given my previous experience with Ceph
Longhorn
- Project by the folks at Rancher/SUSE
- Replicates the volume
- Worked well enough when I was running k8s with some light workloads in it
- Only seems to provide block storage, which I am much less interested in.
OpenEBS
- Never used it myself
- Only seems to provide block storage, which I am much less interested in.
Wow this is an amazing explanation! I’ve been starting to lean towards UnRAID but I didn’t realize their fault tolerance wasn’t as vetted as my cohorts believe. I’ve been taking a painful option to keep an lvm on ubuntu for my drives, but I’m trying to get out of my mistake and build something for offsite so I can feel comfortable with redoing it all.
Thanks so much for the very detailed reply. I think at this point I’m conflicted between using TrueNAS or going all in and trying SDS. I’m leaning towards SDS primarily because I want to build experience, but heck maybe I’ll end up doing both and testing it out first and see what clicks.
I’ve setup Gluster before for OpenStack and had a pretty good experience, but at the time the performance was substantially worse than Ceph (however it may have gotten much better). Ceph was really challenging and required a lot of planning when I used it last in a previous role, but it seems like Rook might solve most of that. I don’t really care about rebuild times… I’m fine if it takes a day or two to recover the data as long as I don’t lose any.
As long as I make sure to have an offsite backup/replica somewhere then I guess I can’t go too wrong. Thanks for explaining the various configurations of Gluster. That will be extremely helpful if I decide to go that route, and if performance can be tuned to match Ceph then I probably will.
If you’ve got >=3 machines and >=3 devices, I’d suggest at least strongly considering Rook. It should allow for future growth and will let you tolerate the loss of one node at the storage level too, assuming you have replication configured. Which (replication params) you can set per StorageClass in case you want to squeeze every last byte out for cases where you don’t need storage-level replication.
I’ve run my own k8s cluster for years now, and solid storage from rook really made it take off with respect to how many applications I can build and/or run on it.
As for backup, there’s velero. Though I haven’t gotten it to work on bare metal. My ideal would be to just use it to store backups in Backblaze B2 given the ridiculously low cost. Presumably I could get there with restic, since that’s my outside-k8s backup solution, but I still haven’t gotten that set up since it’s much more cloud-provider friendly.
Thanks. That is what I’m leaning towards. Do you have any suggestions for a particular distro for your K8S nodes? I’m running Arch on my desktop.
The idea of being able to setup different storage classes is very appealing, as well as learning how to build my on K8S cloud.
I’m also running arch. Unfortunately I’ve been running mine long enough that it’s just my own bespoke Ansible playbooks for configs that have morphed only as required by breaking changes or features/security I want to add. I think the best way to start from scratch these days is kubeadm, and I think it should be fairly straightforward on arch or whatever distro you like.
Fundamentally my setup is just kubelet and kubeproxy on every node, the oci runtime (CRIO for me), etcd (set up manually but certs are automated now) and then some k8s manifests templated and dropped into the k8s manifest folder for the control plane on 3 nodes for HA. The more I think about it, the more I remember how complicated it is unless you want a private CA. Which I have and love the convenience and privacy it affords me (no CTL exposing domain names unless I need public certs and they’re public anyway).
I have expanded to 6 nodes (5 of which remain, RIP laptop SSD) and just run arch on all of them because it kinda just works and I like the consistency. I also got quite good at the arch install in the process.
That’s rad… I have a set of Ansible playbooks/roles/collections already for most system-wide settings. I have a love-hate relationship with Ansible though, but it gets the job done. I may try for cloud-init first until I reach its limitations. I’ve gotten pretty good at the Arch install too, although setting up the disks with LUKS was the most challenging part. Fortunately, the few times I’ve broke things I’ve been able to boot the installer ISO and mount my LUKS volumes from memory, but I couldn’t tell you how I set them up in the first place. 🤣 However I do it, I really just want to automate the process so that I can add new nodes and expand should I decide to rent out colocation space someday.
If you want to have a single node hosting your storage, ZFS either directly or through TrueNAS is the way to go. ECC is not really necessary.
If you do want to have a higher availability and as you have 3 potential nodes already I can recommend Ceph. It’s pretty much THE open source network storage provider and has been battle tested a lot. With Ceph you can define how you want redundancy configured, but I haven’t gone into that enough to tell you what to do.
As long as 2/3 nodes are up you won’t have problems, but less nodes will lead to either a broken cluster or split brain scenario if you’re not careful.
Not sure how Ceph handles nodes with mismatched drives, I only tested using the same number and size of drives. Maybe you’d have to get two more drives for your other nodes.Re RAID: At the moment RAID 6 does not offer any advantage over RAID 10, so you could go with that instead. RAID 10 would provide higher performance, but future upgrades need to be done in pairs of 2 disks.
RAID 5 is okish, but with 20TB you’re looking at very long restore times and the likelyhood of failure is very high during the restore and it’s very likely you’ll have a half finished restore when a second drive fails.
See this article for an explaination. (The same issue also applies to RAID 6, but the drives can be bigger as you have 2 redundant drives. Regardless, RAID 10 doesn’t have this problem due to being so simple)Would suggest raid0 for maximum read speed /s
if the data is mission critical have multiple backups for it, and test the restore process of said backups.
As for your raid vs software raid, just install truenas and throw it on there, use a LSI in IT mode and create a ZFS cluster. Don’t over think it.
Thanks. I’ve got Gigabit Fiber so I guess I’ll try Hetzner as a remote backup, or see how much it will cost to upgrade my Google Workspace account since they started enforcing their storage quotas.
It’s a lot cheaper in the long run to do an off site backup rig and just put it at a friend’s house if you can. Google will be expensive.
I don’t have any friends really 😥 and the unlimited storage with Google Workspace was $25/mo. I think it will cost me about $125/mo. now to get enough pooled storage with Google, but it is doable at least in the short-term. I guess I need to make some friends with fiber connections.
If you’re planning to go BSD, or buy all the drives you’re ever gonna have in the cluster up-front, then ZFS is great. Otherwise, be mindful of the hidden cost of ZFS. Personally, for my home server, because I’m gradually adding more drives still, I’m using mdraid on RAID6 with 8 x 8TB WD Reds/HGST Ultrastars, and I’m loving the room for activities.
Having said that, regardless of the solution you go with, since you’ve got only 4 drives, higher RAID level (and equivalent of thereof such as RAIDZ2) might be out of reach as you’d be “wasting” a lot of space for the extra piece of mind. If I were in your situation, I’d probably use RAID5 (despite RAID 5 is dead in 2009, or have they continued chugging on after 2013) for less important data (so sustain 1 drive failure) or RAID 10 if I need more performance (and depending on luck of draw, potentially sustain 2 drive failures depending on which drive fails).
Thanks. I believe TrueNAS does ZFS as well… maybe by default. If I want to keep it simple this will probably be the route I go.
Really the first question to ask is, do you want n+1 redundant or n+2 redundant? Decide that, then whatever tech you use to make it happen is a matter of preference.
Personally I always go for n+2 redundant with offsite backup…
Prob n+2 but I’ll at least need to buy 2 more drives.