Introduction

About 6-8 months ago, I decided to make the switch to ZFS to store all of my data, in hope of one day being able to back up remotely with ZFS encryption.

That day has finally arrived; I have an off-site backup that I am dumping data to.

Rationale for ZFS

People may ask why I did not use BorgBackup or other like alternatives. There are several parts to the answer:

  1. ZFS is an upgrade to my filesystem and will protect my data better. (I also recently upgraded my machine to have ECC RAM too, although contrary to popular belief, ECC RAM is not required for ZFS.)
  2. I compared the encryption protocols used by both Borg and ZFS, and in my very uninformed opinion, ZFS’s was slightly stronger.
  3. I wanted to be able to access my data on the remote, and though it’s possible with Borg, it’s easier with ZFS (in my opinion). This is especially important for making sure backups are good. If I can access the data on the remote the same way I do on my normal machine, I can have confidence that the backup is good.
  4. It’s easier to set up mirrors with ZFS.

That last reason was the real differentiation between the two. I have mirrors on both my home machine and the remote, and the ZFS set up was easy.

My Current Setup

I have a local backup as well, in the form of a (now-old) Synology Diskstation. It has served me well thus far, having saved my data on several occasions, including several times when I made mistakes while building the ZFS infrastructure in this post.

My current setup is to rsync my data to the Diskstation. Over the years, I made an elaborate script (353 lines of shell). This includes doing dry runs to ensure that if something is deleted, it’s because I want it to be deleted.

Before I put the dry runs in, I almost lost data when it was deleted. Fortunately for me, that data also happened to be stored under another user on the Diskstation at the time.

There is much more to my Diskstation backup than that, such as possibly deleting software builds and things like that, but that is the gist of what I did before I managed to get a remote ZFS server.

First Attempts

My first attempts at getting a remote backup with ZFS failed spectacularly.

rsync.net

When I set up ZFS for the first time on my machine, I looked at rsync.net. I even set up an account with them.

Unfortunately, they don’t provision a full VM for customers with less than 1 TiB of data, a category I fall under. This meant that I was unable to update FreeBSD’s version of ZFS in my bhyve, which was a dealbreaker because at the time, that version of FreeBSD was running a version of ZFS without encryption.

I could have upgraded to an account that made me pay for 1 TB, but my wife and I eventually decided not to.

zfs.rent

Next, when it was announced on Hacker News, I looked into zfs.rent.

It was cheaper, so that was good. However, it was just being started, and I was not entirely impressed with the founder. So I decided to not pursue that avenue.

Actually, the success of zfs.rent, especially the fact that he turned away potential customers makes me wonder if there is a bigger market for off-site ZFS backups than I thought. Maybe I can capture some of the customers he turned away?

Renting a Server

Eventually, I came to a deal by a friend of mine who rents servers. I ended up paying more, but I have a full machine with all of the niceties I could ever want, and that means that I can move my websites away from the hosting they are on to the server I am renting, which will make deployment of the websites much easier.

The move will happen soon; I just need to learn how to set up BIND.

I also sent him some drives that he put in the machine, so he could send them to me directly should I need him to.

Another reason I am glad to pay more is that I know who this money is going to, and it’s supporting a small business in a time when small businesses are being choked.

Creating the Infrastructure

After I had the server and had installed my choice of OS and ZFS, I got to work.

Creating the Mirror

The first thing I had to do was create a mirror. That was easy; I had done it before.

zpool create -o ashift=12 -O compression=off -O atime=off -m none home mirror \
	/dev/disk/by-id/<first_disk_id> /dev/disk/by-id/<second_disk_id>

I turn off atime by default for performance.

I turn off compression by default because the author of ZFS encryption makes no guarantees that a CRIME-like attack isn’t possible against ZFS encryption with compression turned on, and since the vast majority (space-wise) of my data is compressed video, I don’t think it would help me much while also opening a potential hole.

I also set ashift to 12 because that’s what my drives have.

Finding the Right Commands

This part was the toughest.

I knew I wanted to do a zfs send/zfs recv with snapshots. From my attempt with rsync.net, I had a script to handle creating the snapshots, but I hadn’t yet managed to send a stream successfully.

With a lot of work, and a lot of fruitless attempts, I finally reached out to the zfs-discuss mailing list, and they were immensely helpful.

I eventually learned that it was going to be best to not do a raw send, but send the data encrypted by ssh instead.

The basic command I wrote is the following:

zfs send -vL $token | ssh $SSHARGS "zfs recv $mountpoint -u -sd $pool"

$token is the name of the snapshot I want to send, unless that snapshot has a receive_resume_token, in which case, $token is -t <receive_resume_token>. If the backup is an incremental backup, $token also has -I <previous_snapshot>.

$mountpoint is blank if the mountpoint of the dataset is inherited, but if it’s custom, $mountpoint is -o 'mountpoint=<mountpoint>' to make sure that mountpoints are the same on both machines.

$pool is obviously the pool the dataset is in.

And $SSHARGS is redacted for obvious reasons.

Performance

My ISP’s upload speed is pretty bad. However, in my experience, it has been worse than what they advertised.

So I was not surprised at all when my first implementation was slower than the advertised upload speed. Still, with encouragement from the mailing list, I decided to dig a little.

By using bzip2 on both ends, I was able to recover some speed, giving me this command:

zfs send -vL $token | bzip2 -c | ssh $SSHARGS \
	"bzip2 -dc | zfs recv $mountpoint -u -sd $pool"

However, even with that, at the rate it was sending, it would take me more than 30 days to send it all.

I wondered if I could find another bottleneck.

So I tried using mbuffer to ensure that ZFS was not the bottleneck. It wasn’t because it easily filled the 1GiB buffer in seconds.

I tried Wireguard with netcat to see if ssh was the bottleneck. Nope; I still had terrible upload (though it was slightly better).

Finally, I gave up.

However, I noticed that most CPU time went to bzip2, so I wondered if I could send two snapshots in parallel and have one send while the other was being compressed.

The result blew me away: I got twice the upload speed.

I quickly checked to see if bzip2 had been the bottleneck by using it between ZFS and mbuffer; nope, it also filled the 1GiB buffer easily.

The True Bottleneck

And that led me to my most surprising discovery of all: Linux (the kernel) was the bottleneck.

I don’t know enough about kernel development to investigate, but at this point, I am sure that Linux was the reason I could not saturate my connection. Once I started sending enough datasets in parallel, I easily saturated it.

Parallelizing the Upload

Because I could cut my send time from 30 days to 6 days, it would easily be worth my time to figure out how to parallelize it.

I did this in a few steps. First, I broke up my largest datasets into more manageable sizes by making subdatasets.

Then I had a conundrum: I wanted to send subdatasets before their parents, so I needed a way to express dependencies between datasets while being able to send datasets in parallel. After banging my head against the wall for several hours, I realized the answer was obvious: make with the -j flag.

So I wrote a Makefile expressing those dependencies, biasing towards sending larger datasets first. (In a Makefile, the order of prerequisites matters.)

Then, after trying out the Makefile and finding that it worked splendidly, I wrote a script to generate the Makefile, using zfs get and zfs list.

End Result

If you use the scripts in this section, you do so AT YOUR OWN RISK.

The end result is three scripts:

  • zfs_backup.sh, which is called from my existing backup script.
  • zfs_send_gen.sh, which generates the Makefile.
  • zfs_send.sh, which is the script called by the targets in the Makefile and does the actual sending.

A fourth script, labelled YESNO in the above scripts, is also needed. It just makes it easier for scripts to ask users yes or no questions.

zfs_backup.sh does the following:

  1. Creates snapshots of every dataset in every pool, if requested.
  2. Deletes snapshots (if requested), leaving $NUMSNAPS snapshots untouched.
  3. Generates the Makefile by calling zfs_send_gen.sh.
  4. Runs make, doing the parallel send.

Each make target calls zfs_send.sh. If the special argument umount is provided, zfs_send.sh unmounts all pools on the remote (which is done in a make target that all others depend on). Otherwise, it does the following:

  1. Determines if the mountpoint of the dataset is inherited or not. If it is not, the $mountpoint argument is set.
  2. Gets the last snapshot (which zfs_backup.sh should have just created).
  3. Gets the last snapshot on the remote.
  4. If the snapshots are the same, it exits successfully since the snapshot has already been sent.
  5. If the snapshot has not been sent, it continues by detecting if there is a receive_resume_token on the remote for that dataset.
  6. If there is a token, it sets $token to -t <token_id>.
  7. Otherwise, it sets $token to the snapshot name.
  8. It checks if there is no snapshot on the remote.
  9. If not, it starts sending a full stream.
  10. If there is a snapshot on the remote, it sends an incremental stream.

There are some cool advantages of this system.

First, if sending gets interrupted, as happened often just sending the data the first time, I just have to run make -j$JOBS in the directory where the Makefile and scripts are, and they will pick up where they left off.

Second, I can run a ZFS backup without running my Diskstation backup.

Third, since the snapshots are the exact same on both ends, I can have assurance that ZFS is making sure the data is also the exact same on both ends.

Conclusion

I don’t know that there’s much to the conclusion of this post, but here are some things I learned:

  • ZFS struggles with raw sends/receives.
  • The ZFS community is helpful.
  • Linux’s TCP stack is awful, at least for this use case.
  • You can work around the above if you can upload stuff in parallel.
  • make is a great way of expressing dependencies between jobs, even if you are not using it to build software.
  • Makefiles are surprisingly easy to generate.
  • ZFS has a lot of user tools included that make querying the system easy.
  • However, the ZFS man pages, while good, could be better.

The biggest lesson: use ZFS. It’s great!