Saturday, February 15, 2025

Replace failed ZFS mirror drive in OPNSense

Welcome Back!

It's been a while since I last shared anything. I recently changed jobs and have been busy with that endeavor, but I hope to share more insights from this journey soon.

Encountering SMART Errors after OPNSense Upgrade

Upon upgrading one of my OPNSense instances, I noticed some errors upon restarting one of my drives, ada1. After further investigation, I came across some SMART errors. Although these errors were not enough to trigger a SMART failure, they were still concerning. Even manual short tests returned clean results. Here's what I found when running smartctl -a:

 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...  
180 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       4185
...
195 Hardware_ECC_Recovered  0x0032   100   099   000    Old_age   Always       -       1676229109
...
SMART Error Log Version: 1
ATA Error Count: 4216 (device log contains only the most recent five errors)
...	
Error 4216 occurred at disk power-on lifetime: 32126 hours (1338 days + 14 hours)
  When the command that caused the error occurred, the device was in an unknown state.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 00 00 00 00 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ec 00 00 00 00 00 00 00      00:00:09.590  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:09.260  IDENTIFY DEVICE
  f5 00 00 00 00 00 00 00      00:00:09.250  SECURITY FREEZE LOCK
  ec 00 00 00 00 00 00 00      00:00:09.250  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00      00:00:09.250  IDENTIFY DEVICE
...

This drive is on its way out, so I need to replace it. This particular system is a retasked $30 Barracuda Load Balancer 340 that I upgraded with a new processor and memory. It works great for the use case. Unfortunately, it uses commodity hardware (an MSI customized mainboard), and its manual did not state it supports hotplug, so I had to bring it down to swap it out. Log into the console via your preferred method—I'm using SSH. The first task is to remove the failing drive from the ZFS pool after identifying it.

root@OPNsense:~ # zpool status

pool: zroot
 state: ONLINE
  scan: scan: scrub repaired 0B in 00:00:15 with 0 errors on Wed Feb  5 01:31:15 2025
config:
        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p4  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0
errors: No known data errors 

Since ada1 is failing, the ZFS partition is ada1p4, so we will remove that partition.

 root@OPNsense:~ # zpool detach ada1p4 

Replacing the Drive

As this system does not support hotplug, I then shut it down and swapped the defective drive with a known good one of the same size or larger. The new drive should be clean without any partitions on it. However, in this case, as the second drive, OPNSense will boot back up from ada0, so it's not super important. Your results may vary. Once back up, log back into the console via your preferred method.

Verifying the Partitions

Firstly, we need to verify the partitions because copying the wrong ones could lead to trouble! If your new drive has a partition table this will show it.

root@OPNsense:~ # gpart show

=>       40  500118112  ada0  GPT  (238G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528   16777216     3  freebsd-swap  (8.0G)
   17311744  482805760     4  freebsd-zfs  (230G)
  500117504        648        - free -  (324K)

We have four partitions and the partition table to clone to the new disk. We use dd on partitions 1 and 2; however, partitions 3 and 4 are addressed via the relevant tools. Next, we need to turn off swap. Since both partitions are listed in /etc/fstab, we receive an error for the swap located on the now-missing disk.

root@OPNsense:~ # swapoff -a

swapoff: removing /dev/ada0p3 as swap device
swapoff: /dev/ada1p3: No such file or directory

Cloning Partitions

Now comes the potentially dangerous parts, so be VERY careful here. The source drive is ada0, and the new drive is ada1. We will clone the partition table from ada0 to ada1.

root@OPNsense:~ # gpart backup ada0 | gpart restore -F ada1

Next we clone partition 1:

root@OPNsense:~ # dd if=/dev/ada0p1 of=/dev/ada1p1
532480+0 records in
532480+0 records out
272629760 bytes transferred in 22.115694 secs (12327434 bytes/sec)

Then partition 2:

root@OPNsense:~ # dd if=/dev/ada0p2 of=/dev/ada1p2
1024+0 records in
1024+0 records out
524288 bytes transferred in 0.054477 secs (9623964 bytes/sec)

For the ZFS mirror, we use the zpool tool to attach it to the zroot pool as shown by zpool status above.

 root@OPNsense:~ # zpool attach zroot ada0p4 ada1p4

You can verify it’s back to an expected state via zpool status:

 root@OPNsense:~ # zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 2.29G in 00:00:10 with 0 errors on Sat Feb 15 10:14:39 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p4  ONLINE       0     0     0
            ada1p4  ONLINE       0     0     0

errors: No known data errors

Finally, turn swap back on, which will take care of the third partition.

 root@OPNsense:~ # root@OPNsense:~ # swapon -a
swapon: adding /dev/ada0p3 as swap device
swapon: adding /dev/ada1p3 as swap device

At this point, it would be a good idea to go to the GUI and navigate to System: Settings: Cron, and verify the SMART tasks are configured the way you want.

Along with any ZFS tasks. I only have a monthly scrub due to enabling autotrim per my config articles.