Welcome Back!
It's been a while since I last shared anything. I recently changed jobs and have been busy with that endeavor, but I hope to share more insights from this journey soon.
Encountering SMART Errors after OPNSense Upgrade
Upon upgrading one of my OPNSense instances, I noticed some errors upon restarting one of my drives, ada1. After further investigation, I came across some SMART errors. Although these errors were not enough to trigger a SMART failure, they were still concerning. Even manual short tests returned clean results. Here's what I found when running smartctl -a
:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
...
180 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 4185
...
195 Hardware_ECC_Recovered 0x0032 100 099 000 Old_age Always - 1676229109
...
SMART Error Log Version: 1
ATA Error Count: 4216 (device log contains only the most recent five errors)
...
Error 4216 occurred at disk power-on lifetime: 32126 hours (1338 days + 14 hours)
When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 00 00 00 00 00 00
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 00 00 00:00:09.590 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:09.260 IDENTIFY DEVICE
f5 00 00 00 00 00 00 00 00:00:09.250 SECURITY FREEZE LOCK
ec 00 00 00 00 00 00 00 00:00:09.250 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:00:09.250 IDENTIFY DEVICE
...
This drive is on its way out, so I need to replace it. This particular system is a retasked $30 Barracuda Load Balancer 340 that I upgraded with a new processor and memory. It works great for the use case. Unfortunately, it uses commodity hardware (an MSI customized mainboard), and its manual did not state it supports hotplug, so I had to bring it down to swap it out. Log into the console via your preferred method—I'm using SSH. The first task is to remove the failing drive from the ZFS pool after identifying it.
root@OPNsense:~ # zpool status
pool: zroot
state: ONLINE
scan: scan: scrub repaired 0B in 00:00:15 with 0 errors on Wed Feb 5 01:31:15 2025
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
ada1p4 ONLINE 0 0 0
errors: No known data errors
Since ada1 is failing, the ZFS partition is ada1p4, so we will remove that partition.
root@OPNsense:~ # zpool detach ada1p4
Replacing the Drive
As this system does not support hotplug, I then shut it down and swapped the defective drive with a known good one of the same size or larger. The new drive should be clean without any partitions on it. However, in this case, as the second drive, OPNSense will boot back up from ada0, so it's not super important. Your results may vary. Once back up, log back into the console via your preferred method.
Verifying the Partitions
Firstly, we need to verify the partitions because copying the wrong ones could lead to trouble! If your new drive has a partition table this will show it.
root@OPNsense:~ # gpart show
=> 40 500118112 ada0 GPT (238G)
40 532480 1 efi (260M)
532520 1024 2 freebsd-boot (512K)
533544 984 - free - (492K)
534528 16777216 3 freebsd-swap (8.0G)
17311744 482805760 4 freebsd-zfs (230G)
500117504 648 - free - (324K)
We have four partitions and the partition table to clone to the new disk. We use dd
on partitions 1 and 2; however, partitions 3 and 4 are addressed via the relevant tools. Next, we need to turn off swap. Since both partitions are listed in /etc/fstab
, we receive an error for the swap located on the now-missing disk.
root@OPNsense:~ # swapoff -a
swapoff: removing /dev/ada0p3 as swap device
swapoff: /dev/ada1p3: No such file or directory
Cloning Partitions
Now comes the potentially dangerous parts, so be VERY careful here. The source drive is ada0, and the new drive is ada1. We will clone the partition table from ada0 to ada1.
root@OPNsense:~ # gpart backup ada0 | gpart restore -F ada1
Next we clone partition 1:
root@OPNsense:~ # dd if=/dev/ada0p1 of=/dev/ada1p1
532480+0 records in
532480+0 records out
272629760 bytes transferred in 22.115694 secs (12327434 bytes/sec)
Then partition 2:
root@OPNsense:~ # dd if=/dev/ada0p2 of=/dev/ada1p2
1024+0 records in
1024+0 records out
524288 bytes transferred in 0.054477 secs (9623964 bytes/sec)
For the ZFS mirror, we use the zpool
tool to attach it to the zroot
pool as shown by zpool status
above.
root@OPNsense:~ # zpool attach zroot ada0p4 ada1p4
You can verify it’s back to an expected state via zpool status
:
root@OPNsense:~ # zpool status
pool: zroot
state: ONLINE
scan: resilvered 2.29G in 00:00:10 with 0 errors on Sat Feb 15 10:14:39 2025
config:
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p4 ONLINE 0 0 0
ada1p4 ONLINE 0 0 0
errors: No known data errors
Finally, turn swap back on, which will take care of the third partition.
root@OPNsense:~ # root@OPNsense:~ # swapon -a
swapon: adding /dev/ada0p3 as swap device
swapon: adding /dev/ada1p3 as swap device
At this point, it would be a good idea to go to the GUI and navigate to System: Settings: Cron, and verify the SMART tasks are configured the way you want.