I have a small single host ESXi on a Dell PowerEdge R710 (popular for homelab) that I support for family. It has all the normal stuff on it:
- Windows Domain Controller with DHCP
- Windows Domain Controller
- General Purpose server for WSUS, MDT and what not.
- Multiple MDT Build and Capture
- Multiple Windows 10 Insider Preview
- ???
All normal
homelab stuff. Recently there was a power loss event, that while it lasted only a few minutes; and should not have been noticed, hadcaused the battery in the UPS that protects this host to blow up, literally, so all the VMs and the host went down hard.
There is an physical Ubuntu Linux server and network equipment that is on a different (matching) UPS and it controls both UPS' so if a power loss happens it will SSH into the ESXi host and pause the VMs and shut it down gracefully if needed. Sooooooo did not plan for batteries exploding.
While we recently bought VMWare vSphere Essentials, which gives us the backup API, its not in use just yet. When both the host and the VMs were setup it used the free license so everything else used free options. Therefore to protect the "critical" Windows Server VMs, we used the built in
Windows Server backup tool and did a daily Full Server backup. These are sent to a UNC path on the above Linux server as it has lots of space and is external. Due to limitations in the built in backup software only a single full backup is done so no incremental.
So upon bypassing the UPS and bringing the host and VMs back up, one of the Domain Controllers was not happy and kept blue screening and in a restart loop.
Solutions on this error were to go through Directory Services Repair Mode. Thankfully this was the DC that is only a DC, no FSMO roles or duties so was pretty disposable. Since the backup is ran daily it was less then 24 hours old I decided to do a bare metal restore instead since its been a while since I did a restore exorcise.
First step was to shut it down and take a snapshop so I can revert and go the Repair Mode route if I chose to. Then boot the VM from the 2012R2 Install ISO.
Select
Next on the initial dialog
Then
Repair your computer at the bottom
then Troubleshoot
and then finally
System Image Recovery
Since my backup is not local and on a Linux server it will error out so we need to get connected to Samba on the Linux box.
Choose
Cancel to close this dialog then select
Next to proceed
Then you click on
Advanced... so you can enter a UNC path.
Here you would locate a backup on the network, however the second option reminded me that I used VMXNET3 for the NIC on this VM and the driver is not in the 2012 R2 install media since its installed by the VMWare Tools.
If you try to attach to the UNC path you will just get an error since there is no NIC present in this PE instance. You can verify further by going to a Command Prompt in Advanced Options and using ipconfig etc.
I
previously wrote on where to source the drivers but that meant getting a source (floppy or CD) mounted on this broken VM. There is a simpler solution for ESXi. I edited the VM and added a second NIC that is supported by the media, which in this case is an E1000e.
After giving a few seconds for DHCP to kick in I selected
Back until I could select
Advanced... again then select
Search for a system image on the network which then asks if you want to bring the network up.
Select yes and enter the UNC to your backups
and of course credentials to get at the backups
Now we have all our Server Backups so we can select the one we want and then Next
It then asks us which volumes to restore. Since its a DC and only has the one so we select it and Next again.
Since it is bare-metal and going to the same "hardware" you just need to select Next here as we have nothing to change.
And now we are at the confirmation dialog. Finally! Just have to select Finish and let it restore.
Nope, still one more dialog to be sure we are REALLY ready to restore. Select
Yes.
Away it goes.
This is weird. Got an error on the restore. I would get this error again each time I tried but in different spots so the backup seemed good to me still.
This is such a generic error I thought it may be NIC related so I went back to ESXi and created a small drive on the multipurpose Server VM
and then copied the backup to it. Note you need to create a folder in the root called
WindowsImageBackup then place the systems backup folder in it. Once completed, I unmounted it from that VM and mounted to this broken VM
and ran through the wizard again. All live, the PE instance was seeing all the hardware changes I was making. This time the wizard found it at the beginning since it was local to the VM so I did not have to go through Advanced like before.
After selecting Next you are taken to the Choose additional restore options dialog and its the same from there on out like above, just no errors this time!
Once complete it restarted and the restored server booted back up in its earlier state.
I did however interrupt the restart and unmounted the restore volume and deleted the e1000e NIC I attached. I looked at the event logs to make sure AD replication was happy and then after a couple days I removed the snapshot I took in the beginning.
Without the ESXi parts at play, Windows Server backup can be a free yet powerful tool to backup servers and restore them. On physical hardware you could copy the backup to a USB if you had the same NIC issues.
Now to work a process around exploding UPS batteries. As a preventative measure, I did copy out the same day backups of the other two servers in case they had this happen but am happy to say they have been running great since this incident happened several weeks ago. Until I have time to research free ESXi backup options that work with my VMWare Essentials license I have added these backups to my
rsnapshot.conf on the Linux box to get quasi incremental backups.
-Kevin