Challenges with image-based backups

I haven’t done a lot of muskie fishing in the past month.  It’s been a busy summer (as always), and I’ve been sticking to short fishing trips for “lesser” species.  I made a couple trips to little cedar this last weekend and both were fun and fairly successful.  The first was with a friend from church and his young son.  They caught several bluegills and the boy had a blast, although he probably enjoyed the snacks more than the fishing.  I caught a couple northern pike, the biggest of which was 19″.  It was almost big enough to keep, but I ended up letting it go.

The next night I took my five year old out for an evening of fishing.  We got some leeches and were targeting walleye, but we only ended up with some largemouth bass.  I think we were getting bites from walleye, but we didn’t manage to hook any.  Below is a picture of a nice largemouth bass my son took with a skitter pop.  The fish had a couple spectacular jumps before making it to the boat.

Nate LM Bass 20160806

Yesterday was one of the most challenging days I’ve had with my current employer.  We’re in the process of changing managed service providers for IT infrastructure, and changes like these are never easy.  The problems yesterday stemmed from the backup solution changes.  Both our old and new MSPs use a disk image-based backup solution, but they are different brands.  In a previous post, I’ve made it clear that I am not a fan of image-based backups for databases.  Most relational database management systems have their own backup mechanisms, and I understand and trust these backups far more than the image backups.  However, different backups can serve different purposes and in this case, I have no choice in the matter.

The problem occurred over the course of the morning.  I got a report that our Oracle database was not available.  Upon checking, I found I could not access the database.  I remote desktop-ed into the server to check the service and found connectivity spotty.  I recycled the Oracle database service, but that did not fix the issue.  Since the connectivity was slow and spotty, I suspected either a network or disk issue.  The new MSP was able to add some diagnostics to the server and we found no hardware issues with the disks.  There also didn’t appear to be any network connectivity issues.

At this point we attempted a server reboot.  This didn’t help.  Since Oracle database is pretty much the only thing running on the server, we figured the problem either had to be Oracle or the antivirus software.  We disabled both and rebooted the server again.  This time it appeared to help.  We brought Oracle back up but left the antivirus down.  Everything was working fine for about 45 minutes, so we figured there must’ve been some sort of conflict between Oracle and the antivirus software.  To win the worst timing of the day award, I sent out an all-clear email to the users about one minute before the server started breaking again.  We saw the same symptoms as before where we had spotty connectivity and extreme slowness.

After yet another reboot we left both Oracle and the antivirus off.  When slowness occurred yet again, we knew it had to be something else that was running on the server.  A quick check of the Programs list showed that the image-based backups for the previous MSP were still running (as they should have been).  However, that was enough of a clue for a lightbulb moment.  In preparation for the new backup solution, a disk defragmentation had been run the previous night.  The current backup solution did a single full, then continuous incrementals afterward.  The defragmentation had scrambled up all the bytes on the disk, resulting in an incremental with lots of changes to process.  In looking at the backup manager, we found the next incremental would take a full two days to process.  We also discovered that each time we’d had problems, there was an incremental backup scheduled to run.

Fairly confident we’d found our culprit, we disabled the backup and turned Oracle and the antivirus software back on.  As expected, there were no issues for the remainder of the day.

In the end, the pressure of the database trying to use the disks in addition to the backup solution proved to be too much for the server to handle.  Through the use of methodical testing, we found the problem and fix it.