Hardware Headaches

Beginning this year, I began to notice my muskie reel running very rough. Instead of a smooth, quiet spin, you could hear the gears pushing each other as I cranked on the handle. Even worse, the reel would often skip, or reverse, when I would be yanking with the rod at the same time as I was pulling in line by turning the handle. I realized this was not going to go away or fix itself, so I sent the reel in to Badger Reel Repair. They were able to fix the reel, but I they had to replace many of the interior parts. I was informed rather directly that my current reel, an Abu Garcia Ambassadeur 6600 C4 was not built to handle the type of fishing that I do. Ripping big rubber baits such as bulldawgs as well as large bucktails like twin 10s will wreck havoc on my reel. As a result, I’m in the market for a new reel. It’s kind of disappointing because my C4 is a $140 reel, and I’d expect that to be able to stand up to anything, but I guess that is not the case. My prime choices for a new reel so far are the Shimano Calcutta 400B, the Abu Garcia Revo Toro NaCl, and the new Abu Garcia Revo Toro S, which doesn’t come out until September. Although it will stink to shell out $200 – $300 on a new reel, it will be fun to use some new equipment, and to compare its performance to that of my current reel.

Calcutta 400B
It can be pretty frustrating when equipment (or hardware) doesn’t work the way it should. During our recent server upgrade, I put forth great effort to ensure everything would go smoothly. I created a set of procedures and practiced those procedures multiple times. I closely inspected the current server to find out what needed to be migrated and what could be left and forgotten. I did everything I could think of to make this a success and a positive experience for the business.
The day of the migration went smoothly. There were a few minor hiccups, the worst of them involving database mail, but overall everything went fine. I did the upgrade on a Saturday, and everything seemed gravy into the next week. Then on Wednesday night, the new SQL Server errored out. I checked the logs and they said the data and log files were inaccessible. I went to check for the files and found the drives were gone! A reboot of the server brought them back and after running DBCC CHECKDB on all databases I was able to tell we were back online. I check of the Windows Event Logs showed one error at that time: “The driver detected a controller error on \Device\RaidPort5”. This isn’t the type of error that most DBAs can troubleshoot. In fact, it eventually took several IBM techs to get to the bottom of the issue.
We tried updating some firmware, but the issue occurred again the following night. This time our system adminstrator vendor brought in an IBM tech to troubleshoot. After multiple hours on hold with IBM support, the IBM tech found the raid card to be loose in the server node. He clipped in back in tightly and we gave it another go. That night, the issue occurred yet again. At this point, the disruptions to our production environment were just too great. Later that evening, the Friday after the initial upgrade, we failed back over to our old database server. The issue is still ongoing, but we are hopeful that a replacement RAID controller will fix them.
One frustrating thing is that the issue didn’t really happen for the month and a half we had the new server in standby, and it hasn’t happened since we took it out of production. We don’t really know how to tell if the issue is fixed without putting it back into production, where it causes severe problems when it bombs. The other frustration is how I perceive this incident affects other people’s impression of my work. I tried very hard to make the process go smoothly, but in the end I was derailed by something that was never in my control. All the users know is that I upgraded the server, and suddenly we are having system outages. I wish there were a lesson here, but in retrospect I can’t think of anything I could have done differently. The only way we may have been able to catch this is if we were able to simulate a production type load on the new server before we put it into production (which we don’t have the capability to do at this small employer), and even with that, the failures occurred off hours when the load was very minimal. We may not have found it regardless.

Advertisements

Testing Disk Speed

It has been a rather busy couple months. We’ve had lots of big projects at work and lots of departmental changes to go with them. Additionally, it’s finally summer and I’ve been fishing as much as possible. This being my second year with a boat, I’ve been making the most of my free time including lot’s of fishing with the kids at nearby Silver and Little Cedar lakes, and a fair amount of muskie fishing with friends and family down at Pewaukee Lake.
20150708 Northen Cropped
Actually, just last night I was fishing with a couple of friends on Pewaukee. We didn’t have much luck, but we did find a 34″ northern pike floating in the middle of the lake. It clearly hadn’t been dead more than an hour or two. It had some thin, deep cuts toward the back of it, probably the result of another fish biting it.
Anyway, my boat is fairly small for three guys, and even with the motor on full throttle the best speed we were able to make was about 18 mph. It felt pretty slow when I had recently been out with a couple different people in their boats, one that could get up to 40 mph and the other was at 45 mph (and may have even been able to go faster). There are things you can do to increase your speed while moving across the lake, such as making sure your boat is level, but the size of the engine really does make a big difference.
The week before the big database server upgrade that I was recently working on, I decided to test the disk speed on the new server. I was confident we’d be getting a big boost because we were going from spinning disks on the old server to SSDs, which are supposed to be up to 40X faster. I was pretty excited to run some tests and see what kind of improvement I got from it.
I decided to use a tool called diskspd to measure the speeds. This tool is available at Microsoft Technet here. Although it was not created specifically for SQL Server, it can be configured to run effective tests on SQL Server disks. I created the following batch file to run tests on the OS disks, the Data File disks, the Log File disks, and the TempDB disks.

@echo off
echo Automated I/O testing with Diskspd, Microsoft tool available at https://gallery.technet.microsoft.com/DiskSpd-a-robust-storage-6cd2f223
REM Author: 		Dave Gugg
REM Create Date: 	2015-05-06
REM Description: This script will automate I/O testing on a new database server.  It is a good idea to
REM           	 run this test on both the new and old server (make sure old server database is offline)
REM			 	 to get an idea of how I/O will change when moving a SQL Server instance to a new server.

REM Description of applicable diskspd arguments:
REM -b: Block size.  SQL Server uses 8K pages, but also sometimes reads full extents (64K) and read-aheads (512K).
REM -c: Size of the workload files to create.  Compare this to your mdf or ldf file sizes.
REM -d: Test duration, in seconds
REM -h: Disable hardware and software caching, from within the Windows OS layer.  This mimics SQL Server behavior.
REM -L: Capture latency info.
REM -o: Outstanding I/Os (queue depth) per worker thread.  Make this higher to increase intesity of test.
REM -r: Random reads/writes.  Use random for data file drives.
REM -s: Sequential reads/writes.  Use sequential for log file drives.
REM -t: Worker threads.  Set this to number of cores on server.
REM -w: Write percentage.  The remaining percentage will be reads, so setting this to 20 will test based on 80% reads and 20% writes.
REM -Z: Separate read and write buffers, and size the write source buffer.  A larger buffer will have more randomized data to write.

cd /d "C:\Diskspeed"

echo Starting OS drive tests
REM System databases are kept on the OS drive

diskspd -b8K -d60 -h -L -o8 -t32 -r -w5 -Z250M -c5G c:\DiskSpeed\io.dat>OS_IO_8k_result.txt
diskspd -b64K -d60 -h -L -o8 -t32 -r -w5 -Z250M -c5G c:\DiskSpeed\io.dat>OS_IO_64k_result.txt
diskspd -b512K -d60 -h -L -o8 -t32 -r -w5 -Z250M -c5G c:\DiskSpeed\io.dat>OS_IO_512k_result.txt

echo OS drive tests complete
echo(
echo(
echo Starting data drive tests
REM Data drives hold the mdf and ndf files

diskspd -b8K -d60 -h -L -o8 -t32 -r -w20 -Z1G -c50G D:\SQLData\io.dat>DF_IO_8k_result.txt
diskspd -b64K -d60 -h -L -o8 -t32 -r -w20 -Z1G -c50G D:\SQLData\io.dat>DF_IO_64k_result.txt
diskspd -b512K -d60 -h -L -o8 -t32 -r -w20 -Z1G -c50G D:\SQLData\io.dat>DF_IO_512k_result.txt

echo Data drive tests complete
echo(
echo(
echo Starting log drive tests
REM Log drives hold the ldf files

diskspd -b8K -d60 -h -L -o8 -t32 -si -w90 -Z5G -c10G E:\SQLLogs\io.dat>LF_IO_8k_result.txt
diskspd -b64K -d60 -h -L -o8 -t32 -si -w90 -Z5G -c10G E:\SQLLogs\io.dat>LF_IO_64k_result.txt
diskspd -b512K -d60 -h -L -o8 -t32 -si -w90 -Z5G -c10G E:\SQLLogs\io.dat>LF_IO_512k_result.txt

echo Log drive tests complete
echo(
echo(
echo Starting tempdb drive tests
REM Tempdb needs its own drive since it can get very intense usage

diskspd -b8K -d60 -h -L -o8 -t32 -r -w33 -Z5G -c20G G:\TempdbData\io.dat>TF_IO_8k_result.txt
diskspd -b64K -d60 -h -L -o8 -t32 -r -w33 -Z5G -c20G G:\TempdbData\io.dat>TF_IO_64k_result.txt
diskspd -b512K -d60 -h -L -o8 -t32 -r -w33 -Z5G -c20G G:\TempdbData\io.dat>TF_IO_512k_result.txt

echo Tempdb drive tests complete
echo(
echo All testing complete, press any key to exit...
pause > nul
cls
exit

Next I took the SQL Server Service offline on our current production server (during a maintenance window) so I could run the tests when both servers were mostly idle. The first thing I noticed is OS drive performance did not improve. This made sense since the OS drives on the new server were not SSDs, they were the built-in spinning disks. Here are the results from the rest of the tests:

20150723 Diskspd results

Overall, I saw about a 20X improvement in speed. It will be interesting to see how this improves the performance of the applications using this database server, as well as running ad-hoc queries.