I was migrating some old databases (I’ll write some more about this later) couple nights back to new database server and while that went mostly alright, things were not looking so good in the morning. I was driving on a highway, taking my daughter to daycare when the phone rang. I did recognize the number being that of my customer, so I answered.
Soon I heard that while everything seemed to be ok last night come the morning and more users, people were now having trouble connecting to database with the client software. There were random errors saying that database could not be opened and so forth. He also ran ping against the cluster node running SQL Server instance and there were packets being dropped every now and then, while the passive node had no such problems. I told him I’d look right into it as soon as I’d get to the office.
I looked to my 2 year daughter through the mirror and told her that for this exact reason you shoud always keep your drivers and software up to date on your servers and before taking system to production, it should be thoroughly tested. She then told me that she’d like to hear a song about a wooden horse and then one about a cat, which we then did.
Back at the office I logged into the production box and quickly looked through the relevant (Windows, SQL Server, Cluster) logs. They were error free and Network Interface counters on Performance Counters were also looking quite good also while the network traffic was not really all that heavy. I also knew there were no changes to made the network topology itself, so it pretty much had to be something on the active node that now reacted to increased user load.
I opened a Command Prompt and used the netstat to see if there were connections:
There seemed to be a number of them and I confirmed from SQL Server that there were lot of clients connected to server. Then a thought hit me, I had seen something similar previously, but that was a while back. So I typed in a next command:
And true enough, there were connections that were in Offloaded state. I then confirmed the status of both TCP Chimney Offload and Receive Side Scaling features being enabled with netsh command:
netsh int tcp show global
I quickly disabled TCP Chimney Offloading as well as the Receive Side Scaling and true enough the customer called back soon to inform me that whatever I had done, it had fixed the issue. I was quite surprised actually that this had turned out to be the culprit, as I really haven’t had any problems with Chimney Offloading for many years. I told the customer to check with the hardware vendor if there were any driver updates for their NIC’s or any other updates/fixes for the issue. If there are, we’ll be probably attempting to enable these again at some point to get the benefits of these features. The commands used to disable Chimney offload and Receive Side Scaling are:
- netsh int tcp set global chimney=disabled
- netsh int tcp set global rss=disabled
If you’re not familiar with the TCP Chimney and Receive Side Scaling features, the chimney one is used for moving the CPU load of dealing with network I/O requests for network adapters, thus freeing your CPUs to do something else. RSS on the other hand is used for spreading the CPU load evenly between multiple CPU cores. Generally speaking these are something you would like to see happen and as I also said, there’s usually no issues as hardware and software these days is designed to take advantage of these features. But sometimes when this is not the case problems similar to what I bumbed into might occur.