When I’m dealing with a problem on a Failover Cluster (not very often, but sometimes) one of the first steps I do is to run the Validation Test. It’s a great tool that’ll usually show what might be the problem, but apparently not always…
For the last couple days I’ve been busy wrecking havoc on a cluster with a Microsoft Cluster PFE on a Cluster Disaster Recovery workshop. Among the scenarios we’ve gone through causing, fixing and then documenting there was one that had a small surprise in store for both of us.
The scenario: Someone creates an Active Directory Group Policy to set on the Firewall and among many other ports, blocks the incoming traffic to 3343/TCP. 3343/TCP is a port used by the Cluster Services and it’s the one by which the nodes communicate with each other. If you’re running two nodes and a witness disk, blocking the incoming traffic to this port will cause a failure of a single node. If you’re running cluster with uneven number of nodes, it’ll cause the failure of a whole cluster.
This also happened in our scenario as one of the nodes went down as expected. After looking at the cluster logs we figured out that we’re dealing with a network issue, We decided to confirm this by running the Cluster Validation Test. We chose only to run the Network part of the test as we knew that we had a network issues.
However after running the Network part of the Validation Test, it came back all green saying that the Firewall settings were fine and the nodes had full connectivity between each other. The only warnings we had were for subnet setting on a clustered SQL Server instance and for using APIPA address in one of the NICs.
We knew for a fact that this was not the case, the Group Policy was set for all the nodes and there was a Firewall rule existing that’d prevented traffic to port 3343 as seen on the screenshot below.
We also confirmed that this rule was doing what it was supposed to by using one of the new Powershell CmdLets (available in Windows 8.1 and Windows 2012 R2), Test-NetworkConnection. It straight up told us that there was a failure in connecting to a port 3343 with any of the interfaces, but the ping always succeeded.
While I still think the Validation Test is a great tool for verifying the health and status of your Failover Cluster and it can be a great aid in troubleshooting, it’s not infallible. I can almost understand that it doesn’t check for a custom Firewall rules, but just the ones that are part of the Windows Firewall. But I was very surprised to learn that it didn’t check the connectivity to actual Cluster Service using the proper port but rather just doing a ping or a traceroute (It actually does check the port, look at the update below for the explanation).
We also did this same test after destroying the cluster and then re-creating it. The Validation Test found no problems in any of the tests, but creating the cluster naturally failed. The operating system we were doing the testing with was Windows Server 2012 R2 (Build 9600).
I just received an email from the Microsoft PFE and they had figured out what caused the Validation Test to succeed even with the Firewall rule in place. When the Validation Test is run the nodes “repair” themselves by creating an invisible Firewall rule to allow the traffic to 3343/TCP. The nasty part of this was that the rule is removed after the Validation Test causing cluster creation to fail.