One thing that we always set up on a new server is a Performance Monitor “black box”. The black box is basically a PerfMon collector that starts when the server does and runs continuously on the background, gathering performance data from number of vital counters. It has minimal performance overhead and is great at giving you a good idea on what has been going on in your servers during the last several days.
What to collect?
This really depends on what you consider to be vital, personally I collect information about the disks (both logical and physical), memory, processor, processes and network. The things that I consider vital are latencies, throughput, IOPS, memory and CPU usage as well as possible connectivity errors and number of connections made. You might have noticed that there aren’t any SQL Server counters listed and there’s a reason for that.
I like to keep the list of counters as compact as possible because the more data you collect the shorter the time period you can cover will be. Also the counters mentioned above will, most of the time, tell you if there have been performance issues. And while they might not give you the exact cause (you’ll need more specific monitoring for that) it’ll give you a starting point because you can see, for example, if it’s related to CPU or storage and you can pin it to a process.
Setting it up
Setting it up is rather simple. First you should create a folder where to save results. If it’s a cluster, use a local disk on a node so it’s available all the time. I usually create one called “PerfMon” or something similar. Then create a text-file that has the counters you’re interested in listed in it. For example, if you want to monitor disk latencies, IOPS and Queue lengths you’ll add this.
\LogicalDisk(*)\Avg. Disk sec/Read \LogicalDisk(*)\Avg. Disk sec/Write \LogicalDisk(*)\Current Disk Queue Length \LogicalDisk(*)\Disk Transfers/sec \PhysicalDisk(*)\Avg. Disk sec/Read \PhysicalDisk(*)\Avg. Disk sec/Write \PhysicalDisk(*)\Current Disk Queue Length \PhysicalDisk(*)\Disk Transfers/sec
If you’re not sure what counters are available, you can get that information very easily from Windows by running the following command in command prompt:
TYPEPERF -Q >> all_counters.txt
The list of all the counters on my laptop is about 2300 rows, from servers you’ll find bit more counters than that. After you have them in a text-file, it’s just a task of copy-pasting to get them to your Black Box counter file. Once you have the text file set up it’s time to create the collector, it’s done by running the following command on an elevated command prompt.
LOGMAN CREATE COUNTER BlackBox -cf blackbox_counters.txt -si 01:00 -f bincirc -o "C:\PerfMon\BlackBox_%computername%" --v -max 500
The command above creates a collector called BlackBox using the counters from the text file. It’s run every 1 minutes and the results are saved to circular binary file (so when it reaches the size of 500MB it starts writing data from the beginning). Notice that there are double-dashes before “v” parameter, this is needed to remove the versioning info from the file name. Without it, every time the collector starts it’ll create a new file called BlackBox_%computername%_datestamp.BLG, thus potentially filling your hard drive with BLG-files.
After you create the BlackBox it’ll be stopped by default, you can check this with following command:
This will display all the collectors and their status, like this:
To start the BlackBox you can simply run the following command:
LOGMAN START BlackBox
Running the query command again, you should see BlackBox with “Running” status. And that’s it, you’re all set.
Making sure that it’s started after reboot
Now that you’ve set up the BlackBox there’s one final step you should take to make most out of it. That step is to make sure that the BlackBox also gets started after you or someone else decides to reboot the server. There’s actually a simple way to accomplish this with Windows Task Scheduler. Once you open the Task Scheduler navigate to following folder Task Scheduler Library\Microsoft\Windows\PLA, this is short for Performance Logs and Alerts and it is where PerfMon collectors, among other things are saved.
Open the properties of the BlackBox and go to the tab that’s named Triggers. In here, choose New… and when the New Trigger dialog opens, select At startup on the Begin the task: dropdown menu. Now every time the server is restarted BlackBox will also get started.
After pressing OK, go to Settings tab. I’ve noticed that in some servers there’s a tick in the box that’s titled: Stop the task if it runs longer than: with default value of 3 days. If it’s enabled, remove it, because you most likely don’t want your collector to stop after 3 days.
Setting this up only takes couple minutes, but those are minutes well spend when, for example, someone tells you on Monday that during the weekend something happened and everything was slow…