About two months back I ended up moving to another job, which has unfortunately kept me to be bit too occupied to find the time to blog, until now. Due to this previously mentioned career change, I have been working quite a bit with monitoring, and it gave me a spark to write this post about my favorite PerfMon counters.
Like most DBAs I rely quite a lot on monitoring, and like most DBAs I too have my own set of PerfMon counters that I rely on to provide me an accurate view of what’s happening in the environment I am administering. In this post, I’ll describe what are my favorite PerfMon counters.
What kind of counters I have
I have three different categories for PerfMon counters that I want to see in my dashboard. I have counters that are useful for baselining, that usually aren’t overtly useful in troubleshooting by themselves, but are good for understanding the workloads and load patterns. In the next set of counters I have the vital signs that basically tell about me how the system is faring. Then finally, I have counters that are so scary they don’t even have named category, I just watch them hoping they never light up in my dashboard.
I don’t have many baselining counters, but I’ve noticed that the following give a quite good view on how busy the system is. The baselining counters are all about SQL Server, and nothing else.
- SQL Server, Buffer Manager: Page Life Expectancy
- SQL Server, Buffer Manager: Buffer Cache Hit Ratio
- SQL Server, General Statistics: User connections
- SQL Server, SQL Statistics: Batches/sec
- SQL Server, SQL Statistics: SQL Server Compilations/sec
- SQL Server, SQL Statistics: SQL Server Re-compilations/sec
- SQL Server, Access Methods: Page Splits/sec
- SQL Server, Access Methods: Forward Records/sec
Page Life Expectency
Page Life Expectancy (PLE for short) is the counter that tells you how long pages are held in a cache. This one is actually a relatively well known counter, if not for any other reason but due to amount of outdated recommendations from the times where your server had 8GB of RAM and NUMA was unheard of, still floating around regarding it.
Besides servers having huge amount of RAM these days, the other thing impacting this counter is the NUMA architecture. In NUMA capable servers this value is the harmonic mean of all the NUMA node values. While I mostly use this for baselining, I do dig deeper when I see this either constantly dipping to very low values or staying in low value for extended periods of time.
Buffer Cache Hit Ratio
This counter tells you the percentage of queries that are retrieving pages that were already in the buffer cache. With modern database servers that are loaded with RAM, this tends to always be around 100% The only cases where I can observe this dropping a bit, is when you do heavy index maintenance and need to touch lot of pages.
This is quite self-evident, it’s the number of connections that are open in your database server.
Batches/sec is excellent baselining counter as it tells you how many TSQL batches your SQL Server is currently processing. I usually compare this to the previous counter, number of connections.
SQL Server Compilations/sec
SQL Server Compilations/sec counter tells you how often SQL Server sees a query being a completely new one and needs to compile an execution plan for it. This is a counter that should be compared to previous, batches/sec counter. If this is around 10% of the batches, then it’s usually fine. If this is higher, you might be experiencing memory pressure which is causing plans to be flushed out very quickly. This all can lead to CPU being higher as usual, since every compile will consume CPU cycles.
SQL Server Re-compilations/sec
Despite the very similar name, this actually quite different from the previous counter. Re-compilations happen when the SQL Server finds the plan from the cache (so it’s note related to running low on memory) but there has been quite a lot of changes in the databases making the existing plans invalid. This should be around 10% of the Batches/sec but can quickly get higher after index/stats maintenance and data loads.
SQL Server Page Splits/sec
Page splits occur when data is being inserted or updated and it does not fit into the 8Kb page, which is where SQL Server stores data. The problem with this counter is, that it not every page split is actually a harmful one, but expected behavior by the database engine. I mostly use this only to compare to long time trend, if there’s something different.
Forwarded records happen when rows in a page were moved to another page and you find those through the forwarded record pointer in the original page. These happen only on the heaps (tables without clustered indexes) though and I use it mostly in a similar way to previous counter, to see if something has changed.
Vital Signs counters
We humans have a bunch of vital signs, such as the heart rate, blood pressure, body temperature, etc. we can monitor to figure out if we’re doing okay or not. The same thing goes for the database servers, even though we call these vital signs with bit different names. My list of server vital signs is something that I can usually correlate to wait statistics.
- Processor: % Processor Time
- Memory: Available MBytes
- Logical Disk: Free Megabytes
- Memory: Page Faults/sec and Pages Read/sec
- Logical Disk: Disk Transfers/sec
- Logical Disk: Avg. Disk sec/Read and Write
- Network Interface: Bytes Sent and Received
% Processor Time
The % Processor Time is the percentage you are using of your total processing power. If you have a 4-Core system and this is 50%, it means that in average all your 4 cores are utilized to 50%. It doesn’t tell you if one of the cores is running in 100% percent though, but usually I go into these details only when I am troubleshooting something. Depending on the environment, I usually want this to be around 40-70% (hardware based systems) or 70-90% in virtualized systems.
As it says, this is just telling you how much memory you have available on your system. The recommended value here is highly dependent of your memory configuration and possible other services you have installed.
There’s nothing quite as annoying as your disk becoming full and bringing otherwise completely good system to a halt. Again there’s no really good value to give how much free space you should have, but in a perfect world, you’d always have enough free space for an extra backup and to restore a copy of any of your database.
Page Faults and Hard Faults
These are bit more interesting counters and not quite as self-explanatory as the previous ones. Page faults happen when the program requests an address on a page that is not currently in the memory, and when this happens the OS actually needs to find that page from the disk and then pull that into the memory. This exact situation is what is called a hard fault and as you can imagine, it’s not good for the performance.
The other sort of page faults are called soft faults and they occur when the wanted page is elsewhere in the memory, such as the working set of another process. I don’t have a good value for what this should be, but combined with other counters such as Disk Transfers, it can indicate when you’re having memory pressure on the system.
Disk Transfers is the amount of read and write operations happening in the system at each second. In short, this is the IOPS for the system.
Avg Disk sec/Read and Write
These counters tell in average how many seconds it takes to perform either read or write operation. These values are displayed in milliseconds, so value of 0.010 would be 10 milliseconds. The value I want to see here depends quite a lot of the storage system, for database servers I want these to be as low as possible, around 10ms or less.
Network Bytes Sent and Received
Again quite obvious counter, this is the amount of data your sending or receiving from the network.
The Nameless Ones
There are fewer counters scarier than these three, at least from my perspective. If these have anything but zero values, something is wrong!
- Network Interface: Output Queue Length
- System: Processor Queue Length
- SQL Server, Memory Manager: Memory Grants Pending
Network Output Queue Length
This counter shows the number of packages that are waiting to go out, basically if there’s non-zero value here there’s a bottleneck.
Processor Queue Length
Pretty much the same than before, but now for the processor. This is the number of threads waiting to get executed, any non-zero value here means that you probably have a processor bottleneck.
Memory Grants Pending
When you have bit more complex queries that do various things like JOINs and require bit more space to hold the results, they need a memory grant from the SQL Server. This counter is used for monitoring how many of such requests you have in queue. If you have queries waiting for memory grants, they’re not obviously getting executed before they get it. Usually the value in this counter should be 0 or at maximum 1. Anything higher and you’re probably witnessing queries that generate execution plans that are asking for ridiculous memory grants or memory pressure in the system.
And that wraps up my favorite set of PerfMon counters. While I do have a plenty of metrics I look at, not nearly everything in this list is something I would get alerted about. But the topic of monitoring and alerting is something I will leave to a future post. Thanks for reading and happy New Year!