What is a Performance Baseline?
One of the important things every DBA should have, is a performance baseline for their business critical servers. A good baseline is something that tells you how your server is performing under various workloads. This is also the reason why you should have multiple sets of performance data collected instead of just one. In a perfect world you should have one for the minimum load, one when the usage is “normal” and finally one for the situation where your server and databases are under a lot of stress.
In this post, I’ll be giving you some advice on when and why you should have a performance baseline and how you can create one.
What is it good for?
I have lost the count of the times I’ve seen one advice being given to DBA’s and that bit of advice is, know your workload. This is a very good bit of advice mind you, you really do need to understand what sort of workloads you get to your servers during and outside the business hours. Understanding what is the normal workload and the expected variations, such as massive reporting queries running on daily or weekly schedule, is crucial knowledge to any DBA.
When you’re familiar with the workload and have a proper baseline, tracking and identifying performance issues becomes so much easier and quicker. It is rather simple thing to get performance statistics from alive system, but if you have nothing to compare it to, how you’ll know if these values are normal or if there’s an actual problem? Naturally having CPUs at 100% is a problem you should be able to spot a mile away, but most often the problems are not quite that obvious. Having an up-to-date baseline will save you time and effort there, allowing you to quicker pinpoint and move to solving the actual problem.
You’ll also find this information valuable when you’re doing capacity planning for server upgrades or when you have to explain to your boss why you can’t add another instance with couple hundred users to your server, without bringing everything to a grinding halt during the peak hours.
When you should have it?
My advice would be that you should always have it for the production systems, even if it’s not considered business critical, it’s usually still important to someone.
How do I set it up?
My personal preference is to use Windows Performance Monitor because I really, really like it. It’s an awesome tool, it’s free and you already have it on your server! While Performance Monitor is very good at collecting hardware and software related performance data, I do like to add bit of information from SQL Server DMVs and DMFs, such wait statistics and about file I/O. On top of these it’s also a good idea to have a collect SQL Trace (or Extended Events if you prefer) that’ll give you lot of valuable information about how your queries are running.
When it comes to Performance Monitor values, look at the high and low ones as these give a good idea on what’s the load on your system during the collection period. Personally I don’t look at the average value too much, simply because they tend to hide the interesting stuff 🙂 You should have at least the following counters from the Performance Monitor added to your collector.
For baseline purposes, the collection interval should be between 10 to 15 seconds. This will allow you to collect data from longer period of time without having huge and unwieldy BLG file in your hands.
Storage (look at both, Logical Disk and Physical Disk):
\% Idle Time
\Avg. Disk Queue Length
\Avg. Disk sec/Read
\Avg. Disk sec/Write
\Current Disk Queue Length
\Paging File\%Usage Peak
\Network Interface(*)\Bytes Received/sec
\Network Interface(*)\Bytes sent/sec
\Network Interface(*)\Bytes Total/sec
\Network Interface(*)\Output Queue Length
\Processor(*)\% Idle Time
\Processor(*)\% Privileged Time
\Processor(*)\% Processor Time
\Processor(*)\% User Time
These counters will give you a pretty good idea on how the hardware resources on your server are utilized and these are also minimum amount of counters I’d use for baseline. If you’re troubleshooting actual performance problems, you’d need to add a whole bunch of additional counters.
Process and SQL Server counters
\Process(sqlservr.exe)\% Processor Time
\Process(sqlservr.exe)\% User Time
\Process(sqlservr.exe)\% Privileged Time
\General Statistics: User Connections
\General Statistics: Logins/sec
\General Statistics: Logouts/sec
\SQL Statistics\Batch Requests/sec
\SQL Statistics\SQL Compilations/sec
\SQL Statistics\SQL Re-Compilations/sec
\Buffer Manager\Page life expectancy
\Buffer Node(*)\Page life expectancy
\Buffer Manager\Lazy Writes/sec
\Buffer Manager\Checkpoint Pages/sec
\Memory Manager\Target Server Memory (kb)
\Memory Manager\Total Server Memory (kb)
And with these you’ll get a decent idea on SQL Server instance and database level of activity. There are many other SQL Server related counters available, but in my opinion, when doing baselines try to keep the number of counters low, that way the information you need wont be buried under a mountain data.
It’s a good idea to collect a summary of your findings into an Excel file for quicker access, so when you have a situation in your hands you don’t need to go through bunch of BLG-files to find out what is the usual minimum or maximum value for whatever counter.
DMVs and DMFs
While the Performance Monitor gives you plenty of information already, I personally like to add some insights from the SQL Server itself. This is done through the Dynamic Management Views and Functions (DMVs and DMFs). For baseline purposes, there are two that are of interest (to me anyway).
SYS.DM_OS_WAIT_STATS is the DMV to use to gain insights on what are the most common WAIT types in your instance. This information is very valuable for troubleshooting, but also for baseline, as the results are comparable to information you get from Performance Monitor counters.
SYS.DM_IO_VIRTUAL_FILE_STATS is a DMV that can be used to find out I/O operations in your database, showing number of reads and writes by file as well as information about wait times for both operations.
You can easily do your own queries to these DMV’s, especially the VIRTUAL_FILE_STATS one. But for Wait Statistics, there’s a huge load of information on that one and I warmly recommend that you look at some of the excellent scripts done by the SQL Server community people. The one I use and always recommend, is here, created by Paul Randal from SQLSkills. There’s also a lot of information about the WAIT types in general in that same page.
If you’re not already familiar with the SQL Server Waits, I encourage you to read the SQLCAT whitepaper titled “SQL 2005 Wait and Queues“. This whitepaper doesn’t cover all the newer wait types but for these you can take a look at the Books Online on the “sys.dm_os_wait_stats” DMV.
As with the Performance Monitor data, add the Wait Statistics and I/O data to your Excel for easier and quicker access.
SQL Trace / Extended Events
We’re already capturing a lot of information from the hardware, software and even from inside the SQL Server. What we are however lacking is the knowledge about the application performance, which I’d consider to be quite critical. So what we need to do now is to gather some performance information about the queries running against the databases, either using SQL Trace (if you’re running SQL Server 2008 R2 or anything older) or Extended Events (if you’re already in SQL Server 2012 or newer).
If you’re doing SQL Trace you could use the Standard template. There’s a “Default Trace” template also for Extended Events called Activity Tracking, but this doesn’t include SQL Batch, Statement or RPC events, so you need to manually add them. Other option with Extended Events is to use the Query Detail Sampling template, but keep in mind that it’ll only collect events from 20% of the active sessions. Which might be just fine for baseline purpose. Using either one of the tools mentioned, the information we’re interested in is the Duration, CPU time, Reads and Writes.
Analyzing the Trace and Extended Event results
After collecting the data you still need to make sense of it. If you’re using SQL Trace, I recommend using RML Utilities Reporting tool for creating a nice graphical report out of the trace-file. That’s probably the easiest way to get readable report of longest running queries or the ones that do most reads, writes or burn up your CPUs. If you like to do some extra work, you could save the results into trace table and use Excel to do a report yourself. I’d still recommend using RML Utilities 🙂
With Extended Events you wont be needing an external reporting tool, in fact, you can get even better results with the SSMS UI! Just view the collected data, do grouping by query_hash and then aggregate the results by using your favorite value (Duration, CPU etc) and sort them in the descending order. You can even choose if you want to use AVG, MIN, MAX or SUM for aggregation, which is something RML Reporter can’t do. And on top of that, you can save the display settings giving you a whole bunch of reports that can be applied to collected results in the future.
And that is a one way to build yourself performance baseline. It’s not the only way, probably not even best one for all environments, but hopefully this gave you something to work with when designing your own baselines. And while there’s definitely good amount of work involved, doing this once will give you something you can re-use in most if not all your SQL Server environments in the future.
If you’re not familiar with Performance Monitor, I’ve previously blogged about it and there’s some cool tricks and tips about this awesome tool there.
Writing this post also got me excited about the Extended Events, so expect to see something about them as well at some point.