Perils of synthetic test data

Published by

Mika Sutinen

on

05/02/2018

binary-code-475664_640 — Data, data, data

I was recently involved in a query tuning work where we used synthetic, rather than production data, to validate the results of our query and index tuning work. We faced some issues with the generated data that had quite a severe impact on our testing, and that prompted me into writing this blog post. Lets start by first defining what is synthetic data. In my view synthetic data is data that resembles actual production data, but is artificial/generated. I have seen similar (and also more detailed) definitions elsewhere and I think it is a good one.

I also like to point out that there are plenty of good reasons for using synthetic data in testing, as production data is often strictly regulated and not easily available for testing purposes. However, you need to be certain that the synthetic data you are using is similar to what you have in production.

Background

Not long ago I was approached about a long running query, that absolutely had to be made to run faster. Since it had been awhile that I had the opportunity to go into query and index tuning work, I jumped at the opportunity. I was send the query generated by the Entity Framework, and while I am not huge fan EF, it was not super-complex or otherwise problematic looking. Then I ran it against the performance testing environment and waited, and waited, and waited.

The query ended up running over a minute! The query had returned almost 2 million rows from the database, which ment that it pulled quite a lot of data over the network. With huge queries like that, there is a very quick method of checking that just how much does the network affect things, I commented out all the columns I was pulling and just did SELECT 1 instead. The query was indeed much quicker at that point, running only little over 10 seconds, so now I knew that the minute was just spent on getting the data from point A to point B over the network.

It was quite easy to confirm this by monitoring the WAIT types during the following executions, also analyzing the plan with SentyOne Plan Explorer quickly revealed that we were pulling couple Gigabytes over the wire. This was a turning into a bit of a problematic, but after a while we found an issue with the test data. As our query originally pulled out 1,8 million rows from the database, it turned out that 1,7 million of those were related to single value out of 20 or so in our WHERE clause.

After some discussion with the people responsible for the test data, we finally got into bit more sensible results. At that point we could get into actually tuning the query, adding some indexes and all the normal performance enhancing stuff.

The Lesson Learned

In this case the issue with the data was quite easy to catch, however that might not always be the case. Also when working with synthetic data it is important to generate as realistic data as possible, not just random characters and numbers. There are plenty of good tools for this purpose, my favorite being the Redgate Data Generator, However having a good tools is not enough, you also need to understand the characteristics of the data. It is also important to monitor how things like data distribution change over time, as the volume of the data increases.

To wrap up this post: This year I decided to try something new also, when it comes to a blogging. Instead of posting my code into the blog posts, I set up GitHub repository where I will share the scripts with anyone who wants to try them out.

This will be the first one, a script to find out how many distinct values you have in any column of the database. As a word of warning, do not ever run this query against a production database, but always a copy or some such.

SQLStarters