Performance Testing Framework

In this post I’ll introduce a set of performance testing classes and practices that I use when creating automated performance tests – with one main aim: simplify the task of measuring/verifying the performance of key areas of my .NET applications.

Source Code:

Often it is not practical, nor even recommended to measure all parts of an application – why waste time profiling sections of code that get called infrequently? It’s also rather difficult knowing exactly where to start profiling an application, whether it’s server or UI, when that app either hasn’t had performance proactively “baked in” or when you’re trying to find the root cause of an occasional freeze.

If I’m trying to isolate performance issues (or maybe retrofitting performance tests), then I tend to start off by using any of the popular . NET profiling apps, such as Red Gate’s ANTS. These give me an idea of where to look, eg for methods that are called very frequently or take a while to run. At least this gives me a fighting chance of where to concentrate my efforts, rather than wading through 1000s of lines of code…needle in a hay stack springs to mind.

There are many resources available showing common coding pitfalls, which can result in extra memory allocates that aren’t needed, making non-deterministic functions behave in a deterministic fashion, lack of caching, multiple enumerations etc. Such examples include:

string concatenation rather than using a StringBuilder,
switch versus multiple if statements,
caching values for ToString/GetHashCode on immutable types

…the list goes on.

I’ll not delve further into the specifics …will leave that for another post; another day.

I treat performance tests as being just as important as my unit tests (which, to me, are of equal importance to my actual project’s code). The performance tests must be easy to run, allowing me to tweak the code under test, run the performance test, and should tell me as soon as a timing metric has not been met…a red alert followed by a fail.

If you’re familiar with the red/green/refactor mantra from the Test Driven Development world, then you’re probably used to seeing red/green alerts whilst you work through a project.

Initial Benchmarks

When I’m looking at improving performance for a block of code and I have no timing metrics, aka key performance indicators, to work with then I follow these steps:

Create a performance test that calls the specific method/block - initially with an expected duration of zero milliseconds – this will fail when ran
Run the performance test (which will repeatedly run my method/block under test) to get an idea of current average time (later I’ll discuss why average is not always the best statistic to use)
Bake that first result into my test
Start the process of incremental refactoring/re-running test to quantifiably improve performance

By following these steps, I have an initial benchmark to work from and will know pretty quickly if I’m actually making things worse, or if the incremental changes are starting to make less of an impact on the results (when this happens it’s generally time to stop trying to squeeze the last clock tick out of a method).

The benchmark can then serve as future indicator of performance. If, sometime later, a developer makes a change to an area of code which negatively affects the performance, the performance test will fail - much like a failing unit test.

Sampling

Unfortunately, there are many factors that can affect how long a section of code takes to run, eg JIT overhead, other applications running, IO activity, debug/release build, physical/virtual memory, type of hard drive, number and speed of processors.

All of these come into play and unfortunately it’s not possible to eradicate any as a factor of the timing stats. To try to alleviate this, the performance test should run the code under test a number of times and calculate a meaningful statistic from that.

Gotcha: This problem is made even worse if you run the performance tests as part of a Continuous Integration on a build machine. Your build machine may be used by other dev teams or may be lower spec than that of your development or end user machines. Performance tests that run without failing on your development machines may well fail on a build machine.

Running Tests

When I’m working through a new feature or refactoring existing code, then it’s important that I can tweak some code and easily run my tests (whether they’re unit or performance tests). To help I use Resharper from JetBrains. This Visual Studio productivity tool makes it’s so easy to run/debug a test (unfortunately not all editions of Visual Studio allow you to create unit/stress/performance tests), simply by clicking on the test method, using test explorer or re-running as part of a test session:

Getting Started

The key tasks of the performance test base class are:

· Specify a minimum acceptable time span that a test method should take to run
protected abstract TimeSpan AcceptableDuration { get; }

· Optionally run some untimed fixture setup code, prior to my test cycle running
protected virtual void FixtureSetup()

· Optionally run some untimed code prior to running each test case method, maybe to reduce impact of first JIT operations:
protected virtual void PreTestCaseSetup()

· Run a specific timed test method a repeated number of times, collecting timing stats from each call:
protected abstract void TestCase()

· Optionally run some untimed code after my test cycle runs, maybe to clean up my environment:
protected virtual void FixtureTearDown()

· Calculate some stats from the timings collected

· Throw an exception if the average time to complete is greater than the AcceptableDuration property

I’m going to use the NUnit framework to throw an exception, via Assert.Fail if the timespan is too long, so you’ll need to run the following in NuGet Package Manager Console:

Install-Package NUnit

As it happens, I could have just thrown an exception to indicate an error state for the test, but NUnit has so many useful features I’ll stick with the using NUnit’s Assert class.

Starting to look at the code, I’ve created a new Class Library project called PerformanceTests (which has the NUnit package referenced above) and added a new class called PerformanceTestBase.cs

This is an abstract class, which I’ll base each of my specific performance tests on – so one test class for each method/block under test. As an example, say I’ve established that the GetHashCode function on my Customer class needs to be profiled, then I’d create a class called CustomerGetHashCodeTest – which will contain all of the code to test just GetHashCode.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using NUnit.Framework;
 
internal abstract class PerformanceTestBase
{
    /// <summary>
    /// The default number of tests cycles to run in order to calc avg
    /// </summary>
    private const int DefaultTestCycles = 100;
 
    /// <summary>
    /// How long each test cycle took to complete
    /// </summary>
    private readonly List<TimeSpan> _testCycleDurations = new List<TimeSpan>();
 
    /// <summary>
    /// Override if you have specific number of times you want
    /// a test to be executed
    /// </summary>
    protected virtual int TestCyclesToRun
    {
        get { return DefaultTestCycles; }
    }
 
    /// <summary>
    /// The minimum amount of time that we'll allow before failing a test
    /// </summary>
    protected abstract TimeSpan AcceptableDuration { get; }
 
    /// <summary>
    /// TIMED: The actual test that you want to run per cycle.
    /// </summary>
    protected abstract void TestCase();
 
    /// <summary>
    /// If you have any code that needs to be ran prior to a test, 
    /// eg to force JITting 
    /// </summary>
    protected virtual void PreTestCaseSetup()
    {}
 
    /// <summary>
    /// Override if you need an initial setup operation before 
    /// the cycle test starts
    /// </summary>
    protected virtual void FixtureSetup()
    {}
 
    /// <summary>
    /// Override if you need to run clean-up operation after 
    /// of the cycles have completed
    /// </summary>
    protected virtual void FixtureTearDown()
    {}

}

As you can see there are a mixture of abstract and virtual methods. When you implement a performance test you’ll need to implement the following:

protected abstract TimeSpan AcceptableDuration { get; }

Returns the minimum acceptable average timespan that will result in a failure of this test

protected abstract void TestCase();

Contains the actual test code to run – this method call is timed and is used to calculate an overall average.

Now we need to add code for the controlling method, ensuring that it’s accessible by any NUnit test runners. TestCaseRunner is a public void method, annotated with the NUnit Test attribute:

[Test]
public void TestCaseRunner()
{
    FixtureSetup();
 
    var testCycles = TestCyclesToRun;
 
    var timer = new Stopwatch();
 
    Log("STARTING TEST. Cycles={0}, AcceptableDuration={1}", 
        testCycles, FormatTimeSpan(AcceptableDuration));
 
    Log("CYCLE \tTime");
    for (var cycle = 1; cycle <= testCycles; cycle++)
    {
        PreTestCaseSetup();
 
        timer.Start();
        TestCase();  // run the actual test
        timer.Stop();
 
        var elapsed = timer.Elapsed;
        _testCycleDurations.Add(elapsed);
 
        Log("{0:000}\t{1}", cycle, FormatTimeSpan(elapsed));
 
        timer.Reset();
    }
        
    FixtureTearDown();
    AssertStatistics();
}

We use a StopWatch instance to provide the actual timing element, simply starting, stopping and resetting the timer between each cycle – this provides a high-enough resolution.

The remaining Log and AssertStatistics methods need to be defined. Log is simply a helper to log messages to the console.

private void AssertStatistics()
{
    var average = _testCycleDurations.AverageTotalMilliseconds();
    var min = _testCycleDurations.MinTotalMilliseconds();
    var max = _testCycleDurations.MaxTotalMilliseconds();
    var deviation = _testCycleDurations.Deviation();
 
    const double Percentile = 95D;
    var percentile = _testCycleDurations.Percentile(Percentile);
 
    var acceptableDuration = AcceptableDuration;
 
    Log("TEST FINISHED. Timings(ms): Acceptable={0}, Percentile={1}, Avg={2}, Min={3}, Max={4}, Deviation={5}, ActualTimes={6}",
        FormatTimeSpan(acceptableDuration), 
        FormatTimeSpan(percentile), FormatTimeSpan(average),
        FormatTimeSpan(min), FormatTimeSpan(max), 
        FormatTimeSpan(deviation), 
        Summarise(_testCycleDurations));
 
    var testPassed = percentile <= acceptableDuration;
    if (!testPassed)
    {
        Assert.Fail("DURATION EXCEEDED: Acceptable={0}, Percentile={1}",
            FormatTimeSpan(acceptableDuration), FormatTimeSpan(percentile));
    }
}

AssertStatistics() uses a few extensions methods that extend ICollection<TimeSpan> from a class called TimeSpanCollectionExtensions.cs (below) . These methods determine the minimum, maximum, average, deviation and most importantly the Percentile value from the timing stats collected after each cycle.

In order to determine if we’ve exceeded the acceptable time space, compare that with use the 95^th percentile – this is more accurate than using the average value.

The remaining methods help with formatting the results:

private static string FormatTimeSpan(TimeSpan span)
{
    return span.TotalMilliseconds.ToString("N0");
}
 
private static string Summarise(IEnumerable<TimeSpan> durations)
{
    return durations
        .Aggregate(new StringBuilder(), 
            (bld, ts) => bld.Append(string.Format("{0}|", FormatTimeSpan(ts))))
        .ToString();
}
 
protected static void Log(string format, params object[] args)
{
    var msg = string.Format("{0:HH:mm:ss.fff}: {1}", 
        DateTime.Now, string.Format(format, args));
    Debug.WriteLine(msg);
}

Finally TimeSpanCollectionExtensions.cs. These methods haven’t been profiled:

using System;
using System.Collections.Generic;
using System.Linq;
 
internal static class TimeSpanCollectionExtensions
{
    public static TimeSpan Percentile(this ICollection<TimeSpan> testDurations, double percentile)
    {
        var sortedDurations = new List<TimeSpan>(testDurations);
        sortedDurations.Sort();
 
        var percentileIndex = Convert.ToInt32((percentile / 100D) * sortedDurations.Count);
        return sortedDurations.ElementAt(percentileIndex);
    }
 
    public static TimeSpan Deviation(this ICollection<TimeSpan> testCycleDurations)
    {
        var sortedDurations = new List<TimeSpan>(testCycleDurations);
        sortedDurations.Sort();
 
        var minTime = sortedDurations.First();
        var maxTime = sortedDurations.Last();
 
        if (minTime != maxTime)
        {
            sortedDurations.Remove(minTime);
            sortedDurations.Remove(maxTime);
        }
 
        return TimeSpan.FromMilliseconds(sortedDurations.Average(ts => ts.TotalMilliseconds));
    }
 
    public static TimeSpan AverageTotalMilliseconds(this ICollection<TimeSpan> testDurations)
    {
        return TimeSpan.FromMilliseconds(
            testDurations.Average(ts => ts.TotalMilliseconds));
    }
 
    public static TimeSpan MinTotalMilliseconds(this ICollection<TimeSpan> testDurations)
    {
        return TimeSpan.FromMilliseconds(
            testDurations.Min(ts => ts.TotalMilliseconds));
    }
 
    public static TimeSpan MaxTotalMilliseconds(this ICollection<TimeSpan> testDurations)
    {
        return TimeSpan.FromMilliseconds(
            testDurations.Max(ts => ts.TotalMilliseconds));
    }
}

Creating a Performance Test

Before we start, I need to create something to test. So, I’ve added a new Class Library project (SampleApp). To that, I’ll add a NumberWords class:

using System;
 
public static class NumberWords
{
    public static string FromSwitch(int number)
    {
        switch (number)
        {
            case 0: return "zero-" + Environment.TickCount;
            case 1: return "one-" + Environment.TickCount;
......
            case 49: return "fortynine-" + Environment.TickCount;
            case 50: return "fifty-" + Environment.TickCount;
            default: return "??-" + Environment.TickCount;
        }
    }
 
    public static string FromIfs(int number)
    {
        if (number == 0) return "zero-" + Environment.TickCount;
        if (number == 1) return "one-" + Environment.TickCount;
......
        if (number == 49) return "fortynine-" + Environment.TickCount;
        if (number == 50) return "fifty-" + Environment.TickCount;
        return "??-" + Environment.TickCount;
    }
 
}

FromSwitch() and FromIfs() return exactly the same thing, they highlight the performance difference you might see between the two implementations.

I’ll start with creating a performance test for the FromSwitch method (I’ll need another test class for FromIfs too). Add a new class to the PerformanceTests project NumberWordsFromIfsTest.cs:

using System;
using System.Linq;
using NUnit.Framework;
 
public static class NumberWordTestCriteria
{
    public const int Start = 0;
    public const int Count = 50000;
    public const int AcceptableMiliseconds = 90;
}
 
[TestFixture]
class NumberWordsFromIfsTest : PerformanceTestBase
{
    protected override TimeSpan AcceptableDuration
    {
        get { return TimeSpan.FromMilliseconds(NumberWordTestCriteria.AcceptableMiliseconds); }
    }
 
    protected override void PreTestCaseSetup()
    {
        var num = GetWord(0);
    }
 
    protected override void TestCase()
    {
        var words = Enumerable.Range(NumberWordTestCriteria.Start, NumberWordTestCriteria.Count)
            .Select(GetWord)
            .ToList();
    }
 
    private static string GetWord(int number)
    {
        return NumberWords.FromIfs(number);
    }
}

NumberWordTestCriteria is a static class that defines some parameters for the tests – I’ve added this class as I want to use the same criteria for the next NumberWordsFromSwitchTest.cs

Looking at NumberWordsFromIfsTest, the class derives from our base performance class PerformanceTestBase and is decorated with the TestFixture attribute so that NUnit runners can find the test.

The abstract AcceptableDuration property is implemented – I want this to pass if the average time span is less than 90 milliseconds.

I’ve also implemented PreTestCaseSetup(), this will get ran prior to running my TestCase method – the method under test. You can generally use this approach to negate the effect of CLR JITing when a method is first called.

TestCase is the method that gets called repeatedly and its execution time averaged out to see if I’m improving or degrading performance.

Using Resharper I can quickly run the performance test:

Once the test run is over, I’ll see a red or green alert and the following results:

20:19:54.583: STARTING TEST. Cycles=100, AcceptableDuration=90
20:19:54.620: CYCLE         Time
20:19:54.854: 001              114
20:19:54.935: 002              78
20:19:55.019: 003              80
20:19:55.095: 004              73
20:19:55.176: 005              77
…..
20:20:03.308: 095              84
20:20:03.387: 096              76
20:20:03.479: 097              89
20:20:03.575: 098              93
20:20:03.661: 099              83
20:20:03.768: 100              104

20:20:03.824: TEST FINISHED. Timings(ms): Acceptable=90, Percentile=113, Avg=87, Min=70, Max=123, Deviation=87, ActualTimes=114|78|80.

Unfortunately my performance test failed L. The TestCase method was called 100 times, the 95^th percentile time was 113 milliseconds, but my acceptable time was only 90 milliseconds

DURATION EXCEEDED: Acceptable=90, Percentile=113

So now it’s up to me to find a faster way to concatenate the words.

Going Further

I’ve added the following JoinWords class to my SampleApp project:

using System;
using System.Linq;
using System.Text;
 
public static class JoinWords
{
    public static string FromStringBuilder(int count)
    {
        var builder = new StringBuilder();
        foreach (var item in Enumerable.Range(0, count))
        {
            builder.AppendLine(item.ToString());
        }
 
        return builder.ToString();
    }
 
    public static string FromConcatenation(int count)
    {
        var line = string.Empty;
        foreach (var item in Enumerable.Range(0, count))
        {
            line += item + Environment.NewLine;
        }
 
        return line;
    }
 
    public static string FromAggregate(int count)
    {
        return Enumerable.Range(0, count)
            .Aggregate(new StringBuilder(), 
                (builder, item) => builder.AppendLine(item.ToString()))
            .ToString();
    }
}

Again, each of the three methods do the same thing, they return a string of numbers joined together, using a StringBuilder, string concatenation and a StringBuilder in conjunction the Linq Aggregate method.

I’ve chosen these implementations to highlight how slow string concatenation can be. I’ve added three test classes using the pattern outline above (although I do have an extra abstract class JoinWordsTestBase that wraps up some commonality)

It’s interesting to see just how slow string concatenation runs compared to using the StringBuilder – ten times slower.