Run Tests Using C# Simulation – Visual Studio Magazine

The Data Science Lab

Runs tests using C# simulation

Dr. James McCaffrey of Microsoft Research uses a full code program for a step-by-step explanation of this machine learning technique that tells if patterns are random.

Suppose you observe a sequence of people entering a store and notice the color of the shirt worn by each person. You see this sequence of 24 shirts:

0, 0, 3, 3, 2, 1, 1, 2, 0, 0, 3, 3, 2, 0, 0, 1, 1, 2, 2, 2, 2, 0, 1, 1

where 0 = white shirt, 1 = yellow, 2 = red, 3 = black. You want to know if the pattern is random or not. The pattern doesn’t seem very random, but how can you quantify this observation? This is an example of a test run.

You’re unlikely to want to examine the colors of shoppers’ shirts, but there are plenty of examples where analyzing print information is meaningful. For example, if you have four web servers configured so that the load is designed to be shared equally, a pattern such as the one above might indicate a problem with the load balancer.

Figure 1: Runs tests using a C# simulation in action
[Click on image for larger view.] Figure 1: Runs tests using a C# simulation in action

A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1. The demo configures the sequence of 24 values ​​shown above. The sequence has 13 passages:

  (0 0), (3 3), (2), (1 1), (2), (0 0), (3 3), (2), (0 0), (1 1), (2 2 2 2), (0), (1 1)

A scan is a set of identical consecutive data points. The demo then scrambles the sequence 1,000,000 times and calculates the observed number of runs for each random sequence. Over the 1,000,000 iterations, the average number of runs of a random sequence (with the same composition of data points) is 18.75, which is slightly more than the 13 runs observed in the data . So it looks like the sequence actually has too few runs if the sequence was really randomly generated.

The demo estimates the probability of seeing less than 13 races in a random sequence in two different ways. The first technique uses the fact that over many iterations the number of runs is approximately normally distributed (bell-shaped curve). The statistical probability estimate is 0.0019 – only about 2 in 1,000 – possible, but very unlikely.

The second technique uses a raw counting technique. In the 1,000,000 iterations, there were 13 or fewer runs in a random sequence only 4,888 + 1,370,313 + 59 + 13 + 1 = 6,644 times, which is an estimated probability of 0.0066 — only about 7 in 1,000. Again, possible, but very unlikely.

The conclusion is that the original pattern could be random, but it’s highly unlikely. An underlying process that is not random is likely responsible for the observed sequence.

This article assumes you have intermediate or better programming skills with the C# language, but does not assume you know anything about runtime testing. The complete demo code and associated data are shown in this article. Source code and data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

The demo program
The full demo program, with some minor modifications to save space, is shown in List 1. To create the program, I launched Visual Studio and created a new C# .NET Core console application named RunsTestSim. I used Visual Studio 2022 (free community edition) with .NET Core 6.0, but the demo has no significant dependencies, so any version of Visual Studio and the .NET library will work fine. You can also use the Visual Studio Code program.

Once the template code loaded, in the editor window, I removed all unnecessary namespace references, leaving only the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked the Program.cs file, renamed it to the more descriptive RunsTestSimProgram.cs, and allowed Visual Studio to automatically rename the class Program to RunsTestSimProgram.

List 1:
Runs test simulation demo code

using System;  // .NET 6.0
namespace RunsTestSim
{
  internal class RunsTestSimProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("nBegin test sequence runs using" +
        " simulation ");

      Random rnd = new Random(0);

      int[] seq = new int[24] { 0, 0, 3, 3, 2, 1, 1, 2, 0, 0,
        3, 3, 2, 0, 0, 1, 1, 2, 2, 2, 2, 0, 1, 1 };  // too few runs

      Console.WriteLine("nObserved sequence: ");
      ShowSeq(seq, 40);  // 40 vals per line

      int obsRuns = NumRuns(seq);  // 13 observed
      Console.WriteLine("nObserved number runs: " + obsRuns);

      int[] counts = new int[25];  // count of runs
      int[] scratch = new int[24];
      Array.Copy(seq, scratch, seq.Length);

      int nTrials = 1_000_000;  // C# 7.0 syntax
      Console.WriteLine("nStart simulation with nTrials = " +
        nTrials);
      for (int i = 0; i < nTrials; ++i)
      {
        Shuffle(scratch, rnd);
        int r = NumRuns(scratch);
        ++counts[r];
      }
      Console.WriteLine("Done ");

      Console.WriteLine("nCounts of runs from simulation: ");
      ShowCounts(counts, 40);

      double mean = Mean(counts);
      Console.WriteLine("nMean number runs for simulation: " +
        mean.ToString("F2"));

      double v = Variance(counts, mean);
      Console.WriteLine("nVariance for simulation: " +
        v.ToString("F2"));

      // no continutiy correction
      double z = (obsRuns - mean) / Math.Sqrt(v);
      Console.WriteLine("nComputed z (no continuity correction): " +
        z.ToString("F4"));

      // with continuity correction
      // double z = 0.0;
      // if (obsRuns > mean)  // right tail
      //   z = ((obsRuns - mean) - 0.5) / Math.Sqrt(v);
      // else  // obsRuns < mean : left-tail
      //   z = ((obsRuns - mean) + 0.5) / Math.Sqrt(v);
      // Console.WriteLine("nComputed z (with continuity correction): " +
      //   z.ToString("F4"));

      double p = OneTail(z);  // 
      Console.WriteLine("nApproximate one-tail probability if random: " +
        p.ToString("F4"));

      double likely = 0.0;
      if (obsRuns > mean)
        likely = Likelihood(counts, obsRuns, "right");
      else
        likely = Likelihood(counts, obsRuns, "left");

      Console.WriteLine("nEmpirical one-tail likelihood if random: " +
        likely.ToString("F4"));

      Console.WriteLine("nEnd sequence runs example ");
      Console.ReadLine();
    } // Main()

    static double OneTail(double z)
    {
      // raw z could be pos or neg
      if (z < 0.0) z = -z;  // make it positive is usual
      double p = 1.0 - Phi(z);  // from z to +infinity
      return p;
    }

    //static double TwoTail(double z)
    //{
    //  // likehood of z, z is positive
    //  double rightTail = 1.0 - Phi(z);  // z to +infinity
    //  return 2.0 * rightTail;
    //}

    static double Phi(double z)
    {
      // cumulative density of Standard Normal
      // erf is Abramowitz and Stegun 7.1.26

      if (z < -6.0)
        return 0.0;
      else if (z > 6.0)
        return 1.0;

      double a0 = 0.3275911;
      double a1 = 0.254829592;
      double a2 = -0.284496736;
      double a3 = 1.421413741;
      double a4 = -1.453152027;
      double a5 = 1.061405429;

      int sign = 0;
      if (z < 0.0)
        sign = -1;
      else
        sign = 1;

      double x = Math.Abs(z) / Math.Sqrt(2.0);
      double t = 1.0 / (1.0 + a0 * x);
      double erf = 1.0 - (((((a5 * t + a4) * t) + a3) *
        t + a2) * t + a1) * t * Math.Exp(-x * x);
      return 0.5 * (1.0 + (sign * erf));
    }

    static int NumRuns(int[] seq)
    {
      int runs = 0;
      int last = -1;
      for (int i = 0; i < seq.Length; i++)
      {
        if (seq[i] != last)
        {
          ++runs;
          last = seq[i];
        }
      }
      return runs;
    }

    static double Likelihood(int[] counts, int obsRuns, string leftOrRight)
    {
      int countOutsideObsRuns = 0;
      int totalCount = 0;
      for (int i = 0; i < counts.Length; ++i)
      {
        // i is a runs count, [i] is a freq
        totalCount += counts[i];
        if (leftOrRight == "left")
        {
          if (i <= obsRuns) countOutsideObsRuns += counts[i];
        }
        else if (leftOrRight == "right")
        {
          if (i >= obsRuns) countOutsideObsRuns += counts[i];
        }
      }
      return (countOutsideObsRuns * 1.0) / totalCount;
    }

    static double Mean(int[] counts)
    {
      int sum = 0;
      int n = 0;  // number of values
      for (int i = 0; i < counts.Length; ++i)
      {
        sum += i * counts[i];  // i is number runs, [i] is freq
        n += counts[i];
      }
      double mean = (sum * 1.0) / n;
      return mean;
    }

    static double Variance(int[] counts, double mean)
    {
      double sumSquares = 0.0;
      int n = 0;
      for (int i = 0; i < counts.Length; ++i)
      {
        sumSquares += counts[i] * ((i - mean) * (i - mean));
        n += counts[i];
      }
      return sumSquares / n;
    }

    static void Shuffle(int[] seq, Random rnd)
    {
      int n = seq.Length;
      for (int i = 0; i < n; ++i)
      {
        int ri = rnd.Next(i, n);  // be careful
        int tmp = seq[ri];
        seq[ri] = seq[i];
        seq[i] = tmp;
      }
    }

    static void ShowSeq(int[] seq, int n)
    {
      for (int i = 0; i < seq.Length; i++)
      {
        Console.Write(seq[i] + " ");
        if ((i + 1) % n == 0)
          Console.WriteLine("");
      }
      Console.WriteLine("");
    }

    static void ShowCounts(int[] counts, int n)
    {
      // n is number values per line
      for (int i = 0; i < counts.Length; i++)
      {
        Console.Write(counts[i] + " ");
        if ((i + 1) % n == 0)
          Console.WriteLine("");
      }
      Console.WriteLine("");
    }

  } // Program
} // ns

The demo program starts by setting up a sequence to analyze and a Random object to scramble the sequence:

Random rnd = new Random(0);

int[] seq = new int[24] { 0, 0, 3, 3, 2, 1, 1, 2, 0, 0,
  3, 3, 2, 0, 0, 1, 1, 2, 2, 2, 2, 0, 1, 1 };  // too few runs

The random seed of 0 is arbitrary. Next, the number of runs observed in the sequence is calculated using a program-defined helper function NumRuns():

int obsRuns = NumRuns(seq);  // 13 observed
Console.WriteLine("nObserved number runs: " + obsRuns);

Compared to some alternative classical statistical techniques such as Wald-Wolfowitz, one of the advantages of the simulation approach is that it can handle very long sequences. So while you can manually count the number of runs in the examined sequence, using a helper function to count the runs is a better approach.

Next, the demo sets up an array to hold the number of runs seen in the simulation and makes a copy of the source sequence:

int[] counts = new int[25];  // count of runs
int[] scratch = new int[24];
Array.Copy(seq, scratch, seq.Length);

Note that since the source sequence has 24 elements, the maximum number of runs is 24. But to store the number of times 24 runs occurred, you need to declare an array size of 25 due to indexing at start from nothing.

The heart of the simulation is:

int nTrials = 1_000_000;  // C# 7.0 syntax
Console.WriteLine("nStart simulation with nTrials = " +
  nTrials);
for (int i = 0; i < nTrials; ++i)
{
  Shuffle(scratch, rnd);
  int r = NumRuns(scratch);
  ++counts[r];
}
Console.WriteLine("Done ");

In each of the 1,000,000 trials, the copy of the source sequence is shuffled in random order using a program-defined Shuffle() function. After scrambling, the number of runs in the scrambled sequence is calculated and stored.

After the simulation, the demo calculates the average (average) number of executions seen and the variance of the executions seen:

double mean = Mean(counts);
Console.WriteLine("nMean number runs for simulation: " +
  mean.ToString("F2"));

double v = Variance(counts, mean);
Console.WriteLine("nVariance for simulation: " +
  v.ToString("F2"));

The mean is needed both for statistical estimation and for empirical estimation of the probability of seeing the observed number of trials. The variance is only needed for statistical estimation.

The statistical estimate is calculated as follows:

double z = (obsRuns - mean) / Math.Sqrt(v);
Console.WriteLine("nComputed z (no continuity correction): " +
  z.ToString("F4"));

double p = OneTail(z);  // 
Console.WriteLine("nApproximate one-tail probability if random: " +
  p.ToString("F4"));

The program-defined OneTail() function estimates the probability of seeing fewer than the observed number of runs (13) if the sequence is random (18.75 runs).

The demo ends by calculating a probability estimate based on the number of runs of the simulation:

double likely = 0.0;
if (obsRuns > mean)
  likely = Likelihood(counts, obsRuns, "right");
else
  likely = Likelihood(counts, obsRuns, "left");

Console.WriteLine("nEmpirical one-tail likelihood if random: " +
  likely.ToString("F4"));

Program-defined Likelihood() function loops through counts[] array and adds the number of times there have been 13 or fewer executions.

Understand the statistical estimation approach
The key idea behind estimating the probability of seeing an observed number of runs in a sequence is that for a large number of simulation iterations, the number of runs seen will be approximately normal (also called Gaussian or shaped bell). The graph in Figure 2 shows the distribution of the number of executions observed over 1,000,000 iterations.

Figure 2: Run count distribution for random sequences
[Click on image for larger view.] Figure 2: Run count distribution for random sequences

The graph shows that the average number of runs is around 19 and the standard deviation (a measure of spread) is around (24 – 12) / 6 = 2.0. The variance of the data is the standard deviation squared, or approximately 2.0^2 = 4.0. If you refer to the demo running in Figure 1the actual calculated mean is 18.75 and the actual calculated variance is 3.94.

Briana R. Cross