.NET 7 Code Generated Regular Expressions

With the new .NET 7 we see increased code generation being used for efficiency. In this blog I want to discuss the added support for regular expressions.

Regex

Regular Expressions

Regular Expressions are a very practical way of checking strings for patterns. Here I have the TelephoneValidator class that checks if a telephone number is a correct Belgian number:

public static class RegExPatterns
{
  public const string PATTERN = @"/^(((\+|00)32[ ]?(?:\(0\)[ ]?)?)|0){1}(4(60|[789]\d)\/?(\s?\d{2}\.?){2}(\s?\d{2})|(\d\/?\s?\d{3}|\d{2}\/?\s?\d{2})(\.?\s?\d{2}){2})$/";
}

public class TelephoneValidator
{
  public static readonly Regex _regex =
    new Regex(RegExPatterns.PATTERN, RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);

  public bool IsMatch(string input)
  => _regex.IsMatch(input);
}

This will generate (at runtime) code to verify that the string matches the regular expression. However, in certain environments the generated code will not be compiled. Instead, the runtime will interpret the generated intermediate language. Blazor WASM is one of those environments.

Using Code Generation

With C# there is also the option to have the code generated at compile time using C# Code Generation. This brings the benefit that all code is generated at compile time, where AOT compilation can take advantage of this, including trimming.

C# Code Generation can only generate new code, so the trick is to use partial classes and methods and let code generation do the 'arduous' work. For example, here is the TelephoneValidationWithCodeGen class using the new RegexGenerator.

public partial class TelephoneValidatorWithCodeGen
{
  [RegexGenerator(RegExPatterns.PATTERN, RegexOptions.CultureInvariant | RegexOptions.IgnoreCase)]
  public static partial Regex _regex();

  public bool IsMatch(string input)
  => _regex().IsMatch(input);
}

This will generate the code at compile time (see later).

Performance

Let's have a look at the performance of this new implementation. Using Benchmark.NET I executed the following benchmark:

public class Benchmarks
{
  private static readonly TelephoneValidator _classic_val = 
    new TelephoneValidator();
  private static readonly TelephoneValidatorWithCodeGen _codegen_val = 
    new TelephoneValidatorWithCodeGen();

  [Params("+3224666666", "124")]
  public string test = string.Empty;

  [Benchmark]
  public void Classic()
  => _classic_val.IsMatch(this.test);

  [Benchmark]
  public void CodeGenerated()
  => _codegen_val.IsMatch(this.test);
}

And these were the results:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT


|        Method |        test |     Mean |    Error |   StdDev |
|-------------- |------------ |---------:|---------:|---------:|
|       Classic | +3224666666 | 33.47 ns | 0.673 ns | 0.661 ns |
| CodeGenerated | +3224666666 | 24.40 ns | 0.492 ns | 0.547 ns |
|       Classic |         124 | 26.76 ns | 0.336 ns | 0.298 ns |
| CodeGenerated |         124 | 18.24 ns | 0.278 ns | 0.260 ns |

The generated code has better performance than the already heavily optimized Regex class!

Generated Code

Let us use DNSpy to examine the generated code. What happens is that the C# Source Generator will implement the _regex() method to return an instance of a new generated class:

// RegEx_CodeGeneration.TelephoneValidatorWithCodeGen
// Token: 0x0600000E RID: 14 RVA: 0x0000211A File Offset: 0x0000031A
[RegexGenerator("/^(((\\+|00)32[ ]?(?:\\(0\\)[ ]?)?)|0){1}(4(60|[789]\\d)\\/?(\\s?\\d{2}\\.?){2}(\\s?\\d{2})|(\\d\\/?\\s?\\d{3}|\\d{2}\\/?\\s?\\d{2})(\\.?\\s?\\d{2}){2})$/", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)]
[GeneratedCode("System.Text.RegularExpressions.Generator", "7.0.6.15202")]
public static Regex _regex()
{
	return TelephoneValidatorWithCodeGen.GeneratedRegex__regex_16298FFF.Instance;
}

This class itself is the implementation of the regular expression:

[Nullable(0)]
[GeneratedCode("System.Text.RegularExpressions.Generator", "7.0.6.15202")]
[EditorBrowsable(EditorBrowsableState.Never)]
private sealed class GeneratedRegex__regex_16298FFF : Regex
{
  // Token: 0x17000001 RID: 1
  // (get) Token: 0x06000011 RID: 17 RVA: 0x00002136 File Offset: 0x00000336
  public static Regex Instance { get; } = 
    new TelephoneValidatorWithCodeGen.GeneratedRegex__regex_16298FFF();

  // Token: 0x06000012 RID: 18 RVA: 0x0000213D File Offset: 0x0000033D
  private GeneratedRegex__regex_16298FFF()
  {
    this.pattern = "/^(((\\+|00)32[ ]?(?:\\(0\\)[ ]?)?)|0){1}(4(60|[789]\\d)\\/?(\\s?\\d{2}\\.?){2}(\\s?\\d{2})|(\\d\\/?\\s?\\d{3}|\\d{2}\\/?\\s?\\d{2})(\\.?\\s?\\d{2}){2})$/";
    this.roptions = (RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.CultureInvariant);
    this.internalMatchTimeout = Timeout.InfiniteTimeSpan;
    this.factory = new TelephoneValidatorWithCodeGen
                       .GeneratedRegex__regex_16298FFF.RunnerFactory();
    this.capsize = 10;
  }

  // Token: 0x0200000B RID: 11
  [NullableContext(0)]
  private sealed class RunnerFactory : RegexRunnerFactory
  {
    // Token: 0x06000014 RID: 20 RVA: 0x00002185 File Offset: 0x00000385
    [NullableContext(1)]
    protected override RegexRunner CreateInstance()
    {
      return new TelephoneValidatorWithCodeGen
                 .GeneratedRegex__regex_16298FFF
                 .RunnerFactory
                 .Runner();
    }

  // More code ...

Summary

Using C# Source Generators for regular expressions allows more efficient code to be used to match a pattern, and this generated code can be optimized at compile time, for example this will result in Blazor AOT using optimized WebAssembly instead of interpreting.