Compiling regular expressions

Regular expressions are a very powerful tool for text processing, and they're well supported by the .NET Framework. Of course, you want your regular expressions to run as fast as possible. You can do two things to speed up regex processing: optimizing the regular expression itself, and optimizing the execution environment. I won't be talking about regular expression optimization itself; there is plenty of information on the net about that. One very good source is www.regular-expressions.info, but I'm sure there are others. What I do want to talk about is how the .NET Framework executes your regular expressions.

Your regular expression typically enters your program as a string. This string is first decoded into an internal form that is more easily processed. Typically, the regular expression is then interpreted. This interpretation is reasonably fast, especially when processing relatively small texts. When the text is large, or the same expression is executed many times, your program may benefit from compilation to IL. You enable compilation by setting the RegexOptions.Compiled flag. As the docs say, this yields faster execution but increases startup time. Obviously, this IL needs to be further compiled by the just-in-time compiler before your CPU can execute it.

Let's look at the following function:

static bool IsValidEmail(string text)
{ 
    Regex regex = 
        new Regex(@"^(?!\.)[a-zA-Z0-9!#\$%&'\*\+-/=\?\^_`\|~\.]+(<!\.)@(\w+[\-\.])*\w{1,63}\.[a-zA-Z]{2,6}$", 
            RegexOptions.ExplicitCapture | RegexOptions.Compiled); 
    Match m = regex.Match(text); 
    return m.Success; 
}
 

What happens when you call this function?

  1. The pattern is decoded into an internal form
  2. The internal form is compiled into IL
  3. The IL is compiled to machine code
  4. The machine code is executed

Much worse is that these steps are repeated for every function evaluation! That's right, nothing is cached, and all of the above four steps are repeated for every function call. Much better is the following:

static readonly Regex regex = 
    new Regex(@"^(?!\.)[a-zA-Z0-9!#\$%&'\*\+-/=\?\^_`\|~\.]+(?<!\.)@(\w+[\-\.])*\w{1,63}\.[a-zA-Z]{2,6}$", 
        RegexOptions.ExplicitCapture | RegexOptions.Compiled); 
 
static bool IsValidEmail(string text) 
{ 
    Match m = regex.Match(text); 
    return m.Success; 
}
 

In this case, step 1, 2 and 3 are executed just once. Only the actual execution of the machine code (step 4) is done in each call.

To get a feeling of the performance impact, I've run a few tests. Obviously, test results depend on your hardware configuration, the actual regular expression and the (length of) the input, so results may vary significantly. Anyway, in my test, interpreted execution took more than twice as long as compiled execution. The decoding step (step 1) took a long as 11 compiled executions, compilation to IL took as long as 300 compiled executions and JIT compilation to machine code (step 3) took as long as 1000 compiled executions.

What does this mean in practice? Compilation speeds up execution significantly, and it's worth doing it if you'll execute the compiled regular expression many times (at least about 500 times in my test). It also means that you should avoid steps 1 to 3. You can do so by caching the result of these steps (see above), or do them at compile time or deployment time instead of at runtime.

The Regex class has a method called CompileToAssembly, which allows you to execute steps 1 and 2 at compile time. It generates an assembly on disk (a DLL) containing strongly typed Regex classes. Unfortunately this is just a method, and the .NET framework does not come with a tool to execute this function (unlike sgen.exe, which does a similar thing for XML serialization).

I've built a command line tool around this function. It takes an xml file as input, containing a number of settings for the assembly to build, and the definition of all the regular expression classes you want to include in that assembly. The output is the assembly itself and an XML documentation file, which provides intellisense in Visual Studio (or it can be used to build a help file with SandCastle). To build an assembly for the above regular expression, the minimal input file would contain the following:

<?xml version="1.0" encoding="utf-8" ?>
<project name="Example.RegularExpressions">
  <regex>
    <name>EmailRegex</name>
    <namespace>Example.RegularExpressions.Validations</namespace>
    <pattern>^(?!\.)[a-zA-Z0-9!#\$%&amp;'\*\+-/=\?\^_`\|~\.]+(?&lt;!\.)@(\w+[\-\.])*\w{1,63}\.[a-zA-Z]{2,6}$</pattern>
    <options>ExplicitCapture</options>
  </regex>
</project>  

The root element, project has one mandatory attribute name containing the (simple) name of the assembly to build. Optional elements add more information. For example:

<project name="Example.RegularExpressions">
  <version>1.0.0.1</version>
  <title>Example.RegularExpressions</title>
  <copyright>© 2007 Kris Vandermotten</copyright>
  <strongNameKeyFile>key.snk</strongNameKeyFile>

The following elements are supported: version, title, description, configuration, company, product, copyright, trademark, culture and strongNameKeyFile. Except for the latter, they all translate to standard attributes at the assembly level. Combined with the name attribute, the version, culture and strongNameKeyFile elements allow specifying a strong name for the assembly. The location of the key file is relative to the location of the source file.

The project element then contains any number of regex elements. Each regex element contains at least the name, namespace and pattern elements. Optionally, it contains options, ispublic, and doc elements. In the doc element you can include the very same XML documentation you would use in C#, C++/CLI or VB.NET, for example:

<regex>
 
<name>EmailRegex</name>
 
<namespace>Example.RegularExpressions.Validations</namespace>
 
<pattern>^(?!\.)[a-zA-Z0-9!#\$%&amp;'\*\+-/=\?\^_`\|~\.]+(?&lt;!\.)@(\w+[\-\.])*\w{1,63}\.[a-zA-Z]{2,6}$</pattern>
 
<options>ExplicitCapture</options>
 
<doc>
   
<summary>Regular expression class to validate email addresses</summary>
   
<remarks>
     
According to RFC 2822, the local-part of the address may use any of these ASCII characters:
     
<list>
       
<item>Uppercase and lowercase letters (case sensitive)</item>
       
<item>The digits 0 through 9</item>
       
<item>The characters ! # $ % &amp; ' * + - / = ? ^ _ ` { | } ~</item>
       
<item>The character . provided that it is not the first or last character in the local part.</item>
     
</list>
      .NET doesn't like { and }
    </remarks>
  </doc>
</regex>  

If you don't include a doc element, the pattern is included in a summary element. The result is that Visual Studio intellisense will show the pattern.

Since strongly named assemblies can be build, you can deploy them in the GAC and compile them to native machine code at deployment time with ngen.exe if you want to. That way, even step 3 above can be eliminated.

Download RegexCompiler here. Source code is available upon request.