A Regular Expression Engine in C#

27/03/2019 A Regular Expression Engine in C# - CodeProject
13,906,223 members Sign in
Search for articles, questions, tips

articles Q&A forums stuff lounge ?
A Regular Expression Engine in C#

codewitch honey crisis, 26 Mar 2019
4.00 (1 vote) Rate this:
A Non-Backtracking Regular Expression Engine for .NET (Core)
Download FA.zip - 109.8 KB (sample, implementation source, and binaries)
Introduction
Why even make a regular expression engine when .NET includes a rich regular expressions library?
First of all, there are two major classes of regular expression engine - backtracking and non-backtracking. Most modern regular
expression engines do backtracking. Backtracking engines have greater expressive power than non-backtracking engines, but at the
cost of performance.
Why make a non-backtracking engine?
Well, for starters they out-perform the backtracking engines on short strings over and over. Secondly, they are more suitable for
operating directly on a streaming source, whereas backtracking parsers generally require the text to be preloaded into memory.
More generally, the style of regular expression engine provided by .NET is geared toward matching patterns in long strings. It's not
as suitable for lexing tokens out of a stream. Lexing is of primary importance to parsers, and so the regular expressions used for
parsing terminals are often implemented using a non-backtracking engine that is geared for lexing.
Background
Lexing is the process of breaking an input stream of characters into "tokens" - strings of characters that have a "symbol" associated
with them. The symbol indicates what type of string it is. For example, the string "124.35" might be reported as the symbol
"NUMBER" whereas the string "foo" might be reported as the symbol "IDENTIFIER"
Parsers typically use lexers underneath, and then compose the incoming symbols into a syntax tree. Because lexers are called in core
parsing code, the lexing operations must be reasonably efficient. The .NET regular expression isn't really suitable here, and while it
can work, it actually increases code complexity while diminishing performance.
https://www.codeproject.com/Articles/1280690/A-Regular-Expression-Engine-in-Csharp 1/9
Included in this project is a file called "FA.cs" that contains the core code for a regular expression engine using finite automata which
resolves to the humble yet ubiquitous state machine.
Finite state machines are composed of graphs of states. Each state can reference another state using either an "input transition" or
an "epsilon transition". An input transition consumes the specified input before moving to the destination. An epsilon transition
moves to a destination state without consuming or examining any input. A machine with one or more epsilon transitions is
considered "non-deterministic" (NFA) and may be in more than one state at any given time. Non-deterministic state machines are
more complicated and less efficient, but fortunately there is a deterministic state machine (DFA) for any non-deterministic state
machine, and an algorithm to turn an NFA into a DFA using a technique called powerset construction - math that is out of the scope
here to explore.
The closure of a state is the state itself and the unique set of all states reachable from that state, on either an epsilon or input
transition. The closure represents the state machine indicated by the root state. These graphs can be recursive.
The graph for a state machine matching "foo" is just below.
The closure of q0 is { q0, q1, q2, q3 } because each of those states are connected to q0 either directly or indirectly.
Each time a black arrow is crossed, the input must match the character(s) above the arrow to proceed along that path. Once a
double circle (q3) is encountered, the machine "accepts" the input as valid/matching and returns the token under the state name, in
this case "foo" is under q3 so it is what will be returned once the input is matched.
Now lets match an identifer of the form [A-Z_a-z][0-9A-Z_a-z]*
As you can see, looping is fairly straightforward.
Now let's get useful. Lexers (aka tokenizers) distinguish between one of many different patterns of input. Let's build one that
represents an identifier, integer or a space
You'll note that there are two ways to match an int (q2, q4), but they each return the same symbol "int" as a result.
Each of the machines presented was deterministic (each is a DFA). A DFA machine will only ever be in one state at a time. There are
also non-deterministic machines, or NFAs, which can be in several states at the same time!
You'll note this machine has gray dashed lines in it. These are epsilon transitions or transitions on epsilon. Epsilon simply means "no
input" in this context. Every time a machine encounters a dashed arrow it is automatically in the state pointed to by that arrow as
well as itself. It can be said that it transitions on no input.
Starting out in q0 also puts you in q1, q4, q5, q7, q8 and q11 as well! This set of states, that is, the set of states reachable from this
state on epsilon transitions is called the epsilon closure
These NFA machines are easier to compose/build out of several other machines, but harder to run because they're more complex.
As mentioned, there's a DFA machine for every NFA machine, and an algorithm for converting an NFA to a DFA.
This code allows you to painlessly create these machines, to make pretty graphs like the above, (with the help of Graphviz), to
generate super fast code to run a given state machine, to run a state machine at runtime without generating the code first, to
translate regex expressions to machines and back again, and to otherwise examine and compose the machines.
Using the code

Below is included in the sample project, and heavily commented to show you the basics of using this library:
Hide Shrink Copy Code
//
// See the included project for a full sample
//
var lexer = FA.Lexer(
// our id from the article
FA.Parse(@"[A-Z_a-z][0-9A-Z_a-z]*", "id"),
// our int from the article
FA.Parse(@"0|([\-]?[1-9][0-9]*)", "int"),
// our space from the article
FA.Parse(@" ", "space")
);
// transform the lexer into its DFA equivelant

var dfaLexer = lexer.ToDfa();
// make sure there are no duplicate states left after the transformation.
// this can be an expensive op and isn't strictly necessary, so it's a
// separate step. It *should* be done before code generation.
// If code generation is done on an NFA, it is automatically transformed to
// a DFA and duplicates are removed during that process
dfaLexer.TrimDuplicates();
Console.WriteLine("Rendering graphs");
try
{
// try to render the graphs (requires GraphViz - specifically dot.exe)
// to change the format, simply use a different extension like PNG or SVG
lexer.RenderToFile(@"..\..\..\lexer_nfa.jpg");
dfaLexer.RenderToFile(@"..\..\..\lexer_dfa.jpg");
}
catch
{
Console.WriteLine("GraphViz is not installed.");
}
// generate the code (can generate from either an NFA or DFA, DFA is slightly faster)
// generate code direct to C#

// we can use the CodeDOM here if CODEDOM compile constant is defined
// but to keep things simple we won't here.
Console.WriteLine("Generating C# Code");
using (var tw = new StreamWriter(@"..\..\..\Generated.cs"))
{
tw.WriteLine("namespace {0}", typeof(Program).Namespace);
tw.WriteLine("{");
tw.WriteLine("partial class Program {");
// generate TryLexValue
dfaLexer.WriteCSharpTryLexMethodTo(tw, "Value");
tw.WriteLine();
// generate LexValue
dfaLexer.WriteCSharpLexMethodTo(tw, "Value");
tw.WriteLine();
tw.WriteLine("}");
tw.WriteLine("}");
}
// our test string

var test = "foo 456 bar 123 baz";
Console.WriteLine("Runtime Lexing:");
// Create a ParseContext over the string

// See https://www.codeproject.com/Articles/1280230/Easier-Hand-Rolled-Parsers
// ParseContext handles the input Cursor, the Position, Line and Column counting
// as well as error handing. It operates over an enumeration of characters, a string
// or a TextReader. The main members are Advance(), Current, Expecting(), Position
// Line, Column, Capture and CaptureCurrent()
// it's easy enough to add code to generate lexers that do not require this class
// but we would lose error handling and position tracking, so it's not implemented.
// to use other interfaces, wrap them with ParseContext.Create()
var pc = ParseContext.Create(test);
pc.EnsureStarted(); // make sure we're on the first character.
try
{
// ParseContext.Current works like a TextReader.Peek() does, returning an int
// without advancing the cursor. Unline Peek() it is not problematic.
while (-1 != pc.Current) // loop until end of input.
{
// runtime lexing (can be NFA or DFA - DFA is intrinsically more efficient)
// if we pass it a StringBuilder, it will reuse it - perf is better,
// but we don't bother here.
var tok = dfaLexer.Lex(pc); // lex the next token
// symbol is in tok.Key, value is in tok.Value
Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
}
}
catch(ExpectingException eex)
{
// We got a lexing error, so report it. The message might be long, but you can always
// build your own using ExpectingException members.
Console.WriteLine("Error: {0}", eex.Message);
}
Console.WriteLine();
Console.WriteLine("Compiled Lexing:");
pc = ParseContext.Create(test); // restart the cursor
pc.EnsureStarted();
try
{
// reusing a stringbuilder allows the lex routine to run a little faster
// don't use it for anything else if you're passing it to lex.
var sb = new StringBuilder(); // don't have to use this, can pass null
while (-1 != pc.Current) // loop until end of input.

{
// compiled lexing (always DFA - most efficient option)
// Generated.cs must be present and valid
var tok = LexValue(pc,sb); // lex the next token
Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
}
}
catch (ExpectingException eex)
{
Console.WriteLine("Error: {0}", eex.Message);
}
Let's take a look at the generated code now:
Hide Copy Code

q1:
if((pc.Current>='0'&& pc.Current<='9')||
(pc.Current>='A'&& pc.Current<='Z')||
(pc.Current=='_')||
(pc.Current>='a'&& pc.Current<='z')) {
sb.Append((char)pc.Current);
pc.Advance();
goto q1;
}
return new System.Collections.Generic.KeyValuePair<System.String,string>("id",sb.ToString());
As you can see, this is state q1, which corresponds to the last part of our "id" token in the graph above.
You can see the output transitions from this state encoded into the if.
You can see the arrow pointing back to itself using the goto.
You can see that if the if condition isn't satisfied it ends with a succesful result that returns a token.
All of these properties are reflected in the graph above if you look for them.
Points of Interest
Generating regular expressions - textual representation - from an FA is non-trivial. Parsing them from a textual representation is
trivial. This is almost exactly the opposite of how syntax trees usually are in terms of coding them.
The generated DFAs are an order of magnitude faster in release builds than debug builds. I still don't know why. Regardless, they're
still the fastest option in any build configuration.
GraphViz Dot is deceptively complicated and expressive.
It took way too many iterations of this design over the years to get one I was happy with that handled general use cases well,
including error handling. I finally got it fast, consistent, and easy in the right combinations.
There are improvements to be made, most notably including the ^ and . operators. Doing that efficiently almost requires it's own
article.
History
3-26-2019 Initial release
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
Share
About the Author

codewitch honey crisis
No Biography provided
United States
You may also be interested in...

Building an AI Chatbot using a Regular Convert all files to searchable PDFs
Expression Engine
Combined Regular Expression Search DEELX - Regular Expression Engine for C++
Getting startet with OpenGL/OpenTK in SmartLexicon

MONO/.NET for serious applications
Comments and Discussions

You must Sign In to use this message board.
Search Comments
First Prev Next
random feedback
Super Lloyd 4hrs 6mins ago
Re: random feedback

codewitch honey crisis 3hrs 51mins ago
Re: random feedback

codewitch honey crisis 3hrs 20mins ago
Refresh 1
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile

Selecione o idioma ▼
Web05 | 2.8.190306.1 | Last Updated 27 Mar 2019
Layout: fixed | fluid Article Copyright 2019 by codewitch honey crisis
Everything else Copyright © CodeProject, 1999-2019

A Regular Expression Engine in C# - CodeProject

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Regular Expression Engine in C# - CodeProject

Uploaded by

Copyright:

Available Formats

27/03/2019 A Regular Expression Engine in C# - CodeProject

13,906,223 members Sign in

Search for articles, questions, tips

4.00 (1 vote) Rate this:

A Non-Backtracking Regular Expression Engine for .NET (Core)

Download FA.zip - 109.8 KB (sample, implementation source, and binaries)

Why make a non-backtracking engine?

The graph for a state machine matching "foo" is just below.

Now lets match an identifer of the form [A-Z_a-z][0-9A-Z_a-z]*

As you can see, looping is fairly straightforward.

Using the code

Hide Shrink Copy Code

// transform the lexer into its DFA equivelant

// generate code direct to C#

// our test string

// Create a ParseContext over the string

pc.EnsureStarted(); // make sure we're on the first character.

while (-1 != pc.Current) // loop until end of input.

Let's take a look at the generated code now:

Hide Copy Code

GraphViz Dot is deceptively complicated and expressive.

About the Author

You may also be interested in...

Getting startet with OpenGL/OpenTK in SmartLexicon

Comments and Discussions

You must Sign In to use this message board.

First Prev Next

Re: random feedback

Re: random feedback

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile

You might also like