You are on page 1of 12

Crafting an interpreter Part 1 - Parsing and Grammars - Martin.

Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
Home Martin.Holzherr Crafting an interpreter Part 1 - Parsing and Grammars
Workspace home
Ta
Co
Arti

Crafting an interpreter Part 1 - Parsing and
Grammars
by: Martin.Holzherr
First of a series of articles on building a language interpreter, describing basics about parsing and grammars.
(Back to article)

0

Fork A

What to do next: fork other workspaces!


You can Fork other members workspaces to create a new, unique project from an existing one.
Push your changes to your forked workspace and dont forget to send a pull request to the
original author to contribute to the project.

CREATE A
WORKSPACE
EXPLORE
WORKSPACES
My favorites
Nothing to display
My workspaces
Nothing to display
Shared with me
Nothing to display
Workspaces by this
user
Crafting an interpreter
Part 3 - Parse Trees
and Syntax Trees
Crafting an
interpreter Part 1 -
Parsing and
Grammars
Crafting an interpreter
Part 2 - The Calc0 Toy
Language
Parsing Expression
Grammar Support for
C# 3.0 Part 1 - PEG
Lib and Parser
Generator
more workspaces C
Tasks Code
Forum

VROSNET Sign out S home workspaces premium account
Submit
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
Download sample - 98.6 Kb
Introduction
Parsers, Interpreters and Compilers are becoming more and more into the reach of the normal programmer. In the
very beginning, such tools were hand crafted. Then came the time of the parser and compiler generators (e.g.,
YACC (popular UNIX parser generator) or ANTLR), and currently we experience the integration of grammars and
parsers directly into a host language. boost::spirit, for example, is integrated into C++ as a library which
comes with a domain specific sublanguage (using operator overloading and expression templates). Perl 6 goes even
a step further and makes grammars part of the language. These tools help you very much but they still require quite
a lot of knowledge about the underlying theoretical framework.
This is the first part of a series of articles which explains the most basic concepts and practices to cope with parsing
and compiling. It uses toy programming languages and their interpreters to expose the concepts. For each toy
language, sample programs are presented. These samples are handwritten, but in a style which shows how the task
could be automated. This introduction into language processing addresses newbies as well as more experienced
programmers. The former get enough explanatory text, the latter C++ code samples which show how language
processing can be done with minimal baggage. The whole article series strives for minimality both in theory and in
implementation.
Steps for building an interpreter
Let's take a command line calculator for arithmetic formulas as an introductory example. The following excerpt shows
a sample session, where lines starting with '>>' indicate program output.
Pi=3.1415926535
>> Pi= 3.1415926535
Area=Pi*r^2
>> Area= 0
r=3
>> Area= 28.2743338815
>> r= 3
To implement this, the calculator program must accomplish the following tasks:
Task Sketch of Solution
It must recognize the structure and the
components of the input line which consists of
arithmetical expressions, assignments and
formula names. Errors must be reported in a user
friendly, informative way.
This is done by a process called parsing. The effort done for
parsing pays off in the quality of the error messages for
syntax errors and in the error recovery. A poorly designed
parser will produce follow errors after the first error or will
not be able to continue parsing at all after the first error.
It must evaluate the arithmetic formula and store
the result in a map which associates the formula
name with the value of the expression.
Furthermore, it must store the formula
expression itself in order to recalculate the
formula when a component changes its value by
another evaluation.
Evaluation of an expression or statement to a result value
can only be done after successful parsing. Evaluation may
either be executed directly during parsing or it may be
deferred until the parsing process has constructed an
appropriate intermediate "database" and finally has
generated efficient "code" for the execution by a real or
virtual processor.
It must cope with runtime errors like divide by
zero or taking the root of a negative number.
Handling of runtime errors is done by the runtime system.
The amount of runtime error handling is a tradeoff between
efficiency and convenience and must be adapted to the
expectations of the target user group.
Parsing and Grammars
Parsing is the process of associating a hierarchical structure with a source text. A programmer, who indents her
source text appropriately (e.g., she indents the body of loops), does a simple form of parsing. Parsing is the first task
in language processing and one of the most important ones, because it guides all the later steps of a translation
system. The formal framework for parsing is set by a grammar which describes the source language. A grammar
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
consists of a set of rules, which taken together associates a unique tree structure with any correct input text.
Grammars are divided into different classes according to their expressiveness. On the lowest level are grammars for
regular expressions. These grammars cannot describe recursively nested constructs as they are needed to express,
for example, the fact that a statement has other statements as its components. The grammar class commonly used
to describe programming languages is called "context free". Each rule of a context free grammar has the form R: a
b c | d ; where R is the name of the rule (also called a Nonterminal), : separates the rule name from the right
hand side of the rule and where a b c d and so on are either terminals or names of further rules (Nonterminals),
and | is used to separate alternatives. This is further explained in the following table:
Notion Samples Meaning
Terminal
'for'
[a-zA-Z][0-
9]+;
Describes a string of the input directly. It is common practice to use a subset of
the notation which is used in regular expressions for describing terminals. The
only difference is the notation for strings. In context free grammars, we write
e.g., 'for', where we would write just for in a regular expression.
Quantifiers ? * + Same meaning as in regular expressions. These three quantifiers are sufficient.
Alternative | Same meaning as in regular expressions.
Grouping ( ... ) Same meaning as in regular expressions.
Nonterminal
Rule
EnclosedDigits:
[0-9]+ | '('
EnclosedDigits
')' ;
EnclosedDigits in the sample is a Nonterminal. In this sample, it is both the name
of a rule and used on the right hand side of the rule. For each Nonterminal used
somewhere in a grammar rule, there must be exactly one corresponding rule.
The following sample context free grammar rule describes a string of digits enclosed in an arbitrary number of
parentheses:
EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;
This grammar rule associates the input string ((123))) with the following parse tree:
EnclosedDigits
'(' EnclosedDigits ')'
'(' EnclosedDigits ')'
'123'
In the coming samples, we will use the following linear form as a tree notation:
EnclosedDigits< '(' EnclosedDigits< '(' EnclosedDigits<123> ')' > ')' >
Interestingly, the rule EnclosedDigits: matches the complete input string ((123)) from the first to the last
character, and it also matches inner parts like (123) and 123. This is possible because of the recursive nature of
context free grammars. When a grammar is used to recognize an input string, we say that we use the grammar for
parsing. A grammar can also be used to generate strings of the language by a process called substitution.
Substitution takes place on the right hand side of the so called start rule. Substitution means replacing a
Nonterminal by one of the alternatives on the right side of the rule for the replaced Nonterminal. By repeated
substitutions on the right hand side of the start rule, we can generate all input strings which belong to the language
described by the grammar. This shall be shown with the following sample start rule:
Expression: Expression '+' Expression | '(' Expression ')' | [0-9]+ ;
The following table shows one possible substitution history:
Before
Substitution
Step Description
After
Substitution
Parse Tree
Expression Substitute by first alternative of Expression rule
Expression
'+'
Expression
Expression<
Expression
'+'
Expression
>
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
Expression
'+'
Expression
Substitute marked Nonterminal by second
alternative of Expression rule
Expression
'+'
'('
Expression
')'
Expression<
Expression
'+'
Expression<
'('
Expression ')'
>
>
Expression
'+'
'('
Expression
')'
Substitute marked Nonterminal by '1001' (using
[0-9]+ alternative)
'1001'
'+'
'('
Expression
')'
Expression<

Expression<'1001'>
'+'
Expression<
'(' Expression
')'
>
>
'1001'
'+'
'('
Expression
')'
Replace marked Nonterminal by '37' (using [0-9]+
alternative)
'1001'
'+'
'(' '37' ')'
Expression<

Expression<'1001'>
'+'
Expression<
'('

Expression<'37'>
')'
>
>
The resulting generated language string is 1001+(37). Note that the order of substitutions - whether you first
substitute a Nonterminal occurring on the very right side or on the left side - is not specified by a context free
grammar. A parser therefore needs not only a context free grammar to do its job, it needs also a parsing strategy,
which is nothing else than a built-in rule telling which of the possible substitutions is carried out first. Common
parsing strategies are LL and LR. Both parser types read the input from left to right, which is the reason for the first L
in LL and LR. An LL parser always substitutes the leftmost Nonterminal first, whereas an LR parser substitutes the
rightmost Nonterminal first. Typically, LL and LR parsers do not accept the same grammar rules. An LL parser is not
happy with left recursive rules whereas an LR parser is not happy with right recursive rules, this is exemplified by the
following sample grammar rule:
Parsing Strategy Grammar Comment
LL Digits: [0-9] Digits? ; Match first a terminal, then recurse.
LR Digits: Digits? [0-9] ; Recurse, then match terminal.
The LR - grammar Digits: Digits? [0-9] is left recursive, because its right hand side has the rule name as its
first symbol. Fortunately, it is not difficult to rewrite an LR grammar into an LL grammar and vice versa, as we will see
later.
Expressiveness of Context Free Grammars
Many import properties of existing programming languages can either not be expressed with a context free grammar
or are tedious to describe. Consider a language with procedures where the procedure name must both occur at the
beginning and at the end:
Proc: 'proc' Name Parameters Body Name ';' ;
There is no way to describe that the Name at the beginning must be the same as the Name at the end. Take as
another example a length limitation, e.g., the rule that an identifier cannot exceed 32 characters. While it is possible
to describe this with a context free grammar rule, it either requires a lot of writing or the addition of a new quantifier
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
(e.g.: [a-zA-Z]{1,32}). It turns out in many cases that it is preferable to check for the violation of a restriction after
successful parsing rather than trying to express the restriction within the grammar. Generally speaking, a grammar
for a programming language accepts always a superset of the legal constructs. The real strength of context free
grammars is the proper description of hierarchical relations as shown by the following examples of grammars for
simple arithmetic expressions. This is a poorly chosen grammar for expressions:
expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ;
This grammar does not express the fact that the precedence of '+' is not the same as that of '*'. The following is a
much better grammar for expressions:
expr1: mul_expr | expr1 ('+'|'-') mul_expr ;
mul_expr: [0-9]+ | mul_expr ('*'|'/') [0-9]+ ;
The source text 2*3+5 could be parsed by expr0 into the following tree:
expr0<
expr0<'2'>
'*'
expr0< expr0<'3'> '+' expr0<'5'> >
>
whereas expr1 would result in this tree:
expr1<
expr1<mul_expr<'2'> '*' '3' >
'+'
mul_expr<'5'>
>
Only the second tree reflects the precedence of the '+' and '*' operators. By the way, the grammar rule for expr1 is
suitable for an LR parser, but not for an LL parser.
The Peril of Ambiguity
A grammar is ambiguous if a source text can be parsed into two different parse trees depending on the order of
substitutions. The above sample expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 |
[0-9]+ ; is ambiguous because the source text 2*3+5 can be parsed into both of the following trees:
expr0<expr0<'2'> '*' expr0< expr0<'3'> '+' expr0<'5'> > >
expr0<expr0<expr0<'2'> '*' expr0<'3'> > '+' expr0<'5'> >
depending on the substitution order. Ambiguity defeats the main purpose of grammars, namely the association of a
source text with a unique hierarchical structure. Whether ambiguity of a grammar really results in different
hierarchical structures not only depends on the grammar but also on the parsing strategy.
Recursive Descent Parsing
Recursive descent parsing is the most simple and intuitive parsing method. It is an LL parsing method and therefore
reads the input from left to right (the first L in LL) and traverses the grammar from left to right (the second L in LL).
The name recursive descent is due to the fact that this parsing method uses a function for every grammar rule. The
function for the start rule must parse the complete source. It does this by providing a function body, which
implements the right hand side of the start rule in the following way:
in case of alternatives, it tries all the promising alternatives by executing the corresponding code for the
alternative until one of them succeeds (matches the input text)
in case of a sequence of grammar symbols (tokens or Nonterminals), it executes the corresponding code for
each symbol in the sequence until one of them fails or otherwise the code for the whole sequence succeeds.
Note that the implementation of a recursive descent parser sometimes requires backtracking. Backtracking means
that you go back to a previous state of the computation. If, for example, the first part of the chosen alternative is
successfully matched against the input text and then a mismatch occurs, we must set back the source position to the
position before we entered the alternative and choose another alternative. Let's show this with a sample
implementation. The rules:
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
EnclosedDigits: Digits | '(' EnclosedDigits ')' ;
Digits: [0-9] Digits? ;
could be implemented by the following parsing procedures which merely test whether the input matches the
grammar rules:
typedef const unsigned char* Iterator;
bool MatchesDigits(Iterator& it) //[0-9]Digits?
{
return (*it>='0' && *it<='9'?(++it,true):false)//[0-9]
&& (MatchesDigits(it),true); //Digits?
}
bool MatchesEnclosedDigits(Iterator& it) //Digits | '(' EnclosedDigits ')'
{
Iterator itSave= it;
return
MatchesDigits(it) // Digits
|| // |
( (itSave=it,
(*it=='('?(++it,true):false) // '('
&& MatchesEnclosedDigits(it) // EnclosedDigits
&& (*it==')'?(++it,true):false) // ')'
)
||(it=itSave,false)
);
}
The start grammar function MatchesEnclosedDigits takes an iterator (e.g., character pointer) as input and
either returns true, in which case it sets the iterator to the position just behind the recognized text, or it returns
false, in which case it leaves the iterator unchanged. The close relationship between grammar rules and
corresponding parsing functions make it obvious that a recursive descent parser cannot work with grammar rules
like:
Digits: Digits? [0-9] ;
because this would result in infinite recursion. Many grammars coming with language specifications are written for LR
parsers (the well known YACC tool requires an LR grammar). But it is quite easy to rewrite an LR grammar into an LL
grammar.
Parsing Expression Grammars (PEGs)
The idea behind the so called parsing expression grammar (PEG) formalism [1] is a "procedural interpretation" of
context free grammars. A PEG grammar can be "executed" using the grammar and a source string as input. The
result of the execution is either a match of the input string, in which case the parsing position is at the end of the
recognized part, or otherwise the result is "doesn't match", in which case the input position remains unchanged. This
statement is not only true for the grammar as a whole, but for each grammar rule. The PEG formalism is nothing else
than a mapping of common practices in recursive descent parsing to grammars. It consists of the following grammar
elements:
PEG
construct
Notation Operational Meaning
Sequence e
1
e
2
Match input against e
1
moving the input position behind the input part described by
e
1
, then match from new input position against e
2
. If the match fails for either e
1
or
e
2
, the input position is restored to the position before entering the Sequence and
the result is fail, otherwise the input position is moved behind the recognized
Sequence and the result is success.
Prioritized
choice
between
Alternatives
e
1
/ e
2
Match input against e
1
moving the input position behind the input part described by
e
1
. If the match fails, restore the input position and match against e
2
. If the match
fails again, restore to the position before entering the Alternative and the result is
fail. Otherwise, if one of the alternatives succeeds, the input position is moved
behind the source part which is recognized by the chosen alternative and the result
is success.
Greedy
e?
Match input against e, moving the input position behind the recognized part in case
of success. If the match fails, restore the input position, but the result is in any case
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
Option
success.
Greedy zero
or more
occurrences
e* Interpret it like an unlimited number of e? e? e? e? ...
Greedy one
or more
occurrences
e+ Interpret it like e e? e? e? ...
Peek
predicate
&e
Match the input against e without changing the input position (the input position
remains the same whether the match succeeds or fails). The result is success if it
matches, otherwise it is fail.
Not
predicate
!e
Match the input against e without changing the input position (the input position
remains the same whether the match succeeds or fails). The result is fail if it matches,
otherwise it is success.
Advance
predicate
. Increase the input position by one. The result is success.
The following table shows the effect of PEG grammar constructs on sample input strings. The | character indicates
the input position:
PEG construct Input Match Position Match Result
S: 'for' ; |for for| true
S: 'for' ; |former for|mer true
S: 'for' ; |afor |afor false
S: 'for' 'all' ; |forall men forall| men true
S: 'former' / 'for' ; |for for| true
S: 'former' / 'for' ; |former former| true
S: 'for' / 'former' ; |for for| true
S: 'for' / 'former' ; |former for|mer true
S: 'for'?'mer' ; |former former| true
S: 'for'?'mer' ; |mer mer| true
S: 'for'?'former' ; |former |former false
S: [0-9]* ; |1903.535 1903|.535 true
S: [a-z .]+'.*'? ; |ifi.go.* ifi.go.|* true
S: 'for' &'(' ; |for( for|( true
S: 'for' &'(' ; |for[ |for[ false
S: 'for' !'(' ; |for[ for|[ true
S: 'for' !'(' ; |for( |for( false
PEG versus regular expressions
Since PEGs cover context free grammars, they are more powerful than regular expressions. The most important
difference between Parsing Expression Grammars and regular expressions is the fact that the former supports
recursive definitions, the latter not. But simple non recursive situations often require shorter regular expression
patterns as can be shown for an email address recognizer.
Regular expression as email address recognizer:
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
[A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}
Incorrect PEG rule as email address recognizer:
EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;
Correct PEG rules as email address recognizer:
EMailChar: [A-Za-z0-9._%-];
EMailSuffix: [A-Za-z]{2,4} !EMailChar ;
EMail: EMailChar+
'@'
([A-Za-z0-9_%-] |'.'!EMailSuffix)+
'.'
EmailSuffix ;
The correct PEG email address recognizer is more complicated than the regular expression email address recognizer
and uses implicit backtracking (in !EMailSuffix). When we parse the legal email address marc.bloom@blo.blo.uk
with the PEG rule EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;, we will fail,
because blo.blo.uk will be recognized by the greedy rule component [A-Za-z0-9._%-]+, so that there will be no
input character left for '.'[A-Za-z]{2,4}. The regular expression pattern does not have this problem, because
the input string blo.blo.uk will be wisely shared between the two parts of the regular expression [A-Za-z0-9._%-
]+\.[A-Za-z]{2,4}. Fortunately, it turns out that such intricate backtracking (as used in the correct PEG rule) is
seldom necessary for parsing.
Procedural and template implementations of PEG elements
An ideal C/C++ PEG implementation is both close to a PEG grammar and runtime efficient. This can be achieved
both with normal "C" code as well as with C++ templates. C++ templates prove to be more flexible, whereas normal
C-code has the advantage of making less trouble in case of programmer errors, because error messages for
templates are still difficult to decode. The rule S: 'C' ; has the following procedural implementation:
typedef const unsigned char* Iterator;
inline bool Char(Iterator& p,int c) { return *p==c?(++p,true) : false;}
bool S(Iterator& it){return Char(it,'C');}
and the following template implementation:
template<int CHARVALUE>
struct Char{
template<typename It>
static bool Matches(It& p) { return *p==CHARVALUE?(++p,true):false;}
};
struct S{
typedef Char<'C'> rule;
template<typename It>
static bool Matches(It& p) { return rule::Matches(p); }
};
The following table sketches the procedural and template implementations of all the PEG constructs:
PEG notation procedural implementation template implementation
[a-zA-Z0-9]
In ( p, 'a', 'z',
'A', 'Z', '0', '9' )
In<'a', 'z', 'A', 'Z',
'0', '9'>::Matches(p)
[:;/] OneOfChars ( p, ':', ';', '/' )
OneOfChars<':', ';',
'/'>::Matches(p)
'for' String ( p,'f','o','r' )
String<'f', 'o',
'r'>::Matches(p)
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
e
1
e
2
( pSave=p, e1(p) && e1(p) )||
( p=pSave, false )
And<e1,e2>::Matches(p)
e
1
/ e
2
( pSave=p, e1(p))
|| ( p=pSave, e2(p) )
|| ( p=pSave, false )
Or<e1,e2>::Matches(p)
e? e ( p ) ; return true; Option<e>::Matches(p)
e* while ( e(p) ) {} return true; OptRepeat<e>::Matches(p)
e+
if ( !e(p) ) {return false;}
while ( e(p) ) {} return true;
PlusRepeat<e>::Matches(p)
e{LOW,HIGH}
pSave=p; int i;
for ( i=0; i<=HIGH; i++ ) {
if ( !e(p) ) { break; }
}
return i>=LOW ? true:
( p=pSave,false );
For<LOW,HIGH>::Matches(p)
&e
( pSave=p, e(p) )
?( p=pSave,true ):
( p=pSave, false )
Peek<e>::Matches(p)
!e
( pSave=p, e(p) )
? ( p=pSave,false ) :
( p=pSave,true )
Not<e>::Matches(p)
. ++p Advance<1>::Matches(p)
Note, that the constructs . and e{LOW,HIGH} are not part of the originally proposed PEG elements, but are added
by me. Applied to the email PEG recognizer rules, we get the following template implementation:
bool MatchesEMail(openend_iterator& it)
{
using namespace peg_templ;
C c;
typedef In<'A','Z','a','z'> Alpha;
typedef In<'A','Z','a','z','0','9'> Alnum;
typedef Or<Alnum,OneOfChars<'.','_','%','-'> > EMailChar;
typedef And<For<2,4,Alpha >,Not<EMailChar > > EMailSuffix;
typedef And<
PlusRepeat<EMailChar>,
Char<'@'>,
PlusRepeat<Or< Or<Alnum,OneOfChars<'_','%','-'> >,
And<Char<'.'>,Not<EMailSuffix> >
>
>,
Char<'.'>,
EMailSuffix
> EMail;
return EMail::Matches(it,&c);
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
}
Note that EMail::Matches(it,&c) has two parameters (contrary to the table above). The first is an iterator, the
second a pointer to the parsing context. This parsing context is necessary, if one wants to carry out some action
during parsing.
Input Iterators
The parser reads the source for parsing. Storage medium, encoding, and end of input indicators can be different for
various sources. In most cases, the storage medium is a file or a character string. The character encoding is either
ASCII, UTF-8 or one of the many other Unicode encodings. The end of input can be detected either by the presence
of a special terminator or by comparison against an end of input pointer.
Checking each input position against the end of input is in most cases not necessary. Most grammars expect only a
subset of the possible input characters. If we know that the input is terminated by a special character, say '\0', we can
use a start rule like: S: TopRule '\0';.
If we use a template parameter Iterator, we can cope with all the mentioned possibilities by providing a sensible
iterator implementation. Our PEG implementation uses only the following iterator operations:
PEG iterator operation Meaning
int
iterator::operator*
()const
Returns the current character value. In case of an iterator for an input range, it
returns a negative value when at the end of range.
iterator&
iterator::operator++
()
Increments the input position. In case of an iterator for an input range, it does not
increment the input position when at the end of the input range.
The following table shows two sample models for input iterators:
container/iterator characterization iterator implementation
character buffer with special termination character (e.g.: '\0').
typedef const unsigned
char* peg_openend_iterator;
character buffer with end of input pointer.
struct peg_begend_iterator{
peg_begend_iterator(const
char* p=0,const char* pEnd=0)
:m_p((const unsigned char*)p),
m_pEnd((const unsigned
char*)pEnd){}
int operator*()const{
return m_p==m_pEnd?
(INT_MIN+1):*m_p;
}
peg_begend_iterator&
operator++(){
if(m_p<m_pEnd)++m_p;
return *this;
}
unsigned const char* m_p;
unsigned const char* m_pEnd;
};
Why performance matters
The parser of a C++ compiler spends less then 20% of the compilation time with parsing. But there are many
applications for parsing without sophisticated post-processing where speed really matters. Examples are auto-
completion support for development environments, preprocessing of C/C++ source files, extraction of items from
large XML files.
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
There are two main sources of inefficiencies in parsing:
backtracking
runtime costs of unnecessary abstractions
In recursive descent parsing, backtracking can be avoided by rewriting the grammar in a way that avoids alternatives
with the same beginning. This technique is known as left factorization and will be shown in one of the next articles of
this series. In LR parsing (LALR parsing to be precise) as used by YACC (the UNIX parser generator), backtracking is
avoided by the internal use of a stack for symbols which wait for further processing. It is well known that hand written
recursive descent parsers using optimized grammars and tools like YACC are the best performers. But for the casual
user, handwritten parsers as well as compiler generation tools are not ideal, the former because they require too
much work, the latter, because they generate in most cases impenetrable and hard to maintain code. With Parsing
Expression Grammars, the picture could change, because PEGs combine the easy to understand recursive descent
technique with the possibility of an efficient implementation which is very close to the original grammar. I therefore
compared the runtime efficiency of three PEG implementations and of a hand-optimized parser. I took a grammar
for arithmetical expressions as a test case, because such a grammar can be written in a way which requires no
backtracking. All the tested parsers had to parse the expression 132*( firstOccurance + x2*( 1001/N55
)+19 ), which was repeated until it filled a buffer of 1000000 characters. The following table gives the results.
Implementation technique CPU usage VC++ 7.0 CPU usage GCC 3.3.1
procedural PEG implementation 0.17 seconds 0.03 seconds
templated PEG implementation 0.211 seconds 0.041 seconds
begEnd templated PEG implementation 0.32 seconds 0.09 seconds
hand optimized PEG implementation 0.06 seconds 0.03 seconds
The results show that the templated and procedural PEG implementations come very close to the hand-optimized
parser. They also show that using an iterator which always checks for the end of input condition has its price.
Points of interest
The term Parsing Expression Grammars was coined by Bryan Ford [1]. He also pointed out that this kind of recursive
descent parsing with backtracking is common practice. He showed that only very few rules are needed to define a
powerful framework for parsing. Instead of relying on grammar transformations which eliminate alternatives having
the same beginning (so called left factorization), he invented a parser, which memorizes already taken paths and
achieves linear parsing time for any PEG grammar (the packrat parser).
An important advantage of PEGs comes from the fact that a separate scanner is not necessary. When parsing with a
scanner, one divides the parsing task in a low level part called scanning, and in a high level part, proper parsing. A
scanner typically skips white space and comments and reads items like numbers, operators and identifiers, and
returns an integer value to indicate what it has found. A scanner reduces the necessity to backtrack in some cases.
On the other hand, it introduces a new level of abstraction which was never necessary from a theoretical point of
view, and it can cause problems because of the lack of context information. A typical example is the closing angle
bracket of template instantiations (the > symbol). You get a compiler error if you close two template instantiations at
once using >> (one must use > > instead), because the scanner recognizes >> as a shift operator, a problem which
does not arise if you do not use a scanner. Interestingly, Christoper Diggins [2], who presented his YARD parser on
this site, does not mention the work of Bryan Ford, but uses exactly the PEG ideas including scanner-less parsing.
This is certainly due to the fact that these ideas lie in the air.
History
2005 April: initial version.
2005 April: corrections to chapter PEG versus regular expressions (thanks to Gary Feldman).
References
1. Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
2. A Regular Expression Tokenizer using the YARD Parser, Christopher Diggins, December 2004 [^]
Crafting an interpreter Part 1 - Parsing and Grammars - Martin.Holzherr - CodeProject
https://workspaces.codeproject.com/martinholzherr/crafting-an-interpreter-part-1-parsing-and-grammar[19/06/2014 4:45:12 AM]
Advertise | Privacy | Pricing | Report problem | Mark workspace as inappropriate | Forum | Twitter Copyright CodeProject, 1999-2014

All Rights Reserved. Terms of Use
layout: fixed | fluid

You might also like