#LyX 1.1 created this file. For more info see http://www.lyx.org/ \lyxformat 218 \textclass article \language english \inputencoding auto \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default \layout Standard \align center \begin_inset Figure size 208 84 file spirit.eps width 4 70 flags 9 \end_inset \newline \size larger Spirit Version 1.2 \layout Title \size giant Spirit \size default \newline Version 1.2 \layout Author By Joel de Guzman \newline Edited by Dan Nuffer, Chris Uzdavinis \newline Layout and HTML by Joel de Guzman, Martijn W. van der Lee \newline Latex/Lyx by Nils Springob \newline Spirit is hosted by SourceForge \layout Standard \added_space_bottom bigskip \noindent \align center \series bold \size large Copyright (c)2001 Joel de Guzman \layout Standard \noindent \family sans \size small This software is provided 'as-is', without any express or implied warranty. In no event will the copyright holder be held liable for any damages arising from the use of this software. \layout Standard \noindent \family sans \size small Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: \layout Enumerate \noindent \family sans \size small The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. \layout Enumerate \noindent \family sans \size small Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. \layout Enumerate \noindent \family sans \size small This notice may not be removed or altered from any source distribution. \newline \layout Paragraph Acknowledgements: \layout Standard Special thanks to Dan Nuffer, John (EBo) David, Chris Uzdavinis, and Doug Gregor. These people are most instrumental in steering Spirit in the right direction. \layout Standard Special thanks also to people who have contributed to the code base and sample code, ported Spirit to various platforms and compilers, gave suggestions , reported and provided bug fixes. Alexander Hirner, Andy Elvey, Bogdan Kushnir, Brett Calcott, Bruce Florman, Changzhe Han, Colin McPhail, Hakki Dogusan, Jan Bares, Joseph Smith, Martijn W. van der Lee, Raghavendra Satish, Remi Delcos, Tom Spilman, Vladimir Prus, W. Scott Dillman, David A. Greene, Bob Bailey, Hartmut Kaiser. \layout Standard Finally special thanks also to people who gave feedback and valuable comments, particularly members of Spirit's Source Forge mailing list and boost.org. \layout Section Introduction \layout Standard Spirit is an object oriented recursive descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus Normal Form (EBNF) completely in C++. \layout Standard The Spirit framework enables a target grammar to be written exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code. \layout Standard A simple EBNF grammar snippet: \layout LyX-Code group ::= '(' expr ')' \layout LyX-Code expr1 ::= integer | group \layout LyX-Code expr2 ::= expr1 (('*' expr1) | ('/' expr1))* \layout LyX-Code expr ::= expr2 (('+' expr2) | ('-' expr2))* \layout LyX-Code is approximated using Spirit's facilities as seen in this code snippet: \layout LyX-Code group = '(' >> expr >> ')'; \layout LyX-Code expr1 = integer | group; \layout LyX-Code expr2 = expr1 >> *(('*' >> expr1) | ('/' >> expr1)); \layout LyX-Code expr = expr2 >> *(('+' >> expr2) | ('-' >> expr2)); \layout Standard Through the magic of expression templates, this is perfectly valid and executabl e C++ code. The production rule \family typewriter expr \family default is in fact an object that has a member function \family typewriter parse \family default that does the work given a source code written in the grammar that we have just declared. \layout Standard Certainly we have done some modifications to the original EBNF syntax. This is done so as to conform to C++ syntax rules. Most notably we see the abundance of shift ( \family typewriter >> \family default ) operators. Since there are no 'empty' operators in C++, it is simply not possible to write something like: \layout LyX-Code a b \layout Standard as seen in Math syntax, for example, to mean multiplication or, in our case, as seen in EBNF syntax to mean sequencing ( \family typewriter b \family default should follow \family typewriter a \family default ). The framework uses the shift ( \family typewriter >> \family default ) operator instead for this purpose. We take the ( \family typewriter >> \family default ) operator, with arrows pointing to the right, to mean \begin_inset Quotes eld \end_inset is followed by \begin_inset Quotes erd \end_inset . Thus we write: \layout LyX-Code a >> b \layout Standard The alternative operator ( \family typewriter | \family default ) and the parentheses (grouping) remain as is. The assignment operator ( \family typewriter = \family default ) is used in place of BNF's ( \family typewriter ::= \family default ) Last but not least, the Kleene star ( \family typewriter * \family default ) which used to be a postfix operator in EBNF becomes a prefix. Instead of: \layout LyX-Code a* ... in EBNF syntax, \layout Standard we write: \layout LyX-Code *a ... in Spirit. \layout Standard since there are no postfix stars ( \family typewriter * \family default ) in C/C++. Finally, we terminate each rule with the ubiquitous semi-colon ( \family typewriter ; \family default ). \layout Section The Parser \layout Standard What makes Spirit tick? First, some concepts. The parser class is the most fundamental entity in the framework. Basically, a parser accepts a pair of iterators and returns a match object as its result. The iterators delimit the data currently being parsed. The match object evaluates to \family typewriter true \family default if the parse succeeds, in which case the input is advanced accordingly. Each parser can represent a specific pattern or algorithm, or it can be a more complex parser formed as a composition of other parsers. \layout Standard All parsers inherit from the base template class, parser: \layout LyX-Code template \layout LyX-Code struct parser { \layout LyX-Code /*...*/ \layout LyX-Code DerivedT&; derived(); \layout LyX-Code DerivedT const&; derived() const; \layout LyX-Code }; \layout Standard This class is a protocol base class for all parsers. This is essentially an interface contract. The parser class does not really know how to parse anything but instead relies on the template parameter DerivedT to do the actual parsing. This technique is known as the "Curiously Recurring Template Pattern" in meta-programming circles. This inheritance strategy gives us the power of polymorphism without the virtual function overhead. In essence this is a way to implement compile time polymorphism. \layout Standard Concrete sub-classes inheriting from parser must have a corresponding member function \family typewriter parse() \family default compatible with the conceptual Interface: \layout LyX-Code template \layout LyX-Code match \layout LyX-Code parse(IteratorT&; first, IteratorT last) const; \layout Standard where \family typewriter first \family default points to the current input, \family typewriter last \family default points to one after the end of the input (reminiscent of STL algorithms), \family typewriter match \family default result reports the parsing success (or failure). Note that \family typewriter first \family default is passed in by reference. This is advanced appropriately when a match is found, otherwise its position is undefined. \layout Quotation \noindent \family sans \series bold \size large \noun on Token Types \begin_deeper \layout Quotation \noindent \family sans \size small Since \family typewriter parser::parse \family sans is a template member function, its \family typewriter IteratorT \family sans template parameter is an abstract concept. This implies that parsers can deal with arbitrary token types. This can be a \family typewriter char \family sans , a \family typewriter wchar_t \family sans , an enum, an integer or a user defined type. Spirit can work on arbitrary token types as long the \family typewriter == \family sans and \family typewriter < \family sans operators are applicable to its instances. \end_deeper \layout Standard \family typewriter IteratorT \family default is an STL compliant forward iterator. It is passed by reference to allow the function to 'move' its position accordingly when a match is found. The parse is considered successful if a portion of the input starting at the current scanner position is matched. The parse function terminates as soon as the iterator finds anything that the parser does not recognize or when the input is exhasuted ( \family typewriter first \family default \family typewriter == last \family default ). \layout Standard A parser can be quite simple. \newline Here is a sample parser that accepts all characters: \layout LyX-Code struct anychar : public parser { \layout LyX-Code template \layout LyX-Code match \layout LyX-Code parse(IteratorT& first, IteratorT last) const \layout LyX-Code { \layout LyX-Code if (first != last) \layout LyX-Code { \layout LyX-Code ++first; \layout LyX-Code return match(1); \layout LyX-Code } \layout LyX-Code return match(); \layout LyX-Code } \layout LyX-Code } \layout Standard A match object is returned by the parse function. Most importantly, the match object reports the success of the parse; i.e. evaluates to \family typewriter true \family default if the parse function is successful, \family typewriter false \family default otherwise. If the parse is successful, the match object may also be queried to report the number of characters matched ( \family typewriter match.length() \family default ). The length is non-negative if the match is successful, and the typical length of a parse failure is -1. Note: a default-constructed match object represents an unsuccessful parse. \newline Here is a code snippet: \layout LyX-Code match hit = parse(first, last); \layout LyX-Code bool success = hit; \layout LyX-Code int length = hit.length(); \layout Standard Clients of the framework generally do not need to write their own hand-coded parsers at all. Spirit has an immense repertoire of pre-defined parsers covering all aspects of syntax and semantics analysis. We shall examine this repertoire of parsers in the following sections. In the rare case where a specific functionality is not available, it is extremely easy to write a user-defined parser. The ease in writing a parser entity is the main reason for Spirit's extensibili ty. \layout Section Primitives \layout Quotation \noindent \family sans \series bold \size large \noun on Character and Phrase Levels \begin_deeper \layout Quotation \noindent \family sans \size small Typical parsers regard the processing of characters (symbols that form words or lexemes) and phrases (words that form sentences) as separate domains. Entities such as reserved words, operators, literal strings, numerical constants, etc., which constitute the terminals of a grammar are usually extracted first in a separate lexical analysis stage. \layout Quotation \noindent \family sans \size small At this point, as evident in the examples we have so far, it is important to note that contrary to standard practice, the Spirit framework handles parsing tasks at both the character level as well as the phrase level. One may consider that a lexical analyzer is seamlessly integrated in the Spirit framework. \layout Quotation \noindent \family sans \size small Although the Spirit parser library does not need a separate lexical analyzer, there is no reason why we cannot have one. One can always have as many parser layers as needed. In theory, one may create a preprocessor, a lexical analyzer and a parser proper, all using the same framework. \end_deeper \layout Standard The framework predefines some parser primitives. The most common is " \family typewriter chlit \family default " (character literal), " \family typewriter range \family default " (character range), and " \family typewriter strlit \family default " (string literal). \layout Standard Going back to our original example, the character literals '(', ')', '+', '-', '*' and '/' in the grammar declaration are \family typewriter chlit \family default objects that are implicitly created behind the scenes. One may prefer to declare these explicitly as: \layout LyX-Code chlit<> plus('+'); \layout LyX-Code chlit<> minus('-'); \layout LyX-Code chlit<> times('*'); \layout LyX-Code chlit<> divide('/'); \layout LyX-Code chlit<> oppar('('); \layout LyX-Code chlit<> clpar(')'); \layout Standard The framework also predefines a few more utility parsers. There is " \family typewriter epsilon \family default " which matches the null string (always returns a sucessful match with 0 length), " \family typewriter anychar \family default " which matches any single character (including the ' \backslash 0') and " \family typewriter nothing \family default " which never matches anything and always fails. Finally, there is a full repertoire of single character parsers: \family typewriter alnum \family default , \family typewriter alpha \family default , \family typewriter cntrl \family default , \family typewriter digit \family default , \family typewriter graph \family default , \family typewriter lower \family default , \family typewriter print \family default , \family typewriter punct \family default , \family typewriter space \family default , \family typewriter upper \family default and \family typewriter xdigit \family default . \layout Itemize \series bold The Complete List of Parser Primitives \begin_deeper \layout Itemize \series bold Basics \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter chlit \end_inset \begin_inset Text \layout Standard \family sans Character literal \end_inset \begin_inset Text \layout Standard \family sans Example: \family default \family typewriter chlit<>('X'); \end_inset \begin_inset Text \layout Standard \family typewriter range \end_inset \begin_inset Text \layout Standard \family sans Character range \end_inset \begin_inset Text \layout Standard \family sans Example: \family default \family typewriter range<>('0','9'); \end_inset \begin_inset Text \layout Standard \family typewriter strlit \end_inset \begin_inset Text \layout Standard \family sans String literal \end_inset \begin_inset Text \layout Standard \family sans Example: \family typewriter strlit<>( \begin_inset Quotes eld \end_inset Hello \begin_inset Quotes eld \end_inset ); \end_inset \end_inset \layout Standard \family typewriter chlit \family default and \family typewriter range \family default are template classes parameterized by character type which defaults to \family typewriter char \family default . \newline Examples: \layout LyX-Code chlit, range \layout Standard \family typewriter strlit \family default is a template class parameterized by the string type which defaults to \family typewriter cstring<> \family default . The auxilliary class \family typewriter cstring \family default is a template class parameterized by the character type which defaults to \family typewriter char \family default . \newline Examples: \layout LyX-Code strlit> \layout LyX-Code strlit> \end_deeper \layout Itemize \series bold Functors \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter ch_p(ch) \end_inset \begin_inset Text \layout Standard \family sans Functor version of chlit \end_inset \begin_inset Text \layout Standard \family typewriter range_p(from, to) \end_inset \begin_inset Text \layout Standard \family sans Functor version of range \end_inset \begin_inset Text \layout Standard \family typewriter str_p(string) \end_inset \begin_inset Text \layout Standard \family sans Functor version of strlit \end_inset \end_inset \layout Standard These functors are designed to be used within expressions. These functors are actually objects that create parsers. Example: \layout LyX-Code helloworld = str_p( \begin_inset Quotes eld \end_inset hello \begin_inset Quotes erd \end_inset ) >> str_p( \begin_inset Quotes eld \end_inset world \begin_inset Quotes erd \end_inset ); \end_deeper \layout Itemize \series bold Utility Primitives \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter anychar \end_inset \begin_inset Text \layout Standard \family sans Matches any character \end_inset \begin_inset Text \layout Standard \family typewriter nothing \end_inset \begin_inset Text \layout Standard \family sans Matches nothing \end_inset \begin_inset Text \layout Standard \family typewriter epsilon \end_inset \begin_inset Text \layout Standard \family sans The epsilon, always a sucessful match with 0 length \end_inset \begin_inset Text \layout Standard \family typewriter alnum \end_inset \begin_inset Text \layout Standard \family sans Alpha-numeric characters \end_inset \begin_inset Text \layout Standard \family typewriter alpha \end_inset \begin_inset Text \layout Standard \family sans Alphabetic characters \end_inset \begin_inset Text \layout Standard \family typewriter cntrl \end_inset \begin_inset Text \layout Standard \family sans Control characters \end_inset \begin_inset Text \layout Standard \family typewriter digit \end_inset \begin_inset Text \layout Standard \family sans Numeric digits \end_inset \begin_inset Text \layout Standard \family typewriter graph \end_inset \begin_inset Text \layout Standard \family sans Non-space printing characters \end_inset \begin_inset Text \layout Standard \family typewriter lower \end_inset \begin_inset Text \layout Standard \family sans Lower case letters \end_inset \begin_inset Text \layout Standard \family typewriter print \end_inset \begin_inset Text \layout Standard \family sans Printable charachters \end_inset \begin_inset Text \layout Standard \family typewriter punct \end_inset \begin_inset Text \layout Standard \family sans Punctuation symbols \end_inset \begin_inset Text \layout Standard \family typewriter space \end_inset \begin_inset Text \layout Standard \family sans Space, tab, return, etc. \end_inset \begin_inset Text \layout Standard \family typewriter upper \end_inset \begin_inset Text \layout Standard \family sans Upper case letters \end_inset \begin_inset Text \layout Standard \family typewriter xdigit \end_inset \begin_inset Text \layout Standard \family sans Hexadecimal digits \end_inset \end_inset \end_deeper \end_deeper \layout Section Numerics \layout Standard The framework provides a couple of predefined objects for parsing signed and unsigned integers and real numbers. The following table enumerates the numeric parsers available. \layout Itemize \series bold Numeric Parsers \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter uint_p \end_inset \begin_inset Text \layout Standard \family sans Unsigned integers \end_inset \begin_inset Text \layout Standard \family typewriter int_p \end_inset \begin_inset Text \layout Standard \family sans Signed integers \end_inset \begin_inset Text \layout Standard \family typewriter ureal_p \end_inset \begin_inset Text \layout Standard \family sans Unsigned floating point numbers C/C++ format, e.g 123.456E78 \end_inset \begin_inset Text \layout Standard \family typewriter real_p \end_inset \begin_inset Text \layout Standard \family sans Signed floating point numbers C/C++ format, e.g -123.456E78 \end_inset \end_inset \end_deeper \layout Standard These objects are parser functors designed to be used within expressions. Here's an example of a rule that parses comma delimited list of numbers: \layout LyX-Code list_of_numbers = real_p >> *(',' >> real_p); \layout Standard Later, we shall see how we can extract the actual numbers parsed by the numeric parsers. We shall deal with this when we get to the section on specialized actions. \layout Section Character Sets \layout Quotation \noindent \family sans \series bold \size large \noun on Sparse bit vectors \begin_deeper \layout Quotation \noindent \family sans \size small In order to accomodate 16/32 and 64 bit characters, the chset class is internall y implemented as a sparse bit/boolean set. The implementation uses a sorted vector of disjoint ranges (range_runs). The set is constructed from ranges such that adjacent or overlapping ranges are coalesced. \layout Quotation \noindent \family sans \size small range_runs are very space-economical in situations where there are lots of ranges and a few individual disjoint values. Searching is O(log n) where n is the number of ranges. \layout Quotation \noindent \family sans \size small In the near future, the implementation can be further optimized to use the standard library's bit vector whenever possible. \end_deeper \layout Standard The character set \family typewriter chset \family default matches a set of characters over a finite range bounded by the limits of its template parameter \family typewriter CharT \family default . This class is an optimization of a parser that acts on a set of single characters. The template class is parameterized by the character type ( \family typewriter CharT \family default ) and can work efficiently with 8, 16 and 32 and even 64 bit characters. \layout Standard \family typewriter chsets \family default are constructed from literals (e.g. 'x'), \family typewriter chlits \family default , \family typewriter ranges \family default , \family typewriter anychar \family default and \family typewriter nothing \family default (see primitives section above) or copy-constructed from another \family typewriter chset \family default . The \family typewriter chset \family default class uses a copy-on-write scheme that enables instances to be passed along easily by value. \newline Examples: \layout LyX-Code chset<> s1('x'); \layout LyX-Code chset<> s2(anychar - s1); \layout Standard Optionally, character sets may also be constructed using a definition string following a syntax that resembles posix style regular expression character sets, except that double quotes delimit the set elements instead of square brackets and there is no special negation '^' character. \layout Standard Character set definition strings follow the meta-syntax: \layout LyX-Code range = anychar >> '-' >> anychar; \layout LyX-Code set = *(range | anychar); \layout Standard Since we are defining the set using a \family typewriter string \family default , the usual C/C++ literal string syntax rules apply. \newline Examples: \layout LyX-Code chset<> s1( \begin_inset Quotes eld \end_inset a-zA-Z \begin_inset Quotes erd \end_inset ); // alphabetic characters \layout LyX-Code chset<> s2( \begin_inset Quotes eld \end_inset 0-9a-fA-F \begin_inset Quotes erd \end_inset ); // hexadecimal characters \layout LyX-Code chset<> s3( \begin_inset Quotes eld \end_inset actgACTG \begin_inset Quotes erd \end_inset ); // DNA identifiers \layout LyX-Code chset<> s4( \begin_inset Quotes eld \end_inset \backslash xff \backslash xfe \begin_inset Quotes erd \end_inset ); // Hexadecimal 0xFF and 0xFE \layout Standard The standard Spirit set operators apply (see operators section below) plus an additional character-set-specific inverse (negation) operator: \layout LyX-Code ~a // Set inverse \layout LyX-Code a | b // Set union \layout LyX-Code a & b // Set intersection \layout LyX-Code a - b // Set difference \layout LyX-Code a ^ b // Set xor \layout Standard where operands \family typewriter a \family default and \family typewriter b \family default are both chsets or one of the operand is either a literal character, a \family typewriter chlit \family default , a \family typewriter range \family default , \family typewriter anychar \family default or \family typewriter nothing \family default . Special optimized overloads are provided for anychar or nothing operands. A \family typewriter nothing \family default operand is converted to an empty set, while an \family typewriter anychar \family default operand is converted to a set having elements of the full range of the character type used (e.g. 0-255 for unsigned 8 bit chars). \layout Standard A special case is \family typewriter ~anychar \family default which yields nothing, but \family typewriter ~nothing \family default is illegal. Inversion of \family typewriter anychar \family default is asymmetrical, a one-way trip comparable to converting \family typewriter T* \family default to a \family typewriter void* \family default . \layout Standard An assignment and a copy construtor are provided to allow mixed conversion from a \family typewriter chset \family default to a \family typewriter chset \family default . This is possible when type \family typewriter A \family default is convertible to type \family typewriter B \family default . For example if type \family typewriter A \family default is a \family typewriter char \family default and type \family typewriter B \family default is a \family typewriter wchar_t \family default . \layout Section Symbols \layout Quotation \noindent \family sans \series bold \size large \noun on Ternary State Trees \begin_deeper \layout Quotation \noindent \family sans \size small The actual set implementation is supplied by the SetT template parameter (3rd template parameter of the symbols class) . By default, this uses the tst class which is an implementation of the Ternary Search Tree. \layout Quotation \noindent \family sans \size small Ternary Search Trees are faster than hashing for many typical search problems especially when the search interface is iterator based. Searching for a string of length k in a ternary search tree with n strings will require at most O(log n+k) character comparisons. TSTs are many times faster than hash tables for unsuccessful searches since mismatches are discovered earlier after examining only a few characters. Hash tables always examine an entire key when searching. \layout Quotation \noindent \family sans \size small For details see http://www.cs.princeton.edu/~rs/strings/ \end_deeper \layout Standard This class symbols implements a symbol table. The symbol table holds a dictionary of symbols where each symbol is a sequence of \family typewriter CharT \family default s. The template class is parameterized by the character type ( \family typewriter CharT \family default ) can work efficiently with 8, 16, 32 and even 64 bit characters. \layout Standard Traditionally, symbol table management is maintained seperately outside the BNF grammar through semantic actions. Contrary to standard practice, the Spirit symbol table \family typewriter symbols \family default \emph on is-a \emph default parser. An instance of which may be used anywhere in the EBNF grammar specification. It is an example of a dynamic parser. A dynamic parser is characterized by its ability to modify its behavior at run time. Initially, an empty symbols object matches nothing. At any time, symbols may be added, thus, dynamically altering its behavior. \layout Standard Each entry in a symbol table has an associated mutable data slot. In this regard, one can view the symbol table as an associative container (or map) of key-value pairs where the keys are strings. The data associated with each symbol may be modified any time. Later, when we get to the section on specialized actions, we shall see how that is done. \layout Standard The \family typewriter symbols \family default class expects two template parameters (actually there is a third, see detail box above). The first parameter ( \family typewriter T \family default ) specifies the data type associated with each symbol (defaults to \family typewriter int \family default ) and the second parameter ( \family typewriter CharT \family default ) specifies the character type of the symbols (defaults to \family typewriter char \family default ). Here are some sample declarations: \layout LyX-Code symbols<> sym; \layout LyX-Code symbols sym2; \layout LyX-Code \layout LyX-Code struct my_info { \layout LyX-Code \layout LyX-Code int id; \layout LyX-Code double value; \layout LyX-Code }; \layout LyX-Code \layout LyX-Code symbols sym3; \layout Standard After having declared our symbol tables, symbols may be added statically using the construct: \layout LyX-Code sym = a, b, c, d ...; \layout Standard where \family typewriter sym \family default is a symbol table and \family typewriter a..d \family default etc. are strings. Note that the \emph on comma operator \emph default is separating the items being added to the symbol table, through an assignment. Due to operator overloading this is possible and correct (though it may take a little getting used to) and is a concise way to initialize the symbol table with many symbols. Also, it is perfectly valid to make multiple assignments to a symbol table to iteratively add symbols (or groups of symbols) at different times. \layout Standard Simple example: \layout LyX-Code sym = \begin_inset Quotes eld \end_inset pineapple \begin_inset Quotes erd \end_inset , \begin_inset Quotes eld \end_inset orange \begin_inset Quotes erd \end_inset , \begin_inset Quotes eld \end_inset banana \begin_inset Quotes erd \end_inset , \begin_inset Quotes eld \end_inset apple \begin_inset Quotes erd \end_inset , \begin_inset Quotes eld \end_inset mango \begin_inset Quotes erd \end_inset \layout Standard Note that it is invalid to add the same symbol multiple times to a symbol table, though you may modify the value associated with a symbol artibrarily many times. \layout Standard Now, we may use \family typewriter sym \family default in the grammar. Example: \layout LyX-Code fruits = sym >> *(',' >> sym); \layout Standard This simple grammar \family typewriter fruits \family default accepts a comma separated list of my favorite fruits. Using specialized actions, we can dynamically add more \emph on fruity \emph default entries into our symbol table \family typewriter sym \family default . We shall deal with this later. \layout Section Operators \layout Standard Operators are used as a means for object composition and embedding. Simple parsers may be composed to form composites through operator overloading , crafted to approximate the syntax of an Extended Backus-Normal Form (EBNF) variant. An expression such as: \layout LyX-Code a | b \layout Standard actually yields a new parser type which is a composite of its operands, \family typewriter a \family default and \family typewriter b \family default . Taking this example further, if \family typewriter a \family default and \family typewriter b \family default were of type chlit<>, the result would have the composite type: \layout LyX-Code Alternative , chlit<> > \layout Standard A suite of operators facilitate the composition of parsers: \layout Itemize The Complete List of Operators \begin_deeper \layout Itemize Set operators \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter a | b \end_inset \begin_inset Text \layout Standard \family sans Union \end_inset \begin_inset Text \layout Standard \family sans match a or b. Also referred to as alternatives. \end_inset \begin_inset Text \layout Standard \family typewriter a & b \end_inset \begin_inset Text \layout Standard \family sans Intersection \end_inset \begin_inset Text \layout Standard \family sans match a and b. \end_inset \begin_inset Text \layout Standard \family typewriter a - b \end_inset \begin_inset Text \layout Standard \family sans Difference \end_inset \begin_inset Text \layout Standard \family sans match a but not b. If b's matched text is shorter \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \family sans than a's matched text, a successfull match is made. \end_inset \begin_inset Text \layout Standard \family typewriter a ^ b \end_inset \begin_inset Text \layout Standard \family sans XOR \end_inset \begin_inset Text \layout Standard \family sans match a or b, but not both \end_inset \end_inset \layout Standard Alternative operands are tried one by one on a first come first served basis starting from the leftmost operand. After a successfully matched alternative is found, the parser concludes its search, essentially short-circuiting the search for other potentially viable candidates. This short-circuiting implicitly gives the highest priority to the leftmost alternative. \layout Standard Short-circuiting is done in the same manner as C or C++'s logical expressions; e.g. if (x < 3 || y < 2) where, if x evaluates to be less than 3, the y < 2 test is not done at all. In addition to providing an implicit priority rule for alternatives which is necessary, given the non-deterministic nature of the Spirit parser compiler, short-circuiting improves the execution time. TIP: If the order of your alternatives is logically irrelevant, strive to put the (expected) most common choice first for maximum efficiency. \end_deeper \layout Itemize Sequencing operators \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter a >> b \end_inset \begin_inset Text \layout Standard \family sans Sequence \end_inset \begin_inset Text \layout Standard \family sans match a and b in sequence \end_inset \begin_inset Text \layout Standard \family typewriter a && b \end_inset \begin_inset Text \layout Standard \family sans Sequencial-and \end_inset \begin_inset Text \layout Standard \family sans same as above, match a and b in sequence \end_inset \begin_inset Text \layout Standard \family typewriter a || b \end_inset \begin_inset Text \layout Standard \family sans Sequential-or \end_inset \begin_inset Text \layout Standard \family sans match a or b in sequence \end_inset \end_inset \layout Standard The sequencing operator \family typewriter >> \family default can alternatively be thought of as the sequential-and operator. The expression \family typewriter a && b \family default reads as match \family typewriter a and b \family default in sequence. Continuing this logic, we can also have a sequential-or operator where the expression \family typewriter a || b \family default reads as match \family typewriter a or b \family default and in sequence. That is, if both \family typewriter a \family default and \family typewriter b \family default match, it must be in sequence; this is equivalent to \family typewriter a | b | a >> b \family default . \end_deeper \layout Itemize Loops \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter *a \end_inset \begin_inset Text \layout Standard \family sans Kleene star \end_inset \begin_inset Text \layout Standard \family sans match a zero (0) or more times \end_inset \begin_inset Text \layout Standard \family typewriter +a \end_inset \begin_inset Text \layout Standard \family sans Positive \end_inset \begin_inset Text \layout Standard \family sans match a one (1) or more times \end_inset \begin_inset Text \layout Standard \family typewriter !a \end_inset \begin_inset Text \layout Standard \family sans Optional \end_inset \begin_inset Text \layout Standard \family sans match a zero (0) or one (1) time \end_inset \end_inset \end_deeper \end_deeper \layout Standard Operator precedence and grouping: \newline Since we are defining our meta-language in C++, we follow C/C++'s operator precedence rules. Groups override this (e.g., \family typewriter *(a | b) \family default reads: match \family typewriter a \family default or \family typewriter b \family default zero (0) or more times). \layout Quotation \noindent \family sans \size large \noun on Primitive type operands \begin_deeper \layout Quotation \noindent \align left \family sans It is important to emphasize that only one of the operands may be a primitive data type. C++ forbids the overloading operator for primitives such as ints, chars and pointers (e.g. \family typewriter char* \family sans ). At least one of the operands must be a user-defined type). \layout Quotation \noindent \align left \family sans Typically, in an expression involving multiple operators, casting the leftmost operand as a parser is enough to 'infect' all the rest of the operands to its right to be acceptable. Examples: \layout LyX-Code \noindent \size small r = 'a' | 'b' | 'c' | 'd'; // ill formed \layout LyX-Code \noindent \size small r = ch_p('a') | 'b' | 'c' | 'd'; // OK \end_deeper \layout Standard For binary operators, one of the operands but not both may be a \family typewriter char \family default , \family typewriter wchar_t \family default , \family typewriter char const* \family default or \family typewriter wchar_t const* \family default . Examples: \layout LyX-Code a | 'x'; \layout LyX-Code a - \begin_inset Quotes eld \end_inset Hello World \begin_inset Quotes erd \end_inset ; \layout Standard where \family typewriter a \family default is a parser object. \layout Standard A few simple classes, when composed and structured in a hierarchy, form a very powerful object oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of more complex parsers. The final parser composite is a non-deterministic recursive descent parser with infinite look-ahead. Top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives. \layout Standard Non-determinism is not an absolute limitation though. The character set and the symbol table introduce some form of determinism. For instance, whereas \family typewriter 'a' | 'b' | 'c' \family default is non-deterministic, \family typewriter chset ('a') | 'b' | 'c' \family default is absolutely deterministic. Soon, we shall see more deterministic parsers in future versions of the framework. \layout Standard The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few primitive classes. \layout Standard The framework is designed to be fully open-ended and extensible. New primitives or composites, from the trivial to the complex, may be added any time. Composition happens at compile time (static). Again, this is not a requirement. There are facilities, such as the symbol table, that enable dynamic creation and modification of parser behavior and structure. This is possible through the expressive flexibility of C++ template meta-progra mming. \layout Section Loops \layout Standard So far we have introduced a couple of EBNF operators that deal with looping. We have the (+) positive operator, which matches the preceding symbol one (1) or more times, as well as the Kleene star (*) which matches the preceding symbol zero (0) or more times. \layout Standard If we look more closely, take note that we generalized the \series bold optional \series default expression of the form \family typewriter !a \family default above in the same category as loops. This is logical, considering that the \series bold optional \series default matches the expression following it zero (0) or (1) time. Taking this further, we may want to have a generalized loop operator. To some this may seem to be a case of overkill. Yet there are grammars that are impractical and cumbersome, if not impossible, for the basic EBNF iteration syntax to specify. Examples: \layout Itemize A file name may have a maximum of 255 characters only. \layout Itemize A specific bitmap file format has exactly 4096 RGB color information. \layout Itemize A 32 bit binary string (1..32 1s or 0s). \layout Standard Other than the Kleene star (*), the Positive closure (+), and the optional (!), a more flexible mechanism for looping is provided for by the framework. The following are availabe on all parsers. \layout Itemize Loop Constructs \begin_deeper \layout Standard \begin_inset Tabular \begin_inset Text \layout Standard \family typewriter a.repeat(); \end_inset \begin_inset Text \layout Standard \family sans Repeat a, zero or more times \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \family sans Same as a.repeat(0,more); and *a; \end_inset \begin_inset Text \layout Standard \family typewriter a.repeat(n); \end_inset \begin_inset Text \layout Standard \family sans Repeat a, exactly n times \end_inset \begin_inset Text \layout Standard \family typewriter a(n); \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \family typewriter a.repeat(n1, n2); \end_inset \begin_inset Text \layout Standard \family sans Repeat a at least n1 times and at most n2 times \end_inset \begin_inset Text \layout Standard \family typewriter a.repeat(n, to); \end_inset \begin_inset Text \layout Standard \end_inset \begin_inset Text \layout Standard \family typewriter a.repeat(n, more); \end_inset \begin_inset Text \layout Standard \family sans Repeat a, n or more times \end_inset \begin_inset Text \layout Standard \family typewriter a(n, more); \end_inset \begin_inset Text \layout Standard \end_inset \end_inset \end_deeper \layout Section The Rule \layout Standard The rule is a polymorphic parser that acts as a named place-holder capturing the behavior of an EBNF expression assigned to it. Naming an EBNF expression allows it to be referenced later by another EBNF expression. The rule is a template class parameterized by the type of iterator ( \family typewriter IteratorT \family default ) that it will work on. The default iterator type is \family typewriter char const* \family default . \layout Standard The rule class models EBNF's production rule. Example: \layout LyX-Code rule<> a_rule = *(a | b) & +(c | d | e); \layout Standard The type and functionality of the right-hand EBNF expression, which may be arbitrarily complex, is encoded in the rule named \family typewriter a_rule \family default . \family typewriter a_rule \family default may now be referenced elsewhere in the grammar: \layout LyX-Code rule<> another_rule = f >> g >> h >> a_rule; \layout Quotation \noindent \family sans \size large \noun on Rules and virtual functions \begin_deeper \layout Quotation \noindent \family sans \size small rules straddle the border between static and dynamic C++. In effect, a rule transforms compile-time polymorphism (using templates) into run-time polymorphism (using virtual functions). This is necessary due to C++'s inability to automatically declare a variable of a type deduced from an arbitrarily complex expression in the right-hand side (rhs) of an assignment . Basically, we want to do something like: \layout Quotation \noindent \family typewriter \size small T rule = an_arbitrarily_complex_expression; \layout Quotation \noindent \family sans \size small without having to know or care about the resulting type of the right-hand side of the assignment expression. Apart from this, we also need a facility to forward declare an unknown type: \layout Quotation \noindent \family typewriter \size small T rule; \layout Quotation \noindent \family typewriter \size small ... \layout Quotation \noindent \family typewriter \size small rule = a | b; \layout Quotation \noindent \family sans \size small These limitations lead us to this implementation of rules. This comes at the expense of the overhead of a virtual function call, once through each invocation of a rule. \end_deeper \layout Standard When a rule is invoked by an EBNF expression, the rule is held by the expression by reference. It is the responsibility of the client to ensure that the referenced rule stays in scope and does not get destructed while it is being referenced (cyclic dependency ). \layout Standard Forward declarations: \layout Standard rules may be declared before being defined to allow cyclic structures typically found in BNF declarations. Example: \layout LyX-Code rule<> a, b, c; \layout LyX-Code a = b | a; \layout LyX-Code b = c | a; \layout Itemize Recursion: \begin_deeper \layout Standard The right-hand side of a rule may reference other rules, including itself. The limitation is that direct or indirect left recursion is not allowed (this is an unchecked run-time error that results in an infinite loop). This is typical of top-down parsers. Example: \layout LyX-Code a = a | b; // infinite loop! \layout Quotation \noindent \family sans \size large \noun on Dynamic Parsers \begin_deeper \layout Quotation \noindent \family sans \size small We have seen that the symbol table is a dynamic parser. In fact, hosting declarative EBNF in imperative C++ yields an interesting blend. We have the best of both worlds. We have the ability to conveniently modify the grammar at run time using imperative constructs such as if, else statements. Example: \layout Quotation \noindent \family typewriter \size small if (feature_is_available) \layout Quotation \noindent \family typewriter \size small r = add_this_feature; \layout Quotation \noindent \family sans \size small Rules are essentially dynamic parsers. A dynamic parser is characterized by its ability to modify its behavior at run time. Initially, an undefined rule matches nothing. At any time, alternatives may be added, thus, dynamically altering its behavior. \end_deeper \end_deeper \layout Itemize Undefined rules: \begin_deeper \layout Standard An undefined rule matches nothing. \end_deeper \layout Itemize Multiple declarations: \begin_deeper \layout Standard Some BNF variants allow multiple declarations of a rule. The declarations are taken as alternatives. Example, the definition: \layout LyX-Code r = a; \layout LyX-Code r = b; \layout Standard is equivalent to: \layout LyX-Code r = a | b; \end_deeper \layout Itemize No start rule: \begin_deeper \layout Standard Typically, parsers have what is called a start symbol, chosen to be the root of the grammar where parsing starts. The Spirit parser compiler has no notion of a start symbol. Any rule can be a start symbol. This feature promotes step-wise creation of parsers. We can build parsers from the bottom up while fully testing each level or module up untill we get to the top-most level. \end_deeper \layout Section Iterators and Parsing \layout Standard Simple parser classes are composed and structured to form more complex and powerful parsers. Such parser structures form the basic infrastructure of Spirit. Behind the scenes, iterators interweave this infrastructure. Iterators extract data from the input, parceling, potentially modifying or filtering, and then finally relegating the result to individual parser elements on demand until the input is exhausted. \layout Standard The iterator concept and interface adheres to the formal definitions of the C++ Standard Template Library (STL). We can use iterators supplied by STL's containers (e.g. \family typewriter list \family default , \family typewriter vector \family default , \family typewriter string \family default , etc.) as is, or perhaps write our own or use those pre-defined by the Spirit framework. Iterators can be very simple and typically defaults to a pointer (e.g. c \family typewriter har const* \family default ). Iterators can be moderately complex such as the \family typewriter scanner \family default and the \family typewriter multi_pass \family default ; these are iterator adapters that modify the behavior of other iterators that they are enclosing. At the other end of the spectrum, iterators can be quite complex; for instance, an iterator adapter that wraps a lexer such as LEX (In the near future, there is a plan to have a complete lexer iterator). \layout Quotation \noindent \family sans \size large \noun on Input Iterators and multi_pass \begin_deeper \layout Quotation \noindent \family sans \size small Input iterators may still be used with backtracking iterators. The \family typewriter multi_pass \family sans iterator transforms an input iterator into a forward iterator. \layout Quotation \noindent \family sans \size small In general, Spirit is a backtracking parser. This is not a requirement though. In the near future, we shall see more deterministic parsers that require no more than 1 character (token) of lookahead. Such parsers allow us to use input iterators such as the \family typewriter istream_iterator \family sans as is. \end_deeper \layout Standard Generally speaking, Spirit is a backtracking parser. The implication of this is that at some point, the iterator position will have to be saved to allow the parser to backtrack to that point. Thus, for backtracking to work, the framework requires at least a forward iterator. \layout Standard The parse member function may be called directly. For extremely simple parsing tasks, it is advisable not to use the rule at all. Apart from the fact that there's an unavoidable virtual function call overhead for each rule, the rule tightly couples a parser to an iterator. A rule once declared can only work with a specific type of iterator. Any parser linked to a rule or references a rule will in turn be coupled to the rule's specific iterator type. \layout Paragraph* Parsers, rules and iterators: \layout Standard To clarify, let's start with an example: we want to parse a file path (to avoid burying ourselves in details, the following simplistic grammar is meant only as an example to get the message across clearly). \layout LyX-Code *alnum >> *('/' >> *alnum) >> '.' >> *alnum \layout Standard This expression evaluates to a parser. Since this parser is not tied yet to a specific rule, it can use just about any iterator. Here's an example using the expression above to parse from C strings. \layout LyX-Code char const* str = \begin_inset Quotes eld \end_inset boost/spirit/spirit.hpp \begin_inset Quotes erd \end_inset ; // our input \layout LyX-Code char const* first = str; \layout LyX-Code char const* last = str + strlen(str); \layout LyX-Code \layout LyX-Code /***/ \layout LyX-Code \layout LyX-Code match hit = (*alnum >> *('/' >> *alnum) >> '.' >> *alnum).parse(first, last); \layout Standard Here's an example using the same expression above, this time, the input comes from a \family typewriter list \family default : \layout LyX-Code list v; // our input \layout LyX-Code list::iterator first = v.begin(); \layout LyX-Code list::iterator last = v.end(); \layout LyX-Code \layout LyX-Code /***/ \layout LyX-Code \layout LyX-Code match hit = (*alnum >> *('/' >> *alnum) >> '.' >> *alnum).parse(first, last); \layout Standard If we were to use rules in the examples above, we need to explicitly parametize and couple the rule to the iterator type. In the first example, we have: \layout LyX-Code char const* str = \begin_inset Quotes eld \end_inset boost/spirit/spirit.hpp \begin_inset Quotes erd \end_inset ; // our input \layout LyX-Code char const* first = str; \layout LyX-Code char const* last = str + strlen(str); \layout LyX-Code \layout LyX-Code /***/ \layout LyX-Code \layout LyX-Code typedef char const* iterator_t; \layout LyX-Code typedef rule rule_t; \layout LyX-Code rule_t path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum; \layout LyX-Code match hit = path.parse(first, last); \layout Standard The rule \family typewriter path \family default is tied to \family typewriter char const* \family default iterator type and cannot be used to parse input from another iterator type. We have to re-declare the rule to use \family typewriter list::iterator \family default to make it usable in our second example: \layout LyX-Code list v; // our input \layout LyX-Code list::iterator first = v.begin(); \layout LyX-Code list::iterator last = v.end(); \layout LyX-Code \layout LyX-Code /***/ \layout LyX-Code \layout LyX-Code typedef list::iterator iterator_t; \layout LyX-Code typedef rule; rule_t; \layout LyX-Code rule_t path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum; \layout LyX-Code match hit = path.parse(first, last); \layout Standard Notice the use of typedefs for \family typewriter iterator_t \family default and \family typewriter rule_t \family default . Sooner or later, we will have to use more complex iterators. It is best to use typedefs earlier on. That way, we can easily change the \family typewriter iterator_t \family default type anytime without rewriting the whole grammar. \layout Paragraph Wrapping the grammar in a template class: \layout Standard Another interesting technique that avoids early coupling of the rule to a specific iterator type is to wrap the grammar in a template. Following our trivial example: \layout LyX-Code template ; \layout LyX-Code struct my_grammar { \layout LyX-Code my_grammar() \layout LyX-Code { \layout LyX-Code path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum; \layout LyX-Code } \layout LyX-Code rule path; \layout LyX-Code }; \layout Standard Although this might seem trivial at first, this approach makes a grammar so much more flexible. Decoupling the iterator type from the rules that form a grammar allows the grammar to be used in different contexts possibly using different iterators. We don't care what iterator we are dealing with. \layout Standard Also, as the grammar gets quite complicated, it is a good idea to group parts of the grammar into logical modules. For instance, when writing a language, it might be wise put expressions and statements into separate grammar structs (or classes). We may go further and declare the rules as private to the class while allowing const access to only some of the pertinent productions through inline member functions. This approach allows us to conveniently publish grammars in header files. \layout Section The Scanner \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section The Multi Pass \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Directives \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Semantic Actions \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Specialized Actions \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Reference Wrappers \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Parametric Parsers \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Error Handling \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Debugging \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Future Directions \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Wrap up \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section Annotations \layout Standard -------------------------------------------------------------------------------- ---------------------- \layout Section References \layout Standard -------------------------------------------------------------------------------- ---------------------- \the_end