YOCC: Yet Other Compiler Compilers Mark K. Gardner Brigham Young University Provo, UT 84602 (gardner@osm7.cs.byu.edu) 1.0 Introduction Writing a compiler has always been a challenging task. Some may say overwhelmingly so. There are so many details to consider, and yet many of the details remain the same from compiler to compiler. Luckily, tools have been developed to automate parts of the process. The most widely known of these tools are the scanner generator LEX [LeskÊ75] and the parser generator YACC [JohnsonÊ78]. These tools are commonly available on most Unix¨ systems. There are also free implementations available, called Flex [Flex] and GNU Bison [Bison] respectively. Since both LEX and YACC were developed in the 1970Õs, it is natural to wonder if compiler tools have changed much since then. This paper is a survey of some of the more recently developed tools, focusing primarily on tools to parse LL languages There are also tools available to parser LR languages, but these will not be covered here (see [Muskox], [USSA], [GroschÊ90] and [SpectorÊ88], see also [Aho 86]). 2.0 Other Tools In this section, several alternate compiler creation tools are discussed. All are freely available by anonymous ftp on the Internet. All create predictive parsers for LL languages with formal parameters, local variables and return values similar to procedures. And all use a single specification for both the scanner and the parser. The usual calculator example will be used as an illustration of the programmer interfaces of these tools. For the LEX and YACC implementation, see appendix A. 2.1 RDP RDP is a recursive decent parser generator being developed by Adrian Johnstone for a course on compiler design [JohnstoneÊ94]. It accepts LL(1) attributed grammars with C-language semantic actions and compiles them into predictive parsers. It comes with its own high-speed scanner module and hash-coded symbol table module. The code generation module is supplied by the user in the form of semantic actions in the parser. Although it has a few rough edges, it shows promise for those who need to generate translators quickly. Productions in RDP are translated into C functions and thus can have parameters, local variables, and return values, as well as semantic actions. Writing a specification in RDP feels like writing functions at a higher level. Thus specifications can probably be written more quickly than with YACC. The scanner specification appears to be created automatically from the quoted strings used in the parser specification. One of the strengths of RDP is its extreme simplicity. Not only does it generate parsers, but it adds the necessary support systems to make a parser into a complete translator. It also allows the programÕs command-line switches and built-in help information to be included as part of the specification. It comes with an extended BNF (EBNF)[1] grammar for Pascal and for Toy, a small interpreter. It also comes with a specification of itself to be processed by the hand-coded version... Ògood for boggling undergraduateÕs minds with.Ó The calculator example is presented in appendix B. ---------------------------------------------------------------------------- [1] EBNF extends BNF by allowing the use of the repetition operators of regular expressions. This removes the need to specify common items, such as lists, by recursion. It also makes the rules more concise and readable. ---------------------------------------------------------------------------- 2.2 Coco/R Coco/R is a simple, yet flexible scanner and parser generator tool designed to create efficient compilers [MšssenbšckÊ93]. It generates Oberon code from a block structured LL(1) attributed EBNF grammar with Oberon semantic actions. Rather than being table-driven, Coco/R generates the necessary procedures to implement a predictive parser. Perhaps the greatest strength of the Coco/R tool is its simple, unified interface. Because of the block structured nature of the specification and its similarity with procedure declarations, specifying a translator in Coco/R is extremely easy (see appendix C for the calculator example). In addition, the topic of error recovery has been addressed in a rather elegant manner, without sacrificing efficiency. This was done by adding the concepts of synchronization points and weak symbols to the grammar specification. Synchronization points are added by preceding tokens with the keyword SYNC. A synchronization point is translated into a loop that skips all symbols up to one of the SYNC tokens (or the end of file). In addition, spurious error messages are avoided by suppressing addition errors that are within a certain number of symbols from the first error. Typical synchronization tokens are the beginning of statements, and declarations. In addition, Coco/R allows one to specify that certain symbols are ÒweakÓ, meaning that they are often missing or mistyped, by preceding the token with the keyword WEAK. If a weak symbol is missing, an error is reported and input is skipped until a member of the first set of the successor is found. This technique avoids skipping safe symbols. A classic example of a weak token is the semicolon used to separate statements in Pascal. As an additional flexibility for advanced users, Coco/R uses templates for instantiating the scanner and parser. And the templates can be customized for more sophisticated applications. 2.3 PCCTS PCCTS (the Purdue Compiler Construction Tool Set) consists of two tools, ANTLR (Another Tool for Language Recognition) and DLG (DFA-based Lexical analyzer Generator), which construct parsers and lexical analyzers [PCCTSÊ92]. Like Coco/R, an emphasis was placed on making the programming interface simple and flexible. However unlike Coco/R, PCCTS was designed from notations used in previous tools or languages Òin order to bring a sense of familiarity and to attenuate the learning curve.Ó Thus, while PCCTS is easier to use than LEX and YACC, it is not as easy to use as Coco/R. Like RDP and Coco/R, PCCTS uses EBNF to specify the grammar. Also like the preceding tools, PCCTS defines parameters, local variables, and return types for productions similar to procedure declarations. And like RDP, PCCTS generates C code from the C-language semantic actions given in the specification. One of the strengths of PCCTS is the option to automatically create an abstract syntax tree (AST) as its intermediate form. Another of its strengths is the specification of both the parser and the scanner in the same file. This reduces the source synchronization problems that are often experienced with LEX and YACC. (as noted before, RDP and Coco/R also possess this strength) PCCTS is more powerful than the preceding tools in that it allows one to specify multiple scanners and multiple grammars in a single specification. The different scanners and grammars can be mixed, giving a extremely flexible format for specifying complex translators. The price for this freedom is the loss of the coherence seen in a Coco/R specification. It should be noted that PCCTS processes LL(k) grammars, where k is the maximum number of lookaheads allowed. It builds mixed model parsers in which those rules which cannot be resolved with one symbol of lookahead are attempted with a lookahead of up to k symbols. Thus it can parse a larger class of languages than RDP or Coco/R. PCCTS generates one C function for every rule in the specification (like RDP and Coco/R). Although it cannot handle as large of class of grammars as YACC, semantic actions can be introduced at any point without introducing ambiguity or adding extra states. PCCTS uses a simple and automatic error detection and recovery mechanism. Upon an error, a PCCTS generated parser reports the problem in terms of the set of legal tokens for that location and consume all tokens until one is found that is in the follow set of the rule in which the error occurred. This error recover technique does not effect the performance of the parser unless an error is detected. However, it does tend to create extra errors by consuming many tokens. Having k symbols of lookahead does give the potential for more sophisticate error recover. However, such recovery techniques have not yet been implemented in PCCTS. 3.0 Ruminations As the title of this section implies, my own thoughts and preferences will be presented here. As always, your mileage may vary. Of the three tools discussed above, I lean towards using Coco/R. There are several reasons for this. First of all, I am biased towards the Pascal / Modula-2 / Oberon line of programming languages (more towards the latter languages than the former). But aside from that, I like the way that Coco/R integrates the parts of a specification in a block structured fashion. And Coco/R is simple without being inflexible. PCCTS, while being extremely flexible and more powerful, contains a myriad of options to remember. It has purposefully inherited from LEX and YACC and thus has a rather cryptic syntax, making it less attractive to me. On the other hand, because PCCTS and RDP use C to implement semantic actions, they will probably be more usable in most environments. On another note, you will notice that all three tools generate predictive parsers. And none of them generate table driven parsers. I did not purposefully bias the paper in this direction, but decided to limit the scope to a contrast of the three tools when the size of the paper grew beyond three pages. However, it does seem somewhat ironic that I found more top down tools than bottom up tools when searching for information. Perhaps it is because of the accessibility of YACC that more bottom up tools are not being created. Or perhaps it is because the power of bottom up parsing is not required as often as folk wisdom would suggest. I believe it is the latter. Do we really need bottom up parsers for programming languages? The answer is probably yes if creating a compiler for C or C++. On the other hand, almost as if they were answering the above question, the designers of PCCTS said, ÒMost programming language constructs that cannot be described in LL(1) can be easily described in LL(2). Typically, LL(2) is only required in a handful of rules yielding a general LL(1) parser with a few anomaliesÓ [PCCTSÊ92]. The experience I have had with Òwell focusedÓ languages such as Pascal, Modula-2, and Oberon leads me to believe the statement. And I have not found myself restricted when using LL grammars rather than the more powerful LR grammars. Are bottom up parsers more efficient than top down parsers? I have been unable to find empirical evidence documenting that such is the case. And although I have also not seen empirical evidence suggesting the reverse, the PCCTS userÕs manual did state ÒAn advantage of LL parsers over LR parsers is that inheritance (rule/subrule communication) is relatively simple and efficient.Ó [PCCTSÊ92]. Perhaps this comes from the additional complexity created when marker productions added to an LR grammar to handle inherited attributes. Top down parsers do not have this problem and handle inherited attributes quite naturally. In addition, if the grammar is L-attributed, synthesized attributes pose no problem either. Thus, even if bottom up parsing is more efficient, top down may be better when syntax directed translation is performed. What does this all mean? Well, since there doesnÕt appear to be any empirical evidence to document the claims of either side, it appears to come down to a matter of taste. If your tastes are as mine, you will probably like any of the tools discussed. However, if you want Òfantastic cosmic powersÓ, perhaps you had better stick with LEX and YACC. =============================================================================== References [AhoÊ86] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools, AddisonÐWesley, Reading Mass, 1986. [Bison] Available for anonymous ftp from prep.ai.mit.edu. [Flex] Available for anonymous ftp from prep.ai.mit.edu. [GroschÊ90] J. Grosch, ÒObject-Oriented Attribute GrammarsÓ, technical report, GMD Forshungsstelle an der UniversitŠt Karlsruhe, grosch@karlsruhe.gmd.de, 1990. [JohnsonÊ78] S. C. Johnson, ÒYACC: Yet Another Compiler-CompilerÓ, Computer Science Technical Report #32, Bell Laboratories, Murray Hill, NJ, 1978. [JohnstoneÊ94] A. Johnstone, ÒRDP - A Recursive Descent Parser GeneratorÒ, software documentation, CS Department, Royal Holloway, University of London, available by anonymous ftp from cscx.cs.rhbnc.ac.uk, 1994. [LeskÊ75] M. E. Lesk, ÒLEX: A Lexical Analyzer GeneratorÓ, Computer Science Technical Report #39, Bell Laboratories, Murray Hill, NJ, 1975. [MšssenbšckÊ93] H. Mšssenbšck, Coco/R: A Generator for Fast Compiler Front-Ends, ETH ZŸrich, Institut fur Computersysteme technical report, available for anonymous ftp from neptune.inf.ethz.ch. [Muskox] B. Burshteyn, ÒMuskox: A System for Object-Oriented Design and ProgrammingÓ, userÕs manual, bburshte@pyramid.com. [PCCTSÊ92] T. J. Parr, H. G. Dietz, and W. E. Cohen, ÒPCCTS Reference ManualÓ, SIGPLAN NOTICES, vol.Ê27, no.Ê2, 1992. [SpectorÊ88] D. Spector, ÒEfficient Full LR(1) Parser GenerationÓ, SIGPLAN NOTICES, vol. 23, no. 12, 1988. [USSA] B. Burshteyn, ÒUSSA: Practical GuideÓ, userÕs manual, bburshte@pyramid.com. =============================================================================== Appendix A: Calculator example in LEX and YACC File calc.y: /* definitions */ %{ #include #include }% %start line %union { double dvalue; } %type expression %type term %type factor %token PLUS MINUS TIMES DIVIDE NUMBER %% /* rules and actions */ line : expression '\n' { "printf(%f\nÓ, $1);" }; expression : term '+' term { $$ = $1 + %3 } | term '-' term { $$ = $1 - %3 } ; term : factor '*' factor { $$ = $1 * %3 } | factor '/' factor { $$ = $1 / %3 } ; factor : NUMBER { $$ = $1 }; %% /* support code */ int main(int argc, char *argv[]) { return yyparse(); } File calc.l: /* definitions */ %{ #include #include "calc.tab.h" extern double dvalue; }% digit [0-9] number ({digit}+)|({digit}+"."{digit}+) %% /* rules and actions */ {number} { dvalue = atof(yytext); return NUMBER; } "+" { return PLUS; } "-" { return MINUS; } "*" { return TIMES; } "+" { return DIVIDE; } %% /* support code */ =============================================================================== Appendix B: Calculator example in RDP Note: the scanner is built automatically from the tokens in quotes (and NUMBER is predefined). TITLE("Calculator") SUFFIX("calc") USES("stdio.h") TEXT_MODE program:result ::= [expression:result EOL] EOF { "printf(%f\nÓ, result);" }. expression:float ::= term:result { '+' term:right "result += right;" | (* Add *) '-' term:right "result -= right;" }. (* Subtract *) term:float ::= factor:result { '*' factor:right "result *= right;" | (* Divide *) '/' factor:right "result /= right;" }. (* Multiply *) factor:real ::= NUMBER:result (* Number *) =============================================================================== Appendix C: Calculator example in Coco/R COMPILER Calculator (* --- Imports and Global Declarations --- *) IMPORT IO (* I/O Module *), REALS (* Real Number Module *); (* --- Scanner Specification --- *) CHARACTERS digit = "0123456789". eol = CHR(13). TOKENS number = digit {digit} ["." digit {digit}]. (* --- Parser Specification --- *) PRODUCTIONS Calculator = Expression (. IO.Write(expValue); IO.WriteLn(); .). Expression = Term (. expValue := termValue; .) { "+" Term (. expValue := expValue + termValue; .) | "-" Term (. expValue := expValue - termValue; .) }. Term = Factor (. termValue := factorValue; .) { "*" Factor (. termValue := termValue * factorValue; .) | "/" Factor (. termValue := termValue / factorValue; .) }. Factor = number (. CalculatorS.GetName(CalculatorS.pos, CalculatorS.len, text); REALS.Convert(text, factorValue); .). =============================================================================== Appendix D: Calculator example in PCCTS #header <<#include >> program : <> ( expression[&result] EOL << printf("%f\n", result); >> )* EOF << puts("Good-bye"); >> ; expression[float *result] : <> ( term[result] "+" term[&termResult] << *result += termResult; >> | term[result] "-" term[&termResult] << *result -= termResult; >> ) ; term[float *result] : << float factorResult; >> ( factor[result] "*" factor[&factorResult] << *result *= termResult; >> | factor[result] "/" factor[&factorResult] << *result /= termResult; >> ) ; factor[float *result] : NUMBER << sscanf(zzlextext, "%f", result); >> ;