Javacc

JavaCC
Programacin de Sistemas
Que es un generador de parsers?

T o t a l = p r e c i o + i v a ;
Scanner
Total
= precio
+ iva
asignacin Total = Expr
Parser
id
precio
id
iva
Parser generator (JavaCC)

Especificacin lexica+gramatical
2
JavaCC
JavaCC (Java Compiler Compiler) es un generador de scanner y parser Producir un scanner y/o parser escrito en java, mismo que est escrito en Java; Hay muchos generadores de parsers
yacc (Yet Another Compiler-Compiler) para el lenguaje de programacin C Bison de gnu.org
Hay tambin muchos generadores de parsers escritos en Java

JavaCUP; ANTLR; SableCC
3
Ms sobre la clasificacin de generadores de parsers en java

Herramientas generadoras de Parsers ascendentes.
JavaCUP; jay, YACC for Java www.inf.uos.de/bernd/jay SableCC, The Sable Compiler Compiler www.sablecc.org
Herramientas generadoras de Parsers descendentes

ANTLR, Another Tool for Language Recognition www.antlr.org JavaCC, Java Compiler Compiler www.webgain.com/java_cc
Caractersticas de JavaCC
Generador de Parsers descendentes LL(K) Especificacin Lexica y gramtica en un archivo Procesador Tree Building
con JJTree
Extremadamente Ajustable
Muchas opciones diferentes seleccionables
Generacin de Documentacin
Utilizando JJDoc
Internacionalizacin
Puede manejar unicode completo
Lookahead Sintctico y Semntico
Caractersticas de JavaCC (cont.)

Permite especificaciones extendidas en BNF
Puede utilizar | * ? + () en RHS.
Estados y acciones Lexicas. Anlisis lxico sensitivo a maysculas y minsculas Capacidad de depuracin extensiva Tokens especiales Reporteador de error muy bueno
Instalacin de JavaCC
Descargar el archivo javacc-3.X.zip desde https://javacc.dev.java.net/ Seguir el enlace que dice Download o ir directamente a https://javacc.dev.java.net/servlets/ProjectDocumentList unzip javacc-3.X.zip en un directorio %JCC_HOME% add %JCC_HOME\bin directory to your %path%.
javacc, jjtree, jjdoc may now be invoked directly from the command line.
Pasos para usar JavaCC

Escribir una especificacin JavaCC (.jj file)
Define la gramtica y acciones en un archivo (digamos, calc.jj)
Ejecutar javaCC para generar un scanner y un parser

javacc calc.jj Generar el parser, scanner, token, java sources
Escribe el programa que utilice el parser

Por ejemplo, UseParser.java
Compilar y ejecutar el programa

javac -classpath . *.java java -cp . mainpackage.MainClass
Ejemplo 1
Parsear una especificacin de expresiones regulares y que coincidan con las cadenas de entrada Grammar : re.jj Ejemplo
% todas las cadenas terminan en "ab" (a|b)*ab; aba; ababb;
Nuestras tareas:
Por cada cadena de entrada (Linea 3,4) determinar cuando coincida con la expresin regular (linea 2).
La pelcula completa
tokens
% comentario (a|b)*ab; a; ab;
REParserTo kenManager
REParser
MainClass
javaCC
resultado
re.jj
10
Formato de una gramtica de entrada para JavaCC

javacc_options PARSER_BEGIN ( <IDENTIFIER>1 ) unidad_de_compilacin_de_java PARSER_END ( <IDENTIFIER>2 ) ( produccion )*
11
El archivo de especificacin de entrada (re.jj)
options { USER_TOKEN_MANAGER=false; BUILD_TOKEN_MANAGER=true; OUTPUT_DIRECTORY="./reparser"; STATIC=false; }
12
re.jj
PARSER_BEGIN(REParser) package reparser;
import java.lang.*; import dfa.*; public class REParser { public FA tg = new FA(); // mensaje de error con la linea actual public static void msg(String s) { System.out.println("ERROR"+s); } public static void main(String args[]) throws Exception REParser reparser = new REParser(System.in); reparser.S(); } } PARSER_END(REParser)
re.jj (Definicin de tokens)

TOKEN : { <SYMBOL: | <EPSILON: | <LPAREN: | <RPAREN: | <OR: "|" | <STAR: "* | <SEMI: "; } ["0"-"9","a"-"z","A"-"Z"] "epsilon" > "( > ") > > > > >
SKIP: { < ( [" ","\t","\n","\r","\f"] )+ > |< "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); } }
14
re.jj (producciones)
void S() : { FA d1; } { d1 = R() <SEMI> { tg = d1; System.out.println("------NFA"); tg.print(); System.out.println("------DFA"); tg = tg.NFAtoDFA(); tg.print();
System.out.println("------Minimizar"); tg = tg.minimize(); tg.print();

System.out.println("------Renumerar"); tg=tg.renumber(); tg.print(); System.out.println("------Ejecutar"); } testCases() }
15
re.jj
void testCases() : {} { (testCase() )+ }
void testCase(): { String testInput ;} { testInput = symbols() <SEMI> { tg.execute( testInput) ; } }

String symbols() : {Token token = null; StringBuffer result = new StringBuffer(); } { ( token = <SYMBOL> { result.append( token.image) ; } )* { return result.toString(); } }
16
re.jj (expresiones regulares)

// R --> RUnit | RConcat | RChoice FA R() : {FA result ;} { result = RChoice()
return result;
FA RUnit() : { FA result ; Token d1; } { ( <LPAREN> result = RChoice() <RPAREN> |<EPSILON> { result = tg.epsilon(); } | d1 = <SYMBOL> { result = tg.symbol( d1.image ); ) { return result ; } }
17
re.jj
FA RChoice() : { FA result, temp ;} { result = RConcat() ( <OR> temp = RConcat() { result = result.choice( temp ) ;} )* {return result ; } } FA RConcat() : { FA result, temp ;} { result = RStar() ( temp = RStar() { result = result.concat( temp ) ;} )* {return result ; } }
FA RStar() : {FA result;} { result = RUnit() ( <STAR> { result = result.closure();} )* { return result; } }
18
Formato de una gramtica de entrada de JavaCC

javacc_input ::= javacc_options PARSER_BEGIN ( <IDENTIFIER>1 )
unidad_de_compilacion_de_java
PARSER_END ( <IDENTIFIER>2 ) ( production )* <EOF> Codigo de color:
azul --- no-terminal <naranja> un tipo de token morado --- lexema ( palabra reservada; I.e., consistente de la literal en s misma) negro -- meta simbolos
19
Notas
<IDENTIFIER> significa cualquier identificador de Java como var, class2,
IDENTIFIER significa solamente IDENTIFIER.
<IDENTIFIER>1 debe ser igual a <IDENTIFIER>2 unidad_de_compilacio_de_java es cualquier codigo de java que
como un todo puede aparecer legalmente en un archivo.
Ejemplo:
Debe contener una declaracin de clase principal con el mismo nombre que <IDENTIFIER>1 .
PARSER_BEGIN ( MiParser ) package mipackage; import miotropackage.; public class MiParser { } class MiOtraClase { } PARSER_END (MiParser)
20
La entrada y salida de javacc

(MiEspecifLeng.jj
) Token.java javacc ParserError.java
PARSER_BEGIN ( MiParser ) package mipackage; import miotropackage.; public class MiParser { } class MiOtraClase { } PARSER_END (MiParser)
MyParser.java
MyParserCostant.java
MyParserTokenManager.java
Notes:
Token.java y ParseError.jar son los mismos para todas las entradas y pueden ser reutilizados. package declaration in *.jj are copied to all 3 outputs. import declarations in *.jj are copied to the parser and token manager files. parser file is assigned the file name <IDENTIFIER>1 .java The parser file has contents: class MiParser { //generated parser is inserted here. } The generated token manager provides one public method: Token getNextToken() throws ParseError;
22
Especificacin Lxica con JavaCC
23
javacc options
javacc_options ::= [ options { ( option_binding )* } ] option_binding es de la forma :
<IDENTIFIER>3 = <java_literal> ; donde <IDENTIFIER>3 no es sensible a maysculas y minsculas.
Ejemplo: options { USER_TOKEN_MANAGER=true; BUILD_TOKEN_MANAGER=false; OUTPUT_DIRECTORY="./sax2jcc/personnel"; STATIC=false; }

24
More Options
LOOKAHEAD
java_integer_literal (1)
CHOICE_AMBIGUITY_CHECK
java_integer_literal (2) for A | B | C
OTHER_AMBIGUITY_CHECK
java_integer_literal (1) for (A)*, (A)+ and (A)?
STATIC (true) DEBUG_PARSER (false) DEBUG_LOOKAHEAD (false) DEBUG_TOKEN_MANAGER (false) OPTIMIZE_TOKEN_MANAGER

java_boolean_literal (false)
OUTPUT_DIRECTORY (current directory) ERROR_REPORTING (true)
More Options
JAVA_UNICODE_ESCAPE (false)
replace \u2245 to actual unicode (6 char 1 char)
UNICODE_INPUT (false)
input strearm is in unicode form
IGNORE_CASE (false) USER_TOKEN_MANAGER
(false)
generate TokenManager interface for users own scanner
USER_CHAR_STREAM (false)
generate CharStream.java interface for users own inputStream
BUILD_PARSER (true)
java_boolean_literal
BUILD_TOKEN_MANAGER (true) SANITY_CHECK (true) FORCE_LA_CHECK (false) COMMON_TOKEN_ACTION (false)

invoke void CommonTokenAction(Token t) after every getNextToken()
CACHE_TOKENS (false)
Ejemplo: Figura 2.2

1. 2. 3. 4. 5. 6. 1. 2. 3. 4. if IF [a-z][a-z0-9]* ID [0-9]+ NUM ([0-9]+.[0-9]*) | ([0-9]*.[0-9]+) REAL (--[a-z]*\n) | ( |\n | \t )+ nonToken, WS . error Notaciones javacc if or i f or [i][f] [a-z]([a-z,0-9])* ([0-9])+ ([0-9])+ . ( [0-9] ) * | ([0-9])* . ([0-9])+
Especificacin JavaCC para algunos Tokens

PARSER_BEGIN(MiParser) class MiParser{} PARSER_END(MiParser) /* Para la expresin regular en la derecha, se retornar el token a la izquierda */ TOKEN : { < IF: if > | < #DIGIT: [0-9] > |< ID: [a-z] ( [a-z] | <DIGIT>)* > |< NUM: (<DIGIT>)+ > |< REAL: ( (<DIGIT>)+ . (<DIGIT>)* ) | ( <DIGIT>+ . (<DIGIT>)* ) > }
Continuacin
/* Las expresiones regulares aqu sern omitidas durante el anlisis lxico */ SKIP : { < > | <\t> |<\n> } /* como SKIP pero el texto saltado es accesible desde la accin del parser */ SPECIAL_TOKEN : { <-- ([a-z])* (\n | \r | \n\r ) > } /* . Para cualquier subcadena que no coincida con la especificacin lxica, javacc lanzara un error */ /* regla principal */ void start() : {} { (<IF> | <ID> |<NUM> |<REAL>)* }
Especificacin de Gramtica con JavaCC
30
La forma de una Produccin

java_return_type java_identifier ( java_parameter_list ) : java_block
{opciones_de_expansion }
Ejemplo : void XMLDocument(Logger logger): { int msg = 0; } { <StartDoc> { print(token); } Element(logger) <EndDoc> { print(token); } | else() }
31
Ejemplo ( Gramtica )
1. 2. 3. 4. 5. 6. 7. 8. PL S id := id S while id do S S begin L end S if id then S S if id then S else S L S L L;S
1,7,8 : P S (;S)*
32
JavaCC Version of Grammar 3.30

PARSER_BEGIN(MiParser) pulic class MiParser{} PARSRE_END(MiParser) SKIP : { | \t | \n }
TOKEN: { <WHILE: while> | <BEGIN: begin> | <END:end> | <DO:do> | <IF:if> | <THEN : then> | <ELSE:else> | <SEMI: ;> | <ASSIGN: => |<#LETTER: [a-z]> | <ID: <LETTER>(<LETTER> | [0-9] )* > }
33
JavaCC Version of Grammar 3.30 (contd)

void Prog() : { } { StmList() <EOF> }
void StmList(): { } { Stm() (; Stm() ) * }
void Stm(): { } { <ID> = <ID> | while <ID> do Stm() | <BEGIN> StmList() <END> | if <ID> then Stm() [ LOOKAHEAD(1) else Stm() ]
}
34
Tipos de producciones
production ::= javacode_production | regulr_expr_production | bnf_production | token_manager_decl Note: 1,3 se utilizan para definir gramticas. 2 se usa para definir tokens 4 se usa para incrustar cdigo en el token manager.
35
JAVACODE production
javacode_production ::= JAVACODE java-return_type iava_id ( java_param_list ) java_block
Note:
Se utiliza para definir no-terminales para reconocer Used to define nonterminals for recognizing sth that is hard to parse using normal production.
36
Example JAVACODE
JAVACODE void skip_to_matching_brace() { Token tok; int nesting = 1; while (true) { tok = getToken(1); if (tok.kind == LBRACE) nesting++; if (tok.kind == RBRACE) { nesting--; if (nesting == 0) break; } tok = getNextToken(); } }
37
Note:
Do not use nonterminal defined by JAVACODE at choice point without giving LOOKHEAD. void NT() : {} { skip_to_matching_brace() | some_other_production() } void NT() : {} { "{" skip_to_matching_brace() | "(" parameter_list() ")" }
38
TOKEN_MANAGER_DECLS
token_manager_decls ::= TOKEN_MGR_DECLS : java_block The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block). These declarations and statements are written into the generated token manager (MyParserTokenManager.java) and are accessible from within lexical actions. There can only be one token manager declaration in a JavaCC grammar file.
39
regular_expression_production
regular_expr_production ::= [ lexical_state_list ] regexpr_kind [ [ IGNORE_CASE ] ] : { regexpr_spec ( | regexpr_spec )* } regexpr_kind::= TOKEN | SPECIAL_TOKEN | SKIP | MORE TOKEN is used to define normal tokens SKIP is used to define skipped tokens (not passed to later parser) MORE is used to define semi-tokens (I.e. only part of a token). SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is passed on to the parser and accessible to the parser action but is ignored by production rules (not counted as an token). Useful for representing comments.
40
lexical_state_list
lexical_state_list::= < * > | < java_identifier ( , java_identifier )* > The lexical state list describes the set of lexical states for which the corresponding regular expression production applies. If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets. if omitted, then a DEFAULT lexical state is assumed.
41
regexpr_spec
regexpr_spec::= regular_expression1 [ java_block ] [ : java_identifier ] Meaning: When a regular_expression1 is matched then
if java_block exists then execute it if java_identifier appears, then transition to that lexical state.
42
regular_expression
regular_expression ::=
java_string_literal | < [ [#] java_identifier : ] complex_regular_expression_choices > | <java_identifier>

| <EOF>
<EOF> is matched by end-of-file character only. (3) <java_identifier> is a reference to other labeled regular_expression.
used in bnf_production
java_string_literal is matched only by the string denoted by itself. (2) is used to defined a labled regular_expr and not visible to outside the current TOKEN section if # occurs. (1) for unnamed tokens
43
Example
<DEFAULT, LEX_ST2> TOKEN IGNORE_CASE : { < FLOATING_POINT_LITERAL: (["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? | "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? | (["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? | (["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] > { // do Something } : LEX_ST1 | < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ > } Note: if # is omitted, E123 will be recognized erroneously as a token of kind EXPONENT.
44
Structure of complex_regular_expression
complex_regular_expression_choices::=
complex_regular_expression (| complex_regular_expression )* complex_regular_expression ::= ( complex_regular_expression_unit )* complex_regular_expression_unit ::= java_string_literal | "<" java_identifier ">" | character_list | ( complex_regular_expression_choices ) [+|*|?] Note: unit concatenation;juxtaposition complex_regular_expression choice; | complex_regular_expression_choice (.)[+|*|?] unit
45
character_list
character_list::= [~] [ [ character_descriptor ( , character_descriptor )* ] ] character_descriptor::= java_string_literal [ - java_string_literal ] java_string_literal ::= // reference to java grammar singleCharString* note: java_sting_literal here is restricted to length 1. ex:
~[a,b] --- all chars but a and b. [a-f, 0-9, A,B,C,D,E,F] --- hexadecimal digit. [a,b]+ is not a regular_expression_unit. Why ?
should be written ( [a,b] )+ instead.
46
bnf_production
bnf_production::=
java_return_type java_identifier "(" java_parameter_list ")"

":"
java_block
"{" expansion_choices "} expansion_choices::= expansion ( "|" expansion )* expansion::= ( expansion_unit )*
47
expansion_unit
expansion_unit::= local_lookahead | java_block | "(" expansion_choices ")" [ "+" | "*" | "?" ] | "[" expansion_choices "]" | [ java_assignment_lhs "=" ] regular_expression | [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ") Notes: 1 is for lookahead; 2 is for semantic action 4 = ( )? 5 is for token match 6. is for match of other nonterminal
48
lookahead
local_lookahead::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ") Notes: 3 componets: max # lookahead + syntax + semantics examples:
LOOKHEAD(3) LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} )
More on LOOKAHEAD
see minitutorial
49
JavaCC API
Non-Terminals in the Input Grammar NT is a nonterminal => returntype NT(parameters) throws ParseError; is generated in the parser class
API for Parser Actions Token token;

variable always holds the last token and can be used in parser actions. exactly the same as the token returned by getToken(0). two other methods - getToken(int i) and getNextToken() can also be used in actions to traverse the token list.
50
Token class
public int kind;
0 for <EOF>
public int beginLine, beginColumn, endLine, endColumn; public String image; public Token next; public Token specialToken; public String toString() { return image; } public static final Token newToken(int ofKind)
51
Error reporting and recovery

It is not user friendly to throw an exception and exit the parsing once encountering a syntax error. two Exceptions
ParseException . can be recovered TokenMgrError not expected to be recovered
Error reporting
modify ParseExcpetion.java or TokenMgrError.java generateParseException method is always invokable in parser action to report error
52
Error Recovery in JavaCC:

Shallow Error Recovery Deep Error Recovery Shallow Error Recovery Ex: void Stm() : {} { IfStm() | WhileStm() } if getToken(1) != if or while => shallow error
53
Shallow recovery
can be recovered by additional choice: void Stm() : {} { IfStm() | WhileStm() | error_skipto(SEMICOLON) } where JAVACODE void error_skipto(int kind) { ParseException e = generateParseException(); // generate the exception object. System.out.println(e.toString()); // print the error message Token t; do { t = getNextToken(); } while (t.kind != kind);}
54
Deep Error Recovery

Same example: void Stm() : {} { IfStm() | WhileStm() } But this time the error occurs during paring inside IfStmt() or WhileStmt() instead of the lookahead entry. The approach: use java try-catch construct. void Stm() : {} { try { ( IfStm() | WhileStm() ) } catch (ParseException e) { error_skipto(SEMICOLON); } } note: the new syntax for javacc bnf_production.
55
More Examples There are plenty examples on the net

http://www.vorlesungen.uniosnabrueck.de/informatik/compilerbau98/cod e/JavaCC/examples/
JavaCC Grammar Repository

http://www.cobase.cs.ucla.edu/pub/javacc/
56
References
http://xml.cs.nccu.edu.tw/courses/compiler/cp2003Fall/s lides/javaCC.ppt Compilers Principles, Techniques and Tools, Aho, Sethi, and Ullman
57

Javacc

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Javacc

Uploaded by

Copyright:

Available Formats

JavaCC

Que es un generador de parsers?

asignacin Total = Expr

Parser generator (JavaCC)

Hay tambin muchos generadores de parsers escritos en Java

Ms sobre la clasificacin de generadores de parsers en java

Herramientas generadoras de Parsers descendentes

Lookahead Sintctico y Semntico

Caractersticas de JavaCC (cont.)

Pasos para usar JavaCC

Ejecutar javaCC para generar un scanner y un parser

Escribe el programa que utilice el parser

Compilar y ejecutar el programa

% comentario (a|b)*ab; a; ab;

Formato de una gramtica de entrada para JavaCC

El archivo de especificacin de entrada (re.jj)

options { USER_TOKEN_MANAGER=false; BUILD_TOKEN_MANAGER=true; OUTPUT_DIRECTORY="./reparser"; STATIC=false; }

re.jj (Definicin de tokens)

System.out.println("------Minimizar"); tg = tg.minimize(); tg.print();

void testCase(): { String testInput ;} { testInput = symbols() <SEMI> { tg.execute( testInput) ; } }

re.jj (expresiones regulares)

Formato de una gramtica de entrada de JavaCC

como un todo puede aparecer legalmente en un archivo.

La entrada y salida de javacc

) Token.java javacc ParserError.java

Especificacin Lxica con JavaCC

Ejemplo: options { USER_TOKEN_MANAGER=true; BUILD_TOKEN_MANAGER=false; OUTPUT_DIRECTORY="./sax2jcc/personnel"; STATIC=false; }

STATIC (true) DEBUG_PARSER (false) DEBUG_LOOKAHEAD (false) DEBUG_TOKEN_MANAGER (false) OPTIMIZE_TOKEN_MANAGER

OUTPUT_DIRECTORY (current directory) ERROR_REPORTING (true)

IGNORE_CASE (false) USER_TOKEN_MANAGER

generate TokenManager interface for users own scanner

BUILD_TOKEN_MANAGER (true) SANITY_CHECK (true) FORCE_LA_CHECK (false) COMMON_TOKEN_ACTION (false)

Ejemplo: Figura 2.2

Especificacin JavaCC para algunos Tokens

Especificacin de Gramtica con JavaCC

La forma de una Produccin

JavaCC Version of Grammar 3.30

JavaCC Version of Grammar 3.30 (contd)

void StmList(): { } { Stm() (; Stm() ) * }

java_string_literal | < [ [#] java_identifier : ] complex_regular_expression_choices > | <java_identifier>

java_return_type java_identifier "(" java_parameter_list ")"

API for Parser Actions Token token;

Error reporting and recovery

Error Recovery in JavaCC:

Deep Error Recovery

More Examples There are plenty examples on the net

JavaCC Grammar Repository

You might also like