TopicsSyntax AnalysisLR Analysis

LR Parsing

Breaking Ambiguity in Jison/Eyapp/Yacc/Bison et al. using token precedence

This is a simplified version of the rules to resolve conflicts and ambiguities in a Yacc-like parser generator:

💡

Precedence Rules

  1. La precedencia de los tokens se hace en la cabecera del programa Jison; esto es: antes del primer %% usando la sintáxis %left token ... , %right token ... o %nonassoc token ...
  2. La precedencia de una regla de producción AαA \rightarrow \alpha es la precedencia del último token que aparece en la parte derecha α\alpha de la regla.
    • Por ejemplo la precedencia de ee@ee \rightarrow e @ e será la precedencia que le demos al token @@ en la cabecera del programa Jison.
  3. Si no existen tokens en la parte derecha con precedencia asignada o bien si la precedencia por defecto no es la deseada, se puede usar la directiva %prec para asignar la precedencia deseada a la regla. Por ejemplo:
    exp: '-' exp %prec 'STRONG'
    asigna la precedencia del token STRONG a la regla exp: '-' exp
  4. Cuando el parser detecta un conflicto y ve que hay dos posibles vias de continuar la construcción del árbol: Una que indica que quizá se aplicó la regla AαA \rightarrow \alpha y otra que indica que quizá se pueda seguir leyendo el token tt a la entrada,
    1. El parser compara las precedencias del token y de la regla y se queda con el de mas prioridad.
    2. Si es el token quien tiene mayor prioridad avanzará en la lectura desplazando el token tt y buscando nuevos símbolos (se dice que hace un shift; en este caso el AST se “hundirá” a derechas) y
    3. Si es la regla completará el subárbol parcial αA\overset{A}{\overset{\triangle}{\alpha}} y continuará en su construcción del árbol (se dice que hace un reduce y en este caso el árbol construido estará más hundido a izquierdas)
  5. Los tokens declarados en la misma línea mediante una declaración %left o %right tienen igual precedencia e igual asociatividad.
    • La directiva %nonassoc t r indica que el parser debe producir un error cuando el token t aparece como lookahead con una regla de prioridad r.
  6. La precedencia es mayor cuanto mas abajo su posición en el texto

Jison

%nonassoc

La directiva %nonassoc indica que el parser debe producir un error cuando el token aparece como lookahead con una regla de su misma prioridad. Por ejemplo, si tenemos el siguiente programa jison:

%{
const { builders: b } = require('ast-types');
}
%}
 
%nonassoc ','
%%
es: e EOF               { return  b.program([b.expressionStatement($e)]); }
;
 
e: 
    e ',' e             { $$ = b.sequenceExpression([$e1, $e2])  }
  | N                   { let numNode = b.literal($N); numNode.loc = @1; $$ = b.callExpression(b.identifier('Complex'),[numNode]); $$.loc = @1; }
;

Una entrada como 2,3,4 causa error:

➜  solution24250311 git:(main) ✗ cat examples/julio.calc 
2,3,4%                                                                                                                                                  
➜  solution24250311 git:(main) ✗ bin/calc2js.mjs examples/julio.calc -c
Syntax error:
Parse error on line 1:
2,3,4
---^
Expecting 'EOF', got ','

Undocumented features about Jison lexer

The actions inside the lexer are called as anonymous functions. For instance for a lexer lexer.las the following:

%{
const util = require('util');
const deb = (x) => console.log(util.inspect(x, {depth: null}));
%}
number   \d+([.]\d+)?([eE][-+]?\d+)?[i]?|"i"
%%
\s+    ;
{number} return 'N';
'**'          return '**';
[-+*/!()]     return yytext;
<<EOF>>       return 'EOF';
.+            { 
                 throw new Error('Unexpected character ' + yytext+ 
                 `- ${yy_.yy.input} - ${deb(yy_)}`); 
              }

The lexer generated by Jison will have the following structure:

/* generated by jison-lex 0.3.4 */ 
var lexer = (function(){
    var lexer = ({
 
        EOF:1,
 
        parseError:function parseError(str, hash) { ...   },
 
        // resets the lexer, sets new input
        setInput:function (input, yy) { ... },
 
        // consumes and returns one char from the input
        input:function () {... },
 
        // unshifts one char (or a string) into the input
        unput:function (ch) { ... },
 
        // When called from action, caches matched text and appends it on next action
        more:function () { ... },
 
        // When called from action, signals the lexer that this rule fails to match the input, so the next matching rule (regex) should be tested instead.
        reject:function () { ... },
 
        // retain first n characters of the match
        less:function (n) { ... },
 
        // displays already matched input, i.e. for error messages
        pastInput:function () { ... },
 
        // displays upcoming input, i.e. for error messages
        upcomingInput:function () {... },  
 
        // displays the character position where the lexing error occurred, i.e. for error messages
        showPosition:function () { ... },
 
        // test the lexed token: return FALSE when not a match, otherwise return token
        test_match:function(match, indexed_rule) { ... },
 
        // return next match in input
        next:function () { ... },
 
        // return next match that has a token
        lex:function lex () {
                var r = this.next();
                if (r) {
                    return r;
                } else {
                    return this.lex();
                }
            },
 
        // activates a new lexer condition state (pushes the new lexer condition state onto the condition stack)
        begin:function begin (condition) {
                this.conditionStack.push(condition);
            },
 
        // pop the previously active lexer condition state off the condition stack
        popState:function popState () { ...},
 
        // produce the lexer rule set which is active for the currently active lexer condition state
        _currentRules:function _currentRules () { ... },
 
        // return the currently active lexer condition state; when an index argument is provided it produces the N-th previous condition state, if available
        topState:function topState (n) { ... },
 
        // alias for begin(condition)
        pushState:function pushState (condition) { ... },
 
        // return the number of states currently on the stack
        stateStackSize:function stateStackSize() { ... },
        options: {},
        performAction: function anonymous(yy,yy_,$avoiding_name_collisions,YY_START) {
            const util = require('util');
            const deb = (x) => console.log(util.inspect(x, {depth: null}));
 
            var YYSTATE=YY_START;
            switch($avoiding_name_collisions) {
                case 0:;
                break;
                case 1:return 14;
                break;
                case 2:return 12;
                break;
                case 3:return yy_.yytext;
                break;
                case 4:return 5;
                break;
                case 5: 
                                throw new Error('Unexpected character ' + yy_.yytext+ 
                                `- ${yy_.yy.input} - ${deb(yy_)}`); 
                            
                break;
        }
    },
    rules: [/^(?:\s+)/,/^(?:(\d+([.]\d+)?([eE][-+]?\d+)?[i]?|i\b))/,/^(?:\*\*)/,/^(?:[-+*/!()])/,/^(?:$)/,/^(?:.+)/],
    conditions: {"INITIAL":{"rules":[0,1,2,3,4,5],"inclusive":true}}
    });
    return lexer;
})();
parser.lexer = lexer;
function Parser () {
  this.yy = {};
}

The token numbers are defined somewhere in the generated parser. For this example we have:

var parser = {
    ...
    symbols_: { "error": 2, "es": 3, "e": 4, "EOF": 5, "-": 6, "+": 7, "*": 8, "/": 9, "(": 10, ")": 11, "**": 12, "!": 13, "N": 14, "$accept": 0, "$end": 1 },
    terminals_: { 2: "error", 5: "EOF", 6: "-", 7: "+", 8: "*", 9: "/", 10: "(", 11: ")", 12: "**", 13: "!", 14: "N" },
    productions_: [0, [3, 2], [4, 3], [4, 3], [4, 3], [4, 3], [4, 3], [4, 3], [4, 2], [4, 1], [4, 2]],
    ...

We can see that the performAction function is called with yy and an object yy_ . The object yy_ contains the following properties:

{
  yy: {
    input: '4.3XXXX',
    offsets: [ 0 ],
    lexer: [Circular *1],
    parser: {
      yy: { input: '4.3XXXX', offsets: [ 0 ] },
      parseError: [Function: parseError]
    }
  },
  _input: '',
  done: false,
  _backtrack: false,
  _more: false,
  yyleng: 4,
  yylineno: 0,
  match: 'XXXX',
  matched: '4.3XXXX',
  yytext: 'XXXX',
  conditionStack: [ 'INITIAL' ],
  yylloc: { first_line: 1, last_line: 1, first_column: 3, last_column: 7 },
  offset: 0,
  matches: [ 'XXXX', index: 0, input: 'XXXX', groups: undefined ]
}

Thus we can access to the current offest via yy.lexer.offset of via yy_.offset.

However , if you want to access the match position via yy_.yylloc you will get an error because the prefix yy is used by Jison to manipulate and change the user’s code and the expression yy_.yylloc is transformed into yy_.yy.yylloc which is not defined. Instead use directly yylloc, yylineno, yytext, etc. in your lexer action.

References

Introducción al Análisis LR

Parse::Eyapp

Parse::Eyapp is an extension of yacc for Perl, the standard Unix compiler-compiler. It provides a compiler which lets you compile yacc -like grammars into a Perl LALR(1) Object Oriented parser. It automates the building of the abstract syntax trees, provides an API for tree transformation and has being used inside the Perl community in many projects (see for instance the RPerl compiler developed by AutoParallel Technologies at http://rperl.org/).

Casiano Rodriguez-Leon created the compiler and maintained during the years 2006 to 2017. From 2017 up to now is maintained by William N. Braswell, Jr.

See https://metacpan.org/dist/Parse-Eyapp/view/eyapp, https://cpan.metacpan.org/authors/id/C/CA/CASIANO/, https://metacpan.org/author/CASIANO/releases

LR Parses for C languages

History of Parsing

GLR: Generalized LR