Tree-Sitter Grammar: regex won't work, while string does


#1

I read through Aerijo’s exellent guide on tree-sitter grammars.

I ran into a weird problem. I’m trying to write a grammar for 6809 assembly, here is my grammar.json:

module.exports = grammar({
    name: 'asm6809',

    rules: {
        source_file: $ => repeat($._line),

        // a line has up to three fields:
        // _label: a label
        // _instruction: either a memnonic/opcode or assembler instruction
        // _comment: a comment preceded by a semicolon
        _line: $ => prec.left(seq(
            optional($.label),
            optional($._instruction),
            optional($._comment),
            $._line_break
        )),

        _line_break: $ => '\n',

        // lwtools accepts global an local labels
        label: $ => prec.right(seq(
            $._identifier,
            ':'
        )),

        _identifier: $ => /[a-zA-Z\._][a-zA-Z0-9\._\$]*/,

        // instructions
        // TODO: operands, etc.
        _instruction: $ => choice(
            $.opcode,
            $.pseudo_opcode
        ),

        // TODO: preliminar
        // opcode: $ => 'opcode',
        opcode: $ => seq(
            $.memnonic
        ),

        // all valid 6809 memnonics
        // TODO: improve code with regex'
        memnonic: $ => 'abx',
        // memnonic: $ => /abx/i,

        // TODO: preliminar
        pseudo_opcode: $ => 'pseudo_opcode',

        // comments
        _comment: $ => seq(
            $.semicolon_comment
        ),

        semicolon_comment: $ => /;.*/

    }

})

My stuck with the memnonic: $ => 'abx',. I’d rather use regex for recognizing abx, but using the line right below it instead will not pass the test. Is my regex somehow wrong?

Here is the test file that fails:

====
Line format test
====

        abx
        pseudo_opcode

label:  abx           ; comment after line
label:  pseudo_opcode

; just a standard label
label:


---

(source_file
    (opcode (memnonic))
    (pseudo_opcode)
    (label) (opcode (memnonic)) (semicolon_comment)
    (label) (pseudo_opcode)
    (semicolon_comment)
    (label)
)

And the error output:

 
  ♢ The asm6809 language 
  
  tests
    ✗ Line format test 
        »        
        actual expected 
         
        (source_file (ERROR)opcode (memnonic)) (pseudo_opcode) (label) (opcode (memnonic)) (semicolon_comment) (label) (pseudo_opcode) (semicolon_comment) (label)) 
         // macros.js:14

✗ Broken » 1 broken (0.007s) 

Weirdly, it only fails when there is no label field before the opcode.


#2

It seems to be interpreting the abx as an _identifier, but only when using the regex. It looks like a possible bug to me, and definitely something I would need to make a not on, but I don’t know much about Tree-sitter’s internals.

This seems to fix it though

prec(100, token(/abx/))

The precedence should be higher than the identifier. What exactly token is doing here I don’t know, but the combination of precedence and token worked.


#3

Thanks for your help! The ‘token’ function does the trick.

I thought too it might be a bug; it seems like the “expected behavior” doesn’t match the actual.

I’ve seen other grammars use strings only, but I think using regex would keep it shorter and more elegant (and therefore easier to maintain).


#4

Also note that Tree-sitter does not (yet) recognise the case regex flag. You can however make a function that emulates it, e.g.,

// return a `choice` of all case permutations
noCase (input) {
  let allCasePermutations = getPermutations(input) // you'd have to write this yourself
  return choice(...allCasePermutations)
}

or

// return a regex made by swapping `a` with `[aA]`
function ignoreCase(str) {
  return new RegExp(
    str
    .split("")
    .map(c => /[a-zA-Z]/.test(c) ? `[${c.toLowerCase()}${c.toUpperCase()}]` : c)
    .join("")
  );
}