Comment vs. String precedence in tree-sitter grammar


#1

In the grammar I’m working on, I have the following situations that I need to match:

function f1() {
  c = 'http://example.com/';
}

function f2() {
  a = 123 // foo;
    ;
}

I’m still lacking the understanding to find an approach that handles both. Concerning this problem, the most relevant parts of grammar.js are:

const PREC = {
  COMMENT: -1
};

// Stolen from the JavaScript/C grammars
comment: $ => token(prec(PREC.COMMENT, choice(
  seq('//', /.*/),
  seq(
    '/*',
    /[^*]*\*+([^/*][^*]*\*+)*/,
    '/'
  )
))),

_char: $ => choice(
  /[^'"\\]/,
  /\\[^"]/
),

string_literal: $ => seq(
  "'",
  optional(seq($._char, repeat1($._char))),
  "'"
),

Keeping PREC.COMMENT at -1 does not work for the comment in f2(). Changing it to 1 will break the string in f1(). Wrapping the string_literal defintions in token() will give me the error Symbols inside tokens are not allowed. I assume that prec() must be part of the solution to this, but trying it in different ways in the string_literal and _char functions, I so far didn’t find anything that would work.


#2

If / is a valid string character, I don’t know why it would be matching as a comment. I’d just model it on the JS grammar.


#3

I already tried messing with the JavaScript approach, but that only got me into other problems. E.g. this one:

string_literal: $ => seq(
  "'",
  repeat(choice(
    token.immediate(prec(PREC.STRING, /[^'"\\]+/)),
    $._escape_sequence
  )),
  "'"
),

_escape_sequence: $ => token.immediate(seq('\\', /[^"]/)),

This solves the problem with comments, but introduces a new one with char literals. What syntactically looks like a 1-char string does not have string semantics in this language, but character semantics (i.e. behaves like a number in certain contexts; not a good idea, but that’s how it is). So, the following works for escaped char literals, but for some reason not for unescaped ones:

const PREC = {
  STRING: 2,
  CHAR: 3,
};

string_literal: $ => seq(
  "'",
  repeat(choice(
    token.immediate(prec(PREC.STRING, /[^'"\\]+/)),
    $._escape_sequence
  )),
  "'"
),

char_literal: $ => prec(PREC.CHAR, seq(
  "'",
  choice(
    /[^"'\\]/,
    $._escape_sequence
  ),
  "'"
)),

_escape_sequence: $ => token.immediate(seq('\\', /[^"]/)),

This will recognize the first line in the following function as string_literal, thought the second line is correctly detected as char_literal:

function Run() {
  a = 'a';
  a = '\0';
}

This is all pretty confusing to me. Does this look like a tree-sitter bug that should be reported?


#4

Could you specify a string must have 2 or more characters?


#5

It can also be empty. Anyway, I also tried this, which breaks even more tests for reasons I have no idea about:

string_literal: $ => choice(
  "'",
  seq(
    "'",
    choice(
      token.immediate(prec(PREC.STRING, /[^'"\\]+/)),
      $._escape_sequence
    ),
    repeat1(choice(
      token.immediate(prec(PREC.STRING, /[^'"\\]+/)),
      $._escape_sequence
    )),
    "'"
  )
),

(I’m very thankful for your support here, your instructions elsewhere and linter-tree-sitter.)


#6

I fixed it by fiddling round, but I can’t claim I understood what made it work. The following removes conflicts between strings and chars and keeps comments working as expected.

string_literal: $ => seq(
  "'",
  repeat(choice(
    token.immediate(repeat1(/[^'"\\]/)),
    token.immediate(/\\[^"]/)
  )),
  "'"
),

char_literal: $ => choice(
  /'[^'"\\]'/,
  /'\\[^"]'/
),