Tree-sitter: Look ahead and only parse special strings


#1

Hello there! :slight_smile:

I am preparing the creating of an Atom package that highlights Adobe ZStrings in Lua files for Adobe Lightroom plugins. That’s a very special thing and is probably never needed by another Atom user, but I wanted a small project for getting familiar with Atom package development anyways.

Here is an example for an Adobe ZString. It always starts with "$$$/.

I already managed to create a working syntax highlighter for ZStrings using Atom’s old TextMate-like parser and now I struggle to adopt that success to tree-sitter. I know that’s a better case for the TextMate parser anyways, but I want to do it with tree-sitter just for sake of learning and modernness.

This is my current tree-sitter grammar:

module.exports = grammar({
  name: "zstring",
  rules: {
    source: $ => repeat(choice(
      prec.left(2, $._junk),
      prec.left(1, $.zString)
    )),
    zString: $ => seq(
       $.zStringStart,
       optional(seq($.zStringRoot, $.zStringSeparator)),
       repeat(seq($.zStringFolder, $.zStringSeparator)),
       $.zStringKey,
       $.zStringEquals,
       $.zStringDefault,
       $.zStringEnd
    ),
    zStringStart: $ => seq(
        $.zStringQuote,
        $.zStringPrefix,
        $.zStringSeparator,
    ),
    zStringEnd: $ => $.zStringQuote,
    zStringQuote: $ => '"',
    zStringPrefix: $ => "$$$",
    zStringSeparator: $ => "/",
    zStringRoot: $ => prec(1, /[A-Za-z0-9]+/),
    zStringFolder: $ => /[A-Za-z0-9]+/,
    zStringKey: $ => /[A-Za-z0-9]+/,
    zStringEquals: $ => /[\s]*=[\s]*/,
    zStringDefault: $ => repeat1(choice(
      token.immediate(prec(1, /[^"\\\n]+/)),
      $.stringEscape
    )),
    stringEscape: $ => token.immediate(seq( // Based on https://github.com/tree-sitter/tree-sitter-javascript/blob/530c8c94211531f0db0dccb7d89c57aaa0af7525/grammar.js#L735
     "\\",
     choice(
       /[^xu0-7]/,
       /[0-7]{1,3}/,
       /x[0-9a-fA-F]{2}/,
       /u[0-9a-fA-F]{4}/,
       /u{[0-9a-fA-F]+}/
     )
    )),
    _junk: $ => /./
  }
})

This is my testing file:

abc "$$$" def

u
hupe9hupe9rhp "$$$/LightroomPluginName/Meta/PluginName=Plugin Title" 8hpu

9ohj9iof gz8uo

This is the parsing output using tree-sitter parse from tree-sitter-cli:

I already tried dozens of variations and I can’t stop tree-sitter from assuming that the first string is a ZString too. I want tree-sitter to define the first string as $._junk and just move on, but it currently tries to intepret the string using the $.zString rule. So how can I tell tree-sitter to look ahead and check if the whole string part until end is a ZString, otherwise set the parser pointer back, define the string as $.junk and continue with parsing? I had some success while playing around with token and token.immediate, but I can’t use the functions, because every part of my $.zString rule is another named rule.


#2

Can you provide more examples and what they should be please? I’m not familiar with this language. Also,

  • Does "$$$/ always identify the start of a zString?
  • Is boringString an actual thing, or an attempt to fix the problem?
  • Must " characters be balanced?

#3

Good morning, Aerijo (well, in case we have similar timezones)!

  • Yes, source contents that don’t begin with "$$$/ can be safely defined as $.junk.

This is the general ZString format:

"$$$/Root/Folder/Key=Default Value"

Example:

"$$$/MyExamplePlugin/UserInterface/OptionsWindow/Button/Apply=Apply changes"

  • In one of my earlier versions (repo for context) I wanted to find all strings and categorize them either as $.zString or $.boringString. An approach like this was more strict and Lua-aware, but then I discarded $.string and $.boringString when I understood that a more dumb parser is enough for a simple case like I have (simply highlighting zStrings in my Atom IDE).

  • I want the parser to be forgiving and not caring too much about valid Lua, so something like this should parse without problems.

    ===
    3 quotes
    ===
    
    <<< "$$$/i-am-just-here-to-confuse-you"$$$/R/A=B" >>>
    
    ---
    
    (source
      (zString
        (zStringStart
          (zStringQuote)
          (zStringPrefix)
          (zStringSeparator))
        (zStringRoot)
        (zStringSeparator)
        (zStringKey)
        (zStringEquals)
        (zStringDefault)
        (zStringEnd
          (zStringQuote))))
    
    

Thanks for your quick response! I will push the latest state to mentioned repository. :slight_smile:


#4
  • In this test you have " and $ in the path text, but in the definitions you declare only word characters are valid. Which should it be?

  • In your example, it seems to have 3 sections (i-am..., R, A=B), but you’ve declared only 2 should be expected.

  • I tried myself, but it wouldn’t accept it as key when in the final segment. I’m assuming the final segment is the only one that can contain a =, so I made that a rule in an external scanner

Also, I’d try changing _junk to choice(/[^"]+/, /"/) .


#5

Wow, thank you so much for your help! I love to see your amount of commitment for the tree-sitter project considering that it gets way too less attention. I mean, even big players like Facebook profit from the Atom infrastructure while its new parsing engine doesn’t even get enough love to have a completed documentation.

I already merged your pull request (Thank you! :astonished:) into the repository and I will answer there regarding the “3 quotes” test.