Nested tree-sitter grammars


#1

I’m dealing with a file format that uses a data description language sort of like JSON, where functions are stored as strings. This is basically the same situation like JavaScript inside HTML, where an ordinary text node is interpreted as a script. I.e. this in HTML:

<html>
  <body>
    <div>
      // this is not a comment, it's just a text node
    </div>
    <script>
      // this is a comment inside a script
      console.log('hi');
    </script>
  </body>
</html>

…is analogous to this:

{
  // "this is not a comment, it's just a key/value pair where '//' is the key"

  greet "() {
    // This is a comment inside a function
    Trace('hi');
  }"
}

Now Tree-sitter apparently manages to highlight JavaScript inside HTML. I had a look at the grammar.js file as well as the tree-sitter-html.cson and tree-sitter-javascript.cson but did not see any hint how this is achieved. So how does this work?

The main problem I’m dealing with when I combine everything in one grammar is comments. There is no comment syntax in the data description language that I know of, but the scripting language has C-style comments. If extras had a mechanism for dealing with that, that would help, but nesting grammars would certainly be the less messy approach.


#2

Injections were introduced by this PR. It’s explanation looks satisfactory.


#3

That’s exactly the feature I was looking for. It took me a bit to understand it, so I’ll paraphrase what I learned for anyone who might find this thread.

The lib/main.js file of my language-* package currently looks somewhat like this:

exports.activate = function () {
  atom.grammars.addInjectionPoint('source.sibelius_plg', {
    type: 'function',
    language (plgFunctionSyntaxNode) {
      return 'manuscript'
    },
    content (plgFunctionSyntaxNode) {
      return plgFunctionSyntaxNode;
    }
  });
}

This works as follows:

addInjectionPoint()
Arguments:

  • the scope name of the host language that is defined in the scopeName field of the host language’s grammar/*.cson file.
  • an object with the following fields:
    • type
      The type name of the syntax node, i.e. what was used in the grammar.json of the host language to define this node. In my case, it was:
      function: $ => /"\s*\([^")]*\)\s*{[^"]+}\s*"/
    • language()
      A callback function that tells Atom what language should be injected at a specific node. The argument passed by atom and the expected return value are as follows:
      • Argument: a tree-sitter SyntaxNode object of the type specified in the type field.
      • Return value: a string specifying the language to be injected at this SyntaxNode. This string must match the regular expression defined in the injectionRegExp field of the injected language’s grammars/*.cson file.
    • content()
      Another callback function that should return a node that will be syntax highlighted as the injected language.
      • Argument: a tree-sitter SyntaxNode
      • Return value: A tree-sitter SyntaxNode. I assume this has to be a leaf in the host tree. In my case, nodes of type function are already leafs (they are defined by a single regular expression), but e.g. for HTML script elements, their SyntaxNodes encompass the entire element including start tag, content and end tag. To interpret the content as JavaScript, this function has to return the content, which can be selected with .child(1) (.child(0) and .child(2) would be start and end tag).

@Aerijo It seems that your linter-tree-sitter package does not work in the injected languages. Is that because the injected language is not added to the children field of the host SyntaxNode?


#4

I made it before injections, so I’ll have to take a look when I get time. It will probably be a while though (PR’s are always welcome :slight_smile: ).


#5

Of course, I’d be happy to contribute, but I might need assistance (and time). I think the main problem right now could be that it is not possible to descend from the host language node into the injected language nodes via the node-tree-sitter API.

I think it would be useful if the Tree interface had a field like injectionRoots: Array<Tree>, but as far as I can tell, node-tree-sitter is ignorant of the injection. Do you have any insight whether this is handled by Atom itself and how one could get access to the injected trees?


#6

This was much easier to solve than I thought. I only had to find the right piece of code in Atom that showed how to access the injected tree. I created a pull request.