Sequential includes in grammars


#1

I’ve been using Atom for just a day or so, and so far I love it… But my main use for it would be for it to support a custom language we use for the Torque3D game engine. To this end, I’ve been trying to write a TorqueScript package all day, and it’s possible but it seems like writing “nice” grammars is near impossible.

Example:
I’d like to be able to highlight code following this pattern:

function ID (PARAMS) { EXPRESSION }

But apart of writing a single big expression to catch it all, I don’t see a nice way of doing it. I’d prefer to be able to re-use my existing RegEx’s instead of having to write the same RegEx’s over and over and change it everywhere if there is a bug :confused:

I’ve been trying to do something like:

  'functiondeclaration1' :
    'patterns': [
      {
        'begin': '/\\bfunction/i'
        'name': 'comment'
        'end':
          'include': '#functiondeclaration2'
        'match': '\\s+'
      }
    ]

  'functiondeclaration2' :
    'patterns': [
      {
        'begin': '[a-zA-Z]\\w*'
        'name': 'comment'
        'end': '\\('
        'match': '\\s+'
      }

Haven’t been able to make it work yet, but I can’t do an ‘include’ on the ID for the “begin”. Normally, you use grammars like JFlex or similar to implement these sort of things, is that at all possible here?


#2

You might want to take a look at the pull request I did for adding YARD documentation comments to the Ruby grammar:

YARD has just a few different patterns for documentation comments and many tags use each pattern. This might give you some ideas on how to model your grammar.


#3

Thanks for the answer! But that kindda highlights my point.

'(@)(attr|attr_reader|attr_writer|param|see|yieldparam)(\\s+([a-z_][a-zA-Z_]*))(\\s+((\\[)[^]]+(\\])))?(.*)$'

That is like several different components, all scrambled together in a single RegEx. Where I’d like to have:

attr|attr_reader|attr_writer|param|see|yieldparam

In a completely seperate expression, and

[a-z_][a-zA-Z_]*

In a seperate expression as well, it makes the whole thing a lot more understandable and maintainable.


#4

Perhaps things can be broken down that far? If you look, the parts I wrote are included in the grammar somewhere else as a rule. I just wanted something working for a project and I didn’t take the time to understand the grammar system completely.


#5

I would love to find a solution for this as well. I’ve written my own theme, but I found the underlying DOM sometimes lacking. Personally, I’d tweak Go and Java’s grammars here and there, but they work well enough.

The grammar that prompted this post is Clojure’s. It’s unaware of several language concepts (@derefs, ~macro-escapes, *dynamic-variables* and namescaped/variables, for example), and I’m having a hard time finding a non-kludgy, copy-and-paste-all-over way to add them properly. And after reading some posts and tutorials, I’m not sure there is a way, given how Textmate grammars work; regexes are just too limited to parse context-free (for the most part) languages.

I realize that using Textmate grammars was a good way to hit the ground running, but did anyone give some thought to use full-blown parsers? Most languages publish their grammar (or something close) as part of their specification already, Textmate grammars could still be used if nothing else works, and proper parsing can enable other possibilities (inline syntax error reporting, refactoring, inline documentation…).

P.S. I’d link to some grammars, but I’m limited as a first time poster to only two links :frowning:.


#6

Ah, I can see my last reply never made it here:

“It might be possible by doing it as I described in the original post, but that’s cumbersome, ugly and prone to errors.
I don’t think it’s possible using Atom’s grammars, which is a shame. Could be fixed by being able to embed includes into the match expression instead of isolated as it is now, so you can match on includes sequentially.”

Another issue is that, as @hanjos describes, I tried to make it only highlight the code when it was syntactically correct, I guess. From what I can understand, the only thing that’s possible to do i more or less a lexer. It’s a shame, now I have to go back to my IntelliJ plugin :confused:


Sequential match rules in a language grammar
#7

Actually, I was thinking of PEGs: easy to read and understand, solid theoretical foundation, known fast algorithms, several implementations available (including for Javascript) and great parsing power (no need to separate lexers from parsers, can parse any context-free language - that we know of).

Oh, well. As a newbie in modern Javascript, I have some ground to cover before being able to contribute something, and I don’t think I can spare the time…


#8

The way I did it in https://github.com/wmertens/sublime-nix/blob/master/nix.YAML-tmLanguage was to make a regex for the whole pattern with lookahead begin and end, and then inside the regex match each part with a lookahead ending that marks the next part. This way you’re chaining regexes and enforcing ordering. Then you can e.g. include the regex for ID.

The one catch with this method is that if the ending lookahead of the wrapper regex matches the ending of a part, it will end the wrapping.