Grammar regex not matching \G anchor correctly


#1

I’m working a language grammar package and I’ve noticed that the \G anchor (\\G as it’s used in atom) doesn’t appear to work properly (or I’m miss-understanding it’s use). As a test, I tried the following rule in Atom:

'patterns':[
  {
    'name': 'invalid.error.test'
    'match': '^\\s*\\b(anchor)\\b
  }
  {
    'name': 'invalid.error.test'
    'match': '\\G\\s*\\b(test)\\b
  }
]

and the following rule in Sublime Text 3:

<key>patterns</key>
<array>
    <dict>
        <key>name</key>
        <string>invalid.error.test</string>
        <key>match</key>
        <string>^\s*\b(anchor)\b</string>
    </dict>
    <dict>
        <key>name</key>
        <string>invalid.error.test</string>
        <key>match</key>
        <string>\G\s*\b(test)\b</string>
    </dict>
</array>

against the string “anchor test end test”. The Sublime Text 3 grammar correctly highlights “anchor test” and doesn’t highlight “end test”. The Atom grammar only highlights “anchor”.

The only way I’ve been able to get \G to work at all in atom is at the start of a multiline rule. For instance the rule

'patterns': [
    {
        'begin': '^\\s*\\b(anchor)\\b'
        'beginCaptures':
            '1': 
                'name': 'invalid.error.test'
        'end': '\\n'
        'patterns':[
            {
                'name': 'invalid.error.test'
                'match': '\\G\\s*\\b(test)\\b'
            }
        ]
    }
]

will correctly highlight “anchor test” but won’t highlight the second “test” in “anchor test test”.


#2

I don’t know if Oniguruma works identically to other regular expression libraries. There may be subtle differences in its implementation of certain metacharacters. Personally, I would start with the other Atom grammars and see how they use the \G anchor. It looks like the Ruby grammar uses it:

But I’m not an expert in how the grammars work. I only know enough to modify ones that already exist.


#3

Thanks for pointing out the Ruby grammar. Upon a quick inspection, it it seems the only time they use \G is in conjuncture with a multiline rule, to either match or not match directly following the begins match (like what I do in my second Atom grammar example). It’s possible this is the only way it does work. As for Oniguruma, I was under the impression Sublime Text and Textmate both use Oniguruma for their grammars.


#4

Actually, you may be right about that. I don’t know what they use.


#5

Exactly, that is the main reason for the node-oniguruma module: being able to use TM/ST grammars out of the box.

Now, I’m not sure I understand enough what it is supposed to achieve in this context to give a deeper explanation.


#6

The \G anchor forces a match to start from the end of the previous match in the same way the ^ anchor forces a match to start from the beginning of a line. So if I used the rules:

(this is)

and

\G(\s+a match)

It would capture both parts of the phrase “this is a match” but only the first part of “this is not a match.”

From what I can tell, the Atom regex engine for language grammars only uses \G to mean the start of the contents of a multi-line rule (i.e., the end of the “begin” match).