Atom grammars and multi-line look ahead


#1

Is there a way to do a multi-line look ahead using the grammar regex engine? The oniguruma regex engine seems to allow it (or at least doesn’t explicitly restrict it) but all of my attempts to do so haven’t produced the desired results.


#2

Have you a specific example that you can post here?
It’s always simpler to reason on a real use case.


#3

I’m working on the Fortran grammar for Atom and if statements and where statements pose a bit of a problem because they are treated differently (either as a multi-line construct or a single line statement) depending on what follows the logical part of the statement. If the logical part of the statement is split up over multiple lines it makes the rules messy. So I’d like to make a “wrapper” rule that can check ahead across multiple lines. For instance

{
  'name': 'meta.block.if-then-construct.fortran'
  'begin': '(?ix)(?=\\s*\\b(if)\\s*\\(   # start of if statement
      .*                                 # Something to match across multiple lines
    \\)\\s*then\\b)                      # then statement, signifying this is a
                                         # multi line if statement'
  'end': '(?=\\n)'
  'patterns':[
    ...
  ]
}

#4

I can’t say I’m familiar with Fortran, but looking at some examples on wikipedia I fail to see why you need to catch the whole expression in a single rule.
Naively I would do like many other language grammars do: a rule for the keywords only without caring for the condition or the code block that follows.

Can you explain a bit more why you need to catch the whole if...then expression in a single rule?


#5

It’s not an issues of can or can’t. I have sets of rules that handle these cases. Those rules however have certain disadvantages that I’d prefer to avoid if possible, such clunky scope assignment. For instance I can write a rule for Fortran if statements as

{
  'begin': '(?i)\\s*\\b(if)(?=\\s*\\()'
  'beginCapures':
    '1': 'name': 'keyword.control.if.fortran'
  'end': '(?=\\n)'
  'patterns':[
    {
      'comment': 'Logical control statement'
      'begin': '\\G\\s*\\('
      'end': '\\)'
      'patterns':[
        ...
      ]
    }
    {
      'comment': 'If-then construct.'
      'name': 'meta.block.if.fortran'
      'begin': '(?i)\\s*\\b(then)\\b'
      'end': '(?i)\\b(end\\s*if)\\b'
      'endCaptures':
        '1': 'name': 'keyword.control.endif.fortran'
      'patterns':[
        ...
      ]
    }
    {
      'comment': 'Single line if statement'
      'name': 'meta.statement.if.fortran'
      'begin': '(?i)(?=\\s*\\b[a-z])'
      'end': '(?=\\n)'
      'patterns':[
        ...
      ]
    }
  ]
}

which will highlight both sets of rules correctly. But in this format I can’t assign a scope to the first portion (either meta.block.if.fortran or meta.statement.if.fortran) as it requires information that may potentially come on a separate line such as in this case:

! multi-line if statement
if (x == 1 .and. & # statement continues on the following line.
    y == 2) then
  z = 3
end if

or as in this case:

! single-line if statement
if (x == 1 .and. & # statement continues on the following line.
    y == 2) z = 3

#6

In a regex, . (period) matches any character, except newline. So you could write:

(.|\n|\r)*

But I’m not clear whether you should write one backslash (\n) or two (\\n). It sounds like a plain regex question.


#7

The regex engine that Atom uses for grammars is different from what it uses for everything else. So a regex that works in the standard find function in Atom won’t necessarily work in a grammar package. And from my testing '(.|\\n|\\r)*' does’t work. I think the issue here (and I’m speculating here, as I don’t know enough to reverse engineer it myself yet) is the grammar regex engine only deals with each line individually rather than the document as a whole. So while '\\n' can be used in a rule to match a newline it won’t consume it and take you to the next line.


#8

I believe this is the way the TextMate engine worked, so the Atom engine is most likely designed to work the same way. I can’t find the reference for this belief right now, though :disappointed:


#9

This is intentional.

See the first answer by @Ingramz of issue https://github.com/atom/first-mate/issues/57


My experience also confirms that I can never match something of two successive lines.

I can match “\n”, but it is meaningless. The “character” immediately after “\n” is always “$”(End of Line). Any other character after “\n” will break the match.


#10

The fact that you can nest patterns means that you can match across multiple lines. You just have to be a bit creative about how you do it. And it would require a definite ending string or else the engine will complain about being given too much work with no definite end point in sight.