Multiline FirstMate pattern-matching


#1

Alright, we all know TextMate’s biggest (only?) limitation is matching stuff across lines. And by “we”, I mean any dev who’s spent a reasonable amount of time developing a grammar package. I’m sure we’ve all hit that wall, and I’m wondering if there might be a solution that’s backwards compatible. As in, it won’t break every existing grammar, or plunge Atom into the pits of performance hell.

Oniguruma (the regex engine powering TextMate/FirstMate) has an extension that enables multiline pattern-matching in expressions (where . will simply match newlines, as per usual multiline behaviour). Incidentally, this extension does nothing in TextMate, which splits everything into lines before evaluating a document.

Which brings me to wonder… could we use the (?m) pattern to give some patterns “special treatment”? If a multiline pattern is detected in a loaded grammar, FirstMate could execute a few first passes to break a document apart into regions matched by a multiline expression - which are then processed individually, as though they were separate buffers (patterns inside them would be subject to the same restrictions as patterns inside any other rule with a begin/end clause).

Granted, the current limitation can often be circumvented with clever nesting and lookaheads. However, this doesn’t always work - case-in-point, I’ve found myself trying to match an “empty comment” in reStructuredText:

An “empty comment” does not consume following blocks. (An empty comment is “…” with blank lines before and after.)

..
        Commented
        
        Commented

..

        Not commented

The only way this can be done reliably is to match ^\s*\n\s*\.\.\s*\n\s*\n. We can’t simply end at a blank line, because reStructuredText permits them between paragraphs in a comment-block. However, not terminating this at a blank line will incorrectly highlight everything that follows an empty comment, which is often used to “break out of” things like tables and nested block-quotes.

Prose languages are undoubtedly the ones that suffer most from the line-based limitation, with AsciiDoc being a particular victim.

Anyway, I’m asking because I’d like to hear thoughts on whether or not it should be attempted, as it’s essentially using a non-functional regex extension and (ab)using it for a non-standard use.