Write a grammar for indent based language?


#1

I’m looking into writing a grammar specification for a language that uses indent level to signify blocks. Any suggestions to implement such a pattern? Can’t find a way to use the begin and end regex pattern to support indentaion level.


#2

At the most basic, ^\s will catch a single whitespace character, ^\s\s will catch two, and so forth. If you want precision, you might run into difficulty if the language allows the user to add as many spaces as they want.


#3

Also, you might want to check out the grammar package for Python, language-python, since Python is an indent-sensitive language.


#4

You could use something like this to find indentation blocks:

will match something like this (using > to signify indent level):

some code
more code
> indented code
> continued
>>  double-indented code
> single-indented code again
non-indented code

You can also do a variation on this using begin/end pattern, that would match line before indentation block starts as well.

Not entirely sure what you’re going to do with that though.


#5

Thank you, I can work with that. I wasn’t realizing you could use a back-reference from the begin pattern in an end pattern.


#6

Now it looks like I need to reference indent level from a child pattern in order to signify class body inside of a class. Can I reference the parent somehow? I noticed you use $self in your example, are other variables available?

I couldn’t find any documentation on grammars is there any kind of reference I can look this stuff up in?

  'class':
    'begin': '^(\\s+)?(?=class\\s+\\w+)'
    'end': '^(?!\\1\\s+)(?!\s*$)'
    'endCaptures':
      '0':
        'name': 'punctuation.section.class.end'
     'name': 'meta.class'
     'patterns': [
        ...
        {
            'begin': ':'
            'beginCaptures':
                 '0':
                      'name': 'punctuation.section.class.begin'
            'end': "when indent level ends"
            'contentName': 'meta.class.body'
            'patterns':[
                ...
            ]
        }
    ]

#7

Sadly, no. Or at least I don’t know how – and I’ve asked myself this question a lot while working on language-haskell.

That said, I don’t think you have to in your particular case. Treat grammar as a finite-state automation, somewhat similar to Turing machine. When parser encounters ^(\\s+)?(?=class\\s+\\w+) pattern, it switches to ‘class definition parsing mode’. So you can use a completely different set of patterns from main scope. If you have to do some custom tokenizing on the first line, use beginCaptures – you can use sub-patterns there as well. This is usually enough. That is, unless your language supports arbitrary line breaks, but then this whole grammar model is horribly inadequate and the very best you can hope for is guessing tokens correctly ‘most of the time’ =\

Also I should note that \A, \G and \z regex anchors should be treated specially, but documentation is scarce.

From what I can understand, \A should match start of the document (and seems to do so), \z supposedly should match the end of the document (and I couldn’t make it work, I think it’s a bug), and \G evidently matches exactly where begin rule ends (in the same line, which kinda makes it pointless in most cases) – there might be other use cases which I’m not aware of.

$self references current grammar. You can also reference other grammars by scope name, or repository items (prefixed with #). Also $base references current document’s grammar (primarily useful with injected grammars)

Atom uses TextMate’s grammar model, so https://manual.macromates.com/en/language_grammars and http://www.apeth.com/nonblog/stories/textmatebundle.html should get you an idea of what’s going on.

Bear in mind that begin/while patterns are not supported in Atom.

UPDATE: This also might prove useful: https://manual.macromates.com/en/regular_expressions


#8

Atom also doesn’t require that begin statements have an end, so you can begin and go to the end of the document if you want.


#9

Take a look at Peg.js. I am not sure what class of parser it is but it looks like it will parse context free languages and will certainly parse regular languages, so unless you have some unusual language requirements it should work fine for you. There is also an interactive web page that allows you to define your language and test it.

If you are using whitespace as a syntactic element of your grammar you will run into some of the same problems that python faces, e.g. handling tabs mixed with spaces. But ignoring that, you could define a terminal symbol in your grammar such as ‘indent’ that would match exactly 4 spaces, another that matches end-of-line, etc., and use these in your grammar. You will have to match indentation level using semantic rules.

Disclaimer: I have not used peg.js myself but looking at the online documentation it seems like it will work for you.

Hope this helps.