Purpose of all grammar & patterns properties


#1

@Wliu if you could just indulge me for this question, it would be very much appreciated. Given that I didn’t quite understand injections when I first saw them, I just want to confirm I understand what everything listed below is for.

Properties of the grammar file observed by the Grammar class:

  • name: Aesthetic label for language selection menu

  • fileTypes: Array of file extensions used to help calculate the language score (for automatic selection)

  • scopeName: A root scope that gets applied to all text, regardless of further matches.

  • foldingStopMarker: A currently ignored property, potentially will be used for syntax based folding (instead of the current indentation folding). foldingStartMarker is not even looked for.

  • maxTokensPerLine: The maximum number of rule/pattern matches before tokenization of a line is stopped.

  • maxLineLength: The maximum line length, with longer ones being truncated to fit. It’s value in the grammar class can be Infinity, but this results in an error when set directly in the file.

  • limitLineLength: Boolean, set to true makes the GrammarRegistry set the value of maxLineLength to Infinity before converting to a grammar object.

  • injections: Explained in my other question. Basically, a way for the active grammar to insert rules based on scope instead of includeing in other rules.

  • injectionSelector: Used to apply the grammar in file where it is not the active grammar. Based on the scope provided, which is converted into a ScopeSelector class.

  • patterns: Converted into an ‘initial’ rule, which is then used to begin tokenization.

  • repository: Storage for rules based on name, which can be included into other rules.

  • firstLineMatch: Used to help the score, like fileTypes.

Properties of an object in the patterns array observed by the Patterns class:

  • name: Scope applied to text matched by the pattern.

  • contentName: Scope applied to text between the begin and end captures of a pattern.

  • match: A single line regex used to determine a match. I’m unsure of the implementation though, specifically the difference between @match and @regexSource because of backreferences and the existence of an end rule? Especially because I didn’t think end was supposed to be used with match.

    if match
      if (end or @popRule) and @hasBackReferences ?= DigitRegex.test(match)
        @match = match
      else
        @regexSource = match
  • begin: Checked if match does not exist. If not, it is set up similarly to a match rule, but an end pattern is generated and turned into a rule with the other patterns.

  • end: Not required, but used to finish a begin match.

  • patterns: An array of objects which are part of the argument passed to @grammar.createRule(). This in turn makes a new Rule object, which passes each pattern to @.grammar.createPattern(), which makes a new Pattern. This recursive combination effectively processes every rule and pattern, no matter how deeply nested.

  • captures: Used to apply scopes to the captured match text. Is alternatively used for begin if beginCaptures does not exist. Any patterns arrays inside of the capture group objects are processed as described in the above point.

  • beginCaptures: Captures specifically meant for the begin match.

  • endCaptures: Captures specifically meant for the end match. If it does not exist, captures will be tried instead.

  • applyEndPatternsLast: Getting more unsure here. The name seems straightforward; I would assume that by default the end regex of the currently ‘active’ rule is always looked for first, before any internal pattern matches. Setting this to true would cause the reverse, where any internal patterns would try to be matched first.

  • include: The code for this is nice and straightforward. If it starts with a #, the rule is looked for in the current repository. If it’s index is further in, the left side is considered the scope name of another grammar, and the right side the rule name in that grammar’s repository. $self and $base get special handling. Otherwise, it is considered another grammar’s scope name, and it’s ‘initial’ rule is inserted.
    I haven’t tested how it behaves if match, patterns, etc. are also present, but that sounds like a bad thing to do.

  • popRule: Related to the ruleStack variable used in tokenizeLine, I think. Seems to be automatically applied to end patterns, so I’m guessing it pops an entire rule, which is made from a patterns array. Are there any use cases where using it directly can help? I tested it now, and an error is thrown if it’s used to pop the initialRule used in tokenizeLine. I don’t know why, but that amused me…

  • hasBackReferences: Automatically set if it doesn’t exist (tangent: the CoffeeScript ?= operator always confuses me). Used to determine if backreferences are present. If they are, they need to be replaced at some point.
    Is there any reason to provide this property directly? Performance doesn’t seem like an issue, as it’s only done once when initialised.
    Also, the AllDigitsRegex it uses doesn’t seem able to detect oniguruma backreferences (of the \k<n> form). Using these throws an error though, and I can’t determine where from. It seems the be in the construction of a Scanner.

  • disabled: A boolean only looked at by the Rule constructor, which does what it says.


beginCaptures not being applied to grammar
#2

Of course, this could all become useless if the new method using tree-sitter works out.


#3

An example of applyEndPatternLast is available at https://github.com/atom/language-yaml/blob/b09319efc82be0652d0f99a07392530613d092c1/grammars/yaml.cson#L320, where you will notice that the escape pattern is '' but the end pattern is '.

I don’t know about the semantics of match, hasBackReferences, popRule.

I would encourage you to look into tree-sitter; Atom already has experimental support for it. It’s much more powerful than first-mate grammars.


#4

Thanks, I’m looking into tree-sitter now. Hoping to avoid a situation similar to the regex based grammars, are you able to point me to some references to learning the new syntax?

I’ll try to learn what I can from the existing ones, but I would like to learn from a manual / guide if possible.


#5

I’ve found this repo

It seems to describe the syntax that will be used by Atom. I’ve only just started, so I haven’t tried running my own grammar to see if it works yet (I’m on the development build linked in this issue, so support’s not an issue). Not too sure on dependencies and how many I need to include directly in the package, but I’ll cross that bridge when I come to it.