How does atom text editor parse / tokenise code? (syntax-highlighting)


#1

So CodeMirror uses modes to tokenise its code.
It breaks up the document into lines and makes each line a stream, which is then put through into the pre-defined mode. It can span multiple lines by using its state parameter.

This method doesn’t use RegExp inherently (but obviously whomever creates the mode can code in RegExp into their mode).

From what I’ve read of Atom’s code and style, is that it calls different syntax highlighters grammars and they resemble closely the grammars from TextMate.
These grammars resemble JSON objects which contain classnames and RegExps (see how to write a TextMate grammar).
It seems ACE has a similar method.

I can’t figure out for the life of me how exactly Atom Text Editor actually performs the parsing of code, keeping its state and also extending through various scopes.

If someone could point me in the right direction that would be great.


#2

I would be really happy if that’s the case : ) All I’ve seen is regex in Atom.

I’m really interested in this whole deal, will be watching the thread closely. Thanks for asking!


#3

Yeah - it seems that CodeMirror’s modes are more difficult to write, but more powerful and can tackle most edge cases. But TextMate or ACE’s grammars rely on RegExp which isn’t fully implemented in JavaScript unfortunately.

Since Atom is a newer text editor I was hoping to find where they actually perform the parsing/tokenising of their code, but I can’t find the right module on their GitHub repo, I can only seem to find code related to before or after the parsing process…


#4

That would be first-mate, which uses oniguruma for its engine.


#5

It looks like there is a moment where scripting could be allowed after tokenizing with regex, but unfortunately it seems to be handled internally in command-registry. It looks like it would be pretty tricky to circumvent.

As far as how the regex works, I don’t see a good high-level explanation. It’s hard to tell if each RegExp applies to the whole line/section, or if it treats pretokenized sections differently. I.e. if you start by tokenizing comments, do you get to ignore them the rest of the time?


#6

Atom’s grammar engine is built to, as close as possible, emulate TextMate’s. The engine breaks the document into lines, running each line through the grammar to find matching rules. Rules match by regular expression. Rules can activate a different state that gets carried to the next line. This is how things like multi-line comments work.

At a very high level, that is how Atom’s grammar engine works. If you have more specific questions, I can possibly point you in the right direction.


#7

Can a state send the parser to a particular location in the grammar? Not the text mind you, but the grammar definition. The (bad) alternative would be that each rule is still tried, but would see the state and then move on.

So I’m wondering if a state is more like a flag that every rule sees, or like a redirect notice.


#8

A grammar can send the parser to a particular location in the grammar. In the language-php grammar, for example, there are a lot of statements like 'include': '#namespace' and 'include': '#php_doc'. Each of these references a section of the grammar’s repository. It’s possible to include a whole other grammar with a statement like 'include': 'text.html.basic', naming the other grammar by its scope. You can get back to the root of the current grammar with an 'include': '$self' instruction.

I’ve built a grammar template for my own reference and understanding, and the TextMate documentation is very good (it just requires a little translation from JS to CS).


#9

That’s cool. Starting to get the picture.

Does an included rule have to be an entire pattern? Can I define just a begin rule without a contentName or end, and use it like a snippet?

More importantly, in the php example, what are ‘injections’ and what is that line that follows?


#10

All of the pattern fields are optional. Some depend on others (beginCaptures is meaningless without begin), but none of them are required.

More importantly, in the php example, what are ‘injections’ and what is that line that follows?

Injections aren’t covered in my template because they weren’t in the TextMate docs and I haven’t sat down to fully understand them yet, but there’s a thread on these forums with investigation into the subject.


#11

Ugh, no one had quite came out and mentioned that the regex is not js flavored!

fyi it is Ruby’s flavor of regex

:sob:


#12

I’ve never needed to inspect oniguruma closely, so I hadn’t noticed. :anguished:


#13

@leedohm @DamnedScholar thanks to you both - led me to where I wanted to go! :slight_smile: