Syntax highlighting using existing tokenizer


I’m thinking of writing an F# plugin for Atom and I was wondering if it is possible to reuse an existing tokenizer that is exposed by the F# Compiler Services rather than describing the grammar using a CSON file.

I found the Grammar class, which sounds like exactly what I could use. So, I’m thinking that I could implement my own Grammar and pass it to the addGrammar function on GrammarRegistry.

My questions are:

  1. Would this work? And are there any examples where people already did this that I could follow?
  2. Is there any way to change the colorization later? Say, there is a long running asynchronous process in the background that eventually tells me more information about some identifiers…

PS: Before I’m told that I should not be doing this :smile: here is a couple of reasons:

  • Some of the tokenization rules in F# are not trivial. For example, string in a comment still behaves like a string (useful if you want to comment out code), for example: code (* comment " *) " still comment *) code. It gets more subtle than this.

  • It just sounds silly not to reuse a tokenizer for the language that already exists and can be easily called (and will always be sync with the version of the language currently used).

  • This is perhaps a separate thing (that could be done on top of basic grammar), but F# queries have extensible syntax where the keywords can be highlighted only after we know the types (after type checking).

PPS: Yes, there is already language-fsharp for Atom, which is probably a good start if we really have to use CSON grammar specification.

Defining a language grammar via code
Dynamic Grammers
Custom grammar based on external tokenizer
Marker decorations that add style classes to the actual text nodes?
Compiler provided language grammar

Would this work?

I assume this is referring to the concept of implementing one’s own class that conforms more or less to the API of the Grammar class.

In theory … maybe? I mean, this is one of the “beauties” of JavaScript right? You can monkey patch anything. I haven’t looked deeply enough into the first-mate package to be able to say for certain.

And are there any examples where people already did this that I could follow?

I’m pretty certain there aren’t any that asynchronously shell out to some other process to tokenize things, no. You might want to look at this topic for some ideas though:

Is there any way to change the colorization later?

I don’t believe the tokenization is done asynchronously at this point, but you probably want to look at and to see for yourself:

Syntax highlighting using provided lexer

Thanks a lot for the pointers! I mainly wanted to make sure that I’m not trying to do something completely silly…


If you are curious I did manage to make a code based grammar for TypeScript :

However its not async. So I can only use the typescript’s lexical classifier, which is super fast, and in fact faster than the regex grammar for JavaScript alone.

I do plan to use the sytactic classifer at some point : which needs to be async but the current version does most of the things you would want it to do.