Better Syntax Highlighting


#1

I haven’t delved too deeply into the capabilities of the syntax highlighting engine in Atom, so forgive me if some of this is already in there just not well-taken-advantage-of by the current syntax packages. But as one of my recent posts stated, I wanted to share some of my more radical thoughts on what a truly awesome editor should be.

As @codinghorror et al have stated in the past, more and more often we are encountering languages within languages. Regular expressions is one of the more oft-cited examples, but even JavaDoc or YARD documentation markup in comments can be used as a reference:

# Frobs the frobnicator.
#
# @return [Bar] New frobnicator setting.
def foo
  # snip
end

What I would like to see here is not for one package to have to handle every possible sub-language that is embeddable in the super-language, but for one package to handle the super-language (in the text above, Ruby) and another package to handle the sub-language (in the text above, YARD). Having this kind of composability of syntaxes (syntices?) would significantly raise the capabilities of the editor without raising the bar of complexity for language syntax package authors.

Also, I would hope if I wrote a Foo syntax package that it could automatically and easily compose with any language that supports Foo embedded within it. Both SQL and regular expressions come to mind as something that is often embedded in other languages. (Well, and everything is embedded within HTML these days.) So if I was writing the Ruby syntax package, I could tag a sequence as a regular expression and be confident that sequence would be highlighted by the user’s choice of regex highlighting package.

We all understand what composability does for us in our applications. Let’s afford it to ourselves in our syntax highlighter so that we don’t get frustrated that embedded SQL is highlighted differently in Foo language as opposed to Bar language because the person who implemented the Bar highlighter knows more about SQL than the Foo implementer.


What's the future of this forum since Atom is now FOSS?
Using multiple language modes in one opened buffer
Syntax highlighting using provided lexer
#2

Seems like a good idea in theory, but do you have thoughts on how the sub-language syntax highlighting would look? At first thought, it seems like you could easily confuse yourself (especially in the case of regex) with highlighting that, for lack of a better word, interfered with the super-language (by using the same colors for things, etc).

Aside from that (unless I’m totally missing the point, it’s happened before) I do agree that there needs to be super-language syntax and some way of indicating sub-language syntax is handled by another super language syntax.


#3

Well, this part is already handled because it is the theme that determines the color of things, not the syntax. The syntax just states that if is a keyword and /a*b+/ is a regular expression literal. It is up to the theme to determine how general or specific it wants to be in its coloration, so some themes will give a different color to the quote characters than to the text inside a string while others will just color the whole thing the same color.

But yes, this type of thing will require perhaps stricter conventions than we have currently in our syntax packages. All the more reason we should hash all this out here. Start discussing the theory of editors and what they should and should not do as well as the practice, i.e. “My workflow depends on feature X from Sublime Text, can you please add it?”.


How to inject an additional pattern into a Atom or TextMate editor grammar
#4

AFAIK this is already feasible for many cases, but, sadly, not all.

In a language grammar you can use the include keyword in patterns to refer to another grammar. For instance, I just made a PR on the language-coffee-script package to highlights embedded javascript and as you can see the feature is quite simple to use.

However there’s actually many limitations:

  • You can’t use patterns for single match rules, it only works with begin/end rules (though I think it can be implemented).
  • The sub language have the possibility to consume the end match of the super language rule, leading to improper highlighting after the sub-language part.

Also, I would hope if I wrote a Foo syntax package that it could
automatically and easily compose with any language that supports Foo
embedded within it

It seems you can set an include pattern to an undefined language without breaking the grammar, so in theory you can just declare all the inclusions rules for your grammar and let Atom handles the rest. As for the choice of the highlighting package for a given language, I guess using packages enable/disable toggle is enough to handle it, as long as they use the same grammar name (source.js, source.coffee, source.ruby, etc.).

The biggest issue I see is for sub-language used in context not distinguishable of otherwise legal expressions, the classic example pointed by @codinghorror is SQL in strings. There’s just no way to know if the string content is a complete SQL query, only a part of it, or something that may look as SQL but is not.
Another ambiguity can arise if you want to have highlights for different flavors of SQL (mysql, postgre, sqlite) that may have subtle differences. I don’t know how Atom can behave in that case, but I feel that it may lead to a lot of problems. Without a construct in the language to annotate the string (as python with literal string prefix), an highlighter can’t tell which language compose the string content. To solve that you can always try relying on language detection (as done in highlight.js) but I’m not sure it is reliable on small snippets of code that may not contains discriminant statements.


Extending a grammar
How to extend a syntax?
Load multiple syntax highlighters for a file type
Use custom grammar for JS files (titanium files)?
Asp tag inside Javascript tag (syntax highlighting)
How can I create new Keywords to a syntax?
Open .rb files with RoR grammar by default instead of Ruby
#5

Great! It sounds like this is fairly close to being possible in Atom, we just have to give syntax package authors some guidelines on how to achieve it.

I would consider this a bug, but perhaps not an unavoidable one depending on the grammars of the super- and sub-language.

Absolutely, sub-languages embedded in strings are going to be really hard … so lets leave those aside for now.

One last requirement I would like to see … a sub-language should not need the permission of the super-language to syntax highlight a part of the file. In the case of the Ruby/YARD example above, ideally the Ruby package author shouldn’t have to do the 'include': 'source.ruby.comment.yard' part. I should just be able to write a YARD syntax package and tell Atom somehow that it works on source.ruby files. Then I just add some rules to add scopes to the parts I want and leave the rest alone.


#6

Actually you should be able do something approaching, just like in the literate coffeescript grammar: Create a yard grammar that only contains rules for comments, and include the ruby grammar for everything else (maybe a rule that start with a [^#] and end before a #). Then uses your yard grammar for ruby files instead of the ruby grammar.


How to derive syntax from XML?
#7

Yes, this definitely would work for one super-language and one sub-language. But what about the case of one super-language and four or five sub-languages? Regex literals, YARD documentation comments, snippets that I want only to be active within certain DSLs such as Rake or RSpec (even if the coloration is the same as regular Ruby), Ruby embedded within a string colored as actual Ruby, etc. I just think that requiring that the syntax package have a priori knowledge of all other syntaxes that you are going to want to use in order for them to work is too limiting.

I’m trying to think of this not as “how can we take what we have and move it the next iterative step forward” but more as “what is the ultimate that we can dream up” and then work backwards to figure out how to get there.


#8

I think the best would be to have active the super-language syntax using a halt-tone between the default color and the sub-language syntax colors (for example, Javascript comment colors + JsDoc syntax colors) and when doing mouse hovering and/or cursor is inside of them, use full syntax colors in that section (maybe all of the ones that shares this syntax?) and gray down the colors of the main code, in a similar way as XCode does when selecting a block of code on the ruler.


#9

Sub-languages in strings should never be highlighted. It’s almost universally a bad practice to embed any functioning code into a string. We shouldn’t encourage it.

Both SQL in strings, and JS in strings (new Function("/* OMG this executes! */")) are considered bad ideas. I don’t think that there are many cases where it’s considered idiomatic, or a “good idea”.

Code in comments and “special comments”, on the other hand, are used frequently. You see it a lot in “self-documenting” code, and it’s used in C/C++, HTML (<!--[IE-specific nonsense]-->), JS tooling control (JSHint/JSLint/ESLint), the list goes on for a bit.

Maybe a good start would be a multi-lingual “special comment” recognizer, that keeps a list of known “special-comment” cases per-grammar, and switches highlighting within that range?


#10

I think one thing that you’re omitting is that we don’t just edit code with our editor. GitHub-Flavored Markdown is a first-class language in Atom … and code is very often embedded within it.

For another, there is a long and storied history of languages being embedded in other languages … and not just in strings. JavaScript can be embedded in CoffeeScript. Command substitution in shell languages like bash. Regular expressions are first-class objects in many scripting languages. Assembler can be embedded in many older languages.

So whether or not we should encourage the degenerate cases you mention … I strongly believe that this capability should be supported for all the non-degenerate cases.


#11

Oh, I agree. I’m just saying that perhaps we should start small? GFM is kind of crazy in that virtually every grammar is going to appear in it an some point. And, it needs to be aware of the differences between what Atom supports, and what Linguist supports. GFM embedded-language support should probably be its own plugin.

All I was saying was that code inside a string literal shouldn’t be given its own highlight hints.


#12

IntelliJ does this. They take two approaches. In some circumstances they “intuit” what the language of the embedded section is because they have a list of calls that take strings which contain other languages, e.g., a command to execute SQL will take a string of SQL code so the string is highlighted as SQL, even if the string is defined in a variable. - And to whomever said you shouldn’t put SQL in code, you are flat out nuts. What you shouldn’t do is assemble queries from user input. Especially in IntelliJ which validates queries against the database, gives you SQL code completion, etc. -

The second approach is annotation. You can use special comments before the string that tells IntelliJ what code the string contains.

Obviously the first approach requires detailed knowledge of not just the language but of the language’s libraries. You can’t reliably do this without implementing an actual parser/lexer for every language, especially the “main” language. You should ask whether it’s worth doing for a general editor, not an IDE.


#13

IntelliJ/Pycharm/Webstorm also recently started supporting language embedding inside script tags in HTML. How would this be implemented within include declarations?

<script type="text/coffeescript">...</script>