Recognizing unicode words in grammar


I’m trying to more robustly solve an issue in a package I maintain. The language grammar in the package recognizes word characters and in English, the “\w” regex class is sufficient. However, for languages with other characters, I have to hard-code in every unicode character to be matched. Is there a way to use the “\p{…}” unicode matching classes in Atom’s grammar files?

From what I’m reading, this might be a limitation of Javascript’s regex engine. There are some Javascript packages that can translate Javascript’s regexes to simulate the unicode matching classes which I could use in my Javascript code but not in a grammar.


Is there a way to use the “\p{…}” unicode matching classes in Atom’s grammar files

Did you try it ? It might just work, I think the grammar use an alternative regex engine that support posix character classes. This was for compatibility with textmate grammar.


Again, this should not be a problem as grammars do not use the JavaScript regex engine. They use Oniguruma, in which \w matches more than just [A-Za-z0-9_].


My mistake. I had thought I tried it, but I must not have reloaded properly or something. I tried using \p{Ul} to match uppercase word characters, and it worked great in the grammar.

Unfortunately, the package needs to also be able to create a “preview” pane, similar to how Markdown Preview has both syntax highlighting and a rendered preview. Those seem to be pure Javascript regex, since the same trick doesn’t work. This regex in particular is the problem.

Is there another straightforward trick for handling these uppercase word characters on the Atom Javascript side? I think I might be able to make ranges manually, but that seems ugly.

Edit: For now, I’m using XRegExp to simplify the logic, though it would be nice to avoid adding another dependency.


Yeah, if you’re going to use a preview pane, then you’ll probably need to either use ranges or an external dependency.