Grammar: Match opcodes within line


#1

I’m trying to improve my 6809 assembly grammar to support highlighting for some older assemblers. Generally, they demand the source code to be formatted like this:

LABEL OPCODE OPERAND COMMENT

only the opcode is mandatory, label, operand, and comment are optional. Since there are no multiline commands in assembly code(not generally true, but in my case it is), I tried to capture the whole line like this:

    patterns: [
      # directives
      {
        captures:
          1: name: 'support.function.pseudo.asmb'
          2: name: 'keyword.mnemonic.asmb'
          3: name: 'constant.numeric.hex.asmb'
          4: name: 'comment.line.semicolon.asmb'
        match: '^([a-zA-Z0-9][a-zA-Z0-9_]*)?\\s+(.*?)\\s+(.*?)\\s+(.*?)$'
      }

Now, this has several problems. First, it only skips the label, if the opcode takes no operand, the comment part will match third/as operand - is there a way to avoid this? Second, can I add a regex to this capture, so that only valid opcodes are highlighted. Is there a way to add extra conditions to a capture?

I’ve been stuck for a while now on this. I looked at a bunch of other grammars for solutions but none was applicable to my problem. I’m also open to alternative strategies. When I look at the rules defined below I really feel like I’m missing the obvious; these seem almost tailored to be solved with regexes.

Here is the complete set of rules for a single line of code:

LABEL OR SYMBOL FIELD:
This field may contain a symbolic label or name which is assigned the instruction’s address and may be called upon throughout the source program.

  • The label must begin in column one and must be unique. Labels are optional. If the label is to be omitted, the first character of the line must be a space.
  • A label may consist of letters (A-Z or a-z), numbers (0-9), or an underscore (_ or 5F hex).
  • Every label must begin with a letter.
  • Labels may be of any length.
  • The label field must be terminated by a space or a return.

OPCODE FIELD:
This field contains the 6809 opcode (mnemonic) or pseudo-op. It specifies the operation that is to be performed.

  • The opcode is made up of letters (A-Z or a-z) and numbers (0-9). In this field, upper and lower case may be used interchangeably.
  • This field must be terminated by a space if there is an operand or by a space or return if there is no operand.

OPERAND FIELD:
The operand provides any data or address information which may be required by the opcode. This field may or may not be required, depending on the opcode. Operands are generally combinations of register specifications and mathematical expressions which can include constants, symbols, ASCII literals, etc.

  • The operand field can contain no spaces.
  • This field is terminated with a space or return.
  • Any of several types of data may make up the operand: register specifications, numeric constants, symbols, ASCII literals, and the special PC designator.

COMMENT FIELD:
The comment field may be used to insert comments on each line of source.
Comments are for the programmer’s convenience only and are ignored by
the assembler.

  • The comment field is always optional.
  • This field must be preceded by a space.
  • Comments may contain any characters from SPACE (hex 20) thru DELETE (hex 7F).
  • This field is terminated by a carriage return.

Ideally, I’d like to have something along the line of

    patterns: [
      # directives
      {
        captures:
          1: name: 'support.function.pseudo.asmb'
          2: name: 'keyword.mnemonic.asmb'
          3: name: 'constant.numeric.hex.asmb'
          4: name: 'comment.line.semicolon.asmb'
        match: '^([a-zA-Z0-9][a-zA-Z0-9_]*)?\\s+(.*?)\\s+(.*?)\\s+(.*?)$'
      }
      patterns: [
         # rule for label field
         { 
            # label rules
         }
         # rule for opcode field
         { 
            # opcode rules
         }
         # rule for operand field
         { 
            # operand rules
         }
         # rule for comment field
         { 
            # comment rules
         }
    ]

Any ideas and hints are greatly appreciated.


#2

I only skimmed your post, but do you mean like this?


Grammar: Force hard tabs in settings?
#3

I even got that excellent gist of yours starred a while back…seems like this is another case of RTFM. But this looks promising, thanks a lot!