开发者

Regular Expression For Assembly Instructions

I'm currently writing my own parser for a fictional Assembly language. The instructions are very similar to any normal assembly instruction:

[INSTRUCTION] [OP]*

where op can be 0-3 operands. I want to be able to use an expression that matches this. This is being written in C++ with boost::regex. I myself am a regexp noobie, trying to understand the boost documentation of what each symbol does.

开发者_如何学Python

Now, I already have an expression that can match 0-3 operands like so:

Sample Instructions:
    MOVI 8 10
    ADDI 8 8 10
    NOP
    BNEZI -1

Expression: ^([a-z]+)( ([-,0-9]+))*

However, I can't create a suitable expression that handles the same instructions when comma-delimited:

Sample Instructions:
    MOVI 8, 10
    ADDI 8, 8, 10

This is really tripping me up. I tried rewriting my expression like so:

^([a-z]+)( ([-,0-9]+))*(, ([-,0-9]+))*

This looks to be extremely green, poor regexp. It also isn't working correctly. I was thinking of using a recursive expression, but I looked at the documentation and I might as well scribble "overkill" on my forehead.

I realize I could just format the line to take out all the commas, but I would rather like to be able to write and understand a regexp expression first, then do it the easy way. Any help would be appreciated.


A string like:

ADDI 8, 8, 10

can be matched by a regex like this:

[a-zA-Z]+[ \t]+-?[0-9]+([ \t]*,[ \t]*-?[0-9]+)*

A (short) explanation:

[a-zA-Z]+   # match an instruction
[ \t]+      # match one or more spaces or tabs
-?[0-9]+    # match an integer value with an option minus sign in front of it
(           # open group 1
  [ \t]*    #   match zero or more spaces or tabs
  ,         #   match a comma
  [ \t]*    #   match zero or more spaces or tabs
  -?[0-9]+  #   match an integer value with an option minus sign in front of it
)*          # close group 1, and repeat it zero or more times

Having said all that, I must agree with dmckee's comment: a proper parser is the way to go, even if this is just a fictional language you're parsing.


try

^([a-z]+)( ([-,0-9]+)((,|\s)[-,0-9]+)*)*


There is a bison grammar as part of HLA assembler source, which you can get here

http://webster.cs.ucr.edu/AsmTools/HLA/frozen.html

I suggest using a proper grammar :)


If your serious about doing this, you have to account for every single line, all boundry conditions, errors, etc. Its not enough just to match a certian form, while not knowing anything about other forms.

If using regex, a cleaner way is to reserve a fixed number of buffers that can be analyzed. This is fairly complex for a rudimentary first level syntax checker.

This is a start:

/
  ^
     (?!\s*$)
     \s* 
     (?|
         ([a-zA-Z]+)
         \s*
         ((?<=\s)
          -?\d+|) (?!,\s*$) (?:,\s*|\s*$)  
         (-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
         (-?\d+|) \s*$
         ()
       |
         ()()()()(.+)
     )
  $
/x;

A perl test case, using newline delimeted lines:

use strict;
use warnings;

my $rx = qr/
  ^
     (?!\s*$)
     \s* 
     (?|
         ([a-zA-Z]+)
         \s*
         ((?<=\s)
          -?\d+|) (?!,\s*$) (?:,\s*|\s*$)  
         (-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
         (-?\d+|) \s*$
         ()
       |
         ()()()()(.+)
     )
  $/x;

my $cnt = 0;  # line counter

while ( my $line = <DATA> )
{
    ++$cnt;

    if ( $line =~ /$rx/ )
    {
        if (length $5) {
           print "\nSyntax error ? (line $cnt)   '$5'\n";
        }
        else {
           print "\nInstruction:  '$1'\n";
           print "     op1 = '$2'\n";
           print "     op2 = '$3'\n";
           print "     op3 = '$4'\n";
        }
    }
}

__DATA__

    MOVI 8 10
    ADDI 8 8 10

    NOP
    BNEZI -1
    InstA  0
    InstB  1,
    InstC  2,3, 4
    InstD  5,6, 7, 8

    MOVI 7, 8
    ADDI 9, 10, 11
    ADDI 12, 13 14

Output:

Syntax error ? (line 2)   'MOVI 8 10'

Syntax error ? (line 3)   'ADDI 8 8 10'

Instruction:  'NOP'
     op1 = ''
     op2 = ''
     op3 = ''

Instruction:  'BNEZI'
     op1 = '-1'
     op2 = ''
     op3 = ''

Instruction:  'InstA'
     op1 = '0'
     op2 = ''
     op3 = ''

Syntax error ? (line 8)   'InstB  1,'

Instruction:  'InstC'
     op1 = '2'
     op2 = '3'
     op3 = '4'

Syntax error ? (line 10)   'InstD  5,6, 7, 8'

Instruction:  'MOVI'
     op1 = '7'
     op2 = '8'
     op3 = ''

Instruction:  'ADDI'
     op1 = '9'
     op2 = '10'
     op3 = '11'

Syntax error ? (line 14)   'ADDI 12, 13 14'

Alternative, all the data is in a string (same regex).

while ( $str =~ /$rx/mg ) { }


you ca try this :

^[\t ]*(?:([.A-Za-z0-9_] )[:])?(?:[\t ]*([A-Za-z]{2,4})(?:[\t ] (\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] )(?:[\t ]*[,][\t ]*(\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] ))?)?)?

can match label, OP, params

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜