开发者

Regex (or other suggestion) to Parse a Hashtag

I'm trying to parse a string I get from a hashtag, and so far it looks like a regex might be the cleanest way to go. The pattern, with an explanation, is as follows:

#p3                  -> p = 3
#h4                  ->         h = [ 4 ]
#h4,h6               ->         h = [ 4, 6 ]
#p3,h4,h6            -> p = 3,  h = [ 4, 6 ]
#h4,h6,p3            -> p = 3,  h = [ 4, 6 ]
#h4s2,6,10           ->         h = [ 4 ],    s = { "4": [ 2, 6, 10 ] }
#h4s2,6,10,h6s5      ->         h = [ 4, 6 ], s = { "4": [ 2, 6, 10 ] , "6": [ 5 ] }
#p20h4s2,6,10,h6s5,1 -> p = 20, h = [ 4, 6 ], s = {开发者_JAVA技巧 "4": [ 2, 6, 10 ] , "6": [ 5, 1 ] }

As I said, I think a regex might be my best bet but its also a weak spot for anything complex.

If you have other suggestions/solutions then I welcome them. I could do this using if/else with a lot of indexOf and splits and so on... but I'm certain there must be a better way than that.

Update: The output on the left is best thought of as an explanation. While this is the desired end result, a regex solution does not have to solve the full problem, but maybe get me part of the way. Also, to reiterate, I'm working in Javascript.


You can probably use a single expression to determine if a particular line is valid or not, but if you're trying to build a structure based on the input strings then you should probably go for a two-step tokenization/parsing scheme since that will simplify things.

By the looks of it you have three types of tokens: p, h and s. Each token is a letter, followed by a number (followed by more numbers in the case of s).

So I'd start with a tokenizer, designed to convert the string into a sequence of abstract tokens. Each token could be matched using a regular expression.

Let's take this string: #p20h4s2,6,10,h6s5,1. While there is still input left, you're going to be creating a sequence of tokens based on the remaining input.

Your first token is a p, with value 20. Then you have an h with value 4. Then an s with value [2,6,10], and so on. To determine which token is which, use a very simple regular expression. p's expression could be p\d+. h's could be h\d+. s's looks like it would be s(\d+)(,\d+)*.

The result of your tokenization step is a sequence of objects like this: { p(20), h(4), s(2,6,10), h(6), s(5,1) }. At this point you can decide that s(2,6,10) is part of h(4) and build your structure without worrying about the string representation of the structure.

Now, as for actually implementing this in JavaScript, it wouldn't be too hard. The token sequence could be an array, and you can find tokens using if/else blocks and the regular expressions above.

The important thing is separating the part where you're using the string representation (tokenization) and the part where you're using an abstract representation (parsing). It makes things conceptually a lot simpler. It's also easier to add new types of tokens this way if you need them later.


Here is some code to do that.

var p, h = [], s = {};

var re = /[ph][0-9]+|s[0-9,]*[0-9]/g;
var a;
while ((a = re.exec(myhashtag)) !== null) {
    var first = a[0].substring(0, 1);
    var rest = a[0].substring(1);
    if (first == 'p')
        p = parseInt(rest);
    else if (first == 'h')
        h.push(parseInt(rest));
    else {
        a = rest.split(',');
        for (var i = 0; i < a.length; i++)
            a[i] = parseInt(a[i]);
        s[h[h.length - 1]] = a;
    }
}

This uses the regular expression only to find all pieces of the hashtag that look like p3 or h4 or s3,4,5. The rest is just regular JavaScript.

This code is very lax. If the hashtag contains garbage in it that can't be parsed, like #p3_banana_*q4, this code just ignores it. It would probably be better to write stricter code that throws an error if the hashtag contains such nonsense.


Regular expressions are for determining whether a given pattern is present, and possibly acting on it (replacing, removing, etc). You want to do more than that; you want to determine whether a pattern is present, and then do some sort of parsing based on that. If it were me, I would execute it as follows (pseudocode below):

if(string_begins_with('#')) {
    if(string_contains('p')) {
        // get numbers following using some "CSV-to-array" function
    }

    if(string_contains('h') {
        foreach('h') {
            // check for 's' following, do csv-to-array thing
        }
    }
}

EDIT: if you really want to go this route, you're going to have to use lookaheads. Assuming the p is fixed at the beginning:

/
^\#
(p[\d,]+)?   # find the 'p'
(            # beginning of 'find the "h"' code
    h([[0-9],])  # find the 'h'
    (?=          # beginning of lookahead for 's'
        (s([[0-9],])+)?  # code for s, including a final ? since it may not 
                         # be there at all. I'm not sure if this part will work.
    )           # end of lookahead
)+            # end of code for 'h', + since you may have more than one 'h'
/

This probably needs a bit of work, but it's a start.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜