Regex (or other suggestion) to Parse a Hashtag

2023-01-26 02:43 问答作者：

I'm trying to parse a string I get from a hashtag, and so far it looks like a regex might be the cleanest way to go. The pattern, with an explanation, is as follows:

#p3                  -> p = 3
#h4                  ->         h = [ 4 ]
#h4,h6               ->         h = [ 4, 6 ]
#p3,h4,h6            -> p = 3,  h = [ 4, 6 ]
#h4,h6,p3            -> p = 3,  h = [ 4, 6 ]
#h4s2,6,10           ->         h = [ 4 ],    s = { "4": [ 2, 6, 10 ] }
#h4s2,6,10,h6s5      ->         h = [ 4, 6 ], s = { "4": [ 2, 6, 10 ] , "6": [ 5 ] }
#p20h4s2,6,10,h6s5,1 -> p = 20, h = [ 4, 6 ], s = {开发者_JAVA技巧 "4": [ 2, 6, 10 ] , "6": [ 5, 1 ] }

As I said, I think a regex might be my best bet but its also a weak spot for anything complex.

If you have other suggestions/solutions then I welcome them. I could do this using if/else with a lot of indexOf and splits and so on... but I'm certain there must be a better way than that.

Update: The output on the left is best thought of as an explanation. While this is the desired end result, a regex solution does not have to solve the full problem, but maybe get me part of the way. Also, to reiterate, I'm working in Javascript.

You can probably use a single expression to determine if a particular line is valid or not, but if you're trying to build a structure based on the input strings then you should probably go for a two-step tokenization/parsing scheme since that will simplify things.

By the looks of it you have three types of tokens: p, h and s. Each token is a letter, followed by a number (followed by more numbers in the case of s).

So I'd start with a tokenizer, designed to convert the string into a sequence of abstract tokens. Each token could be matched using a regular expression.

Let's take this string: #p20h4s2,6,10,h6s5,1. While there is still input left, you're going to be creating a sequence of tokens based on the remaining input.

Your first token is a p, with value 20. Then you have an h with value 4. Then an s with value [2,6,10], and so on. To determine which token is which, use a very simple regular expression. p's expression could be p\d+. h's could be h\d+. s's looks like it would be s(\d+)(,\d+)*.

The result of your tokenization step is a sequence of objects like this: { p(20), h(4), s(2,6,10), h(6), s(5,1) }. At this point you can decide that s(2,6,10) is part of h(4) and build your structure without worrying about the string representation of the structure.

Now, as for actually implementing this in JavaScript, it wouldn't be too hard. The token sequence could be an array, and you can find tokens using if/else blocks and the regular expressions above.

The important thing is separating the part where you're using the string representation (tokenization) and the part where you're using an abstract representation (parsing). It makes things conceptually a lot simpler. It's also easier to add new types of tokens this way if you need them later.

Here is some code to do that.

var p, h = [], s = {};

var re = /[ph][0-9]+|s[0-9,]*[0-9]/g;
var a;
while ((a = re.exec(myhashtag)) !== null) {
    var first = a[0].substring(0, 1);
    var rest = a[0].substring(1);
    if (first == 'p')
        p = parseInt(rest);
    else if (first == 'h')
        h.push(parseInt(rest));
    else {
        a = rest.split(',');
        for (var i = 0; i < a.length; i++)
            a[i] = parseInt(a[i]);
        s[h[h.length - 1]] = a;
    }
}

This uses the regular expression only to find all pieces of the hashtag that look like p3 or h4 or s3,4,5. The rest is just regular JavaScript.

This code is very lax. If the hashtag contains garbage in it that can't be parsed, like #p3_banana_*q4, this code just ignores it. It would probably be better to write stricter code that throws an error if the hashtag contains such nonsense.

Regular expressions are for determining whether a given pattern is present, and possibly acting on it (replacing, removing, etc). You want to do more than that; you want to determine whether a pattern is present, and then do some sort of parsing based on that. If it were me, I would execute it as follows (pseudocode below):

if(string_begins_with('#')) {
    if(string_contains('p')) {
        // get numbers following using some "CSV-to-array" function
    }

    if(string_contains('h') {
        foreach('h') {
            // check for 's' following, do csv-to-array thing
        }
    }
}

EDIT: if you really want to go this route, you're going to have to use lookaheads. Assuming the p is fixed at the beginning:

/
^\#
(p[\d,]+)?   # find the 'p'
(            # beginning of 'find the "h"' code
    h([[0-9],])  # find the 'h'
    (?=          # beginning of lookahead for 's'
        (s([[0-9],])+)?  # code for s, including a final ? since it may not 
                         # be there at all. I'm not sure if this part will work.
    )           # end of lookahead
)+            # end of code for 'h', + since you may have more than one 'h'
/

This probably needs a bit of work, but it's a start.

继续阅读：javascript parsing regex

Regex (or other suggestion) to Parse a Hashtag

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？