开发者

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.

Some examples:

0001          - No prefix, Number = 0001, No suffix
1-0001        - Prefix = 1-, Number = 0001, No suffix
AAA001        - Prefix = AAA, Number = 001, No suffix
AAA 001.01    - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01    - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01

The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.

I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this cas开发者_运维技巧e?

(The RegEx should be .NET compatible)

UPDATE: For those that are interested, here's the C# code I came up with:

var regex = new System.Text.RegularExpressions.Regex(@"(\d+)");
if (regex.IsMatch(m_Key)) {
     string value = "";
     int length;
     var matches = regex.Matches(m_Key);
     foreach (var match in matches) {
         if (match.Length >= length) {
             value = match.Value;
             length = match.Length;
         }
     }
     var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
     m_KeyCounter = value;
     if (split.Length >= 1) m_KeyPrefix = split(0);
     if (split.Length >= 2) m_KeySuffix = split(1);
}


You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).

So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.

To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).


You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.

Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.


I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().


This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.

OK, let's talk about your question:

Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.

Now, what I can get from your example, is you always need middle portion of a number, try this:

/^(.*\D)?(\d+)(\D.*)?$/ig

The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.

The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.

Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.

This example is in PCRE, but I'm sure .NET has a compatible mode.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜