开发者

Regex matching a word with numbers in it

I'm using Text::Ngrams to determine the word combinations in a string. However, I need to keep words that have digits in them. I've determined that $o->{tokenrex} is what I need to modify, but I can't determine the proper regex for it.

The original is qr/([a-zA-Z]+|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/; but I'm thinking I need something more along the lines of this:

 qr/([a-zA-Z]+|(?<=\w)(\d+(\开发者_StackOverflow中文版.\d+)?|\d*\.\d+)([eE][-+]?\d+)?(?=\w)|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/;

Which should, if I'm reading regex right, match any number of alpha characters, or a "number" that has a word character before and after it, or a "number". Except that it's splitting up my "word" into separate tokens. The example word I'm working with is "A1X".

Any assistance would be great.


Y'all are making this way too complicated. The original regex matches words made of letters only or numbers (integers, floating point including exponential notation).

If you need to match words made of letters and numbers, then the regex for that is [a-zA-Z\d]+. Per the module docs, you'll also want to specify what to skip, and that matches [^a-zA-Z\d]+.

$self->{tokenrex} = qr/([a-z\d]+)/i;
$self->{skiprex}  = qr/([^a-z\d]+)/i;

If you need to recognize numbers as the module documentation shows in its example, then please let me know, and I'll be happy to add that back in for you. From your description, that doesn't sound like what you need.


The (?<=...) and (?=...) constructions are look-behind and look-ahead expressions, and the text that they match are not included in the text matched by the whole regular expression.

As a simpler example, for $_ = "A1X", the regular expression

qr/(?<=A)1(?=X)/

does match the string $_, but the text matched by the expression (say, in $&) is just 1, not A1X.

You could add another clause to your original expression:

qr/([a-zA-Z]+|[a-zA-Z][a-zA-Z0-9]+[a-zA-Z]|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/

(this will match A1B2C3D though -- it's not clear if you'd want it to do that)


So it looks like you have a couple of things you're looking to fix. The problem with splitting the word into different tokens is easy enough, if I understand what you mean by that: just use non-capturing groups. Use (?:foo) if you don't want to create a new capture group around foo; use (foo) if you do.

Anyway, what your desired pattern sounds like to me is this:

p{L}*(?:\d*\.)?\d+(?:[eE][-+]?\d+)?(?:(?<=p{L}(?:\d*\.)?\d+(?:[eE][-+]?\d+)?)p{L}+)?

Explanation:

p{L}*                 #Zero or more letter characters (note that this is broader than [a-zA-Z], as it allows accent marks and so forth)
(?:\d*\.)?\d+         #Slightly simplified version of your number-matching pattern
(?:(?<=p{L}...)p{L}+)? #Optionally match trailing letters, but only if there are letters at the beginning

Hope I understood what you're looking for. One issue is the [eE]; that will introduce some ambiguity. For example, if you get a string like A3E4D, is the E meant as a letter, or an exponent? I have some ideas about that, but it will be longer and more complicated. Let me know what the rules are and I'll edit, I just don't want to make this more confusing until I'm sure what you're looking for.


Try this one:

qr/(\b[a-zA-Z]([a-zA-Z\d]+[a-zA-Z])?\b|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/

Note, however, that this regex (and the original) will match numbers on the "edges" of words.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜