Regex matching a word with numbers in it

2023-02-15 21:54 问答作者：

I'm using Text::Ngrams to determine the word combinations in a string. However, I need to keep words that have digits in them. I've determined that $o->{tokenrex} is what I need to modify, but I can't determine the proper regex for it.

The original is qr/([a-zA-Z]+|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/; but I'm thinking I need something more along the lines of this:

 qr/([a-zA-Z]+|(?<=\w)(\d+(\开发者_StackOverflow中文版.\d+)?|\d*\.\d+)([eE][-+]?\d+)?(?=\w)|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/;

Which should, if I'm reading regex right, match any number of alpha characters, or a "number" that has a word character before and after it, or a "number". Except that it's splitting up my "word" into separate tokens. The example word I'm working with is "A1X".

Any assistance would be great.

Y'all are making this way too complicated. The original regex matches words made of letters only or numbers (integers, floating point including exponential notation).

If you need to match words made of letters and numbers, then the regex for that is [a-zA-Z\d]+. Per the module docs, you'll also want to specify what to skip, and that matches [^a-zA-Z\d]+.

$self->{tokenrex} = qr/([a-z\d]+)/i;
$self->{skiprex}  = qr/([^a-z\d]+)/i;

If you need to recognize numbers as the module documentation shows in its example, then please let me know, and I'll be happy to add that back in for you. From your description, that doesn't sound like what you need.

The (?<=...) and (?=...) constructions are look-behind and look-ahead expressions, and the text that they match are not included in the text matched by the whole regular expression.

As a simpler example, for $_ = "A1X", the regular expression

qr/(?<=A)1(?=X)/

does match the string $_, but the text matched by the expression (say, in $&) is just 1, not A1X.

You could add another clause to your original expression:

qr/([a-zA-Z]+|[a-zA-Z][a-zA-Z0-9]+[a-zA-Z]|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/

(this will match A1B2C3D though -- it's not clear if you'd want it to do that)

So it looks like you have a couple of things you're looking to fix. The problem with splitting the word into different tokens is easy enough, if I understand what you mean by that: just use non-capturing groups. Use (?:foo) if you don't want to create a new capture group around foo; use (foo) if you do.

Anyway, what your desired pattern sounds like to me is this:

p{L}*(?:\d*\.)?\d+(?:[eE][-+]?\d+)?(?:(?<=p{L}(?:\d*\.)?\d+(?:[eE][-+]?\d+)?)p{L}+)?

Explanation:

p{L}*                 #Zero or more letter characters (note that this is broader than [a-zA-Z], as it allows accent marks and so forth)
(?:\d*\.)?\d+         #Slightly simplified version of your number-matching pattern
(?:(?<=p{L}...)p{L}+)? #Optionally match trailing letters, but only if there are letters at the beginning

Hope I understood what you're looking for. One issue is the [eE]; that will introduce some ambiguity. For example, if you get a string like A3E4D, is the E meant as a letter, or an exponent? I have some ideas about that, but it will be longer and more complicated. Let me know what the rules are and I'll edit, I just don't want to make this more confusing until I'm sure what you're looking for.

Try this one:

qr/(\b[a-zA-Z]([a-zA-Z\d]+[a-zA-Z])?\b|(\d+(\.\d+)?|\d*\.\d+)([eE][-+]?\d+)?)/

Note, however, that this regex (and the original) will match numbers on the "edges" of words.

继续阅读：perl regex

Regex matching a word with numbers in it

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？