What's wrong with this regex?

2023-01-23 20:21 问答作者：

I am trying the following code on Java:

String test = "http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqw开发者_如何学JAVAf";
String regex = "[http://]{0,1}([a-zA-Z]*.)*\\.google\\.com/[-a-zA-Z/_.?&=]*";
System.out.println(test.matches(regex));

It does work for several minutes (after that I killed the VM) with no result. Can anyone help me?

BTW: What will you recommend me to do to speed up weblink-testng regexes in future?

[http://] is a character class, meaning any one of those characters from the set.

Just leave those particular square brackets off if it must start with http://. If it's optional, you can use (http://)?.

One obvious problem is that you're looking for the sequence ([a-zA-Z]+.)*\\.google - this will do a lot of backtracking due to that naked . which means "any character" rather than the literal period that you wanted.

But even if you replace it with what you meant, ([a-zA-Z]+\\.)*\\.google, you still have a problem - this will then require two . characters immediately before google. You should instead try:

String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

That returns immediately for me with a true match.

Keep in mind that this currently requires the / at the end of google.com. If that's a problem, it's a minor fix, but I've left it there since you had it in your original regex.

You are trying to match the scheme as a character class using square brackets. That means only zero or one of the characters from that set. You want a subpattern, with parentheses. You can also change {0,1} to just say ?.

Also, you should remove the period just before google\\.com because you're already looking for a period in the subdomain subpattern of your regex. As cherouvim points out, you forgot to escape that period as well.

String regex = "(http://)?([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

In the ([a-zA-Z]*.) part you either need to escape the . (because right now it means "all characters") or remove it.

There are two problems with the regular expression.

The first is easy, as was mentioned by others. You need to match "http://" as a subpattern, not as a character class. Change the brackets to parentheses.

The second problem causes the very poor performance. It's causing the regex to backtrack repeatedly, trying to match the pattern.

What you're trying to do is match zero or more subdomains, which are groups of letters followed by a dot. Since you want to match the dot explicitly, escape the dot. Also remove the dot in front of "google" so you can match "http://google.com/etc" (ie, no leading dot in front of google).

So your expression becomes:

String regex = "(http://){0,1}([a-zA-Z]+\\.)*google\\.com/[-a-zA-Z/_.?&=]*";

Running this regex on your example takes just a fraction of a second.

Assuming you fix the ([a-zA-Z]*\\.) you need to change * to + so the part becomes ([a-zA-Z]+\\.). Otherwise you'll be accepting http://...google.com and this is not valid.

By grouping part before google.com I assume you are looking for part of URL host name. I think that rexep is powerful tool, but you can simply use URL Java class. There is getHost() method. Then you can check if host name ends with google.com and split it or use some simplier regexp with only host name.

URL url = new URL("http://asda.aasd.sd.google.com/asdasdawrqwfqwfqwfqwf");
String host = url.getHost();
if (host.endsWith("google.com"))
    {
    String [] parts = host.split("\\.");
    for (String s: parts)
        System.out.println(s);
    }

继续阅读：regex

What's wrong with this regex?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？