开发者

Very Simple Regex Question

I have a very simple regex question. Suppose I have 2 conditions:

  1. url =http://www.abc.com/cde/def
  2. url =https://www.abc.com/开发者_C百科sadfl/dsaf

How can I extract the baseUrl using regex?

Sample output:

  1. http://www.abc.com
  2. https://www.abc.com


Like this:

String baseUrl;
Pattern p = Pattern.compile("^(([a-zA-Z]+://)?[a-zA-Z0-9.-]+\\.[a-zA-Z]+(:\d+)?/");
Matcher m = p.matcher(str); 
if (m.matches())
    baseUrl = m.group(1);

However, you should use the URI class instead, like this:

URI uri = new URI(str);


A one liner without regexp:

String baseUrl = url.substring(0, url.indexOf('/', url.indexOf("//")+2));


/^(https?\:\/\/[^\/]+).*/$1/

This will capture ANYTHING that starts with http and $1 will contain everything from the beginning to the first / after the //


Except for write-and-throw-away scripts, you should always refrain from parsing complex syntaxes (e-mail addresses, urls, html pages, etc etc) using regexes.

believe me, you will get bitten eventually.


I'm pretty sure that there is a Java class that will allow path manipulations, but if it has to be a regex,

https?://[^/]+

would work. (s? included to also handle https:)


Looks like the simplest solution to your two specific examples would be the pattern:

[^/]_//[^/]+

i.e.: non-slash (0 or more times), two slashes, non-slash (0 or more times). You can be stricter than that if you wish, as the two existing answers are doing in different ways -- one will reject e.g. URLs starting with ftp:, the other will reject domains with underscores (but accept URLs without a leading protocol://, thereby being even broader than mine in that respect). This variety of answers (all correct wrt your scant specs;-) should suggest to you that your specs are too vague and should be tightened.


Here's a regex that should satisfy the problem as given.

https?://[^/]*

I'm assuming you're asking this partly to gain more knowledge of regexes. If, however, you're trying to pull the host from a URL, it's arguably much more correct to use Java's more robust parsing methods:

String urlStr = "https://www.abc.com/stuff";
URL url = new URL(urlStr);
String host = url.getHost();
String protocol = url.getProtocol();
URL baseUrl = new URL (protocol, host);

This is better, as it should catch more cases if your input URL isn't as strict as described above.


Old post.. thought I might as well put a simple answer to a simple regex Q:

(http|https):\/\/(www.)?(\w+)?\.(\w+)?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜