开发者

Java Regex to get the text from HTML anchor (<a>...</a>) tags

I'm trying to get a text within a certain tag. So if I have:

<a href="http://something.com">Found<a/>

I want to be able to retrieve the Found text.

I'm trying to do it using regex. I am able to do it if the <a hre开发者_C百科f="http://something.com> stays the same but it doesn't.

So far I have this:

Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );

I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.


As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try

Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group(1)
} 

will iterate over all matches in a string.

It won't handle nested <a> tags and ignores all the attributes inside the tag.


str.replaceAll("</?a>", "");

Here is online ideone demo

Here is similar topic : How to remove the tags only from a text ?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜