Java .split() with regex to match html <a> links

2023-04-12 14:24 问答作者：

I need to parse a string and escape all html tags except <a> links.

For example:

"Hello, this is <b>A BOLD</b> bit and this is <a href="www.google.com">a google</a> link"

When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.

I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:

myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");

Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" 开发者_StackOverflowor ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.

The bit in question is: [^<\\/a>]*

How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.

You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:

String unsafe = 
    "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: &lt;p&gtr;<a href="http://example.com/" rel="nofollow">Link</a>&lt;/p&gtr;

Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.

If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):

<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>

Which translates into:

<\s* # less-than character with optional space
(?:  # non capturing group of
  [^aA]\b         # a single letter which is not a nor A 
  |              # or
  [a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character

继续阅读：regex

Java .split() with regex to match html <a> links

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？