\" character in a HTML string, but doesn\'t match tag\'s closed brac开发者_如何学运维ket." />
开发者

Regex to match ">" in HTML

I need a regex which matches ">" character in a HTML string, but doesn't match tag's closed brac开发者_如何学运维ket. Example:

<span id="bla"> bla bla a > b bla bla bla <a>bla </a> </span>

The regex should match the ">" between a anb b


You can use a negative lookbehind: (?<!\<[^>]+)\>.
Untested

This will match any > character that isn't preceded by the beginning of an HTML (a sequence starting with < and not containing >)


The following regex should work:

([^/]>)+


A specific solution rather than just an admonition:

"Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. " - http://www.crummy.com/software/BeautifulSoup/

Don't use regex to parse html -

"Among programmers of any experience, it is generally regarded as A Bad Idea to attempt to parse HTML with regular expressions." - Link

and "You can't parse [X]HTML with regex" - 4352 votes at the time of this posting

"Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use ..." something designed for that purpose.


What you need is a regex that finds "unpaired" greater-than signs; >s that are not preceded by a < as you'd find in a tag.

Try this: "(?<!\<[^<>]+)\>" It should match a greater-than that is not part of an HTML tag; that is, a construct consisting of a less-than, some number of characters other than the angle-bracket characters, then a greater than.

EDIT: put in SLak's suggestions. I'll keep the < in the "not match" block just in case the less-than being matched is also not part of a tag, for instance << or <-. It shouldn't hurt the pattern's ability to match proper tags.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜