Regex to match ">" in HTML
I need a regex which matches ">" character in a HTML string, but doesn't match tag's closed brac开发者_如何学运维ket. Example:
<span id="bla"> bla bla a > b bla bla bla <a>bla </a> </span>
The regex should match the ">" between a anb b
You can use a negative lookbehind: (?<!\<[^>]+)\>
.
Untested
This will match any >
character that isn't preceded by the beginning of an HTML (a sequence starting with <
and not containing >
)
The following regex should work:
([^/]>)+
A specific solution rather than just an admonition:
"Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away. " - http://www.crummy.com/software/BeautifulSoup/
Don't use regex to parse html -
"Among programmers of any experience, it is generally regarded as A Bad Idea to attempt to parse HTML with regular expressions." - Link
and "You can't parse [X]HTML with regex" - 4352 votes at the time of this posting
"Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use ..." something designed for that purpose.
What you need is a regex that finds "unpaired" greater-than signs; >s that are not preceded by a < as you'd find in a tag.
Try this: "(?<!\<[^<>]+)\>"
It should match a greater-than that is not part of an HTML tag; that is, a construct consisting of a less-than, some number of characters other than the angle-bracket characters, then a greater than.
EDIT: put in SLak's suggestions. I'll keep the < in the "not match" block just in case the less-than being matched is also not part of a tag, for instance << or <-. It shouldn't hurt the pattern's ability to match proper tags.
精彩评论