Making XHTML file valid using regex
I'm trying to use PHP with SimpleXML to parse an XHTML file, however the file contains < and > signs which are not part of the markup and cause parsing to fail (openin开发者_如何转开发g and end tag mismatches).
How can I convert these to HTML entities before parsing without changing the file or affecting the markup?
Example:
<p> a < b </p>
Would become:
<p> a < <b> </p>
Well the short answer is: you can't parse html with regex.
Maybe you could try using another xml parser that doesnt' choke on the <
and >
?
Better yet, don't try to parse an xhtml file as xml, since as you already point out yourself, it isn't really an xml file, and has illegal characters in it.
As Martin Jespersen already said, there is no good way to parse (invalid or valid) markup with regexes, at least not with PHP regexes.
That said, if you're only looking for a way to remove
- unbalanced angle brackets
- that are between valid tags
- which do not contain angle brackets somewhere inside their attribute values
then you might get away with doing this:
$intermediate = preg_replace('/(>[^<>]*)<([^<>]*<)/', '\1<\2', $subject);
$result = preg_replace('/(>[^<>]*)>([^<>]*<)/', '\1>\2', $intermediate);
but you'd have to run this several times until there are no more matches because this will only catch one stray <
or >
between tags at a time. It will also fail on pseudo-balanced brackets like <p> a <> b </p>
.
精彩评论