Stripping microdata from an XHTML with PHP - using RegEx?
First: I've read the general; don't use RegEx on XHTML arguments like this one: RegEx match open tags except XHTML self-contained tags and I do understand how RegEx will fail on nested XHTML or XML nodes.
I don't see why manipulating attributes of an XML alone should break using RegEx. So there seems to be exceptions to the general rule. Attributes are always contained in a single node starting with a &l开发者_运维问答t;
and ending with a >
any other < or >
in between would break the XML so such can't occur.
Now I'd like to clean an XHTML string of any microdata it might contain. That is any attributes itemscope
, itemtype
, itemprop
, itemid
and itemref
. Something like this:
...
<body itemscope="itemscope" itemtype="http://schema.org/WebPage">
<div itemprop="maincontent">content</div>
...
What's the best way to do this in PHP?
I'd actually suggest:
- Loading the string with something like SimpleXML.
- Removing the attributes you are interested in flushing.
- Saving it back to a string.
There are a bunch of namespace issues that I'm not sure how you'd have to handle, but that will probably be cleaner/happier than trying to build one or more regex expressions and make sure you don't miss anything.
EDIT: turns out SimpleXML won't work (limited modification capabilities) but DOM will. Something like this:
$data=<<<END1
<body itemscope="itemscope" itemtype="http://schema.org/WebPage">
<div itemprop="maincontent">content</div>
</body>
END1;
$xml=new DOMDocument();
$xml->loadXML($data);
// find every relevant node
$xpath = new DOMXPath($xml);
$attr = $xpath->query("//@itemscope|//@itemprop|//@itemtype");
foreach ($attr as $entry) {
$entry->parentNode->removeAttribute($entry->nodeName);
}
echo $xml->saveXML();
You'd have to modify it to include all the attributes you want to remove, and like I said I have no clue how it would deal with namespaces, but its a start.
精彩评论