Keeping file offsets while parsing HTML with the DOM?
I want to modify <img src="">
attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I'm afraid people in blue furry suits will come haunt me in my sleep.
If I use the DOM parser to read the HTML and modify the <img>
tags, I'm afraid I can't reconstruct the post exactly as it was (with only my modification), because the DOM parser will probably do too much cleanup and maybe remove essential data. A SAX parser can probably not handle invalid XML, so this will also not work.
So, is there a middle way, where 开发者_JS百科I can use a DOM parser, but one that knows where each element started, so I can do string replacements or something similar from there? I know some nodes in the DOM tree will not exist in the source document (<b>Some <i>bizarre</b> formatting</i>
will probably trigger this), but does this mean it is always impossible? I see there is a DOMNode::getLineNo()
function added in PHP 5.3, but I'm using 5.2.x.
If PHP's DOM will write "too clean" results, you could try string-based SimpleHTMLDOM whether it's more lenient.
However, with formatting as bizarre as you show, I would never entirely trust the parser to do it "right". But try it out, maybe it just skips such stuff.
The DOM library's DOMNode
class has a getLineNo()
method. I don't entirely see how this works though, seeing as it doesn't provide an offset to go with it. Not sure whether that'll help your use case.
精彩评论