开发者

Check a phrase is not in an <a> (or other) element

A friend is writing an advertisement script that puts links around select phrases in HTML code.

Naturally if the phrase is already inside an <a> element (or another element that doesn't allow it - like if the phrase is found in the attribute of an element), he doesn't want the script to write out a link as it would break validation.

He asked me what I thought. After some bumbling around, I'm asking you all what you think.

Just to clarify, the input is a whole blog post in HTML. Example:

<p>This is a short blog post about ponies!</p>
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p>

For this example, say I want to replace ponies (any case) with <a href="http://www.ponies.com">ponies</a> (but with the original case).

The output from above should read:

<p>This is a short blog post about <a href="http://www.ponies.com">ponies</a>!</p>
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p开发者_开发知识库>

We don't need full code but good ideas/regexes are immensely welcome. He's writing this in PHP but language-neutral is fine.


Use an XPath expression that finds text nodes containing the string you want, but only if they're children of acceptable elements:

//p/text()[contains(.,'ponies')]

That will give you text nodes that you know you can fiddle with directly. At this point, you can safely use a regular expression to find the keyword, but you're better off doing a direct search-and-replace instead of a pattern match.

Used against the example input provided, the only match is "This is a short blog post about ponies!". The "ponies" in the <a> element is not matched, because this looks only for direct children of <p> elements. You can refine this to make it match other elements, such as <div>s, or only specific <p> elements (such as those with specific classes).

The nice bonus about using an XPath expression like this is it will only return text nodes. Which means that "ponies" will never appear alongside any HTML elements, so you're definitely safe in using regular expressions after XPath has done its thing, without evoking Cthulhu's wrath.

XPath is a common method of dealing with XML and HTML. PHP has many XPath libraries for you to choose from. Odds are you're already using a library that works with XPath.


An alternative method is to find all text nodes in the HTML document, and filter them by what their parents are. The result is exactly the same, but depending on your requirements this way might scale better:

//text()[parent::p and contains(.,'ponies')]

This expression reads like this:

//text()                  # Find all text nodes in the document
    [parent::p            # whose parent is a "p" element
    and                   # and
    contains(.,'ponies')] # contains the string "ponies"


I'm sorry but I have to say

Parsing Html The Cthulhu Way

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜