Check a phrase is not in an <a> (or other) element

2022-12-11 15:10 问答作者：

A friend is writing an advertisement script that puts links around select phrases in HTML code.

Naturally if the phrase is already inside an <a> element (or another element that doesn't allow it - like if the phrase is found in the attribute of an element), he doesn't want the script to write out a link as it would break validation.

He asked me what I thought. After some bumbling around, I'm asking you all what you think.

Just to clarify, the input is a whole blog post in HTML. Example:

<p>This is a short blog post about ponies!</p>
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p>

For this example, say I want to replace ponies (any case) with <a href="http://www.ponies.com">ponies</a> (but with the original case).

The output from above should read:

<p>This is a short blog post about <a href="http://www.ponies.com">ponies</a>!</p>
<p>I have <a href="/ponies">written about ponies before</a>.</p>
<p><img src="/media/ponies.jpg" /></p开发者_开发知识库>

We don't need full code but good ideas/regexes are immensely welcome. He's writing this in PHP but language-neutral is fine.

Use an XPath expression that finds text nodes containing the string you want, but only if they're children of acceptable elements:

//p/text()[contains(.,'ponies')]

That will give you text nodes that you know you can fiddle with directly. At this point, you can safely use a regular expression to find the keyword, but you're better off doing a direct search-and-replace instead of a pattern match.

Used against the example input provided, the only match is "This is a short blog post about ponies!". The "ponies" in the <a> element is not matched, because this looks only for direct children of <p> elements. You can refine this to make it match other elements, such as <div>s, or only specific <p> elements (such as those with specific classes).

The nice bonus about using an XPath expression like this is it will only return text nodes. Which means that "ponies" will never appear alongside any HTML elements, so you're definitely safe in using regular expressions after XPath has done its thing, without evoking Cthulhu's wrath.

XPath is a common method of dealing with XML and HTML. PHP has many XPath libraries for you to choose from. Odds are you're already using a library that works with XPath.

An alternative method is to find all text nodes in the HTML document, and filter them by what their parents are. The result is exactly the same, but depending on your requirements this way might scale better:

//text()[parent::p and contains(.,'ponies')]

This expression reads like this:

//text()                  # Find all text nodes in the document
    [parent::p            # whose parent is a "p" element
    and                   # and
    contains(.,'ponies')] # contains the string "ponies"

I'm sorry but I have to say

Parsing Html The Cthulhu Way

继续阅读：parsing php regex

Check a phrase is not in an <a> (or other) element

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？