Regex to replace a string in HTML but not within a link or heading
开发者_StackOverflow中文版I am looking for a regex to replace a given string in a html page but only if the string is not a part of the tag itself or appearing as text inside a link or a heading.
Examples:
Looking for 'replace_me'
<p>You can replace_me just fine</p>
OK
<a href='replace_me'>replace_me</a>
no match
<h3>replace_me</h3>
no match
<a href='/test/'><span>replace_me</span></a>
no match
<p style="background:url('replace_me')">replace_me<h1>replace_me</h1></p>
first no match, second OK, third no match
Thanks in advance!
UPDATE:
I have found a working regex
\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
Parsing HTML with regex is a Bad Idea that will drive you insane. Using regex on this is probably not quite as bad, but a few things to think about in whatever approach you take:
- How many of these are there in a page?
- How many pages will you be doing this to?
- Will you be hand-checking the output, or is it automated?
- Which programming language(s) are you using for this?
I think the best way is not with a "simple" (read: horrendously complicated) regex, but a proper program that has some logic behind it - unless regular expressions are Turing Complete and someone else can provide a regex to do what you want, of course :)
\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)
I had a similar issue - given a string of HTML I wanted to replace all instances of the string tio2
with TiO<sub>2</sub>
, and ticl4
with TiCl<sub>4</sub>
.
This was easy to accomplish with simple string replacement but there are were some instances where the 'needle' strings occur in domain names e.g. www.ilovetio2.com
, www.tastytastyticl4.info
. In these cases the href attributes would be broken by the string replacement.
Rather than mess around trying to find a single, complex regex I opted to make two passes over the HTML string:
- Replace ALL instances with
str_ireplace
Find any href attributes containing
<sub>...</sub>
and fix thempreg_replace_callback
public static function subscriptStrings($str) { // $str is arbitrary string which may be HTML, may be plain text // Define search / replacements $map = [ 'tio2' => 'TiO<sub>2</sub>', 'ticl4' => 'TiCl<sub>4</sub>' ]; // Replace ALL instances, paying no heed to their context $str = str_ireplace(array_keys($map), array_values($map), $str); // Make a second pass, specifically looking for href values $str = preg_replace_callback('/href="[^"]+"/', function ($str) { // Return the href value stripped of <sub> tags return str_replace(['<sub>', '</sub>'], '', $str[0]); }, $str); return $str; }
This is not bulletproof and will fail if for some reason the links in question should have in them for some reason.
精彩评论