Modify HTML tags based on child's attributes
I'm not sure if this would be possible, but here goes
We have a page that receives data based on multiple TinyMCE forms. We want to format this data to be compliant (well, mostly compliant) with our XML storage standards.开发者_Go百科 This mostly includes stripping away certain superfluous tags that are created, and re-organizing a few things so that it is compatible with our CSS rendering. Leaving these tags and attributes in creates very noticeable discrepancies between how it looks and how it should look. I've completed most of it using just regular expressions, but have found a situation that I can't seem to create one for.
Essentially, we would have a section of HTML input that would like like
<td colspan="3" width="214" valign="top">
<p align="center">
<strong>
Here is some text.
</strong>
</p>
</td>
which we would like to replace with something like
<td colspan="3" class="center bold">
Here is some text.
</td>
Basically, strip any superfluous tags from <td>
(width
and valign
, as these exist in our CSS), and then give it the center
class because of the child element <p>
that has the align
attribute center
, and the class bold
due to the child element <strong>
.
Are there any libraries or something similar that may allow me to do this? I'm okay with using regular expressions, as long as they work.
Load the HTML into DOM, then that DOM into XPath. Use XPath to query where you want, and use the resulting nodeList and node->parentNode to navigate the the respective fields. The Node class has many useful properties, which can be read and evaluated by PHP. The rest it all about performing actions based on the properties.
Since the markup you are searching for is quite specific, well-defined and valid, a regex solution should also work quite well (and may be significantly faster). Assuming that the initial <TD>
element will always begin with the colspan="3"
attribute, and the <P>
element will always have just the align="center"
attribute, then this tested code snippet should do the trick:
$result = preg_replace(
'%# Strip unwanted cruft from TinyMCE generated form markup.
<td\scolspan="3"[^>]+> # TD element opening tag.
\s*<p\salign="center"> # P element opening tag.
\s*<strong>\s* # STRONG element opening tag.
( # $1: Contents to be preserved.
[^<]* # {normal*} Zero or more non-"<"
(?: # Unroll the loop. (See MRE3)
< # {special}. Match a "<"
(?!/?strong\b) # only if not a STRONG tag
[^<]* # More {normal*}
)* # Finish {(special normal*)*}
) # End $1: Contents to be preserved.
\n\s*</strong> # STRONG element closing tag.
\s*</p> # P element closing tag.
\s*</td> # TD element closing tag.
%x',
'<td colspan="3" class="center bold">\n\t$1\n</td>', $text);
Note that this regex allows the content to contain other inline elements (e.g. <i>
, <img>
, etc, - anything but a <strong>
).
精彩评论