ColdFusion: Strip Image Tag Attributes from an Image Tag
I'm using the YUI Rich Text Editor to format a textarea within a CMS I'm building. The CMS allows the user to upload and include images (its a blogging application).
Just prior to writing the contents of the textarea to the database, I would like ColdFusion to do locate all the image tags, and to strip out any of the extraneous attributes beyond just the SRC.
For example:
<IMG src="foo.jpg" title="Foo!" alt="Foo!" height="100" width="100" style="border=1;">
Should come out the other end as:
<IMG src="joo.jpg">
The challenge:
- Image attributes could b开发者_开发技巧e any, all or none of those listed in the example.
- The image attributes could be in any order
- The image tag could be located anywhere in the body of text from the textarea
This ColdFusion exercise could be obviated if there were a way to tell YUI not to allow any image attributes, but I'm not sure if that's (easily) possible.
Many kind thanks in advance!
Best regards,
Kris
The most stable and safe way to do this is to load the HTML into a DOM, strip the unwanted bits from it (or, more securely, strip everything but the wanted bits) and convert the result back to a string.
However — to my knowledge ColdFusion does not provide an own DOM parser for HTML (only one for XML), and the YUI Rich Text Editor does not produce XML (i.e. XHTML). This is a bit unfortunate, but not necessarily a dead end.
- There are plenty HTML parsers available for Java and using Java objects from ColdFusion is easy. You could include one of them in your project.
- You could convert your HTML input to XHTML (via jTidy) and then use the built-in XML parser to implement the scrubbing. You can even convert it back to HTML with jTidy after you're done.
To get you started, I've created a sample strict white-listing solution for HTML elements and attributes around the built-in XML parser:
<!--- to serve as an example of what you would get from jTidy --->
<cfset xhtml = XmlParse('
<html xmlns="http://www.w3.org/1999/xhtml">
foo <img src="foo.jpg" title="Foo!" alt="Foo!" height="100" width="100" style="border=1;" />
bar <a href="asdasdad" title="blah" target="baz" onmouseover="doSomethingEvil();">Link</a>
baz <script type="text/javascript">doSomethingEvil();</script>
</html>', true)>
<!--- an easily configurable list of allowed elements and attributes --->
<cfset whiteList = StructNew()>
<cfset whiteList["html"] = "xmlns">
<cfset whiteList["head"] = "">
<cfset whiteList["body"] = "">
<cfset whiteList["img"] = "src">
<cfset whiteList["a"] = "href,title,name">
<!--- delete all attributes that are not white-listed --->
<cfloop collection="#whiteList#" item="tag">
<cfset nodes = XmlSearch(xhtml, "//*[local-name() = '#tag#']")>
<cfloop from="1" to="#ArrayLen(nodes)#" index="i">
<cfset nodeAttrs = nodes[i].XmlAttributes>
<cfloop list="#StructKeyList(nodeAttrs)#" index="attr">
<cfif not ListFind(whiteList[tag], attr)>
<cfset StructDelete(nodeAttrs, attr)>
</cfif>
</cfloop>
</cfloop>
</cfloop>
<!--- delete all elements that are not white-listed --->
<cfset unwantedElements = XmlSearch(xhtml, "//*[not(contains(',#StructKeyList(whiteList)#,', concat(',',local-name(),',')))]")>
<cfloop from="1" to="#ArrayLen(unwantedElements)#" index="i">
<cfset node = unwantedElements[i]>
<cfset node.XmlAttributes["x-delete-flag"] = "true">
<cfset parent = XmlSearch(node, "..")>
<cfif ArrayLen(parent) eq 1 and StructKeyExists(parent[1], "XmlChildren")>
<cfset childNodes = parent[1].XmlChildren>
<cfloop from="#ArrayLen(childNodes)#" to="1" step="-1" index="k">
<cfif StructKeyExists(childNodes[k].XmlAttributes, "x-delete-flag")>
<cfset ArrayDeleteAt(childNodes, k)>
</cfif>
</cfloop>
</cfif>
</cfloop>
When done, the contents of xhtml
looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
foo <img src="foo.jpg"/>
bar <a href="asdasdad" title="blah">Link</a>
baz
</html>
A few explanations:
- There are a several descriptions and UDFs on how to get Tidy to work in ColdFusion around the web, just look for them.
- XML DOM handling is cumbersome in ColdFusion. It's neither beautiful nor elegant, but it is still better than trying to use (God forbid) regular expressions to achieve the same effect. I strongly discourage you from using them for this problem.
- Use
<cfdump>
to get a feeling how ColdFusion represents XML documents and understand what's going on in my code. - The second bit (removing all non-white-listed elements) is a bit hairy. Apparently it is impossible to delete a node from ColdFusion more elegantly, since ColdFusion XML nodes do expose neither the
parentNode()
nor theremoveChild()
DOM methods. This implementation is based on Ben Nadel's approach to deleting DOM nodes in CF. It works, but I am painfully aware that it sucks. Sorry for that. :-\ - XPath: The expression
"//*[not(contains(',#StructKeyList(whiteList)#,', concat(',',local-name(),',')))]"
selects all nodes whose local name (i.e. without looking at the XML namespace) is not contained in the list of allowed names. In detail://
is a shorthand for "anywhere in the document".*
means "any element node".- The square brackets denote the condition. The extra commas are to make sure only full matches are taken into account — otherwise
contains()
would return"a"
as a match in"abbr"
, for example.
I would probably do this in a three-step process. Find all the image tags, extract the data from them, and then replace the originals.
First, identify the image tags. I would use a regex to do this -- something like
<img [^>]+>
This basically says "start the image tag, and grab everything except a closing bracket, then a closing bracket. Use this in combination with reFind or reFindNoCase to get the location and length of each image tag.
Next, since you only want to keep the href, you can find it using a similar method. Grab the tag using the mid() function and the location/length from above. Now, get the href using the regex
href="[^"]+"
Now, you can loop over your results, replacing each image tag with an image tag with just the appropriate href.
精彩评论