Parse HTML "style" attribute using Java
I have HTML code parsed to org.w3c.dom.Document
. 开发者_高级运维I need check all tag style
attributes, parse them, change some CSS properties and put modified style definition back to attribute.
Is there any standard ways to parse style
attribute? How can I use classes and interfaces from org.w3c.dom.css
package?
I need a Java solution.
If you want a way to do this without any dependencies you can use the javax.swing.text.html
package classes to get you most of the way there:
import javax.swing.text.html.*;
StyleSheet styleSheet = new StyleSheet()
AttributeSet dec = ss.getDeclaration("margin:2px;padding:3px");
Object marginLeft = dec.getAttribute(CSS.Attribute.MARGIN_LEFT);
String marginLeftString = marginLeft.toString(); // "2px"
This returns a StyleSheet.CssValue
, which is unfortunately not public. Thus the need to convert it to a String. Also, it won't handle em
units. It is sort of smart about various styles, though. Not ideal, but avoids dependencies.
First, I would check out the classes in the javax.xml
packages. The javax.xml.parsers
package contains parsers for two styles of parsing: SAXParser and DocumentBuilder. It sounds like you want the DocumentBuilder to create a DOM. You can either traverse the DOM manually (slow and painful), or you can use the XPath standard to look up elements in the DOM. Java support for that is in javax.xml.xpath
.
XPathExpression xpath = XPath.compile("//@style");
Object results = xpath.evaluate(dom, XPathConstants.NODESET);
It's your responsibility to cast the results to the NodeList and iterate properly, but its the most direct way to get at what you want. Check out Java's DOM API for more information about reading and changing values.
I don't believe there is any support for a CSS parser built into Java, but you can look at these projects:
- http://www.w3.org/Style/CSS/SAC/Overview.en.html
- http://cssparser.sourceforge.net/
That may help you with your goals. NOTE: the Batik CSS parser is incorporated into the larger Apache Batik project: http://xmlgraphics.apache.org/batik/index.html which may have more than what you need, but it's a corporate friendly license.
I'm not sure I completely understand your requirements, but basically, you'll have to:
- Read the stylesheet(s) and extract the CSS rules.
- Read the HTML page(s) and find the attributes.
- Substitute the new CSS properties for the old CSS properties.
- Write the HTML page(s).
It looks like you would use the CSSStyleSheet interface to extract the CSS rules from the sytlesheet(s).
精彩评论