How to efficiently replace characters in XML document in Java?
I'm looking for a neat and efficient way to replace characters in XML document. There is a replacement table defined for almost 12.000开发者_运维知识库 UTF-8 characters, most of them are to be replaced by single characters, but some must be replaced by two or even three characters (e.g. Greek theta should become TH). The documents can be bulky (100MB+). How to do it in Java? I came up with the idea of using XSLT, but I'm not too sure if this is the best option.
String.replace(..) is very slow, based on my experience. I used to parse 100MB KML files using that API and the performance is just bad. Then, I pre-compiled the regular expression using Pattern.compile(..) and that worked whole lot faster.
Have a look at SAX which allows you to see each individual part of the XML document as they pass by. You can then take action on text nodes and do the manipulation you need.
The problem with XSLT is that most implementations need the whole input tree in memory, which is typically 10 times the size on disk. I only know of the commercial edition of Saxon XSLT transformer which can do streaming XSLT (but that would be perfect for your needs).
精彩评论