Replacing characters in a non well-formed XML body
In a (Java) code that I'm working on, I sometimes deal with a non well-formed XML (represented as a Java String), such as:
<root>
  <foo>
    bar & baz < quux
  </foo>
</root>
Since this XML will eventually need to be unmarshalle开发者_StackOverflow中文版d (using JAXB), obviously this XML as is will throw exception upon unmarshalling.
What's the best way to replace the & and the < to its character entities? For &, it's as easy as:
xml.replaceAll("&", "&")
However, for the < symbol, it's a bit tricky since obviously I don't want to replace the < that's used for the XML tag opening 'bracket'.
Other than scanning the string and manually replacing < in the XML body with <, what other option can you suggest?
Frankly, the best way to repair malformed XML is to send it back to whoever produced it and ask them to send you well-formed XML instead. You show a trivial example, which potentially could have a solution, but a general method for repairing malformed XML is going to be a horrendous job.
And since XML parsers aren't required to handle malformed XML, your parser isn't required to either. Just don't do it.
I guess you need more advance logic. Best to first locate all real tags using a regular expression like "(<[^>]+>)" and only replace text outside those matches, but obviously you won't be able to use a replaceAll method then. It will be more of a plumbing job...
Though its an old post but i thought it might help somebody else..I had the same requirement/issue and i could resolve using the following code.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class XMLTest {
/**
* @param args
*/
public static void main(String[] args) {
String xml = "<xml><body>" +
"<message>something < between <<<  somthing </message>" +
"<text> testing  >> > testing </text>" +
"</body></xml>";
Pattern replaceGTPattern = Pattern.compile(">[^<](.[^<]*)(>)+");
Matcher m = replaceGTPattern.matcher(xml);
String replacement;
StringBuffer intermXml = new StringBuffer();
while(m.find()){
    replacement = ">"+m.group(0).substring(1).replaceAll(">", ";>");
    m.appendReplacement(intermXml,replacement);
    }
    m.appendTail(intermXml);
Pattern replaceLTPattern = Pattern.compile("<(.[^>]*)(<)+");
m = replaceLTPattern.matcher(intermXml);
StringBuffer finalXml = new StringBuffer();
while(m.find()){
    replacement = m.group(0).substring(0,m.group(0).length()-1).replaceAll("<", ";<").concat("<");
    m.appendReplacement(finalXml,replacement);
    }
    m.appendTail(finalXml);
    System.out.println(finalXml);
}
}
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论