Encoding problem
I have to parse the content I get from the web and it can contain special characters. In this case the content string appears like the following:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<id>1</id>
<price>2.14</price>
<title>test ž test</title>
When the contet above is passed to the method characters(), in the class which is extended from org.xml.sax.helpers.DefaultHandler:
public class ProductsXMLHandler extends开发者_开发知识库 DefaultHandler {
...
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String elementValue = new String(ch, start, length);
...
}
I noticed the array test ž test
is broken into three arrays: 'test ', 'ž
' and ' test'
so the elementValue is not equal test ž test
which should be the result. Does anyone know how to solve the problem?
Is it necessary to recode the source string:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<id>1</id>
<price>2.14</price>
<title>test ž test</title>
before it is passed to XML handler class?
Thank you!
As Jon Skeet said in in answer, characters
is called multiple times. What you should do is the following :
- in
startTag
, create a StringBuffer, and note (in a boolean value for example) if you are in the right tag you are searching for. - in
characters
, if you are in the right tag (if the boolean set earlier is true), put the characters in the StringBuffer - in
endTag
, if you are getting out of the right tag (see boolean, same thing as earlier), take the content of the StringBuffer and voilà ! Here is your complete string. Don't forget to empty the StringBuffer after that.
Do you mean that characters
is being called three times? If so, you just need to make your code handle that - the parser is perfectly at liberty to do this. You shouldn't assume that you'll get all character data in one call.
From the documentation for DocumentHandler.characters()
:
SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.
I don't think you can do anything about it, this is per the SAX API. Specifically, from http://java.sun.com/javase/6/docs/api/org/xml/sax/ContentHandler.html#characters(char[],%20int,%20int)
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
(My emphasis)
精彩评论