Java XML parser blocks (very unusual and strange!)
I have a very strange case:
I tried to parse several XHTML-conform websites using default Java XML parser(s). The test blocks during parsing (not during downloading).
Can this be a bug, or does the parser tries to download additional referenced resources during parsing (which would be a "nice" anti-feature)?
With simple data, it works. (TEST1)
With complex data, it blocks. (TEST2) (I trieden.wikipedia.org
and validator.w3.org
)
When blocking occurs, CPU is idle.
Tested with JDK6 and JDK7, same results.
Please see test case, source is ready for copy + paste + run.
S开发者_开发百科ource
import java.io.*;
import java.net.*;
import java.nio.charset.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import org.w3c.dom.*;
public class _XmlParsingBlocks {
private static Document parseXml(String data)
throws Exception {
Transformer t = TransformerFactory.newInstance().newTransformer();
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
DOMResult out = new DOMResult(b.newDocument());
t.transform(new StreamSource(new StringReader(data)), out);
return (Document) out.getNode();
}
private static byte[] streamToByteArray(InputStream is)
throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (;;) {
byte[] buffer = new byte[256];
int count = is.read(buffer);
if (count == -1) {
is.close();
break;
}
baos.write(buffer, 0, count);
}
return baos.toByteArray();
}
private static void test(byte[] data)
throws Exception {
String asString = new String(data, Charset.forName("UTF-8"));
System.out.println("===== PARSING STARTED =====");
Document doc = parseXml(asString);
System.out.println("===== PARSING ENDED =====");
}
public static void main(String[] args)
throws Exception {
{
System.out.println("********** TEST 1");
test("<html>test</html>".getBytes("UTF-8"));
}
{
System.out.println("********** TEST 2");
URL url = new URL("http://validator.w3.org/");
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
byte[] data = streamToByteArray(is);
System.out.println("===== DOWNLOAD FINISHED =====");
test(data);
}
}
}
Output
********** TEST 1
===== PARSING STARTED =====
===== PARSING ENDED =====
********** TEST 2
===== DOWNLOAD FINISHED =====
===== PARSING STARTED =====
[here it blocks]
W3C have in the last few months started blocking requests for common DTDs such as the XHTML DTD - they can't cope with the traffic generated. If you're not using a proxy server that caches the DTDs, you will need to use an EntityResolver or catalog to redirect the references to a local copy.
Looking at the page you downloaded, it contains some more http:
URLs.
This is the start:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
I could imagine that the XML parser is trying to download the referenced DTD here, to be able to validate the XML content.
Try to add the preamble to your simple document, or try to let it away from your complex one, to see if this changes something.
Switch the parser to non-validating, and look if this helps. (Alternatively, there are some options to configure how the parser behaves - setURIResolver
looks good, for example.)
Solution: prefetch (or better: use offline stored) DTDs for a custom EntityResolver
.
When it is expected, that no external XML entities are used (such as
), an empty InputSource
can be returned, see inner enum. Otherwise, a prepared mapping of DTD URI -> bytearray
can be used to prevent downloading DTDs online.
Class
import java.io.*;
import java.util.*;
import javax.annotation.*;
import org.xml.sax.*;
public final class PrefetchedEntityResolver
implements EntityResolver {
/**
* NOTE: {@see #RETURN_NULL} seems to cause default behavior
* (which is: downloading the DTD);
* use {@see #RETURN_EMPTY_DATA} to ensure "offline" behavior
* (which could lead to entity parsing errors).
*/
public static enum NoMatchBehavior {
THROW_EXCEPTION, RETURN_NULL, RETURN_EMPTY_DATA;
}
private final SortedMap<String, byte[]> prefetched;
private final NoMatchBehavior noMatchBehavior;
public PrefetchedEntityResolver(NoMatchBehavior noMatchBehavior,
@Nullable SortedMap<String, byte[]> prefetched) {
this.noMatchBehavior = noMatchBehavior;
this.prefetched = new TreeMap<>(prefetched == null
? Collections.<String, byte[]>emptyMap() : prefetched);
}
@Override
public InputSource resolveEntity(String name, String uri)
throws SAXException, IOException {
byte[] data = prefetched.get(uri);
if (data == null) {
switch (noMatchBehavior) {
case RETURN_NULL:
return null;
case RETURN_EMPTY_DATA:
return new InputSource(new ByteArrayInputStream(new byte[]{}));
case THROW_EXCEPTION:
throw new SAXException("no prefetched DTD found for: " + uri);
default:
throw new Error("unsupported: " + noMatchBehavior.toString());
}
}
return new InputSource(new ByteArrayInputStream(data));
}
}
Usage
public static Document parseXml(byte[] data)
throws Exception {
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
df.setValidating(false);
df.setXIncludeAware(false);
df.setCoalescing(false);
df.setExpandEntityReferences(false);
DocumentBuilder b = df.newDocumentBuilder();
b.setEntityResolver(new PrefetchedEntityResolver(
PrefetchedEntityResolver.NoMatchBehavior.RETURN_EMPTY_DATA,
/* pass some prepared SortedMap<String, byte[]> */));
ByteArrayInputStream bais = new ByteArrayInputStream(data);
return b.parse(bais);
}
Perhaps your "count == -1" condition needs to become "count <= 0" ?
精彩评论