Parse large XML files over a network
I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the netwo开发者_如何学运维rk in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.
In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"
Edit:
I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.
Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).
This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.
For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:
This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.
Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.
To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.
Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.
If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.
Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?
For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.
If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.
精彩评论