开发者

Why are fixed-width file formats still in use?

Are there any advantages to a fixed-width file format over something like XML?开发者_开发技巧 I realize XML would likely take up more disk space to store the same amount of data but the file could also be compressed. I guess you could also, in theory, read a specific piece of data based on where it is in the file (just grab those bytes). But other than that, what else?


When the data is large (Giga/Terra-bytes), fixed width format files can be MUCH more efficient.

Since each record and field has fixed sizes, you can simply seek to the (for example) n-millionth row and read a couple of records from there. You can also memory map the whole file into memory and get rather efficient and easy random access to everything.

XML files aren't a good fit in these cases.


XML is complicated. And especially if you do validation according to a schema. This may not look important, because somebody else already wrote XML parser that you can use. But this adds quite a lot of processing, which means it takes longer. This may not be a problem in many cases, but sometimes can.

If you want to save one integer into a custom file format, it takes just 4 bytes and when you want to load it, you just copy those 4 bytes into memory (assuming the file format and your platform have the same endianness). But with XML, it might take something like 10–30 bytes. And loading it is means comparing strings and parsing decimal representations of integers and probably more.

Again, those performance and storage size differences may very well be too minuscule for you to even consider (and the work that it would take to devise custom format might be non-trivial), but in many cases, those differences do matter.

For example, I encountered a system that uses SMS messages for transmission of some data. That means you have 140 bytes (!) per message. And the device that sends and recieves those messages doesn't have GBs of memory and GHz of CPU. In that situation, you make sure that every bit counts and you certainly don't use XML.


I know this is old, but I deal with both Fixed Width and XML daily. You can pretty much sum it up to:

XML = Readability

Fixed Width = Speed and Low Resource Consumption

XML is largely for readability by a human. I don't care what anyone says about structure and validation. If you're running a system that really doesn't need and should have humans reading the files your passing back and forth, then you're really just adding this as overhead to the amount of time it takes to process the file and to the size of the file, affecting how long the file may take to transfer it contents as well as another impact to processing. All of this will also impact memory usage by the system consuming the XML file. There are advantages however to XML. You can more loosely define your structure. Sometimes its easier if your file and code don't both require a field to be 255 characters long. Only that your code loads that limit period. Another advantage is that XML can/should come with an XML Schema that defines requirements of the XML contents. This helps with having multiple system's that consume a single API. If you can provide your schema to a developer, they can pretty quickly make typed objects that serialize into proper formatted and structured XML.

Fixed Width is for speed and minimal resource consumption. It can be more tedious to setup than XML. Ensuring that all systems know exact positions of "columns" in the Fixed Width file. Often not all systems utilize the same or all columns, so you end up with only a single system that fully understands the Fixed Width contents. This can make it challenging to grow an API or System utilizing your transferred file contents. However because there are no field labels, no tags, nothing but raw data, you can often get a smaller package sent across the wire. Not always true, in some cases, you may have a large number of text fields that common have small amounts of data stored in the fields, but must retain a large column width for one off cases where a paragraph length was input. Now you've got a bunch of white space holding positions in your Fixed Width file and XML may actually reduce your overall package size.

Generally speaking though, XML is for readability. You can't typically just pick up a Fixed Width file or even a CSV file and immediately start grasping at what the data means. Where as well labeled XML files, you can.

There's a number of advantages and disadvantages that I've not gone into, but this is where I see the real meat and potatoes of the differences.


I too had the same questions until I realized the power of fixed width. We have a table that has millions of records extracting them into a file as a JSON swelled up the file size to 15GB and 2+hrs. While using the fixed widht brought it down to 6.5GB and 15 minutes.

Extraction and writing a fixed width is faster than JSON.

I tried CSV's too and even here the Fixed width scored better.


Probably mostly for legacy reasons, since parsers for XML, JSON (etc) exist pretty much on all platforms.

Theoretically fixed-width formats can be more space-efficient, as you suggest; and reading bit simpler. But these do not seem like significant benefits.

For what it is worth, tabular (but not fixed-width) formats like CSV have their uses, combining bit more compact representation and possibly better readability; CSV works quite nicely for map/reduce style jobs.


One reason could be that processing XML (not just reading and loading into memory structures, but think about regex searching in an XML file vs. a simple fixed-width or delimited file, or even making manual quick-fixes to bad data) is more complicated than fixed-width files. Sure, there are many libraries that can do it for you now, but if there isn't one for the platform you're working on, do you really want to write an XML parser, or a program that just reads n bytes at location x?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜