Efficiently store easily parsable data in a file?
I am needing to store easily parsable data in a file as an alternative to the database backed solution (not up for debate). Since its going to be storing lots of data, preferably it would be a lightweight syntax. This does not necessarily need to be human readable, but should be parsable. Note that there are going to be multiple types of fields/columns, some of which might be used and some of which won't
From my limited experience without a database I see several options, all with issues
- CSV - I could technically do this, and it is very light. However the parsing would be an issue, and then it would suck if I wanted to add a column. Multi-language support is iffy, mainly people's own custom parsers
- XML - This is the perfect solution from many fronts except when it comes to parsing and overhead. Thats a lot of tags and would generate a giant file, and parsing would be very resource consuming. However virtually every language supports XML
- JSON - This is the middle ground, but I don't really want to do this as its an awkward syntax and parsing is non-trivial. Language support is iffy.
So all have their disadvantages. But what would be the best when trying to aim for language support AND somewhat 开发者_如何学Csmall file size?
How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.
Also, if you end up using a DB backend later, it should be fairly easy to switch over.
If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.
If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.
That said.
I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.
As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.
If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).
JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)
If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.
If it's unstructured data that might need some analysis, I'd go for JSON.
If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.
When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69
.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:
Num of fields Field Length Data Field Length Data 02 02 48 69 03 61 62 63
You can also use first row as names for the columns. (I have to say this is kind of a DB backend).
You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.
If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.
If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data). Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.
eg (H is header, C is column descriptions and D is data entry):
H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell
H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario
I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.
精彩评论