What are some "mainstream" lightweight alternatives to storing files in .csv format? [duplicate]
I'm on a project which heavily favors the use of .csv files for data storage. I see many issues with using .csv, especially for storing relational data. Parsing .csv is generally a pain, particu开发者_JAVA百科larly when using ad-hoc column assigments.
I've advocated the use of XML and minimal databases such as SQLite, but I'm looking for "faster, better, cheaper" alternatives.
What are some other, "mainstream" lightweight alternatives to .csv files?
Also, what about CouchDB. How does it compare to SQLite in terms of lightweight-ness?
EDIT: I missed it. This question has been asked before.
I would argue there is no direct replacement for a CSV file. CSV is a flat file index-oriented format. It doesn't matter if you replace commas with pipes or whatnot. It's the same thing with slightly different rules.
With that being said, I often opt for SQLite when the data is in my control.
Using SQLite consistently lends to using the same tooling, can be used as either an ad-hoc store or a relational model, has a 'step up' plan to a "standalone" RDBMS, provides DQL "for free" (which is a big plus for me), etc. Unless space is a real issue or there isn't support for the data-access, why not? (Modern Firefox also uses SQLite).
(There are a number of object-database out there, such as DB4O as well -- or even simpler key/value hierarchical stores, etc. Not trying to say SQLite is the only way to obtain relationships in a micro/embedded database.)
One down-side over say, XML is that special tooling (sqlite/adapter) is required. XML, while not the most human-friendly format, can be edited just fine in notepad. Additionally, there is no extra overhead (fragmentation or structure) in XML beside the markup/data itself and XML is generally quite amendable to compression. There are also many libraries to map an entire object graph to XML (and thus maintain relationships) so that might be a nice feature.
Other formats like JSON are also out there -- but if the format is opaque then it doesn't really make a difference over XML (it's more of a matter of tooling support).
So... "it depends".
It looks like YAML is relatively small compared to formats such as XML, but slightly more descriptive than JSON (it's a superset). It's another candidate I'll consider.
It's all about use-case.
My rule of thumb: use SQLite if there are dependencies or relations between two pieces of data; use CSV (or some other "flat" format) if it's just flat data files. The simplest thing that just works is often the most reliable solution as well.
(Note: Ensure the CSV is well formed. Nobody likes having to hack around bad CSV implementations.)
HDF5 is a good choice for storing large tabular datasets, if you do not require concurrent writes.
In Python, Pandas + PyTables are very easy to use. Example from the Pandas documentation:
In [259]: store = HDFStore('store.h5')
In [260]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
Objects can be written to the file just like adding key-value pairs to a dict:
In [261]: np.random.seed(1234)
In [262]: index = date_range('1/1/2000', periods=8)
In [263]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [264]: df = DataFrame(randn(8, 3), index=index,
.....: columns=['A', 'B', 'C'])
.....:
In [265]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
.....: major_axis=date_range('1/1/2000', periods=5),
.....: minor_axis=['A', 'B', 'C', 'D'])
.....:
# store.put('s', s) is an equivalent method
In [266]: store['s'] = s
In [267]: store['df'] = df
In [268]: store['wp'] = wp
# the type of stored data
In [269]: store.root.wp._v_attrs.pandas_type
Out[269]: 'wide'
In [270]: store
Out[270]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[8,3])
/s series (shape->[5])
/wp wide (shape->[2,5,4])
XML is designed to be mainstream and relativey "lightweight". JSON is another popular choice but much more suited to object modeling as opposed to data storage.
MySQL is a good option if you need relational querying capabilities.
精彩评论