开发者

50 million+ Rows of Data - CSV or MySQL

I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as t开发者_JAVA技巧he first values in my queue.

Thanks,


I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.


If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.


Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.

If you need to do lookups, then something that lets you index the data, like MySQL, would be better.


From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?

If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?

If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.

There are also some python modules which might help:

  • graph-tool http://projects.skewed.de/graph-tool/
  • python-graph http://pypi.python.org/pypi/python-graph/1.8.0
  • networkx http://networkx.lanl.gov/
  • igraph http://igraph.sourceforge.net/


How about some key-value storages like MongoDB

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜