开发者

Handle large data pools in python

I'm working on an academic project aimed at studying people behavior.

The project will be divided in three parts:

  1. A program to read the data from some remote sources, and build a local data pool with it.
  2. A program to validate this data pool, and to keep it coherent
  3. A web interface to allow people to read/manipulate the data.

The data consists of a list of people, all with an ID #, and with several characteristics: height, weight, age, ...

I need to easily make groups out of this data (e.g.: all with a given age, or a range of heights) and the data is several TB big (bu开发者_如何转开发t can reduced in smaller subsets of 2-3 gb).

I have a strong background on the theoretical stuff behind the project, but I'm not a computer scientist. I know java, C and Matlab, and now I'm learning python.

I would like to use python since it seems easy enough and greatly reduce the verbosity of Java. The problem is that I'm wondering how to handle the data pool.

I'm no expert of databases but I guess I need one here. What tools do you think I should use?

Remember that the aim is to implement very advanced mathematical functions on sets of data, thus we want to reduce complexity of source code. Speed is not an issue.


Sounds that the main functionality needed can be found from:
pytables
and
scipy/numpy


Go with a NoSQL database like MongoDB which is much easier to handle data in such a case than having to learn SQL.


Since you aren't an expert I recommend you to use mysql database as the backend of storing your data, it's easy to learn and you'll have a capability to query your data using SQL and write your data using python see this MySQL Guide Python-Mysql

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜