开发者

Java or Python distributed compute job (on a student budget)?

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassing开发者_C百科ly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:

  • Ipython
  • DISCO

After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.

Amazon's EC2 etc not really an option, as i have next to no budget.


Speak with the IT dept at your school (especially if you are in college), if it is for an assignment or research I bet they would be more than happy to give you more disk space.


no actual answers; i'd have put this as a comment but on this site you're forced to only answer if you're still a noob

if it's genuinely as parallel as that, and it's only a couple of computers, could you not split the dataset up manually ahead of time?

have you confirmed that there isn't going to be a firewall or similar stopping you using something like that anyway?

you may only have 1GB of user space, but, if linux, what about /tmp ? (if windows, what about %temp% ? )


Definitely speak with the IT department at your school. It's not a good idea to utilize computer resources that don't belong to you.

I found JPPF, which enables applications with large processing power requirements to be run on any number of computers. I'm not sure if you need to install software on the client machines, but certain ports need to be open on the client machines.


If more resources in your computing department are a no go, you're going to have to consider breaking down your data set into manageable chunks before you do any work on it, ad reduce the results down into a meaningful set.

More resources from IT would be the way to go.

Good luck !

Ben

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜