Java or Python distributed compute job (on a student budget)?

2022-12-31 00:48 问答作者：

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassing开发者_C百科ly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like:

Ipython
DISCO

After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated.

Amazon's EC2 etc not really an option, as i have next to no budget.

Speak with the IT dept at your school (especially if you are in college), if it is for an assignment or research I bet they would be more than happy to give you more disk space.

no actual answers; i'd have put this as a comment but on this site you're forced to only answer if you're still a noob

if it's genuinely as parallel as that, and it's only a couple of computers, could you not split the dataset up manually ahead of time?

have you confirmed that there isn't going to be a firewall or similar stopping you using something like that anyway?

you may only have 1GB of user space, but, if linux, what about /tmp ? (if windows, what about %temp% ? )

Definitely speak with the IT department at your school. It's not a good idea to utilize computer resources that don't belong to you.

I found JPPF, which enables applications with large processing power requirements to be run on any number of computers. I'm not sure if you need to install software on the client machines, but certain ports need to be open on the client machines.

If more resources in your computing department are a no go, you're going to have to consider breaking down your data set into manageable chunks before you do any work on it, ad reduce the results down into a meaningful set.

More resources from IT would be the way to go.

Good luck !

Ben

继续阅读：nltk python

Java or Python distributed compute job (on a student budget)?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？