Repository organization for Hadoop project
I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be开发者_JAVA百科 more efficient to keep them in the same, but break down into folders?
If you're pipelining the Hadoop jobs (output of one is the input of another), I've found it's better to keep most of it in the same repository since I tend to generate a lot of common methods I can use in the various MR jobs.
Personally, I keep the streaming jobs in a separate repo from my more traditional jobs since there are generally no dependencies.
Are you planning on using the DistributedCache or streaming jobs? You might want a separate directory for files you distribute. Do you really need a JAR per Hadoop job? I've found I don't.
If you give more details about what you plan on doing with Hadoop, I can see what else I can suggest.
精彩评论