Converting word docs to pdf using Hadoop

2022-12-15 13:11 问答作者：

Say if I want to convert 1000s of word files to pdf then would using Hadoop开发者_运维问答 to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

There isn't much advantage in using hadoop for this use case. Having competing consumers read from a queue and producing output is going to be a lot easier to setup and will probably be more efficient.

Hadoop would not automatically split a document and process sections on differnt nodes. Although if you had a really big (many thousands of pages long) then the Hadoop use case would make sense - but only when the time to produce a pdf on a single machine is significant.

The map tasks could print a few thousand pages each and the reduce task merge the PDF's into a single document - although reading the resulting file may be difficult to read if it is very large.

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.

Good luck with whatever tool you choose for this problem!

Regards, Jeff

How many 1000's are you talking about? If this is a once off batch I would set it up on a single machine and simply let it run, you'll be surprised I think at how fast you can convert 1000s of Docs to PDF, even if you need to run the task for a couple of days, if its a once off convert then there is no need for complications such as Hadoop. If you are continually converting 1000s of docs then its probably worth the effort of setting up something else.

Converting word docs to pdf using Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？