开发者

How do I convert PDF to HTML programmatically?

Are there any classes, COM objects, command line utilities, or anything else that I can make an API for that can convert a PDF to an HTML document? Obviously the conversion might be a little rough since PDFs can contain a lot more than HTML can describe. I found a utility called pdftohtml on Source Forge, but quite honestly it does a horrible job with the conversion. I don't care if the software is free or commercial, but is there anything out there at all that I can incorporate with my own software to do this sort of conversion at least decently? I know Google's developed their own method of doing this, since you can click "View as 开发者_JS百科HTML" on a PDF attached to an email through Gmail, but I was hoping there was something out available to the public.

Remember, PDF to HTML. I'm NOT worried about HTML to PDF.


well one solution i can think of is to write little program that reads pdf text using library called iText and then generate html files.


well for java based PDF solutions...we dont have a clean way i guess-still.. all solutions are primitive and kind of workarounds... No easy solution for 1. Designing a template of a PDF 2. Then at runtime using java, populate data into this template...either using xml or other datasources...

such a simple requirement and NONE has a good "open-source and free" solution yet !

Eclipse BIRT comes close.. but does not handle Barcode elements ..OOB.


You were looking for pdf2htmlEX (C++), which converts PDF to HTML without losing text or format.

To convert further to semantic HTML, you can process pdf2htmlEX output using my project Transcript (Python). It is however not lossless anymore and works best on documents not deviating too much from conventional visual layout.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜