开发者

extracting content from pdf using PHP

Could you please tell me how to extract content from PDF document using PHP? Formatting is the main problem im facing here. So let me know, if there are some ways to extract content w开发者_如何学Cith the same format and to display it on an online text editor.

Thanks


Have a look at XPDF

I suppose you could do

$text = shell_exec("pdftotext $pdffile");

As for displaying it in an editor? Well, which editor? To retain some type of formatting information, and assuming by web editor you mean HTML editor, you can convert it to HTML. Perhaps there are other tools available, but since i use xpdf i came across this converter that is based on xpdf.

Basic usage

pdftohtml -noframes -c test.pdf test.html

To get it into your favorite editor

echo file_get_contents('test.html');

You may need to wrap things inside PHP functions/classes. And you may want to add security measures and whatnot.


As far as I can see, it is not possible to convert a PDF to editable HTML using PHP on the fly, while preserving formatting. There are a number of Desktop apps around that all try to extract data from PDFs with sometimes more, sometimes less reliable results. I would say this is not realistically possible at the moment and all you can do is to extract plain text using XPDF or other command line tools.

It may be different with that new XML-Based PDF format but I don't really know anything about that yet.

Feel free to prove me wrong, of course - I'd be very interested myself if there were a solution.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜