Project Thoughts: Searching Directory of PDFs
To preface this, I know there are discussions on this in various places. Half of what I read is outdated, buggy or simply unrelated to my situation.
This is why I am bringing it to the community that I know will have the answers.
Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 - 100s of pages, add up to around 70,000 pages).
I am looking for a method, scrip开发者_StackOverflow中文版t or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.
Any ideas, whether they be elaborate or inventive, are more than welcome.
My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).
Use a search engine like Lucene or Sphinx to index and tag the PDFs. The Zend Framework has both, a component to read and write PDF files and a Lucene implementation.
XPDF has a utility called pdftotext which often is installed on linux distributions. I would create a tool that uses that to create an index of words to the document they appear in. You could store the index in a database and then write a search against that.
It would take a little more space but it would be simple to include a sentence of context as well to show in the search results.
精彩评论