开发者

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Also I want to know how to add meta data while indexing so that i can boost some parameter开发者_运维技巧s


There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

  • One of them is Apache Tika, a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.


You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);


Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.


see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜