How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?
Also I want to know how to add meta data while indexing so that i can boost some parameter开发者_运维技巧s
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
- One of them is Apache Tika, a sub-project of Lucene.
- Apache POI is a more general document handling project inside Apache.
- There are also some commercial alternatives.
You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
The code will look like this. Reader reader = new Tika().parse(stream);
Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.
see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.
精彩评论