Sitecore text search in PDF or Word documents
I want to find out if it's possible to configure Sitecore's Lucene search engine to index PDF or Word documents? I've looked on the Sitecore support site at this document (http://sdn.sitecore.net/uploa开发者_开发技巧d/sitecore6/65/sitecore_search_and_indexing_sc60-65-a4.pdf) but it mentions creating a file crawler class which suggests to me that it's only possible to achieve this by writing custom code. If I do need to write custom code to do this, would I also need to have some API in order to extract the text content from PDF documents?
I've recently had to do something similar on one of my projects. Have a look at How to index Word 2003, 2007 and 2010 documents using Lucene.NET.
I ended up creating a custom indexer which handled MS Office documents (XP,2003,2007 and 2010 format) and PDF documents:
- For indexing XP-2003 MS Office documents you can use
IFilter
s built into the OS (assuming you are using Windows Server 2003 or newer) - For indexing 2007-2010 MS Office documents you will need to install Microsoft Office 2010 Filter Packs
- For indexing PDF documents I strongly recommend using Foxit PDF IFilter. It is not free, but does a much better job than the Adobe PDF IFilter.
Note: Don't waste your time with Adobe PDF IFilter: it fails to read valid PDF files and is a lot slower. Foxit IFilter is designed to take advantage of multi-core CPUs and performs much better on large documents.
精彩评论