ASP.NET library to extract plain text from Open XML file formats
Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?
I require this to populate a lucene.net index.
I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already开发者_如何转开发 available for the other file formats?
Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.
http://msdn.microsoft.com/en-us/library/ms691105
http://www.codeproject.com/KB/cs/IFilter.aspx
(Some links at the bottom of the codeprject link).
MS provide IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en
I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.
If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.
watch aspose.com, they have a good library to handle both ppt and pptx.
You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.
For detail, you can check here
精彩评论