开发者

ASP.NET library to extract plain text from Open XML file formats

Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?

I require this to populate a lucene.net index.

I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already开发者_如何转开发 available for the other file formats?


Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.

http://msdn.microsoft.com/en-us/library/ms691105

http://www.codeproject.com/KB/cs/IFilter.aspx

(Some links at the bottom of the codeprject link).

MS provide IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en

I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.

If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.


watch aspose.com, they have a good library to handle both ppt and pptx.


You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.

For detail, you can check here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜