ASP.NET library to extract plain text from Open XML file formats

2022-12-29 14:35 问答作者：

Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?

I require this to populate a lucene.net index.

I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already开发者_如何转开发 available for the other file formats?

Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.

http://msdn.microsoft.com/en-us/library/ms691105

http://www.codeproject.com/KB/cs/IFilter.aspx

(Some links at the bottom of the codeprject link).

MS provide IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en

I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.

If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.

watch aspose.com, they have a good library to handle both ppt and pptx.

You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.

For detail, you can check here

继续阅读：asp.net lucene.net openxml

ASP.NET library to extract plain text from Open XML file formats

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？