开发者

C# Alternatives to Tika [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. 开发者_StackOverflow社区

Closed 8 years ago.

  • Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
  • This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Improve this question

Anyone Know of any C# alternative to TiKa able to extract text from HTML,PDF, etc..?


I've implemented a framework called Toxy. It's based on .NET and easier to use than Tika. Please visit http://toxy.codeplex.com


I've got a similar need... I've got a .Net project where I need to pull text out of various files (.XLS, .DOC, .PDF, etc), for indexing with Lucene.Net

This blog post seems to be exactly what I'm after: A .Net wrapper around the .jar file!

I'm implementing it now, but if it doesn't work then I'll update my answer here...

Edit: Ok, it's up, running, and working well (if a little slowly). There's some pretty nasty dependency wrangling with the IKVM bits, but it's the best alternative that I've found.


Your question is a little vague, but for parsing HTML you can use the Html Agility Pack which gives you full DOM access to the HTML and allows extracting elements using XPath expressions.


You can use Lucene.Net and try some parsers.... I have just found this blog that has some cool links... I hope it helps!

http://kalanir.blogspot.com.ar/2008/08/indexing-pdf-documents-with-lucene.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜