Save a binary file in SQL Server as BLOB and text (or get the text from Full-Text index)

2022-12-24 15:33 问答作者：

Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.

Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command l开发者_JAVA技巧ine tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?

-or-

Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?

Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene

If it isn't absolutely necessary to stream directly from SQL Server into your app, the hard part is parsing the PDF or DOC file formats.

The iTextSharp library will give you access to the innards of a PDF file:

http://itextsharp.sourceforge.net/

Here's a commercial product that claims to parse Word docs:

Aspose.Words

Edited to add:

I think you're also asking if there are ways to make SQL Server Full-text Indexing do the work for you by adding IFilters. This sounds like a good idea. I haven't done this myself, but MS has apparently supported a Word filter for a long time, and now Adobe has released a (free) PDF filter. There's a lot of information here:

Filter Central

10 Ways to Optimize SQL Server Full-text Indexing

SQL Server Full Text Search: Language Features - a little out of date but easy to understand.

SQL Server Full-Text Search feature uses IFilters for extracting plain text from PDF or Office file formats. You can install IFilters on your server or if your code is running on the same machine as SQL Server you're already have it.

Here is an article which shows how to use IFilters from .NET: http://www.codeproject.com/KB/cs/IFilter.aspx

You could from your C# application open the .doc file and save it as text and put both the text and .doc document into the database.

If you are using SQL 2008, then you could consider using the new FILESTREAM feature.

Your data is stored in a varbinary(max) column, but you can also access the raw data via a regular Win32 handle.

Here's some sample code showing how to get the handle.

I had this same issue... I solved it by adding the following to my application:

EPocalipse.IFilter.dll (for everything -but- Office 2007 documents, due to 64x Windows issues)
OpenXML SDK 2.0 (for Office 2007 Documents)

I use these to grab the plain text and then store it in the database alongside the binary data. Keep in mind that I am certainly not an expert, so there may be a better way to do this, but this works for everything but "Quick Save" pre-2007 Word Documents, which apparently aren't read by iFilters. I just have my users resave the document if that error occurs, and everything works fine.

Let me know if you'd like some sample code... I would post it right now, but it's a bit long.

继续阅读：asp.net full-text-indexing search-engine sql-server

Save a binary file in SQL Server as BLOB and text (or get the text from Full-Text index)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？