Save a binary file in SQL Server as BLOB and text (or get the text from Full-Text index)
Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.
Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command l开发者_JAVA技巧ine tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?
-or-
Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?
Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene
If it isn't absolutely necessary to stream directly from SQL Server into your app, the hard part is parsing the PDF or DOC file formats.
The iTextSharp library will give you access to the innards of a PDF file:
http://itextsharp.sourceforge.net/
Here's a commercial product that claims to parse Word docs:
Aspose.Words
Edited to add:
I think you're also asking if there are ways to make SQL Server Full-text Indexing do the work for you by adding IFilters. This sounds like a good idea. I haven't done this myself, but MS has apparently supported a Word filter for a long time, and now Adobe has released a (free) PDF filter. There's a lot of information here:
Filter Central
10 Ways to Optimize SQL Server Full-text Indexing
SQL Server Full Text Search: Language Features - a little out of date but easy to understand.
SQL Server Full-Text Search feature uses IFilters for extracting plain text from PDF or Office file formats. You can install IFilters on your server or if your code is running on the same machine as SQL Server you're already have it.
Here is an article which shows how to use IFilters from .NET: http://www.codeproject.com/KB/cs/IFilter.aspx
You could from your C# application open the .doc file and save it as text and put both the text and .doc document into the database.
If you are using SQL 2008, then you could consider using the new FILESTREAM feature.
Your data is stored in a varbinary(max) column, but you can also access the raw data via a regular Win32 handle.
Here's some sample code showing how to get the handle.
I had this same issue... I solved it by adding the following to my application:
- EPocalipse.IFilter.dll (for everything -but- Office 2007 documents, due to 64x Windows issues)
- OpenXML SDK 2.0 (for Office 2007 Documents)
I use these to grab the plain text and then store it in the database alongside the binary data. Keep in mind that I am certainly not an expert, so there may be a better way to do this, but this works for everything but "Quick Save" pre-2007 Word Documents, which apparently aren't read by iFilters. I just have my users resave the document if that error occurs, and everything works fine.
Let me know if you'd like some sample code... I would post it right now, but it's a bit long.
精彩评论