开发者

Full-text searching a SQL Server 2008 FileStream

I have a SQL Server 2008 database with a table that is FileStream enabled. I'll refer to this table as Tbl_FileStream for the remainder of this question.

Tbl_FileStream contains hundreds of thousands of files, ranging from PDF's to JPG's to TXT files.

Aslo, Tbl_FileStream has a full-text index created upon the FileStream. The full-text index works wonderfully, and I have a stored pr开发者_运维知识库ocedure that does full-text searches upon it (using CONTAINSTABLE and RANK) and it works great as well.

However, I find myself in a pickle with regards to what the full-text search is able to return to me when it gets a hit while searching upon the FileStream. For instance, we're I to search for the phrase "et dolore" then my search would yield results that indicate 59 documents matched the search term. Of course, I can get the titles of the documents that were found to match because I am storing the document titles within Tbl_FileStream, but what I really need is to get the text surrounding the search term within the actual file.

For instance, assume that I have a text document with the following latin in it --

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Using SQL Server's full-text search capability, were I to search upon the words "et dolore" then what I really need returned to me is some arbitrary number of words (10 or so) proceeding where the search term was found within the document, so I'd actually end up with some phrase like "... sed diam nonumy eirmod tempor invidunt ut labore et dolore ...".

And for those that are wondering why in the world anyone would want to do this, the reason is because the consumer wants to have a quick glance at the context in which the search term was found within each hit. In other words, they want to glance quickly through the results of the search and try to see which of the document hits likely contains what they're looking for.

Is this possible to do within SQL Server 2008?

If not then are there any backends out there that support this?

Thanks in advance for all suggestions and help that I receive upon this difficult topic.


Using only SQL Server 2008

If you stick to SQL Server 2008, you will have to store the text contents of every file in the database on which you want to search. That means for image file types, you would have to use an OCR program on the file and keep a copy of the text in the database so that it can be searched. That also means you will have to do jump through hoops if for whatever reason the contents of the target file is over 2 GB.

So, let's presume you have the following column in your file metadata table like so:

TextContents nvarchar(max) null.

We might then extract the context using something akin to:

Declare @SearchTerm nvarchar(max)
Declare @MaxResultTextLen int

Set @SearchTerm = 'et dolore'
Set @MaxResultTextLen = 100

Select  CharIndex(@SearchTerm, F.TextContents),
    Case 
    When CharIndex(@SearchTerm, F.TextContents) <= @MaxResultTextLen 
        Then Substring(F.TextContents, 1, @MaxResultTextLen) + '...'
    Else Substring(@SearchTerm
        , CharIndex(@SearchTerm, R.TextContents) 
                - @MaxResultTextLen + Len(@SearchTerm)
        , @MaxResultTextLen) + '...'
    End As TextContext
From Files As F
Where Contains(F.TextContents, @SearchTerm)

Use a third-party search engine

A solution that gets closer to what you want to achieve is to use a third-party search engine that will return back the context with the found search term. However, again, image files and image-type PDFs will have to be OCR'd so that the engine can search the text contents. Doing a quick search on one engine called dtSearch (I haven't used nor work for them), I see that it provides the ability to display "highlighted hits" meaning to display the context of the located term in the file to the user.

dtSearch


Unfortunately, what you're looking for cannot easily be done with the current versions of Sql Server.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜