Indexing large number of XML files
I have a difficult problem lying before me and I thought it best that I seek some guidance from the community before formulating a plan of attack on my own.
I have a couple thousand XML files that I need to be searchable by a SQL Server 2008 database. The XML files currently reside on disk and are not part of any repository. What I mean by "searchable" is that I need to be able to do something like (psuedo-code here)
SELECT *
FROM tbl_xmldata
WHERE CONTAINS('xmldata', 'some search word')
tbl_xmldata would be the table where the XML files are being stored, and xmldata would be the column with the actual XML data.
The last requirement (and this is actually a tough one) is that when a hit is found (and by 'hit' I mean that an XML file was found to contain the term being searched upon) I need to have access to the wording that surrounds where the search term was found out. For instance, if I had an XML file that had the following in it:
< 开发者_JAVA百科root> We hold these truths to be self-evident, that all men are created equal < /root>
and I searched upon the word "self-evident", then I need to be able to return around 20 characters before and after where the search term was found. I only bring up this last point because - in my experience anyway - SQL Server's full-text indexing is limited in that it can only tell you if a term/word/phrase is located in a particular document (assuming that the document is stored in a SQL Server 2008 filestream) and it can't tell you the context in which the term/word/phrase was located.
Any help would be greatly appreciated! Thanks!
Take a look at the Solr project. A less mature but very promising alternative is Elastic Search
精彩评论