Fast search in XMl files in .NET (or How to index XML files)
I have to implement a search feature which is able to quickly perform arbitrary complex queries to XML-data. If the user makes a query, all XML files must be searched to find possible matches. The users will have lots of XML-Files (a few 10000 or more) which are typically a few kilobytes in size. All the XML-files have almost the same structure.
I already benchmarked XPath, it is too slow for my needs.
How can it be done most efficiently? Is is possible to create indexes for the contents of the XML files (preserving content semantics, not just plain fulltext开发者_Go百科 search)?
Will it be useful to put the XML data into an (embedded) SQL database and do the queries with SQL?
What other possibilities do I have?
Don't try an re-invent the wheel!
I would import the XML into a database(eg SQLite) (plus meta data, XML information), and query that.
Edit 1:
You could implement a 'drop folder' which is 'indexed'/imported upon first run. A Folder watcher can be implemented to ONLY update new/changes to XML files. SQLite can be run in memeory for the fastest I/O performance.
The fastest way is to create your own in memory model of data available in XML, convert it to simple objects and simple types, and organize it in the structure that suits your queries best. Index it additionally as appropriate for your problem (using Dictionary/SortedDictionary). This approach will be significantly faster then the one with using SQL database, and using SQL database will also be a lot faster then querying each XML. Depending on the complexity of your queries, this could range from a fairly simple thing to do, to a very hard in which case you should definitely go for embedded database.
The SQL Server 2005+ allows for creating XML indexes. The queries can be performed on the SQL server, without retrieving the XML data on the application side. This feature is present in the free Express edition.
For indexing the contents of xml: use Lucene (and a .net based implementation of it). This will allow you to quickly retrieve the xml docs that contain some specific values; then you might pay more attention to these ones.
精彩评论