开发者

What is a fastest way to do search through xml

Suppose i have an XML file, that i use as local database, like this):

<root>
 <address>
  <firstName></firstName开发者_开发技巧>
  <lastName></lastName>
  <phone></phone>
 </address>
</root>

I have a couple of questions:

1. What will be a fastest way to find address(or addresses) in XML where firstName contains 'er' for example?

2. Is it possible to do without whole loading of XML file in memory?

P.S. I am not looking for XML file alternatives, ideally i need a search that not depend on count of addresses in XML file. But i am realist, and it seems to me that it not possible.

Update: I am using .net 4

Thanks for suggestions, but it's more scientific task than practical.. I probably looking for more fastest ways than linq and xmltextreader.


LINQ to Xml works pretty fine:

XDocument doc = XDocument.Load("myfile.xml");
var addresses = from address in doc.Root.Elements("address")
                where address.Element("firstName").Value.Contains("er")
                select address;

UPDATE: Try to look at this question on StackOverflow: Best way to search data in xml files?.

Marc Gravell's accepted answer works using SQL indexing:

First: how big are the xml files? XmlDocument doesn't scale to "huge"... but can handle "large" OK.

Second: can you perhaps put the data into a regular database structure (perhaps SQL Server Express Edition), index it, and access via regular TSQL? That will usually out-perform an xpath search. Equally, if it is structured, SQL Server 2005 and above supports the xml data-type, which shreds data - this allows you to index and query xml data in the database without having the entire DOM in memory (it translates xpath into relational queries).

UPDATE 2: Read also another link taken by the previous question that explains how the structure of the XML affects performances: http://www.15seconds.com/issue/010410.htm


If you have .NET 3.5+, consider using LINQ To XML.

Some sample code to give you some idea: (code below lifted/modified liberally from the article)

IEnumerable<string> addresses =
    from inv in customer.Descendants("Invoice")
    where inv.Attribute("ProductName").StartsWith("er")
    select (string) inv.Attribute("StreetAddress");


And what about XmlReader ? I think it could be the fastest way...

I tried approx 110 MB file and it took about 1,1 sec. Same file with LinqToXML (above) takes about 3 sec.

XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
XmlReader reader = XmlReader.Create("C:\\Temp\\items.xml", settings);

String firstName = "", lastName = "", phone = "";
String lastTagName = "";
Boolean bItemFound = false;
long nCounter = 0;

Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();

reader.MoveToContent();
// Parse the file and display each of the nodes.
while (reader.Read())
{
    switch (reader.NodeType)
    {
        case XmlNodeType.Element:
            //Console.Write("<{0}>", reader.Name);

            lastTagName = reader.Name;

            if (lastTagName ==  "address")
                nCounter++;

            break;
        case XmlNodeType.Text:
            //Console.Write(reader.Value);
            switch (lastTagName)
            {
               case "firstName":
                    firstName = reader.Value.ToString();
                    bItemFound = firstName.Contains("97331");
                    break;
                case "lastName":
                    lastName = reader.Value.ToString();
                    break;
                case "phone":
                    phone = reader.Value.ToString();
                    break;
            }
            break;
        case XmlNodeType.CDATA:
            //Console.Write("<![CDATA[{0}]]>", reader.Value);
            break;
        case XmlNodeType.ProcessingInstruction:
            //Console.Write("<?{0} {1}?>", reader.Name, reader.Value);
            break;
        case XmlNodeType.Comment:
            //Console.Write("<!--{0}-->", reader.Value);
            break;
        case XmlNodeType.XmlDeclaration:
            //Console.Write("<?xml version='1.0'?>");
            break;
        case XmlNodeType.Document:
        case XmlNodeType.DocumentType:
            //Console.Write("<!DOCTYPE {0} [{1}]", reader.Name, reader.Value);
            break;
        case XmlNodeType.EntityReference:
            //Console.Write(reader.Name);
            break;
        case XmlNodeType.EndElement:
            //Console.Write("</{0}>", reader.Name);
            break;
    }

    if (bItemFound)
    {
        Console.Write("{0}\n{1}\n{2}\n", firstName, lastName, phone);
        bItemFound = false;
    }
}

stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
    ts.Hours, ts.Minutes, ts.Seconds,
    ts.Milliseconds / 10);
Console.WriteLine("RunTime " + elapsedTime);
Console.WriteLine("Searched items: {0}", nCounter);

Console.ReadKey();


You can use XmlTextReader if you don't want to read the whole file into memory. Such solution will probably run faster, but it will involve more coding.


I'm worried you might want to optimize something that might not need it. How many email addresses are we talking about? Most of the time you would read in the input and build a structure that supports the kind of queries you will be running.

There are trees that can get to the kind of results you are looking for in order log(n) time. And you can store a ton of addresses in even a small amount of memory.


If you really need not to do this on server side, you can do it with regular expressions. But loading the XML on memmory would be faster I think...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜