comparing the contents of two huge text files quickly

2023-01-29 17:28 问答作者：

what i'm basically trying to do is compare two HUGE text files and if they match w开发者_开发百科rite out a string, i have this written but it's extremely slow. I was hoping you guys might have a better idea. In the below example i'm comparing collect[3] splitfound[0]

        string[] collectionlist = File.ReadAllLines(@"C:\found.txt");
        string[] foundlist = File.ReadAllLines(@"C:\collection_export.txt");
        foreach (string found in foundlist)
        {
            string[] spltifound = found.Split('|');
            string matchfound = spltifound[0].Replace(".txt", ""); ;
            foreach (string collect in collectionlist)
            {
                string[] splitcollect = collect.Split('\\');
                string matchcollect = splitcollect[3].Replace(".txt", "");
                if (matchcollect == matchfound)
                {
                    end++;
                   long finaldest = (start - end);
                   Console.WriteLine(finaldest);
                    File.AppendAllText(@"C:\copy.txt", "copy \"" + collect + "\" \"C:\\OUT\\" + spltifound[1] + "\\" + spltifound[0] + ".txt\"\n");
                    break;
                }



            }

        }

Sorry for the vagueness guys,

What I'm trying to do is simply say if content from one file exists in another write out a string(the string isn't important, merely the time to find the two comparatives is). collectionlist is like this:

Apple|Farm

foundlist is like this

C:\cow\horse\turtle.txt

C:\cow\pig\apple.txt

what i'm doing is taking apple from collectionlist, and finding the line that contains apple in foundlist. Then writing out a basic windows copy batch file. Sorry for the confusion.

Answer(All credit to Slaks)

               string[] foundlist = File.ReadAllLines(@"C:\found.txt");
           var collection = File.ReadLines(@"C:\collection_export.txt")
        .ToDictionary(s => s.Split('|')[0].Replace(".txt",""));

        using (var writer = new StreamWriter(@"C:\Copy.txt"))
        {
            foreach (string found in foundlist)
            {
                string[] splitFound = found.Split('\\');
                string matchFound = Path.GetFileNameWithoutExtension(found);

                string collectedLine;
                if (collection.TryGetValue(matchFound,out collectedLine))
                {
                    string[] collectlinesplit = collectedLine.Split('|');
                    end++;
                    long finaldest = (start - end);
                    Console.WriteLine(finaldest);
                    writer.WriteLine("copy \"" + found + "\" \"C:\\O\\" + collectlinesplit[1] + "\\" + collectlinesplit[0] + ".txt\"");
                }
            }
        }

Call File.ReadLines() (.NET 4) instead of ReadAllLines() (.NET 2.0).
ReadAllLines needs to build an array to hold the return value, which can be extremely slow for large files.
If you're not using .Net 4.0, replace it with a StreamReader.
Build a Dictionary<string, string> with the matchCollects (once), then loop through the foundList and check whether the HashSet contains matchFound.
This allows you to replace the O(n) inner loop with an O(1) hash check
Use a StreamWriter instead of calling AppendText
EDIT: Call Path.GetFileNameWithoutExtension and the other Path methods instead of manually manipulating strings.

For example:

var collection = File.ReadLines(@"C:\found.txt")
    .ToDictionary(s => s.Split('\\')[3].Replace(".txt", ""));

using (var writer = new StreamWriter(@"C:\Copy.txt")) {
    foreach (string found in foundlist) {
        string splitFound = found.Split('|');
        string matchFound = Path.GetFileNameWithoutExtension(found)

        string collectedLine;
        if (collection.TryGetValue(matchFound, collectedLine)) {
            end++;
            long finaldest = (start - end);
            Console.WriteLine(finaldest);
            writer.WriteLine("copy \"" + collectedLine + "\" \"C:\\OUT\\" 
                           + splitFound[1] + "\\" + spltifound[0] + ".txt\"");
        }
    }
}

First I'd suggest normalizing both files and putting one of them in a set. This allows you to quickly test whether a specific line is present and reduces the complexity from O(n*n) to O(n).

Also you shouldn't open and close the file every time you write a line:

File.AppendAllText(...); // This causes the file to be opened and closed.

Open the output file once at the start of the operation, write lines to it, then close it when all lines have been written.

You have a cartesian product, so it makes sense to index one side instead of doing an enhaustive linear search.

Extract the keys from one file and use either a Set or SortedList data structure to hold them. This will make the lookups much much faster. (Your overall algorithm will be O(N lg N) instead of O(N**2) )

继续阅读：.net

comparing the contents of two huge text files quickly

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？