Approach for finding file in huge tree structure using multithreading
I have a tree which has all the directories and files as its nodes. I want to search a particular file. Say the tree is spread widely and I want to do a breadth first search to find some particular file and that too using m开发者_运维技巧ultithreading. How should I do that using multithreading ? What is a good approach ?
There are some case where multiThreading the search will provide a useful speedup - if the tree spans more than one disk, for example, or if some of the disks/nodes are indirected over some network.
I certainly would not want to try creating threads for every folder. That's thousands of create/run/terminate, thousands of stack allocation/free etc. Gross, avoidable overhead.
A multiThreaded search can be done, but as other posters have said, look at available alternatives first. Then read the rest of this post. Then look again
I have done something like this once using a queue approach similar to that suggested by Matt.
I don't want to ever do it again:
I used a producer-consumer work queue on which 6 threads waited for work, (6 because testing showed this to be optimum with my problem). The threads were all created once at startup and never terminated. None of this continual create/load/run/waitFor/getResult/terminateIfYoureLucky stuff that unaccountably seems to be popular with developers despite poor performance, shutdown AVs, 216/217 messageBoxes etc etc.
The work came in the form of a 'folderSearch' class that contained the path to be searched, a file match function event to call and the FindFirst/FindNext loop method to do the searching. I created a couple hundred of these at startup in a pool, (ie pushed onto another P-C pool queue:). When the FF/FN iterated the files in the folder to look for matching files, encountering a sub-folder resulted in extracting another folderSearch instance from the pool, loading it up with the new path & pushing it onto the work queue - some thread would then pick it up and iterate the sub-folder. The class had a list for the paths to matching files and a 'results' event to call, (with 'this' as parameter, of course), if it found something of interest. If a folderSearch got to the end of a twig, having found nothing and with nothing left to search, it would release itself back to the pool, (well, OK, the thread would do it, but you know what I mean:).
There was no need for any explicit 'load balancing'. If one node was exceptionally deep, it would naturally end up with all six threads working on its subtrees because the other paths are exhausted.
Searching 3 disks in their entirety meant popping 3 folderSearch from the pool, loading them up with 'C:\', 'E:\', 'F:\' and the file match method and then pushing them onto the work queue. The disks then made rattling noises and the event would eventually fire with results. In my case, (Windows), the event PostMessaged the folderSearch objects to a UI thread where the results were displayed in a treeView before repooling the folderSearch's for re-use.
This system was ~ 2.5 times as fast as a simple sequential search across 3 disks, even on my old development box that only had one core, simply because all 3 disks were searched in parallel. I suspect it would show the same sort of advantage on a modern box because the limiting factor is probably dominated by IO waiting on the disks.
Surprisingly, there was also a speedup with only one disk, but not that much. Don't know why - should be slower, by rights, due to all the extra complication.
Naturally, there were issues. One was that, with a search that fired lots of results, the pool would empty because the UI could not keep up with the threads and so all the folderSearch objects got stuck in the PostMessages queued to the UI, so slowing down the search threads as they had to wait on the pool queue until the PostMessages got handled and so returned folderSearch's to the pool. This also meant that the UI was effectively blocked until the search was over and it could catch up, negating one of the advantages of threading off the search in the first place :( With small result sets, it worked fine.
Another possible issue is that the results come back in an 'unnatural' manner, interleaved in such an aparrently confusing manner that things like assembling a tree view are much more complex than with a single-threaded recursive search - you have to flit about all over the place to stuff the results into the treeView in the right place. This loads up the GUI with extra work and can negate the searching speed advantage with large numbers of results, as I found out
This design could run multiple searches concurrently. As a test, I would load on several 3-disk searches at once, (no not while loading up the treeView - I just dumped the number of files found onto a memo line in the GUI message-handler). This made a huge amount of rattling and slowed everything to a crawl, but it did eventually complete all the searches without crashing. I didn't do this often as I was afraid for my poor disks. Don't try this at home
I was never sure how many threads to hang off the queue. Six was about the optimum on my old box with local disks. If there are networked disks in the mix, then more is probably better since a network call will tend to block one thread for much longer periods than a local disk read. Never tried that, but loading on more threads did not affect the performance any with local disks, just used more memory for no extra advantage.
Another problem is finding out if the search is actually over - are all the results in .. or is some thread still waiting on a network drive that's slow or actually unreachable? With only one search, I could tell because the pool became full again when the search was over, (I dumped the pool level to a stausBar on a 1s GUI timer). It didn't matter in my app, but in others, it might...
Cancelling a search is a similar issue. These sorts of things would need another 'searchClass' to control each search. Each folderSearch allocated to a search would have to keep a reference to the searchClass so that any thread handling the folderSearch could find out if an abort had been set and if so stop doing stuff with that folderSearch. I didn't need this, so did not implement it.
The there's error reporting. If a network drive connection fails, for example, several, (most likely all!), threads can block up for a long time before an exception is raised. Then they all except at once. The catch messages get loaded into an 'errorMess' field in the folderSearch and the results event fired. Human-detectable evidence - the rattling stops. Nothing happens for a minute, then [no of threads] errors appear all at once.
Note well the caveats from the other posters and my experiences. Only attempt something like this if you really, really need it for some special search purpose and you are 100% happy with multiThreaded apps. If you can get away with a single-threaded search, or a shell call to a File Explorer, or almost anything else, do it that way!
I've used the same approach since with an FTP server to generate trees. That was much faster as well, though the server admins were probably not happy about the multiple connections
Rgds, Martin
Multithreading a tree-search task with unknown work distribution in each branch is non-trivial (this comes up a lot in say, constraint satisfaction problems.)
The easiest way is to create a task queue (protected by a mutex.) Fill this queue with all the children of the root node. Spawn N threads (one for each available CPU core) and have them search through each node. There are various tricks you can do to avoid some bad scenarios (if any thread finds that its node is "unexpectedly deep" you can have them add new tasks to the queue corresponding to subdirectories it wants other threads to explore.) If your node depths are well distributed and the root node has lots of children you can avoid the queue entirely -- just assign each thread with index i the task of exploring X % N + i (where X is the number of children of the root node.)
My first response is to say "just use nftw and forget about doing it multi-threaded". If you happen to have an implementation of nftw that does the tree walk in a multi-threaded fashion, then you get multi-threading for free (I'm not aware of any such implementation). If you really want to do multiple threads, I would suggest using nftw and spawning a new thread for each directory within the call back, but it's not immediately clear that that would be any easier (or any different) than following Kanopus' suggestion. And after thinking about it for a few moments, I fall back to my first suggestion and wonder why you want to do this with multiple threads. Having more threads is unlikely to speed up the search. Use nftw. Don't worry about threading.
Assuming each node in the tree represents a directory (and the files within it), and also assuming there is no limit in the number of threads you can open:
Enter the root node, if it has n
subdirectories, create n - 1
threads to search in the first n - 1
and continue the search through the last subdirectory. Repeat as needed.
Tree-structures don't typically lend themselves to parallelization. Assuming you have all the nodes loaded into memory, try and organize them such that they occupy an array - after all, they need to live in RAM which is serial - and ignore their tree structure for the purpose of your search. Then iterate over the elements of the array using some sort of parallel for
loop. A popular choice for this is OpenMP or you might try parallel_for_each
in Visual Studio.
精彩评论