Java: list all files(10-20,000+) from single directory
I want to list large number of files(10, 20 thousand or so) contained in a single directory, quickly and efficiently. I have read quite a few posts especially over here explaining the short coming of Java to achieve such, basically due to the underlying filesystem (and that probably Java 7 has some answer to it). Some of the posts here have proposed alternatives like native calls or piping etc and I do understand the best possible option under normal circumstances is the java call - String[] sList = file.list(); which is only slightly better than file.listFiles(); Also, there was a suggestion for the use of multithreading(also Executor service).
Well, here the issue is I have very little practical know-how of how to code multithreading way. So my logic is bound to be incorrect. Still, I tried this way:
- created a list of few thread objects
- Ran a loop of this list, called the .start() and immediately .sleep(500)
- In the thread class, over-rode the run methos to include the .list()
Something like this, Caller class -
String[] strList = null;
for (int i = 0; i < 5; i++){
ThreadLister tL = new ThreadLister(fit);
threadList.add(tL);
}
for (int j = 0; j < threadList.size(); j++) {
thread = threadList.get(j);
thread.start();
thread.sleep(500);
}
strList = thread.fileList;
and the Thread class as -
public String[] fileList;
public ThreadLister(File f) {
this.f = f;
}
public void run() {
fileList = f.list();
}
I might be way off here with multi开发者_StackOverflowthreading, I guess that. I would very much appreciate a solution to my requirement the multithreading. Added benefit is I would learn a bit more about practical multithreading.
Query Update
Well, Obviously multithreading isn't going to help me(well I now realise its not actually a solution). Thank you for helping me to rule out threading.
So I tried, 1.FileUtils.listFiles()
from apache commons - not much difference.
2. Native call viz. exec("cmd /c dir /B .\\Test")
- here this executes fast but then when I read the Stream using a while loop that takes ages.
What actually I require is filename depending upon a certain filter amongst about 100k files in single directory. So I am using like File.list(new FileNameFilter())
.
Kind Regards
Multi-threading is useful for listing multiple directories. However, you cannot split a single call to a single directory and I doubt it would be much faster if you could as the OS returns the files in any order it pleases.
The first thing about learning multi-threading is that not all solutions will be faster or simpler just by using multiple threads.
Am as a completely different suggestion. Did you try using Apache Commons File util?
http://commons.apache.org/io/api-release/index.html Check out the method FileUtils.listFiles().
It will list out all the files in a directory. Maybe it is fast enough and optimized enough for you needs. Maybe you really don't need to reinvent the wheel and the solution is already out there?
What eventually, I have done is.
1. As a quickfix, to get over the problem at the moment, I used a native call to write all the filenames in a temp text file and then used a BufferedReader to read each line.
2. Wrote an utility to archive the inactive files(most of them) into some other archive location, thereby reducing the total no.of files in the active directory. So that the normal list() call returns much quicker.
3. But going forward as a long term solution, I will be modifying the way all these files are stored and create a kind of directory hierarchy structure wherein then each directory will be holding comparatively few files and hence the list() can work very fast.
One thing came to my mind and I noticed while testing was this list() when runs for the first time takes a long time but subsequent requests were very very fast. Makes me believe that JVM inetlligently retrieves the list which has remained on the heap. I tried a few things like adding files to the dir or changing the File variable name but still the response was instant. So I believe that this array sits on the heap till gc'ed and Java intelligently responds for same request. <*Am I right? or is that not how it behaves? some explanation pls.*>
Due to this, I thought, if I can write a small program to get this list once everyday and keep a static reference to it then this array won't be gc'ed and every request to retrieve this list will be fast. <*Again, some comments/suggestion appreciated.*>
Is there a way to configure Tomcat, wherein the GC may gc all other non-referenced objects but doesn't for some which are specified so? Somebody told me in Linux something like this is implemented at obviously for the OS level, I dont know whether its true or not though.
Which file system are you using? each file system has its own limitation on number of files/folders a directory can have (including the directory depth). So not sure how you could create and if created through some program were you able to read all the files back.
As suggested above the FileNameFilter
, is a post file name filter so I am not sure if it would be any help (although you are probably creating smaller lists of file lists) as each listFiles()
method would get the complete list.
For example:
1) Say Thread 1 is capturing list of file names starting with "T*", listFiles()
call would retrieve all the thousands of file names and then filters as per FileNameFilter
criteria
2) Thread 2 if capturing list of file names starting with "S*" would repeat the all the steps from 1.
So, you reading the directory listing multiple times putting more and more load on Heap/JVM native calls/file system etc.
If possible best suggestion would be to re-organize the directory structure.
精彩评论