How to process files with multiple threads so each file is processed by only one thread
I currently have a Java program that spawns 50 threads and the goal is to look at a directory that has many files being written to it and upload those files to an ftp se开发者_Go百科rver and then remove them. Right now I have a super hacky way of looping through the dir in each thread and setting a lock on a ConcurrentMap to keep track of when a thread already is processing that same image to prevent duplicate work. It's working but just doesn't seem right.
So the question is.. in Java what is a preferred way of watching a directory in a multithreaded program and making sure each thread is only operating on a file that no one else has.
Update: I was considering creating a threadpool with the caveat of each thread has an ftpclient connection that I'll have to keep open and keep from timing out.
Update: What about using http://download.oracle.com/javase/tutorial/essential/io/notification.html ?
Use an ExecutorService
to decouple the submission of work to the threads from the threading logic itself (also take a look at the docs for the parent interface Executor
to learn a bit more about their purpose).
With an ExecutorService
, you simply feed work (in your case, a file) to it and threads will pick up work as they become available. There are many options and flavors of ExecutorServices you can configure: single-threaded, a maximum number of threads, unbounded thread pool, etc.
Maybe having a master thread searching the directory and giving tasks out to the worker threads?
IMO, it's asking for trouble to try and write something that does this yourself. There are so many nuances to parallel batch processing, that it's best to learn the API to a framework that does it for you.
In the past I've used both Spring Batch (which is open source) and Flux (which requires a license). They'll both allow you to configure jobs that watch a directory for files, and then process those files in a parallel way. As long as you're willing to invest the time in learning their APIs, then you don't need to worry about synchronization on which process is handling which files.
Just a quick note on pros/cons of Spring Batch vs Flux:
- Spring batch is mostly XML configuration, while Flux has a nice GUI designer
- If you're already familiar with the Spring framework, then Batch will come more naturally. (Otherwise, as a starting point their documentation is great for the basic use cases)
- Spring batch requires scheduling to be done from the outside (usually with Quartz), while Flux also includes scheduling
- Flux is better (and imo, more intuitive) for things like monitoring a directory/FTP/SFTP/email to kick off a job
I'm sure there are other frameworks that do this too... those are just the two I'm familiar with.
I would set up a filehandler class which accepts a directory and has a concurrently locked .nextFile function which passes the next file in the directory. This way every thread asks for a file and every thread gets a unique file
Does the solution really need to be multi-threaded? Unless the maximum upload speed to the destination FTP server is limited per connection, surely it'd be easier sending them one at a time?
Sending 50 files of 1MB sequentially at 1Mbps (assumed max upload speed) over a single FTP connection would be no slower than sending the same 50 files concurrently at ~20Kbps with 50 FTP connections, wouldn't it?
精彩评论