Parallelized Regex Matching in a DFS Environment Using pthreads
I'm trying to parallelize grep for a University project. I'm working off the assumption that throwing more threads at the actual regular expression matching would be inefficient: they would process data faster than it could be read from disk. However, I'm working on an MPI cluster that uses the Lustre DFS system, which allows me to stripe data across multiple storage pools.
I was hoping to somehow utilize this by taking advantage of multiple disks, diminishing the bottleneck created by hard disk I/O. After some initial testing, I'm una开发者_Go百科ble to find a solution that will maximize the DFS.
I've tried:
- A large, single file, striped across multiple OST's. Each thread reads chunks of data the size of the stripe size (eg stripe size/OST == thread read size). My intention was that each thread reads data from a different OST (except in overlap/border cases).
- Multiple, smaller files, each striped to it's own OST. Each thread reads a full file, but each file is located on a different OST.
Each case provided little, if no speedup. I was hoping to gain a reasonable speedup (a maximum of 2x overall would have been nice).
Should I be worried about I/O bottlenecks?
How do I take advantage of a DFS when writing C code? I've tried to read data from offsets that match up with stripe size as well as from files that lie on different OST's and (I assume), different disks.
Is there a way to implement a scalable, parallel grep/regex matcher?
Your intuition for how to parallelize this problem seems correct, but I'd guess there's something in your implementation that is unexpectedly serializing your disk accesses. Grep will be I/O bound, and you should see massive speedups if you get each system in your cluster to operate on its local data.
You're going to have to look in to the fine details of how files are accessed over MPI from a Lustre system. It sounds like the software stack is actually serializing all your disk accesses, so you can't take advantage of the distributed FS. Or, your clients might be accessing the wrong slice of the file for the node they're on.
How do I take advantage of a DFS when writing C code?
If you're using standard POSIX file operations, you're out of luck. They don't expose enough information to make this task easy. This is why distributed file storage and processing systems are written as
This is more of an extended comment than an answer ...
Are you trying to write a parallel grep or are you trying to write a grep which operates on different data parallel ? I can read your question to suggest that you are thinking of either or both. If I'm right (mind you I'm usually wrong when I try to interpret questions) I suggest you separate the two concerns and tackle one first, then add the other in.
If your scenario is to run grep on many small files then this will parallelise very easily, you just need some sort of scheduler to parcel the work out in roughly equal chunks.
If your scenario is to use grep on one (or few) large files, then I can see the benefits of reading the file in chunks, one chunk per process(or). The problem I see with this is that the resolution of some matches will lie across chunks. That strikes me as an interesting problem, but you're the university student with access to the up to date literature so will know better than I do.
If your scenario is to write a parallel grep, that is a program which uses multiple processor-cores when it executes, that's probably an even more interesting (ie difficult) problem. Since grep works by constructing an FA, and since FAs are often modelled as graphs, and since graphs can (if they meet some criteria) be decomposed into subgraphs, this is probably do-able -- you just have to parallelise the construction of the FA, spawn threads for the subgraphs, and collect results. I imagine load-balancing will be difficult.
I don't have a deep knowledge of parallel file systems, but I think I'm right in suggesting that you'll only get the benefits if different nodes on your cluster read different parts of a file on the DFS. If you are using threads, that means your threads have to be on different nodes on your cluster ?
Now to give some poor answers to your questions.
- Should I be worried about I/O bottlenecks? -- Hell yes. You'll have measured the performance of grep at a fine enough level to have identified what proportions of wallclock time are spent in its various phases including constructing the FA, reading the input file(s), and so on. grep is a very mature product which has had a lot of optimisation effort spent on it so you've set yourself a tough target. You'll also have measured the characteristics of your cluster file system and inter-thread communication times, that sort of thing.
- How do I take advantage of a DFS when writing C code? -- Can't help you there but I would have thought that the file system API provides the necessary functions; you seem to be trying to address the disk hardware directly.
- Is there a way to implement a scalable, parallel grep/regex matcher? -- I'm sure there is, but you're the student, didn't you check out the possibilities and pitfalls before starting this project ?
精彩评论