OpenMP debug newbie questions
I am starting to learn OpenMP, running examples (with gcc 4.3) from https://computing.llnl.gov/tutorials/openMP/exercise.html in a cluster. All the examples work fine, but I have some questions:
- How do I know in which nodes (or cores of each node) have the different threads been "run"?
- Case of nodes, what is the average transfer time in microsecs or nanosecs for sending the info and getting it back?开发者_StackOverflow社区
- What are the best tools for debugging OpenMP programs?
- Best advices for speeding up real programs?
Typically your OpenMP program does not know, nor does it care, on which cores it is running. If you have a job management system that may provide the information you want in its log files. Failing that, you could probably insert calls to the environment inside your threads and check the value of some environment variable. What that is called and how you do this is platform dependent, I'll leave figuring it out up to you.
How the heck should I (or any other SOer) know ? For an educated guess you'd have to tell us a lot more about your hardware, o/s, run-time system, etc, etc, etc. The best answer to the question is the one you determine from your own measurements. I fear that you may also be mistaken in thinking that information is sent around the computer -- in shared-memory programming variables usually stay in one place (or at least you should think about them staying in one place the reality may be a lot messier but also impossible to discern) and is not sent or received.
Parallel debuggers such as TotalView or DDT are probably the best tools. I haven't yet used Intel's debugger's parallel capabilities but they look promising. I'll leave it to less well-funded programmers than me to recommend FOSS options, but they are out there.
i) Select the fastest parallel algorithm for your problem. This is not necessarily the fastest serial algorithm made parallel.
ii) Test and measure. You can't optimise without data so you have to profile the program and understand where the performance bottlenecks are. Don't believe any advice along the lines that 'X is faster than Y'. Such statements are usually based on very narrow, and often out-dated, cases and have become, in the minds of their promoters, 'truths'. It's almost always possible to find counter-examples. It's YOUR code YOU want to make faster, there's no substitute for YOUR investigations.
iii) Know your compiler inside out. The rate of return (measured in code speed improvements) on the time you spent adjusting compilation options is far higher than the rate of return from modifying the code 'by hand'.
iv) One of the 'truths' that I cling to is that compilers are not terrifically good at optimising for use of the memory hierarchy on current processor architectures. This is one area where code modification may well be worthwhile, but you won't know this until you've profiled your code.
You cannot know, the partition of threads on different cores is handled entirely by the OS. You speaking about nodes, but OpenMP is a multi-thread (and not multi-process) parallelization that allow parallelization for one machine containing several cores. If you need parallelization across different machines you have to use a multi-process system like OpenMPI.
The order of magnitude of communication times are :
- huge in case of communications between cores inside the same CPU, it can be considered as instantaneous
- ~10 GB/s for communications between two CPU across a motherboard
- ~100-1000 MB/s for network communications between nodes, depending of the hardware
All the theoretical speeds should be specified in your hardware specifications. You should also do little benchmarks to know what you will really have.
For OpenMP,
gdb
do the job well, even with many threads.- I work in extreme physics simulation on supercomputer, here are our daily aims :
- use as less communication as possible between the threads/processes, 99% of the time it is communications that kill performances in parallel jobs
- split the tasks optimally, machine load should be as close as possible to 100% all the time
- test, tune, re-test, re-tune... . Parallelization is not at all a generic "miracle solution", it generally needs some practical work to be efficient.
精彩评论