I just noticed that Intel has made its Vtune Amplifier profiler available free with forum support. There is a Linux and Windows version. I have used it before to perform some system-wide profiling and disassembly.
It is pretty easy to see where the bottleneck locations are in a running program. It shows that gwnum is too aggressive with its software prefetching and limits computation with redundant memory read operations when running multiple WU.
I don't know what it does when confronted with an AMD CPU.