Swimming in OpenCL

manvsmachine · on Nov 13, 2009

Always great to hear about parallelization becoming more widely adopted. The only thing that I don't get is the part about the GPU - it takes a lot of computation to tie up an 8800GT for a full minute. Also, it shouldn't take the 1-3 seconds he described to send a 45 sec .wav file over a PCI Express 2.0 x16 bus (~3-4 GB/s bandwidth IIRC).

I'm not sure what's causing those runtimes, but the fact that it spread over 8 cores that well suggests that it almost qualifies as embarrassingly parallel, which a GPU really should be great for. This makes me really wonder about the maturity of Apple's / nVidia's OpenCL implementation.

EDIT: I just ran a few of the OpenCL SDK demos and can confirm that it is 1-2 orders of magnitude slower than the same demo running in CUDA. The bandwidth for copying memory to / from the device should still be high, though.

My OpenCL Bandwidth Test results: ~/NVIDIA_GPU_Computing_SDK/OpenCL/bin/linux/release$ ./oclBandwidthTest

./oclBandwidthTest Starting...

Running on...

Device GeForce 8400M GT

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access

   Transfer Size (Bytes)	Bandwidth(MB/s)

   33554432			1600.9

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access Transfer Size (Bytes) Bandwidth(MB/s)

   33554432			1235.1

Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s)

   33554432			6069.7

TEST PASSED

Press <Enter> to Quit...

liscio · on Nov 14, 2009

It's probably the algorithm in question that's to blame, in conjunction with the slow OpenCL implementation for the 8800GT, as you found.

On my machine (I'm the article's author), even Apple's GPU-tuned version of Galaxies runs much faster on the Mac Pro's CPUs than the GPU. So, something's up. I think only the GTX285 for the Mac Pro beats out the CPUs on that test, but I could be wrong...

The 1-2 seconds of overhead could also be contributed to by the compilation of the OpenCL program for the GPU, as I do a compile of the .cl kernel on each run of the program.

Furthermore, I wasn't very scientific about the GPU case, because I wasn't planning to ship a GPU-tuned algorithm. To actually pull this off for a consumer app is easier said than done.

For instance, I'd prefer not to ship the .cl kernel in the application, and would rather provide binary-compiled kernels. Doing this for >1 flavor of GPU is nontrivial, from what I gather, as I'd have to actually own the GPUs in question to get compiles for the different targets (I could only cover the GeForce 9400M, and 8800GT from my own collection of hardware).

That said, I still want to stay open to the idea in the future as I play around with the algorithm, and understand it further.

Thanks for the nudge, though. I really should dig deeper.