Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.
The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.
Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.
The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.