I wrote a paper on pretty much that problem: "Evaluation of Streaming Aggregatio...

I wrote a paper on pretty much that problem: "Evaluation of Streaming Aggregation on Parallel Hardware Architectures", https://www.scott-a-s.com/files/debs2010.pdf

Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.

The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.