Super computers are already focused on "embarrassingly parallel" problems. Otherwise 300,000 cores is not going to do much for you anyway. However, I agree that interconnect speed would be a major issue for many supper computer workloads. Yet, I suspect if you had access to a 10+million$ supercomputer built using 1million GPU cores plenty of people would love to work with such a beast.
No, these are not just racks and racks of individual machines. It presents the programmer with a single system image - it "looks" like one huge expanse of memory.
We have a Blue Gene at Argonne, it's not SSI. It is however not designed for embarrasingly parallel workloads, you use libraries like MPI to run tightly coupled message passing applications (which are very sensitive to latency). You can, and people have, run many-task type applications too.
The basic speed of light limitation means that accessing distant nodes is going to have high latency even if there is reasonable bandwidth. Ignoring that is a bad idea from an efficiency standpoint. And, unlike PC programming the cost of the machine makes people far more focused on optimizing their code for the architecture than abstracting the architecture to help the developer out.
It take care of it to some extent, but you still have to be aware of it as the programmer. MPI and associated infrastructure are set up such that they'll pick the right nodes to keep the network topology and your code's topology well matched. But you have to do your best as a programmer to hide the latency by spending that time doing other things.