This is based on legitimate (although second-tier) academic research that appears to combine aspects of GPU-style SIMD/SIMT with Tera-style massive multithreading. (main paper appears to be https://www.utupub.fi/bitstream/handle/10024/164790/MPP-TPA-... )
Historically, the chance of such research turning into a chip you can buy is zero.
> A Finnish startup called Flow Computing is making one of the wildest claims ever heard in silicon engineering: by adding its proprietary companion chip, any CPU can instantly double its performance, increasing to as much as 100x with software tweaks.
Sounds like some of those 1970's miracle filters, that would let you fuel your car with microenergized water and hand-waves, instead of gasoline.
Or the Sloot Digital Coding System: an alleged data sharing technique that its inventor claimed could store a complete digital movie file in 8 kilobytes of data
100X for all workloads is more or less impossible to believe, especially if it is a parallel coprocessor of some sort (I mean Ahmdal’s law would seem to indicate that this is impossible).
100X on selected workloads is easy to believe, I mean, it be a 100 wide vector unit and the workload they get 100X on could be adding up a big line of numbers, haha.
> 100X on selected workloads is easy to believe, I mean, it be a 100 wide vector unit and the workload they get 100X on could be adding up a big line of numbers, haha.
Exactly my thought, modern FPGAs have >1000 DSP elements; I can make 1000x paralle counters and theoretically “do more work”. It’s also extremely unimpressive. Haha
From a brief skim of the whitepaper, it seems like the general idea is something like “GPUs are <big-multiple> faster than CPUs, because GPUs are built ground-up for parallel computation. The bottleneck now is parallel memory access, let’s build a chip from the ground up that is <big-multiple> better at parallel memory access”.
Not an entirely unconvincing idea, in fairness, but what strikes me as odd is that they’re trying to patent and sell the idea, rather than the technology.
Eh... Not buying it. At least not the way they're marketing it. This looks like a group that dreamt up another one of those massively SMP multicore designs and are shopping it around as a coprocessor IP to license. Maybe there's interesting stuff in there, maybe not, but it's not a new idea. Cooking up some massively parallel benchmark and telling journalists it's going to make computers 100x faster is unethical and counterproductive.
The usual speed bumps with these SMP designs show up in the white paper. There's a section on how they can use recompilation to automatically accelerate existing code. There's a rich history of failed attempts at automatic parallelization at compile time. So this section is just an admission that you're going to have to write code specifically for this thing to get anything out of it.
This coprocessor concept looks like it needs to be tightly coupled with the host CPU to work. The upside needs to be massive to justify a vendor integrating with this IP and also overcome the software adoption costs. Especially since they're directly competing with GPGPU...
Also some strange red flags, like the comparison to quantum computing in the FAQ. (Did they feel the need to include this because an investor asked?)
My guess is they've done some interesting core research around memory bottlenecks/latencies in a massively SMP architecture, and picked up investment to attempt to productize it. But the way they're marketing themselves right now doesn't inspire much confidence.
I slightly skimmed the white paper and FAQ, and it really sounds like they're being extremely hand-wavy about making a multi-core CPU.
"The blocks inside the CPU die are optimized and meant for different purposes - vector units for vector calculation, matrix units for matrix calculation - Parallel Processing Unit is optimized for parallel processing."
Parallel processing of what? GPUs are essentially simple CPUs but massively parallel. If you're claiming to be able to parallel processing of a general purpose CPU, then aren't you essentially just making a multi-core CPU? How are you supposedly scaling that up to 100X?
It wasn't readily clear to me that this was an obvious pipe-dream.
I remain cautiously optimistic. There are often large performance gains left unclaimed for the purpose of "generality". My favourite example is Postgres vs TimescaleDB; by exploiting the structure of certain tables (in the case of TimescaleDB, time-order), we can get better performance -- but of course that only works with time-series data.
Could it be that by focusing on parallel operations in a separate chiplet, these workflows can be made much faster? Maybe someone here has the background to tell me why I should be more pessimistic
Glanced at it quickly, and they mention new parallelism primitives that can be used to write faster parallel code, as well as speeding up existing parallel code. So if anything is 100X faster, it's probably some very specific routine written to take advantage of the primitives.
> Our investors were especially excited about the innovativeness and uniqueness of Flow’s technology, its strong IP portfolio and the potential to enable a new era of superCPUs (CPU 2.0) for the AI revolution.
This part makes me think particularly they are building off hype - they showed a tech demo that did well for some specific case that happens to be a buzzword.
The following statements:
> A. Nonexistent cache coherence issues. Unlike in current CPU systems, in Flow’s architecture there are no cache coherence issues in the memory systems due to the memory organization excluding caches in the front of the intercommunication network.
And
> E. Low-level parallelism for dependent operations. In Flow-enabled CPUs, it is possible to execute dependent operations with the full utilization within a step (with the help of chaining of functional units), whereas in current CPUs the operations executed in parallel need to be independent due to parallel organization of the functional units.
Raise my eyebrows, more than a little bit. If to compute e.g. (assume for a second you do not have optimized instructions for this) you have E = D * C and C = B + A, how exactly you can compute E without first knowing C? You can't, not without computing all possible values of E for C (inefficient).
Introduce locking into the situation (where cache coherence issues come into play), their "excluding caches in the front of the intercommunication network." makes no sense.
You need locks to make guarantees about memory safety, saying that you have none is fine if you have nothing to lock, but then you're likely not doing general processing, you're likely doing something specific.
Which I think is a clue as to what they are really doing:
> D. Flexible threading/fibering scheme. Flow-computing technology allows an unbounded number of fibers at the model level, which can also be supported in hardware (within certain bandwidth constraints). In current-generation CPUs, the number of threads is - in theory - not bounded, but if the number of hardware threads is exceeded in the case of interdependencies, the results can be very bad. In addition, the operating systems typically limit the number of threads to a few thousand at most. The mapping of fibers to backend units is a programmable function allowing further performance improvements in Flow.
To me, this sounds like someone CPU-ified a GPU. That's the primary purpose of a gpu, efficiently run a shit-ton of threads. Except of course GPUs aren't great at processing. But it fits the use case of AI algorithms well, which stand to gain a lot from general GPU improvements.
> Raise my eyebrows, more than a little bit. If to compute e.g. (assume for a second you do not have optimized instructions for this) you have E = D * C and C = B + A, how exactly you can compute E without first knowing C? You can't, not without computing all possible values of E for C (inefficient).
I think what the mean (according to their diagram) is each result has to be written back to the register file before another unit can use it. So conventionally you would compute C = A + B and update the C register and in the next step compute E = D*C.
What they seem to claim is they directly bypass the computed C result from the add unit the the multiply unit, hence it is pipelined. This is a bit disingenuous as any performant cpu worth it's salt will have operand bypassing.
> This is a bit disingenuous as any performant cpu worth it's salt will have operand bypassing.
Right, but that's an operation optimization. Nobody disputes that particular instructions (combined add/multiply) can be optimized and pipelined, but that's very different from claiming that arbitrary calculations no longer have dependent steps.
The article doesn't say no, it says idk. The author asks readers if they know the answer in the body. This is the first question-marked headline I've ever seen that's actually meant as a question.
Also, the studies on that Wikipedia page disprove the law.
> But the company can’t quite show any of that today — because Flow hasn’t built a chip and doesn’t necessarily intend to build one, its co-founders tell The Verge.
The answer is clearly, no, they didn't make CPUs 100x faster. Maybe they intend to, but that's not the same thing as having done it.
If I understood things correctly, they do have in common that they employ chaining of execution units: a result can be used as an operand to another instruction without going through the register file. (The Mill does not even have a register file)
Otherwise, they are completely different. The Flow processor is more like a SIMT GPU modified to run general-purpose code. The Mill is more like a DSP modified to run general-purpose code.
I think this is possible. In order to double the speed of computing on a CPU you install a special "parallel processing unit" that is ... a second CPU! Ta da! Doubled the speed!
Historically, the chance of such research turning into a chip you can buy is zero.