Inference will be cheapest when run in a shared cloud environment, simply due to the LLMs roofline. Thus, most B2B use cases are likely to be datacenter based, like AWS today.
Of course, cern is still going to use their FPGA hyper-optimized for their specific trigger model for the LHC, and apple is gojng to use a specialized low power ASIC running a quantized model for hello Siri, but I meant the majority usecase.
I do not buy this premise. I think it will end up being cheaper to simply run the LLMs directly on the user device.
I think that there are plenty of competitors in the "LLMs with open weights" space to essentially make the models a commodity, so all that is left is the compute cost and there is no way that someone will be running a datacenter in a way that is cheaper than "the computer that I already have running on my desk".
I nake your point every time this comes up[1] but its absolutely surprising how few business people, most of whom have some credibility in the form of qualifications or experience, actually recognise a value chain when they see it.
Of course, cern is still going to use their FPGA hyper-optimized for their specific trigger model for the LHC, and apple is gojng to use a specialized low power ASIC running a quantized model for hello Siri, but I meant the majority usecase.