To my understanding, the reason why companies don't mind the hallucinations is the acceptable error rate for a given system. Let's say something hallucinated 25% of the time, but if that's ok, then it's fine for a certain product. If it only hallucinates 5% of the time, it's good enough for even more products and so on. The market will just choose the LLM appropriately depended on the tolerable error rate.
At scale, you are doing the same thing with humans too. LLMs seem to have an error rate similar to humans for the majority of simple, boring tasks, if not even a bit better since they don't get distracted and start copying and pasting their previous answers.
The difference with LLMs is they simply cannot (currently) do the most complex tasks that some humans can, and when they do produce erroneous output, the errors aren't very human-like. We can all understand a cut and paste error so don't hold it against the operator, but making up sources feels like a lie and breeds distrust.
> At scale, you are doing the same thing with humans too. LLMs seem to have an error rate similar to humans for the majority of simple, boring tasks, if not even a bit better since they don't get distracted and start copying and pasting their previous answers.
This is the big one missed by the frequent comments on here wondering whether LLMs are a fad, or claiming in their current state they cannot be used to replace humans in non-trivial real-world business workflows. In fact, even 1.5 years ago at the time of GPT 3.5, the technology was already good enough.
The yardstick is the peformance of humans in the real world on a specific task. Humans, often tired, having a cold, distracted, going through a divorce. Humans who even when in a great condition make plenty of mistakes.
I guess a lot of developers struggle with understanding this because so far when software has replaced humans, it was software that on the face of it (though often not in practice) did not make mistakes if bug-free. But that has been never been necessary for software to replace humans - hence buggy software still succeeding in doing so. Of course, often software even replaces humans when it's worse at a task for cost reasons.
They're at the very least competitive, if not better than, doctors at diagnosing illnesses [1].
Related to that, I once had a CT scan for a potentially fatal brain concern, and the note that the radiologist sent back to my consultant was for a completely different patient, and the notes for my scan were attached to someone else's report. The only reason it was caught was because it referred to me as "she".
If we were both the same gender, I probably would have had my skull opened up for no reason, and she would have been discharged and later died.
> The yardstick is the peformance of humans in the real world on a specific task.
Humans make humans errors, that we can anticipate, recognize, couter, and mitigate. And the rise of deterministic automation was because they help with the parts that are more likely to generate an error. The LLMs strategy always seems like solving a problem that is orthogonal to business objectives, and mainly serves individuals instead.
Almost all deterministic automation also has error rates. The error rates were higher in the past to the order of magnitudes, but we got better at creating reliable software.
We’re judging an entirely new segment of development after only 2 years of it being actively in public. And overall, LLMs have gotten exponentially better.
The bigger, more controversial claim is that LLMs will be net loss for human jobs, when all past automation has been a net positive. Including IT, where automation has led to a vast growth of software jobs, as more can be accomplished with higher level languages, tools, frameworks, etc.
For example, compilers didn't put programmers out of business in the 60s, it made programming more available to people with higher level languages.
A net positive in the long term matters little when it can mean a lifetime of unemployment to a generation of humans. It's easy to dismiss the human suffering incurred during industrialization when we can enjoy its fruits but those who suffered are long dead.
Imagine having a backend being down 20-40 days per year, yeah that would be bad.
Companies do not care about hallucinations because text output being bad is not considered an error, and as long as it won't raise a Datadog alert it won't be taken seriously.
I mean, do you remember early 2000s? We had so many web pages that would go down on a daily basis. Stability is something we achieved over time.
Also, again, if it’s bad, nobody will use it, and product will die. In those theoretical scenarios companies that have lower error rate (and don’t use AI) will win the market.