I think in a lot of ways we are already there. Users are clearly already having difficulty seeing which model is better or if new models are improving over old models. People go back to the same gotcha questions and get different answers based on the random seed. Even the benchmarks are getting very saturated.
These models already do an excellent job with your homework, your corporate PowerPoints and your idle questions. At some point only experts would be able to decide if one response was really better than another.
Our biggest challenge is going to be finding problem domains with low performance that we can still scale up to human performance. And those will be so niche that no one will care.
Agents on the other hand still have a lot of potential. If you can get a model to stay on task with long context and remain grounded then you can start firing your staff.
These models already do an excellent job with your homework, your corporate PowerPoints and your idle questions. At some point only experts would be able to decide if one response was really better than another.
Our biggest challenge is going to be finding problem domains with low performance that we can still scale up to human performance. And those will be so niche that no one will care.
Agents on the other hand still have a lot of potential. If you can get a model to stay on task with long context and remain grounded then you can start firing your staff.