Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> I don’t think it’s surprising the model did poorly.

But it did poorly only on the problems it hadn't seen before. Was it prompted differently on one kind of problem, compared to the other?



But you can do a task you’ve done before with poor specification too. Sure, maybe it is contamination. But who cares? We only ought to judge the tool on its performance for carrying out good instructions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: