Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't get the point. A simple CNN with stride =1 should be able to solve it perfectly and generalize it to any size.


It wasn't obvious that a transformer could do this, and learn to produce conv via attention


But it is, as long as the positional embedding are sufficient, i.e. use relative positional embeddings here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: