Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is what Tomas Mikolov said on Facebook:

> I wanted to popularize neural language models by improving Google Translate. I did start collaboration with Franz Och and his team, during which time I proposed a couple of models that could either complement the phrase-based machine translation, or even replace it. I came up (actually even before joining Google) with a really simple idea to do end-to-end translation by training a neural language model on pairs of sentences (say French - English), and then use the generation mode to produce translation after seeing the first sentence. It worked great on short sentences, but not so much on the longer ones. I discussed this project many times with others in Google Brain - mainly Quoc and Ilya - who took over this project after I moved to Facebook AI. I was quite negatively surprised when they ended up publishing my idea under now famous name "sequence to sequence" where not only I was not mentioned as a co-author, but in fact my former friends forgot to mention me also in the long Acknowledgement section, where they thanked personally pretty much every single person in Google Brain except me. This was the time when money started flowing massively into AI and every idea was worth gold. It was sad to see the deep learning community quickly turn into some sort of Game of Thrones. Money and power certainly corrupts people...

Reddit post: "Tomas Mikolov is the true father of sequence-to-sequence" https://www.reddit.com/r/MachineLearning/comments/18jzxpf/d_...



As another small hint of Mikolov-vs-Le divergence: they're the coauthors of the 'Paragraph Vector' paper (https://arxiv.org/abs/1405.4053) applying a slightly-modified version of word2vec to vectorize longer texts, still in a very shallow way. (This technique often goes by the name 'doc2vec', but other things also sometimes get called that, too.)

There are some results in that paper, supposedly from exactly the technique described on an open dataset, that no one has ever been able to reproduce – & you can see the effort has frustrated a lot of people, over the years, in different forums.

When asked, Mikolov has said, essentially: "I can't reproduce that either – those tests were run & reported by Le, you'lll have to ask him."


This is interesting. I went off and searched for paragraph vector code and indeed find doc2vec stuff, including tutorials referring to the paper such as https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.... It’s not obvious that the results aren’t reproducible (and I realise code is not the same as published results), but I wonder if you could steer us more specifically.


As I understand it, no one has come close to the claimed results in s3.1 ("Sentiment Analysis with the Stanford Sentiment Treebank Dataset") and people have come closer but still not matched those in s3.2 ("Beyond One Sentence: Sentiment Analysis with IMDB dataset").

A thread where Mikolov is trying to help other with his patch to `word2vec.c` demoing a tiny bit of 'Paragraph Vector' – but reaches the limits of what he understands Le to have done – is: https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m...

My own frustration (& reco to avoid the thin stanfordSentimentTreebank/RottenTomatoes data/results) is mentioned at: https://groups.google.com/g/word2vec-toolkit/c/ubFrO0a9Pe8/m...

I'd say that "concatenating PV-DBOW and (plain averaging) PV-DM" never seems to offer much lift compared the favorable way it's described in the paper and other Le comments. And after spending a bunch of time implementing the "PV-DM with concatenation (rather than sum/average) of many word-vectors as the context", as I interpret the paper's description "To predict the 8-th word, we concatenate the paragraph vectors and 7 word vectors", I've only seen it massively increase model size & training time for very little advantage.

Reddit searches on related topics turn up other anecdotes/resentments; eg:

https://www.reddit.com/r/MachineLearning/comments/18jzxpf/co...

https://www.reddit.com/r/MachineLearning/comments/hkiyir/com...

At a certain level, with so much "gold in them thar hills" in followup work – from both academic & commercial perspectives, I don't blame Le for rushing forward to other related fertile ideas, & ignoring the requests-for-explanation.

But there's something sloppy or fishy (hiding secret tweaks?) in the originally-claimed PV results, which has wasted a lot of time among those trying to understand & reproduce.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: