This is what Tomas Mikolov said on Facebook: > I wanted to popularize neural lan...

gojomo · on Dec 18, 2023

As another small hint of Mikolov-vs-Le divergence: they're the coauthors of the 'Paragraph Vector' paper (https://arxiv.org/abs/1405.4053) applying a slightly-modified version of word2vec to vectorize longer texts, still in a very shallow way. (This technique often goes by the name 'doc2vec', but other things also sometimes get called that, too.)

There are some results in that paper, supposedly from exactly the technique described on an open dataset, that no one has ever been able to reproduce – & you can see the effort has frustrated a lot of people, over the years, in different forums.

When asked, Mikolov has said, essentially: "I can't reproduce that either – those tests were run & reported by Le, you'lll have to ask him."

blueboo · on Dec 19, 2023

This is interesting. I went off and searched for paragraph vector code and indeed find doc2vec stuff, including tutorials referring to the paper such as https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.... It’s not obvious that the results aren’t reproducible (and I realise code is not the same as published results), but I wonder if you could steer us more specifically.

gojomo · on Dec 19, 2023

As I understand it, no one has come close to the claimed results in s3.1 ("Sentiment Analysis with the Stanford Sentiment Treebank Dataset") and people have come closer but still not matched those in s3.2 ("Beyond One Sentence: Sentiment Analysis with IMDB dataset").

A thread where Mikolov is trying to help other with his patch to `word2vec.c` demoing a tiny bit of 'Paragraph Vector' – but reaches the limits of what he understands Le to have done – is: https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m...

My own frustration (& reco to avoid the thin stanfordSentimentTreebank/RottenTomatoes data/results) is mentioned at: https://groups.google.com/g/word2vec-toolkit/c/ubFrO0a9Pe8/m...

I'd say that "concatenating PV-DBOW and (plain averaging) PV-DM" never seems to offer much lift compared the favorable way it's described in the paper and other Le comments. And after spending a bunch of time implementing the "PV-DM with concatenation (rather than sum/average) of many word-vectors as the context", as I interpret the paper's description "To predict the 8-th word, we concatenate the paragraph vectors and 7 word vectors", I've only seen it massively increase model size & training time for very little advantage.

Reddit searches on related topics turn up other anecdotes/resentments; eg:

https://www.reddit.com/r/MachineLearning/comments/18jzxpf/co...

https://www.reddit.com/r/MachineLearning/comments/hkiyir/com...

At a certain level, with so much "gold in them thar hills" in followup work – from both academic & commercial perspectives, I don't blame Le for rushing forward to other related fertile ideas, & ignoring the requests-for-explanation.

But there's something sloppy or fishy (hiding secret tweaks?) in the originally-claimed PV results, which has wasted a lot of time among those trying to understand & reproduce.