oldesthacker's comments

oldesthacker · on Dec 18, 2023

This is what Tomas Mikolov said on Facebook:

> I wanted to popularize neural language models by improving Google Translate. I did start collaboration with Franz Och and his team, during which time I proposed a couple of models that could either complement the phrase-based machine translation, or even replace it. I came up (actually even before joining Google) with a really simple idea to do end-to-end translation by training a neural language model on pairs of sentences (say French - English), and then use the generation mode to produce translation after seeing the first sentence. It worked great on short sentences, but not so much on the longer ones. I discussed this project many times with others in Google Brain - mainly Quoc and Ilya - who took over this project after I moved to Facebook AI. I was quite negatively surprised when they ended up publishing my idea under now famous name "sequence to sequence" where not only I was not mentioned as a co-author, but in fact my former friends forgot to mention me also in the long Acknowledgement section, where they thanked personally pretty much every single person in Google Brain except me. This was the time when money started flowing massively into AI and every idea was worth gold. It was sad to see the deep learning community quickly turn into some sort of Game of Thrones. Money and power certainly corrupts people...

Reddit post: "Tomas Mikolov is the true father of sequence-to-sequence" https://www.reddit.com/r/MachineLearning/comments/18jzxpf/d_...

gojomo · on Dec 18, 2023

As another small hint of Mikolov-vs-Le divergence: they're the coauthors of the 'Paragraph Vector' paper (https://arxiv.org/abs/1405.4053) applying a slightly-modified version of word2vec to vectorize longer texts, still in a very shallow way. (This technique often goes by the name 'doc2vec', but other things also sometimes get called that, too.)

There are some results in that paper, supposedly from exactly the technique described on an open dataset, that no one has ever been able to reproduce – & you can see the effort has frustrated a lot of people, over the years, in different forums.

When asked, Mikolov has said, essentially: "I can't reproduce that either – those tests were run & reported by Le, you'lll have to ask him."

blueboo · on Dec 19, 2023

This is interesting. I went off and searched for paragraph vector code and indeed find doc2vec stuff, including tutorials referring to the paper such as https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.... It’s not obvious that the results aren’t reproducible (and I realise code is not the same as published results), but I wonder if you could steer us more specifically.

gojomo · on Dec 19, 2023

As I understand it, no one has come close to the claimed results in s3.1 ("Sentiment Analysis with the Stanford Sentiment Treebank Dataset") and people have come closer but still not matched those in s3.2 ("Beyond One Sentence: Sentiment Analysis with IMDB dataset").

A thread where Mikolov is trying to help other with his patch to `word2vec.c` demoing a tiny bit of 'Paragraph Vector' – but reaches the limits of what he understands Le to have done – is: https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m...

My own frustration (& reco to avoid the thin stanfordSentimentTreebank/RottenTomatoes data/results) is mentioned at: https://groups.google.com/g/word2vec-toolkit/c/ubFrO0a9Pe8/m...

I'd say that "concatenating PV-DBOW and (plain averaging) PV-DM" never seems to offer much lift compared the favorable way it's described in the paper and other Le comments. And after spending a bunch of time implementing the "PV-DM with concatenation (rather than sum/average) of many word-vectors as the context", as I interpret the paper's description "To predict the 8-th word, we concatenate the paragraph vectors and 7 word vectors", I've only seen it massively increase model size & training time for very little advantage.

Reddit searches on related topics turn up other anecdotes/resentments; eg:

https://www.reddit.com/r/MachineLearning/comments/18jzxpf/co...

https://www.reddit.com/r/MachineLearning/comments/hkiyir/com...

At a certain level, with so much "gold in them thar hills" in followup work – from both academic & commercial perspectives, I don't blame Le for rushing forward to other related fertile ideas, & ignoring the requests-for-explanation.

But there's something sloppy or fishy (hiding secret tweaks?) in the originally-claimed PV results, which has wasted a lot of time among those trying to understand & reproduce.

oldesthacker · on Dec 17, 2023

The results of Table 3 are not really exciting. Could this change with 100 times more data? The key novelty in the specific context of this particular application is the quantized variational encoder used "to derive discrete codex encoding and align it with pre-trained language models."

oldesthacker · on Dec 17, 2023

Let's take an example: his "unnormalised linear Transformer," a neural network with "self-attention" published in 1992 under another name. It wasn't just an idea, it was implemented and tested in experiments. However, few people cared, because the computational hardware was so slow, and such networks were not yet practical. Decades later, the basic concepts were "reinvented" and renamed, and today they are really useful on much faster computers. Unquestionably, however, their origins must be cited by those who are applying them.

Why are some people here even debating the generally recognised rules of scientific publishing mentioned in the paper:

> The deontology of science requires: If one "reinvents" something that was already known, and only becomes aware of it later, one must at least clarify it later, and correctly give credit in all follow-up papers and presentations.

oldesthacker · on Dec 17, 2023

> Isn't that exercising political capital?

These are very good points.

> ... the worst of many abusers of the citation system.

Is this ad hominem argument by user "lacker" meant to distract from the omissions of the awardees Bengio, Hinton, and LeCun? Especially the first two got tons of citations for work that should have credited Schmidhuber's lab: the analysis of vanishing gradients in neural networks, the principle of generative adversarial networks, attention in neural networks, distilling neural networks, speech recognition with LSTM neural networks, self-supervised pre-training, and more.

The disputes with LeCun are more recent and of lesser magnitude IMO.

oldesthacker · on Dec 17, 2023

The machine learning field as a whole has a huge credit assignment problem. This post seems to encourage other ML researchers to come out with their own priority disputes. Tomas Mikolov just aired his grievances:

> I wanted to popularize neural language models by improving Google Translate. I did start collaboration with Franz Och and his team, during which time I proposed a couple of models that could either complement the phrase-based machine translation, or even replace it. I came up (actually even before joining Google) with a really simple idea to do end-to-end translation by training a neural language model on pairs of sentences (say French - English), and then use the generation mode to produce translation after seeing the first sentence. It worked great on short sentences, but not so much on the longer ones. I discussed this project many times with others in Google Brain - mainly Quoc and Ilya - who took over this project after I moved to Facebook AI. I was quite negatively surprised when they ended up publishing my idea under now famous name "sequence to sequence" where not only I was not mentioned as a co-author, but in fact my former friends forgot to mention me also in the long Acknowledgement section, where they thanked personally pretty much every single person in Google Brain except me. This was the time when money started flowing massively into AI and every idea was worth gold. It was sad to see the deep learning community quickly turn into some sort of Game of Thrones. Money and power certainly corrupts people...

Reddit post: "Tomas Mikolov is the true father of sequence-to-sequence" https://www.reddit.com/r/MachineLearning/comments/18jzxpf/d_...

oldesthacker · on Dec 17, 2023

How new is this insight? The failure of teams of highly capable individuals is often due to egocentric silver-haired gorillas in the room:

>They spent excessive time in abortive or destructive debate, trying to persuade other team members to adopt their own view, and demonstrating a flair for spotting weaknesses in others' arguments. This led to the discussion equivalent of 'the deadly embrace'. They had difficulties in their decision making, with little coherence in the decisions reached (several pressing and necessary jobs were often omitted). Team members tended to act along their own favourite lines without taking account of what fellow members were doing, and the team proved difficult to manage. In some instances, teams recognised what was happening but over compensated - they avoided confrontation, which equally led to problems in decision making.

munch117 · on Dec 17, 2023

> How new is this insight?

1981.

OhThatGuy · on Dec 17, 2023

As a silver haired person I really appreciate the comment about us. It is always nice to generalize an entire group of people and stereotype them because of the way they look. Great idea.

brabel · on Dec 17, 2023

I understood that to be a reference to actual gorillas:

> They tend to live in troops, with the leader being called a silverback.

Source: https://en.wikipedia.org/wiki/Gorilla

oldesthacker · on Dec 17, 2023

Thanks! Of course, the silverbacks! No offense.

InCityDreams · on Dec 17, 2023

Oh no, even we old(er) people are catching 'offence'.

oldesthacker · on Dec 17, 2023

> It was all about the economics, all the way back from early 90s. We cant do x node because it would be too expensive to sustain it at a node every two years.

Totally agree.

> AI trend could carries us forward to may be 2035. But we dont have another product category like iPhone.

There will be fancier iPhones with on board offline Large Language Models and other Foundation Models to talk to, solving all kinds of tasks for you that would require a human assistant today.

oldesthacker · on Dec 17, 2023

Interesting bit about Samsung’s secret sauce:

Samsung went even smaller than Intel, showing results for 48-nm and 45-nm contacted poly pitch (CPP), compared to Intel’s 60 nm, though these were for individual devices, not complete inverters. Although there was some performance degradation in the smaller of Samsung’s two prototype CFETs, it wasn’t much, and the company’s researchers believe manufacturing process optimization will take care of it. Crucial to Samsung’s success was the ability to electrically isolate the sources and drains of the stacked pFET and nFET devices. Without adequate isolation, the device, which Samsung calls a 3D stacked FET (3DSFET), will leak current. A key step to achieving that isolation was swapping an etching step involving wet chemicals with a new kind of dry etch. That led to an 80 percent boost in the yield of good devices. Like Intel, Samsung contacted the bottom of the device from beneath the silicon to save space. However, the Korean chipmaker differed from the American one by using a single nanosheet in each of the paired devices, instead of Intel’s three. According to its researchers, increasing the number of nanosheets will enhance the CFET’s performance.