More

tbalsam · 2025-10-13T21:07:32 1760389652

This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).

Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.

(Source: am experienced speedrunner who's been in these circles for a decent amount of time)

varunneal · 2025-10-14T16:16:35 1760458595

I think it's good to bring up Bernstein & Newhouse as well as Yuchen Jin, Jiacheng You and the other speedrunners who helped iterate on Muon. But I think it's very fair to call Keller Jordan the main author of Muon of its current form. I'm also in the speedrunning community though maybe not as long as you have

tbalsam · 2025-10-03T13:12:20 1759497140

shocked quack

tbalsam · 2025-07-30T16:48:04 1753894084

They unfortunately recently (last few years) sold out to private equity (which tends to glaze over fundamentals and tries to pump out massive content using previous brand quality to give it credence), so beware of quality in more recent vids:

https://youtu.be/hJ-rRXWhElI?si=Zdsj9i_raNLnajzi

fn-mote · 2025-07-30T17:57:52 1753898272

Yesterday I was reading comments about how the market could pay for research and avoid the “distorting effects” of public funding.

Is there any way to get a better outcome for the public here, or is “do good stuff then sell out” the way it’s always going to be?

edwardbernays · 2025-07-30T18:08:13 1753898893

What distorting effects of public funding? What about the distortionary effects of the market? I'll offer the suggestion that what you read is brainrotting private market propaganda designed to erode the public institutions that make America happier, healthier, and wealthier.

tomrod · 2025-07-30T23:20:41 1753917641

In economics discussions regarding public funding policy, the concern of "crowding out" commercial firms or nonprofits is a real concern. It's definitely an observed, measured, and reported phenomenon.

In the end, incentives matter.

https://en.wikipedia.org/wiki/Crowding_out_(economics)

edwardbernays · 2025-07-30T23:51:00 1753919460

There is no private market entity with an incentive to provide research to the public, so in this sense there is no crowding out. Providing research to the public enables the discovery of new products which would otherwise have not been created. Public research is a public good that makes our nation happier, healthier, and wealthier.

tomrod · 2025-07-31T01:54:05 1753926845

Let's ignore FOSS contributions for a moment, which very much contradict your claim that private companies don't contribute research to the public.

Outside software technology: there is a series of papers from Grossman (going back to the 80s!) that analyzes basic versus applied research in a macroeconomic framework. Basic research _can_ be a public good, applied research can be crowded out. Combined with microeconomic research that monopolies can be dynamically efficient (investing in applied and basic R&D, like Bell Labs) and you get several examples and theories that contradict your statement that "there is no private market entity with an incentive to provide research to the public."

Another real world example in hardware that contradicts this claim is the evolution of building control systems. Before the advent of IOT, so, circa 1980s - 2010s, you saw increasing sharing and harmonization of competing electronics standards because it turned out to be more efficient to be modular, not have to re-hire subcontractors at exorbitant rates to maintain or replace components that go haywire, etc.

edwardbernays · 2025-07-31T03:26:05 1753932365

Including FOSS software is so wild in this conversation that it's ridiculous. You mean creating a product as a loss leader to get people into an ecosystem, farm social capital, create a sales funnel, or get free labor from the community to provide QA? The creation and release of software is NOWHERE NEAR the same category as "doing actual real scientific research" that it just smells of incredibly bad faith argumentation.

Economic analysis? Another intelligence product that requires essentially no staff, no actual R&D, no equipment besides computers? Brother, you have to be kidding me.

The hardware thing is just companies evolving to a shared standard.

Do you have even a little bit of a clue how hard it is to do good pharmacological research? Toxicological? Biological? Chemical? Physical? You have mentioned intelligence products with 0 investment cost and 0 risk of failure.

This is perhaps one of the most fart-sniffing tech-centric perspectives I have ever been exposed to. Go read some actual research by actual scientists and come back when you can tell me why, for instance, Eli Lilley would ever make their data or internal R&D public.

Jonas Salk did it. He is an extremely rare exception, and his incentive was public health. Notice that his incentive was markedly not financial.

Market entities with a financial incentive, whose entire business model and success is predicated on their unique R&D results, have 0 incentive to release research to the public.

tomrod · 2025-07-31T16:19:03 1753978743

I apologize that my points in my prior comment were so easy to misunderstand. Your response here shows a dramatic superficiality in understanding each of these areas I brought up, when not missed entirely. My hope is you can move past your rhetorical stumbling blocks in the conversation -- if that proves impossible, I'm happy to leave things as this being my last comment in our shared thread.

(1) FOSS is not only the next hyped front-end framework or modern data stack funnel. I encourage you to do more research for what European universities and organizations are doing. Not everyone follows the American or Chinese extractive approaches to software.

Further, while many corporations do indeed farm social capital and perform other appreciably maladative and cynic-inducing behaviors, the universe and the space of organizations is large. There are a great many examples of governments adopting public research and development released by private entities -- in FOSS and in other contexts.

Additionally, the fact that FOSS-product-focused companies tend to launch _after_ FOSS becomes successful to support the FOSS offering with associated services is quite a bit different from what is perhaps a FAANG-induced cynicism. To reiterate - the universe and the space of organizations is large.

(2) You interpreted that I did pointed to economic analysis as public vs. private R&D. This is a misinterpretation on your part and I encourage a re-read. I pointed to findings and studies to help you understand where the organizational and market frameworks for analysis stand.

(3) I am a researcher and regularly publish my findings, under the banner of the university I support, under non-profits I support, and under the company I run. I appreciate your experience has made you cynical. Lets break down this section.

> This is perhaps one of the most fart-sniffing tech-centric perspectives I have ever been exposed to.

This was not received as a good-faith statement, and further discussion on it will only engender argument. I suggest we move beyond trivial digs.

> Eli Lilley would ever make their data or internal R&D public.

Not to shill for them, but your point on Eli Lilly is incorrect. Eli Lilly has worked towards more transparent release of information -- they voluntarily launched an online clinical trial registry starting in 2002 (for Phase II–IV trials initiated on/after October 15, 2002) and extended full trial registration (including Phase I) from October 1, 2010.[0] Since 2014, Lilly has published clinical study results (Phase 2/3) regardless of outcome, adhering to PhRMA/EFPIA transparency principles. Patient-level data on marketed-approved indications is available to qualified researchers via a controlled-access third-party portal.[1] Beginning in 2021, Lilly has also produced plain‑language summaries of Phase 2–4 results in English, and more recently extended plain‑language summaries to Phase 1 trials in the EU in compliance with new regulations.[1]

Especially the third point is relevant -- good government regulation leads to better sharing and transparency. Smart companies take regulation as an innovation opportunity.

> Jonas Salk did it. He is an extremely rare exception, and his incentive was public health. Notice that his incentive was markedly not financial.

Aye, and I wish that all medical and life-enhancing research could be accomplished as relatively cheaply or as magnanimously as Jonas Salk.

> Market entities with a financial incentive, whose entire business model and success is predicated on their unique R&D results, have 0 incentive to release research to the public.

Please refer to (2) for studies and theory for why this is untrue.

The number of market entities who only are built on unique R&D tend to fail due to poor delivery of product, so their incentive to release their R&D to the world is more or less moot. I do acknowledge existence of market entities who are built solely on operationalizing R&D -- I challenge the implicit claim that all market entities fall into this category.

[0] https://www.lilly.com/au/policies-reports/transparency-discl...

[1] https://sustainability.lilly.com/governance/business-ethics

ricardobeat · 2025-07-31T08:29:04 1753950544

You could argue that Bell Labs was essentially government funded, as the monopoly/concession of the entire US telephony infrastructure is what made it possible, and research at universities was not funded anywhere near current levels.

They were also forced in the 1950s to license all their innovations freely, as compensation for holding a monopoly. Which only strengthens the parent’s point that private institutions have little incentive to work for public benefit.

egberts1 · 2025-08-01T14:19:27 1754057967

Galileo wants a word with you ... from heaven.

Ma8ee · 2025-07-31T12:46:54 1753966014

That whole discussion is based on the assumption that commercial firms or nonprofits are better in some way than publicly funded research. That is the stupid neoliberal dogma that private and market economy always are better than things that are run by our elected officials. That dogma has to die.

tomrod · 2025-07-31T16:39:53 1753979993

Price as a market signal precedes neoliberalism by several decades to several millennia, depending on which economic historian you speak with. Is your argument that basic research which has no immediately attributable applications is better handled by publicly funded research? I mostly agree to that. Applied research is definitely handled better by commercial firms and nonprofits when handling is defined by what people are willing to value (pay for).

If we're talking about applied technology in the public goods space, then it can be a toss up. Sustainability research, for example, can be quite blurry as to whether the market is pricing it in or not as applied or basic research -- really depends on how a government handles externalities and regulatory capture!

I'll 100% agree to government entities as well as some well-chartered public entities being absolutely awesome at setting up incentive structures for desired outcomes. There is actually a whole field of research dedicated to the topic of incentive structuring called mechanism design -- think of it as the converse to Game Theory and strategic behavioral analysis -- that policy design and analysis learn from.

I'll also note that governments aren't structured to efficiently provide benefits or just-in-time delivery in most situations. Though the discussion has made me more curious about how operationally efficient the DOD is for civilian goods distribution, given it supports a massive population.

Ma8ee · 2025-08-01T09:32:40 1754040760

I'm pointing out that there is an implied assumption that private always is better than public, and that assumption in many cases is just plain wrong. Not in all cases, market economy works great for many things, but there are also many cases it plainly sucks. When you warn that private initiatives might be crowded out, it is implicit that those are more desirable than public initiatives.

This kind of discussion is a bit off topic here, but I think it is important to remind people that the idea that private always is better than public is ideological dogma, not science. But your latest comment makes me believe you agree with that.

tomrod · 2025-08-01T11:09:49 1754046589

Yep, we agree in total. You often hear the opposite dogma too, that governments are wonderfully efficient and all markets are broken.

A moderate path, like what we see in the Scandinavian countries, looks to be a better model.

edwardbernays · 2025-07-31T13:48:11 1753969691

Completely agree. Neoliberalism and its consequences have been a disaster for mankind.

tomrod · 2025-08-02T19:53:55 1754164435

I disagree. I think Neoliberalism has done a remarkable job bringing the majority of the world out of subsistence. I also think it is a target for hijack by neofeudalists as neoliberalism is realpolitick without self-reference.

0xDEAFBEAD · 2025-07-31T04:52:38 1753937558

What's the easiest way to reliably check if a Youtube channel was sold to private equity? Is that info always a matter of public record?

tbalsam · 2025-07-31T16:36:56 1753979816

I'm not entirely sure, to be honest. If you look at the linked video, they state that it's oftentimes not in the best interest of the private equity group's moneymaking capabilities to announce that a channel has been sold out to them.

How that is in practice, I'm not sure, and I'm sure with some sleuthing it would be possible to find out at least some of it. But on the whole, I'm honestly not sure beyond that.

tbalsam · 2025-07-05T12:47:42 1751719662

There are versions of this kind of benchmark with a higher threshold, however, it only seems to adjust the timetables by a linear amount, so you're only buying 1-2 years or so depending on what you want that % success rate to be.

tbalsam · 2025-07-05T12:42:03 1751719323

The only limit is yourself

Source: One of the most classic internet websites, zombo.com (sound on)

tbalsam · 2025-07-05T12:46:11 1751719571

For those curious: https://en.m.wikipedia.org/wiki/Zombo.com

tbalsam · 2025-06-30T15:38:28 1751297908

No! This is not good.

Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.

Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.

tbalsam · 2025-05-28T01:01:43 1748394103

This is (and was) the dream of Cerebras and I am very glad to see it embraced if even in small part on a GPU. Wild to see how much performance is left on the table for these things, it's crazy to think how much can be done by a few bold individuals when it comes to pushing the SOTA of these kinds of things (not just in kernels either -- in other areas as well!)

My experience has been that getting over the daunting factor of feeling afraid of a big wide world with a lot of noise and marketing and simply committing to a problem, learning it, and slowly bootstrapping it over time, tends to yield phenomenal results in the long run for most applications. And, if not, then there's often an applicable one/side field that can be pivoted to for still making immense/incredible progress.

The big players may have the advantage of scale, but there is so, so much that can be done still if you look around and keep a good feel for it. <3 :)

tbalsam · 2025-05-04T21:06:37 1746392797

As someone who's done a fair bit of architecture work -- both are important! Making it either or is a very silly thing, both are the limiting factor for the other and there are no two ways about it.

Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.

Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!

But that being said, adaptive stuff certainly is great when doing classification. Something to note is that batching does become an issue at a certain point -- as well as certain other fine-grained details if you're simply going to average it all down to one single vector (IIUC).

threeducks · 2025-05-05T05:59:15 1746424755

> Also, please do smoothed crossentropy for image class stuff (generally speaking, unless maybe data is hilariously large), MSE won't nearly cut it!

Of course. The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error. The MSE is somewhere in the order of 10^-9.

> Also, for classification, MaxPooling is often far superior, you can learn an average smoothing filter in your convolutions beforehand in a data-dependent manner so that Nyquist sampling stuff is properly preserved.

I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)

> Something to note is that batching does become an issue at a certain point

Could you elaborate on that?

tbalsam · 2025-05-05T15:03:32 1746457412

> The MSE here is not intended to be a training loss, but as a means to demonstrate that both approaches lead to almost the same result except for some rounding error.

Ah, gotcha

> I don't think that max pooling the last feature maps would be a good idea here, because it would cut off about 98 % of the gradients and training would take much longer. (The shape of the input feature layer is (1, 768, 7, 7), pooled to (1, 768, 1, 1).)

MaxPooling is generally only useful if you're training your network for it, but in most cases it ends up performing better. That sparsity actually ends up being a good thing -- you generally need to suppress all of those unused activations! It ends up being quite a wide gap in practice (and, if you have convolutions beforehand -- using avgpooling2d is a bit of extra wasted extra computation blurring the input)

> Could you elaborate on that?

Variable-sized inputs don't batch easily as the input dims need to match, you can go down the padding route but that has its own particularly hellacious costs with it that end up taking away from compute that you could be using for other useful things.

tbalsam · 2025-05-04T17:50:40 1746381040

As someone who has worked in computer vision ML for nearly a decade, this sounds like a terrible idea.

You don't need RL remotely for this usecase. Image resolution pyramids are pretty normal tho and handling them well/efficiently is the big thing. Using RL for this would be like trying to use graphene to make a computer screen because it's new and flashy and everyone's talking about it. RL is inherently very sample inefficient, and is there to approximate when you don't have certain defined informative components, which we do have in computer vision in spades. Crossentropy losses (and the like) are (generally, IME/IMO) what RL losses try to approximate, only on a much larger (and more poorly-defined) scale.

Please mark speculation as such -- I've seen people see confident statements like this and spend a lot of time/manhours on it (because it seems plausible). It is not a bad idea from a creativity standpoint, but practically is most certainly not the way to go about it.

(That being said, you can try for dynamic sparsity stuff, it has some painful tradeoffs that generally don't scale but no way in Illinois do you need RL for that)

hedgehog · 2025-05-05T07:57:41 1746431861

SPECULATION ALERT! I think there's reasonable motivation though. In the last few years there has been a steady drip of papers in the general area, at least insofar as they use vision transformers and image pyramids, and work on applying RL to object detection goes back before that. IoU and the general way SSD and YOLO descendants are set up is kind of wacky so I don't think it's much of a stretch to try to both 1) avoid attending to or materializing most of the pyramid, and 2) go directly to feature proposals without worrying about box anchors or grid cells or any of that. Now with that context if you still think it's a terrible idea, well, you're probably more current than I am.

tbalsam · 2025-05-05T15:07:06 1746457626

Not bad frustrations at all. That said -- IoU is how the final box scores are calculated, that doesn't change how you do feature aggregation, this will happen in basically any technique you use.

Modern SSD/YOLO-style detectors use efficient feature pyramids, you need that to know where to propose where things are in the image.

This sounds a lot like going back to the old school object detection techniques which end up being more inefficient in general, generally very compute inefficient.

tbalsam · 2025-05-03T01:15:47 1746234947

McCormick is a popular brand of seasonings hahaha

https://i5.walmartimages.com/seo/McCormick-Pure-Ground-Black...