equinox_nl's comments

equinox_nl · 2025-10-31T09:48:55 1761904135

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

davidhs · 2025-10-31T10:09:32 1761905372

Do you? Don't you just halt and say this is too complex?

p_v_doom · 2025-10-31T10:15:00 1761905700

Nope, audacity and Dunning-Krueger all the way, baby

dspillett · 2025-10-31T10:25:36 1761906336

Some would consider that to be failing catastrophically. The task is certainly failed.

carlmr · 2025-10-31T10:50:05 1761907805

Halting is sometimes preferable to thrashing around and running in circles.

I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.

SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.

ukuina · 2025-10-31T11:09:20 1761908960

> if LLMs "knew" when they're out of their depth, they could be much more useful.

I used to think this, but no longer sure.

Large-scale tasks just grind to a halt with more modern LLMs because of this perception of impassable complexity.

And it's not that they need extensive planning, the LLM knows what needs to be done (it'll even tell you!), it's just more work than will fit within a "session" (arbitrary) and so it would rather refuse than get started.

So you're now looking at TODOs, and hierarchical plans, and all this unnecessary pre-work even when the task scales horizontally very well (if it just jumped into it).

LunaSea · 2025-10-31T11:46:00 1761911160

I would consider that detecting your own limits when trying to solve a problem is preferable to having the illusion of thinking that your solution is working and correct.

benterix · 2025-10-31T11:30:32 1761910232

This seems to be the stance of creators of agentic coders. They are so bound on creating something, even if this something makes no sense whatsoever.

moritzwarhier · 2025-10-31T11:32:38 1761910358

Ah yes, the function that halts if the input problem would take too long to halt.

But yes, I assume you mean they abort their loop after a while, which they do.

This whole idea of a "reasoning benchmark" doesn't sit well with me. It seems still not well-defined to me.

Maybe it's just bias I have or my own lack of intelligence, but it seems to me that using language models for "reasoning" is still more or less a gimmick and convenience feature (to automate re-prompts, clarifications etc, as far as possible).

But reading this pop-sci article from summer 2022 seems like this definition problem hasn't changed very much since then.

Although it's about AI progress before ChatGPT and it doesn't even mention the GPT base models. Sure, some of the tasks mentioned in the article seem dated today.

But IMO, there is still no AI model that can be trusted to, for example, accurately summarize a Wikipedia article.

Not all humans can do that either, sure. But humans are better at knowing what they don't know, and deciding what other humans can be trusted. And of course, none of this is an arithmetic or calculation task.

https://www.science.org/content/article/computers-ace-iq-tes...

AlecSchueler · 2025-10-31T10:26:23 1761906383

I also fail catastrophically when trying to push nails through walls by I expect my hammer to do better.

moffkalast · 2025-10-31T10:33:20 1761906800

I have one hammer and I expect it to work on every nail and screw. If it's not a general hammer, what good is it now?

arethuza · 2025-10-31T11:09:24 1761908964

You don't need a "general hammer" - they are old fashioned - you need a "general-purpose tool-building factory factory factory":

https://www.danstroot.com/posts/2018-10-03-hammer-factories

code_martial · 2025-10-31T12:05:37 1761912337

Reminds me of a 10 letter Greek word that starts with a k.

hshdhdhehd · 2025-10-31T10:53:23 1761908003

Gold and shovels might be a more fitting analogy for AI

monkeydust · 2025-10-31T10:00:50 1761904850

But you recognise you are likely to fail and thus dont respond or redirect the problem to someone who has a greater likelihood of not failing.

antonvs · 2025-10-31T10:30:25 1761906625

I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

pessimizer · 2025-10-31T14:05:14 1761919514

> I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.

> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.

ffsm8 · 2025-10-31T10:58:04 1761908284

> For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

Ez, just use codegen.

Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.

vidarh · 2025-10-31T11:24:47 1761909887

I have Claude reducing the number of bugs in my traditional codegen right now.

antonvs · 2025-11-01T02:14:24 1761963264

What's your point? Traditional codegen tools are inflexible in the extreme compared to what LLMs can do.

The realistic comparison is between humans and LLMs, not LLMs and codegen tools.

ffsm8 · 2025-11-01T05:30:25 1761975025

The point was that the listed argument of production tons of boilerplate code within a short period of time is a... Pointless metric to cite

exe34 · 2025-10-31T10:31:23 1761906683

If that were true, we would live in a utopia. People vote/legislate/govern/live/raise/teach/preach without ever learning to reason correctly.

raddan · 2025-10-31T10:40:25 1761907225

Yes, but you are not a computer. There is no point building another human. We have plenty of them.

aoeusnth1 · 2025-10-31T19:45:23 1761939923

Others would beg to disagree that we should be build a machine which can act as a human.

equinox_nl · 2025-10-05T07:06:07 1759647967

I'm not American so please educate me if I'm wrong but haven't y'all had school shootings every single day for at least a decade or something?

equinox_nl · 2025-10-01T08:00:00 1759305600

I'm highly skeptical about this paper just because the resulting images are in color. How the hell would the model even infer that from the input data?

anthonj · 2025-10-01T09:59:31 1759312771

It is an overfitted model thst use WiFi data as hints for generation:

"We consider a WiFi sensing system designed to monitor indoor environments by capturing human activity through wireless signals. The system consists of a WiFi access point, a WiFi terminal, and an RGB camera that is available only during the training phase. This setup enables the collection of paired channel state information (CSI) and image data, which are used to train an image generation model"

orbital-decay · 2025-10-01T08:03:27 1759305807

That's just a diffusion model (Stable Diffusion 1.5) with a custom encoder that uses CSI measurements as input. So apparently the answer is it's all hallucinated.

pftburger · 2025-10-01T08:23:45 1759307025

Right but it’s hallucinating the right colours which to me feels like some data is leaking somewhere. Because no way wifi sees colours

HeatrayEnjoyer · 2025-10-01T15:12:05 1759331525

Different materials and dyes have different dialectical properties. These examples are probably confabulation but I'm sure it's possible in principle.

plorg · 2025-10-01T16:04:36 1759334676

Assuming you mean dielectric, but I do like the idea that different colors are different arguments in conflict with each other.

moffkalast · 2025-10-01T08:38:51 1759307931

Well perhaps it can, a 2.4Ghz antenna is just a very red lightbulb. Maybe material absorption correlates, though it would be a long shot?

jstanley · 2025-10-01T08:43:55 1759308235

You can't even pick colour out of infra-red-illuminated night time photography. There's no way you can pick colour out of WiFi-illuminated photography.

AngryData · 2025-10-01T10:02:58 1759312978

There would be some correlation between the visual color of objects and the spectrum of an object in another EM frequency, many object's color share the same dye or pigment materials, but it seems pretty unlikely that it would be reliable at all with a spectrum of different objects and materials and dyes because there is no universal RGB dye or pigment set we rely upon. You can make the same red color many different ways but each material will have different spectral "colors" outside of the visual range. Even something simple like black plastics can be completely transparent in other spectrums like the PS3 was to infrared. Structural colors would probably be impossible to see discern however I don't think too many household objects have structural colors unless you got a stuffed bird or fish on the wall.

steinvakt2 · 2025-10-01T08:41:46 1759308106

If it sees the shape of a fire extinguisher, the diffusion model will "know" it should be red. But that's not all that's going on here. Hair color etc seems impossible to guess, right? To be fair I haven't actually read the paper so maybe they explain this

defraudbah · 2025-10-01T09:08:44 1759309724

downvoted until you read the paper

dtj1123 · 2025-10-01T10:31:50 1759314710

This is largely guesswork but I think whats happening is this. The training set contains images of a small number of rooms taken from specific camera angles with only that individual standing in it, and associated wifi signal data. The model then learns to predict the posture of the individual given the wifi signal data, outputting the prediction as a colour image. Given that the background doesn't vary across images, the model learns to predict it consistently with accurate colors etc.

The interesting part of the whole setup is that the wifi signal seems to contain the information required to predict the posture of the individual to a reasonably high degree of accuracy, which is actually pretty cool.

meindnoch · 2025-10-01T10:11:55 1759313515

The model was trained on images of that particular room, from that particular angle. It can only generate images of that particular room.

equinox_nl · 2025-08-19T13:58:56 1755611936

Gee I wonder why the US gov showered TSMC in money to build a fab on US soil.

etempleton · 2025-08-19T14:40:27 1755614427

Again, that is nice in the short term, but not for long term development if the worst happens between China and Taiwan.

OnlineGladiator · 2025-08-20T00:47:19 1755650839

We'd still have the fab on US soil even if they went to war. I don't think China would invade the US over it.

etempleton · 2025-08-20T14:33:57 1755700437

What about the next fab? And the fab after that? Yes, for the short term the USA has a cutting edge fab on US soil, but what about 10 years after that? The US would lack the ability to build future cutting edge fabs.

OnlineGladiator · 2025-08-20T16:09:40 1755706180

Having TSMC build a fab on US soil doesn't prevent any parallel efforts. If anything having the workforce here would make it easier to build future fabs.

etempleton · 2025-08-20T18:15:44 1755713744

I never said it did. Having a fab on US soil is important for the short term, but you need cutting edge fabs on US soil in the future too and that doesn't happen unless you are doing a lot of research.

Okay, so Taiwan is blockaded or invaded by China. The USA has a 3nm TSMC fab that they can assume control over, and, yes, they have the labor of that fab, great, but what about 2 nm? 1 nm? Etc? Without TSMCs R&D does the US have a cutting edge fab in 10 years? 20 years? Beyond? There is literally no other company in the United States that could even hope to expand their capabilities to be considered on the cutting edge within the next 15 years.

OnlineGladiator · 2025-08-20T18:18:23 1755713903

The easiest way to get TSMC's R&D in the US is to have them build a fab here and have employees here. If China invades Taiwan, and TSMC has employees that want to flee to somewhere else, the US would be the most logical option. If they already have a fab here, an established workforce and infrastructure here, that's better than having to start from nothing.

etempleton · 2025-08-20T18:30:54 1755714654

Taiwan as a country doesn't want that. There is a reason they only fab their latest and greatest chips domestically in Taiwan. It is of Taiwan's national interest and of TSMCs interest to do everything they can to ensure that the US protects them from China.

Again, it is a net positive for the US to have TSMC manufacture chips on US soil, so I am not arguing against that point, I simply posit it is not enough from a US national security / technology leadership standpoint.

OnlineGladiator · 2025-08-20T18:32:46 1755714766

I think we agree and we've just been talking past each other. Cheers.

etempleton · 2025-08-20T19:28:46 1755718126

Best.

equinox_nl · 2025-08-19T13:57:38 1755611858

Unless you chronically need healthcare. In that case: good luck.

equinox_nl · 2025-08-19T13:38:05 1755610685

I don't understand how so many tech-minded people on this site completely disregard the value of privacy. How is this a win?

yen223 · 2025-08-19T14:33:51 1755614031

My expectation of privacy when in a vehicle with dozens if not hundreds of strangers with cameras is low.

equinox_nl · 2025-08-19T13:36:43 1755610603

Not sure if I understand correctly, but are you saying that IBAN leaks personal information?

equinox_nl · 2025-08-19T13:34:59 1755610499

This depends on where in Europe. From personal experience of a relative (going back 20+ years), the Netherlands is very accessible by wheelchair.