martbakler's comments

martbakler · on March 12, 2025

To also jump in here, regarding tools the agents had access to function signatures (i.e tool docstrings, input and output types) and for each tool a small "manual", which described what the tool does, how it affects the game state and a small number of examples where using this tool would be useful (for instance, how to use place_entity_next_to to put an inserter next to an existing chest)

Overall as Jack said, no post-training was done at all but all agents had a complete API description (tools, entities, research) in their context so the results indicate to some level how well can modern agents use a completely OOD API with decent level of documentation

martbakler · on March 11, 2025

Indeed I think the trade-off here is the more "pure factorio" types of images we give to the agents, the more likely it is that they've seen it during training (from google etc), however the signal-to-noise ratio is low and hence the current models get confused as the map complexity (amount of entities) and level of detail grows. If we start to create custom images, we can reduce the unneeded noise, but then risk giving something completely OOD to the agent (unless we train a visual encoder) and the performance also tanks

martbakler · on March 11, 2025

This is interesting, one of our findings was that the Claude was capable of essential tasks & simple automation (i.e iron gear wheel factory in lab-play) but didn't even try to do it during the "build the biggest factory" game episodes. So the models can do these essential tasks but when given a general goal, i.e "complete the game", they don't have a good level of long-term planning to even try to attempt them. Often they just did un-coordinated small-scale constructs without attempting to scale up existing factories

That was also one of our goals, to find out how do the models act when given a very vague and general objective

martbakler · on March 11, 2025

We are thinking of something like this (a curriculum approach) for further training. The reason why we didn't want to do this for current work, where the emphasis is on evaluations, is that the "difficulty level" of different tasks is quite subjective and hence we would need to make arbitrary decisions that could affect the evals (i.e which tasks would follow which scenarios, how to ensure sufficient coverage across all difficulty levels etc)

infogulch · on March 11, 2025

"a curriculum approach" is a nice way to put it!

> the difficulty level of different tasks is subjective

That makes sense. I wonder if difficulty of different scenarios could be derived by assuming a partial ordering and ranking based on training rate: e.g. it preforms better at scenario T if it trains scenario A first, but training scenario first B doesn't help with T. Then infer A < T, and B ? T.

martbakler · on March 11, 2025

It actually is engineering wise quite trivial but the underlying question is which modality is the best to elicit spatial reasoning capabilities from the current general models. We tried (very anecdotally) a couple of months ago to get an agent to reason over a couple of ascii representations of factories and the results weren't very promising. It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

The question is what is the most efficient and high-quality representation we could use to improve that

groby_b · on March 11, 2025

> It seems the models struggle with creating an accurate internal spatial representation of the game state only using textual tokens

That'd be actually interesting research material for the claim that LLMs are able to build internal representations of the world. (Either they can't at all, which'd be an important insight, or it turns out there's something fundamentally different about modalities that engages different reasoning/world model capabilities, which would be even more interesting)

Or, if you want to really go wild, "what capabilities allow models to reason in modalities fundamentally different from their input data/training data".

Damn it, I should quit and go back to University. [Ed.: She wouldn't quit, she likes her job, don't believe her]

ajcp · on March 11, 2025

Did you try providing 2D vectors of where each object relates to every other object? Seems like the most obvious way.

In my experience the current generation of models are very poor at spatial reasoning even when given accurate coordinate based location assignments of each object. But I suspect when a model can build the whole relationship of all objects by being given those spatial relationships in a vector they will be much better.

martbakler · on March 11, 2025

We did discuss this at some point but didn't end up trying it out. I think it's quite an interesting avenue and worth a shot, my intuition also says that the spatial capabilities will improve if the model has more access to relative info and doesn't need to infer it from absolute coordinates

pyinstallwoes · on March 12, 2025

Given vector space on text is more of a spatial space of semantic distance then spatial distance of geometric objects intuitively feel of a different nature due to the fact that words are not at all likely to be represented in similar ratios of distances.

I think a tokenization of ratios between perceived boundaries might help. But, I’m just shooting in the dark.

ajcp · on March 12, 2025

You're conflating the use of vectors to only mean how they relate to semantic meaning. As vectors are just spatial relationships, in the case of objects in Factorio we could provide the vectors for every single object as to how they relate to every single other object in literal 2D space. This would essentially provide the LLM a complete relationship mapping, since it is not able to do it by "seeing" a picture or by providing it with absolute coordinates.

pyinstallwoes · on March 13, 2025

Yeah but that’s a biased approximation at the cost of an assumption in equivalence and not truth distills equivalent in ratio. You’d have to treat tokens at some universal distance if one unit to approximate some unit of measurement along hwd/magnitude.

Overall visual perception is about noticing comparative differences not measuring absolute quantity.

martbakler · on March 11, 2025

Currently it's a text-only modality environment but we are planning to support vision in the future. We did run a couple of tests and saw that including screenshots of the game state did not improve performance on the off-the-shelf models. As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because the VLMs currently aren't good at spatial reasoning in high-detailed images, likely this would improve significantly with finetuning

Good point with MCP as well given it has been blowing up lately, we'll look into that!

vessenes · on March 11, 2025

That makes sense and it’s really interesting - it is a challenging visual test for sure; thousands of entities, either multi tier visual representations (screen, map, overview map) or a GIANT high res image. I hereby propose FLE-V a subset benchmark for visual models where they just turn a factorio image into a proper FLE description. And maybe the overview and map images as well.

kridsdale1 · on March 11, 2025

Such research could have hundreds of billions of dollars in downstream GDP implications when applied to real industrial settings.

dismalpedigree · on March 11, 2025

Not to mention the increased productivity of everyone not wasting their time in factorio (myself included) because the optimal solution is known.

lukan · on March 12, 2025

Not wasted time, you were doing research it seems.

dismalpedigree · on March 13, 2025

Good point. My wife will surely understand if I explain it as “research”

vessenes · on March 11, 2025

Well I better get training!

grayhatter · on March 11, 2025

> As the complexity of the game state grew and the screenshots were filled with more entities, the models got even more confused and started hallucinating directions, entities etc or weren't capable of troubleshooting factories with apparent mistakes (i.e missing transport belt, wrongly rotated inserter). We think it's because [...]

I think you just described a research paper that would advance sota. Less describing why, but how. (Assuming it's not just, wy finetuned the model and it worked perfectly)

martbakler · on March 11, 2025

Sounds almost like a visual "needle in a haystack" type of work, that could be quite interesting!

pyinstallwoes · on March 12, 2025

Where’s Waldo test for vlm

martbakler · on March 11, 2025

We were thinking of creating a minigame resembling a "tower-defense" setting, where waves of bugs get released and the agent needs to create appropriate defenses. It would be interesting to see if agents are capable of defending the base and how much resources would they put towards defenses in a normal game where enemies are enabled

martbakler · on March 11, 2025

Just to jump in here as one of the authors

We designed the API to be as spatially descriptive as possible (include x-y coordinates and neighbors in game state descriptions) and the agents have tools to aid them in carrying out actions which would benefit from vision (i.e find buildable areas on the map with different sizes, placing entities next to other entities etc).

As Jack said, we completed most of the lab tasks manually ourselves and while it took us a lot longer compared to having vision, the tasks were still doable and the human performance is significantly higher than current agents. We are thinking of supporting vision for future evals but from a small number of tests we ran, current models got even more confused as the number of entities on the map grows quite quickly. This is likely due to VLMs being notoriously bad at visual reasoning on images with lots of detail and in a game where one misplaced entity in a large factory breaks everything, the errors start to compound

Hammershaft · on March 11, 2025

Someone below mentioned the ASCII interface for dwarf fortress as being ideal for this, and I wonder if that kind of representation with a legend might produce spatially better results. The drawback I see is that elements can be layered on a tile in Factorio, or have properties that are not visually obvious in ASCII, so the llm would need to be able to introspect on the map.

noddybear · on March 11, 2025

I think your intuition is correct about the amount of information that needs to be encoded into an ASCII char. You could potentially use unicode to pack more more into each char, e.g direction, type, status etc. Or make each representation available on-demand, i.e 'show me the direction of all inserters in a 10 tile radius'.

vessenes · on March 11, 2025

Well we learned last month on HN that you can encode arbitrary data into Unicode; anecdotally, o3-mini-high at least could decode it if given instructions.

I wonder what a quick way to calculate how many Unicode characters you’d need is.. I guess every entity + four orientations. Underground belts and pipes seem tough. But I guess you could just add an encoding showing if the square has an underground pipe or encoding.

I propose this would work. I think I’ll give it a try today.. I’d love dwarf fortress factorio. That said, the encode/decode phase seems like a lot of tokens for a model that’s not trained to understand the Unicode ‘map’. Seems like you’d want to fine tune something at least. Maybe a layout model.

noddybear · on March 11, 2025

Checkout the 'ObserveAll' tool in the repo - its deprecated now, but it pipes all the raw entities on the map back to the agent. You could procedurally convert it to unicode format given a pre-defined codebook (which you give to the agent) before letting the agent observe and reason over it.