Hey! Your Google Benchmark post was one of my go-to resources when I started picking that up a couple years ago. I love to see the focus on performance benchmarking here, and the repository is laid out well. Nice work!
It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.
I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.
Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.
One thing we've been talking about is creating tasks that are a bit more 'tower defence', where biters are released every X steps / seconds. The idea would be to test agents in building a military-industrial complex. One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
Regarding layout optimisation benchmarks, we actually discussed this yesterday. I think we need 2 types of layout task: 1) fix this subtly broken factory, and 2) improve the throughput of this factory. These should be straightforward to implement, if you'd like to have a look.
>One amusing issue we had in developing this idea is that frontier models have an aversion to creating entities called 'GunTurret' etc - as it goes against their constitution! (perhaps we should rename turrets to 'SuperSoaker' or something)
This sounds like a great idea for a short story in the style of Malak by Peter Watts. Imagine a future warfighter AI that has been fitted with a set of filters to make it think it's really having a pillowfight or building a factory to make screws while it's actually tearing people apart or optimizing a military production line.
There was a black mirror episode about this too, I seem to remember! Soldiers imagining they were fighting monsters - while actually committing war crimes.
This was the central plot twist of "Spec Ops: The Line", a video game from 2012 that started out like your typical Call of Duty clone shooter and escalated to an interesting if a bit twisted look at how PTSD affects soldiers.
Love the suggestion, I'll clone it down and start poking around.
I believe your intuition about layout experiments needing to be of different genres is correct. I think you could have a pretty wide range of debugging opportunities (imbalanced belts, inserters fighting for items, insufficient power at full load leading to throughput loss, etc) for the first. The second feels like it would be nicely encapsulated by focusing on optimizing for ratios, although seeing an agent realize that they can get more throughput by simply copy/pasting a block and upgrading a belt would be pretty wild (depending on the recipe, of course). Maybe nuclear power / heat exchanger ratios are a bit too far down the path, but optimizing for copper cable use in green circuits is pretty important and fairly early in the tech tree?
True - although it might be interesting to benchmark them both, as (1) is more about debugging (something that these agents spend a lot of time doing).
It sounds like your nephew has a project in mind. Start there with the basic dependencies and that will start laying out a competency roadmap which looks a lot like a curriculum.
Quick aside: Automate small/mid business manufacturing? Admirable, but will probably choke on the scale problem, so I think the journey will be vastly more interesting than the destination...which is good! Turns out there's a lot of robotics that can be broadly applied.
I fell into industrial work right out of undergrad as a EE, not intending to work in the rust belt or manufacturing or anything of the like. I erroneously assumed it was not important, not sexy, not interesting. How wrong that was.
How things are made is so important, not only for our society but also as learning experiences for engineers, planners, logicticians, and more. As a career roboticist, the time I spent in the manufacturing industry seems invaluable to me now.
The problem is not that it's 'not important' or 'not sexy', and the attitude that this stuff is somehow underappreciated is so tiresome. It's endlessly fetishized in mainstream media, by politicians, and so on, in a reverse-snobbism anti-intellectual bend, designed to appeal to the insecurities of the "lower" class American.
The problem is that industrial manufacturing employers treat their workers are poorly as they possibly can, and these days, companies actually seek to treat workers so poorly they don't stay - to keep them from qualifying for expensive benefits.
We have one of the highest productivity per person-hour rates in the industrialized world and you'd never think it if you spent even two days working in a warehouse or manufacturing plant. No amount of productivity is ever enough - if you do your job well, you're just shoved more work instead of being rewarded for your effort.
I wonder if a mod that changes the graphics to a visual style shown in the link (I'm thinking the Carbot graphics [0] swap for StarCraft 2 in terms of scope) would make it feel more to your liking.
The visualizations are so similar to integrated circuit layouts; they immediately reminded me of some of the coasters that GamersNexus sell which represent simplified computer subsystems.