The Apollo3's MSPI interface does support both "XIP" (with cache) and "XIPMM" (without), but as far as I can tell that's not useful for low-power computing—the NAND Flash chip is going to use an order of magnitude more power than the CPU. The (non-SPI!) NAND I'm going to attempt to use only uses 18μW in standby (datasheet figure, not my measurement) but with XIP you can't predict or even really measure how much you're accessing it and how much power you're using as a result. The chip uses 27mW in active mode, so to hit the 1-milliwatt power budget the NAND's duty cycle needs to be on the order of 1%. Same story with external RAM chips: they all promise to use a lot more power than the Apollo3 itself. The lowest-power way to increase the Apollo3's 384KiB of SRAM seems to be to plop another Apollo3 on the board next to it.
This is a riveting and fantastic thread. If I extrapolate your trajectory, very interesting things are ahead.
I could see myself doing a lot of work with a couple of z80/6502-based systems, as long as they were maxed out in terms of memory and had decent peripheral support .. put a network device in the mix, and it offers plenty of opportunities to run a Strange New Operating System. I would happily run a 21st Century CP/M to read email, watch sensors, drive the ship .. if, say, it had multi-processor/network support and there were somehow 1024 z80's in my wrist-watch/book/headset/nav station, all cooperating at low power to do bigger things.
384k is enough for everyone.
What I wonder actually now, is physically what would it look like to have a fully functioning z80 in silicon from sunshine to user display, in a single package. I bet that could be mighty small, physically.
Scale this into an energy-friendly form, and we have solar-powered computing at hand.
(Edit: I'm also a grey-beard, have kept every system I've ever worked on/written software for, for 50 years. My living room is a retro- computing museum... My motto is "computers don't get old - their users do" .. so the utility of very lower-power computing devices is entirely relevant to my interests..)
Yeah, I feel like, historically, the point where self-hosted development becomes feasible is roughly a Z80 (8500 transistors, basically the same size as the 21-bit MuP21) with CP/M and 48KiB or 64KiB of RAM.
—⁂—
Physically I think the user display is likely to be much larger than the CPU, the RAM, the Flash, or the power supply capacitor. The size of the solar panel might be larger still; in direct sunlight you can get 1 milliwatt from 4.5mm² of 22%-efficient solar cell. Possibly a glasses-mounted display with the appropriate optics to focus onto your retina would allow you to use a display smaller than that, which could also reduce its power consumption. The SHARP LS027B7DH01 400×240 memory-in-pixel LCD I want to use (two of) consumes 50μW just to maintain its lovely high-contrast display (according to the datasheet), and nominally 175μW to flip every pixel on the display at 20Hz, the maximum datasheet speed. Nicolas Magnier was able to get 60Hz out of his: https://www.youtube.com/watch?v=zzJjE1VPKjI but we can extrapolate that this requires an additional 250μW. (Which I also still haven't measured.)
But, without head-mounted optics, I think these screens are too small for a comfortable development environment. You can fit 8 lines of text in a 12-point font on one, with a few words per line. My current working hypothesis is that I'll be able to live with two of them if I use reading glasses and hold the screen close to my face.
These memory-in-pixel LCDs use some power to retain the screen image, unlike e-ink, but much less power than e-ink to update it. I don't have even datasheet numbers for e-ink displays but the crossover point where e-ink uses more power seems to be about three screen refreshes per hour. So, for interactive computing, the memory LCDs should use several orders of magnitude less energy.
But they use proportionally more energy when they're larger. The discontinued 6-inch version https://www.youtube.com/shorts/snXYogDEseA reportedly used 24 milliwatts for a 30fps movie.
An audio interface would be another alternative. AirPods and in-ear hearing aids pack quite a bit of processing power already.
—⁂—
As for a Z80 with CP/M, although it's self-sufficient, it's only marginally so: you can run Turbo Pascal on it, but Anders Hejlsberg, Philippe Kahn, and the others had to write Turbo Pascal in assembly (https://www.latimes.com/archives/la-xpm-1988-01-21-fi-37556-...). Similarly, CP/M (or CP/Mish) can build CP/M, but that's only because it's written in assembly. Some of this is due to deficiencies in the Z80 instruction set which make it ill-suited for high-level languages. Probably the slowness and smallness of floppy disks was also a factor; the S34MS01G2 chips I have here are nominally 133 megabytes per second with 25μs random "seek" time, while floppy disks were more like 0.001 megabytes per second and 1234567μs random seek time. I'm hoping this means that "swapping" from Flash does a better job of providing the illusion of larger memory than loading WordStar's print overlay from a floppy did.
Also, it should help a lot that the Apollo3's Cortex-M4F provides 25 Dhrystone MIPS (at 20MHz to not blow the 1mW power budget) rather than 0.052 Dhrystone MIPS like a 4MHz Z80. So you can push the time/memory tradeoffs waay over to the side of saving memory. And you have 1MiB of NOR Flash on-chip as well.
1024 Z80s would be only 8.5 million transistors, in the neighborhood of an Alpha 21164 or a Pentium II. But 64KiB of 6T SRAM is π million transistors all by itself, and 64KiB of DRAM is half a mebitransistor, plus half a mebicapacitor. So if you want 64KiB on each of those Z80s, you need closer to a billion transistors, like a SPARC T3, an Opteron 2400. (The Apple A17 chip fabbed in 3nm is 19 billion transistors and 103.8mm², according to https://en.wikipedia.org/wiki/Transistor_count, so we could extrapolate that a billion transistors would be about 5mm², which would easily fit into a wrist-watch.) At this point, though, it might seem appealing to use something like the 27000-transistor ARM2 for your processing elements rather than something like the Z80.
Actual Z80s (the kind Zilog discontinued last year, which I believe was CMOS rather than NMOS) are pretty energy-hungry, using hundreds of milliwatts, if we trust the datasheet. But that's presumably because they're fabbed in a large process node with boatloads of gate capacitance, rather than because they switch a lot of transistors. So I think you get lower energy consumption with more recent Z80 clones like the ones in the S1 MP3 players or the TI-84+CE pocket calculator.
—⁂—
I suspect that you can spend less picojoules per computron by using bigger CPUs like ARM, for a variety of reasons. You can decode less instructions to do a given task, and I believe that setting a register bit to 0 that was already 0, or 1 that was already 1, doesn't use extra power, so in a sense the wider registers and ALU should be almost free from a power perspective. Also, I would expect that specialized hardware such as the integer multiplier or the barrel shifter burns less energy to do what it does than doing the same thing through a sequence of steps using things like an adder or a 1-bit shift. You can take these principles further with SSE- or NEON-style SIMD instructions or GPU-style SIMT, and with additional specialized logic for things like floating point, LZW compression, AES encryption, etc. It won't use any power if you power it down when you're not using it.
On the GA144, Chuck Moore claims he got a lot of efficiency mileage out of asynchronous logic, perhaps mostly because synchronous CPUs these days have to devote a lot of brute force to keeping clock skew down. I don't think this is as big a factor as the Apollo3's subthreshold logic, which, if we believe their datasheet, allows it to do 20 MHz and 25 DMIPS at 500μW, working out to 20pJ per Dhrystone "instruction".
> as far as I can tell that's not useful for low-power computing—the NAND Flash chip is going to use an order of magnitude more power than the CPU
Agreed that the NAND can consume a ton more than the CPU, so duty cycle has to be kept low. There's a few places where XIP NAND excels: it's big, it's cheap, and it can saturate the XIP memory bus just like NOR for large reads - it's a great place to store bitmap graphics. One downside is that the random access latency is pretty terrible.
> with XIP you can't predict or even really measure how much you're accessing it
There are a couple incomplete options here:
Just for measuring, you can fence off the XIP address range to generate MPU access violations, then work out a duty cycle.
The cache has performance counters, but at the cache level they don't tell you anything about internal flash vs XIP flash.
> The (non-SPI!) NAND I'm going to attempt to use only uses 18μW in standby
There are similar low standby QSPI parts available(10uA@1.8v typical), like W25N01GV.