How did you do it? Hopefully you generated thousands of such arrays and measured the cost of searching all of them as a single iteration? Because depending on what you use to measure time, the overhead of reading the timer would likely dwarf the cost of searching such a small array. The best case is a dedicated cycle counter instruction/register (e.g. rdtsc) and even that may cost hundreds of cycles. Cache hits cost less than a cycle so if your timing code gated a small number of searches you essentially didn't measure the search at all.
That aside your findings point to the prefetcher having identified your linear searches as sequential access so practically every single access was a cache hit (you effectively measured the latency between a CPU and its L1 cache). If you wanted to test this you could do something like make each element some number of cache lines wide. Stream prefetchers have a maximum stride, so variables more than that many bytes apart won't be prefetched. Google's multichase benchmark uses 256 bytes IIRC.
I often benchmark my DIY data structures, as well as check the assembly on godbolt. I think I wrote the code using ML model (they save lot of time), and then fixed the issues and extended it myself. I have the code here: https://pastebin.com/GzY4HMsZ
I do the search multiple (thousands) times and calculate total time spent, so the timer is called only 2 times. I search across the same array, but for the different number each time. So the array should be completely within L1 cache. It's also important to use the result of the search otherwise the compiler simply omits the code.
I use clock_gettime() which takes on order of 1us. Timestamp counter (which is faster) is not available for me because every CPU core has its own counter and they show different values, so Linux refuses to use it. First core is 600 ms ahead of other cores. How in 2025 we cannot make a boring counter amazes me. Or, maybe this is intentional to prevent using cheaper consumer hardware for professional purposes.
Here are some numbers:
Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns.
Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns.
As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles.
Looking through disassembly on godbolt, the compiler inlined and vectorized the measurement loop for no-branch code, which might explain the numbers. The SIMD assembly is hard to read and I don't quite understand what it does: https://godbolt.org/z/dofKnT3W3
I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native".
Interesting, thanks for sharing. I actually assumed ?: desugared to an if-else and never checked. I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?
You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.
As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.
My SIMD knowledge only extends as far as which compiler options make my code faster.
No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.
> I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?
The point of "ifs" there is to replace a variable "size" with a constant like 8, 16 so that the compiler can hardcode it in the code and maybe make it more efficient (for example, instead of counting the iterations the compiler can just repeat the instruction 8 times when iteration count is 8). I expect that the compiler would generate customized "linear_search" function versions instead of using generic version with unknown number of iterations. You can see the example in godbolt assembly code at line 825, where the compiler unrolled a loop for "linear_search" function with size = 8 - there is no loop, just 8 comparisons.
> As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.
I don't think so. In binary search there is no predictable pattern, so the predictor would miss in 50% of cases. And in linear search there is a guaranteed miss in the last iteration.
> You could circumvent the counter issue by pinning to a specific core,
I think I'll try this in future.
> but I find I have to write some code yourself
I can write it myself, but it would take 20-30 minutes including time to find and read the docs for functions while for LLM it takes less than a minute. But anyway I have to edit the result a lot, for example LLM "forgot" that you need to use the result of the search to prevent compiler from throwing the code away. Also, free version of ChatGPT became much dumber couple of months ago.
Fair point that binary search causes a lot of branch mispredictions. Did you ever measure whether the unrolled linear search with 8 elements outperforms a "rolled" one? Because with instruction level parallelism I wonder how much difference removing 8 mostly easily predicted (in the linear search) branches makes.
I didn't try, but loop contains a branch so there could be misprediction, which doesn't happen with unrolled loop. Also, instruction level parallelism applies to unrolled code as well.
I'm guessing you don't really do this for users since the response for all of them should be 401 on any user that you aren't logged in as? I would argue even for IDs that don't exist, you should get the same error whether they don't exist or you just aren't authorised to see them. It's been a few years since I worked in web but I think that's what I would have done, GitHub does similar for private repos.
Aside from its well-known features, such as displaying images directly in the terminal over ssh, I use it to create TUI applications. The application is a saved Kitty session, with a defined arrangement of windows. Each window runs a specified program, and communicates with the other windows over a Unix socket. Kitty has a convenient tool to create these sessions. Once created, I can start the session-application like any other program. The sessions are defined in a text file, so I can edit it to adjust the window arrangement or other details.
I also use its shell integration features, such as putting the scrollback into a pager, constantly.
to answer that question we need to first look at the alternatives. whee else could session management go? and then we can consider the benefits and drawbacks of each approach.
if you are thinking of tmux then the problem here is that tmux is in itself a terminal.
to get session management away from the terminal it would need to be done in such a way that when the session tool connects the session it merely acts as a proxy or less, but does not interpret and then translate the signals that come from the session like tmux does.
this is not trivial, at least with the current way terminals work.
I‘m not sure I understand the question. Some people use similar kinds of sessions in terminal multiplexers such as tmux (is that what you have in mind?), but this leads to problems (interpretations of key sequences, etc.) that the Kitty solution sidesteps.
Compared to Kitty, Konsole has image preview, click on files and links, tabs and tiles. Konsole does not have a way of changing the layout of tiles, equivalents to Kitty shell and remote control, pager, and the ssh integration (e.g. ls in an ssh session and Konsole will not open the file).
Considering they knew these failure modes in advance, would it not have been prudent to put some sort of self-righting mechanism on the lander? Something like the mechanical arms you see in Robot Wars.
Mechanisms weigh a significant amount, long mechanisms like arms even more so. The more mass you spend on contingencies the less science you bring and the less valuable the mission is, even with a flawless touchdown.
If you can't land reliably, then a lot of the science is wasted. On the other hand, once you reliably solve the landing issue (which was one of the main objectives here, and where they made significant progress in that they successfully deployed the two mini rovers anyway), you can add on as much science as you want.
Also I've said before and will say it again, the moon is not far away. Unlike Mars and other celestial bodies where we have to time launches around orbital positions, gravitational slingshots and such, the moon is really close by, and we should be lofting stuff onto it on a monthly basis.
Neat idea, I'm not sure how much force those jets produce, maybe it could be enough? It might be risky to fire up jets after a botched landing though. If the nozzle has any material in it I'd be worried about blowing a hole in the lander!
What is your opinion of code coverage requirements now? I have been in a "phase" of seeing them as "code quality theatre". Considering that a function which takes a single 8-bit integer as an argument already has 256 unique inputs, and may bug only on 1-2 values, 100% statement coverage can be very misleading. A typical function has billions or trillions of unique inputs and 100% statement coverage could be very nearly 0% state space coverage. I'm 5y into my career (but 15y into programming) and aware that my opinions will change and develop as I progress. This one has been stable for a while though.
Coverage is necessary but not sufficient, your tests also need to be good and test the right things. What's your proposed alternative - that we don't even try because our tests aren't guaranteed to be perfect?
Well, almost. I use a lot of assertions to check the obvious (value out of range, null pointer, etc.) and test the happy path(s) to prove the code at least works in the cases I can anticipate. I add unit tests for complex algorithms, and to prove a reported bug is what I think it is and that it has been fixed. Otherwise, I think using unit tests to find bugs is mostly busywork.
For even a fairly trivial piece of code, the search space for bugs can be vast or even infinite. Writing unit tests to find bugs within that space is like throwing darts at an infinitely large wall and trying to hit an unknown number of invisible targets. You can only write tests for the potential bugs you anticipate - if you could anticipate a bug, you wouldn't write it, right? You end up with dozens or hundreds of tests that probably never failed, except when you have to change something. Such was my experience when I tried to maintain high code coverage. When I switched to writing assertions and acceptance tests, my rate of bug reports did not change, and I was more agile.
I think the best way to achieve this is by providing an OS API that results in the files always being created in the same place. Applications/libraries could still choose their own filenames and syntax, just the location would be OS controlled. I think there is room for a new desktop/laptop OS to emerge and one good idea from mobile OS design I would like to see is having everything be an API call that allows the OS to impose standardisation, permissions and user preferences rather than the free-for-all desktop OSes have (though I propose letting the user run non-compliant applications, and not porting the iOS app store only model into the OS).
The problem with the Windows registry, at least back in the day, was that it was a single file that could be corrupted and it would wreck your whole system.
I think having a standard utility API for *nix configs makes a lot of sense. I'm surprised it doesn't exist.
I tried to find one, and there are some libraries for reading and parsing in every language, but nothing that seems to cover everything.
> The problem with the Windows registry, at least back in the day, was that it was a single file that could be corrupted and it would wreck your whole system.
Exactly. But Registry has been using NTFS and it's capabilities for rollbacks and recovery. Therefore, the problem is mostly solved. There are occasions [0] of bugs causing corruptions though but they are very rare.
At 29, I don't feel like I was "basically finished" until 25 or so, and feel significantly more mature than I was even a year ago. I'm very interested to know why you think 13 year olds are mostly finished developing.
You seemingly state that as a positive goal. I posit that many people would have preferred to be left alone and not killed or subjugated. Perhaps if he had been prevented from those actions until after maturity, he would have chosen different actions.
To rephrase the other user's [correct] sentiment, a scientific determination of brain development has more to do with myelin on axons than maturity regarding emotions, finances, or anything else we actually judge people on.
I mean, I don't think most of us commenting here are Apple ads. I think we're all hoping that if Apple is jumping into VR, then finally it'll be good enough for us all to adopt massively, the same way we all have smartphones. That's an exciting prospect, even if we don't necessarily "like" Apple.
But at $3500 for one, I think they've blown it. I'll wait for the reviews, and if it really does deliver on all the promises, then that'll be nice. Somehow, I doubt that it'll happen though.
I was talking about the posts/linked articles rather than the comments. I counted six in the first 20 posts, and the titles are mostly quite ad-like.
While I was initially interested by the announcement, the size, lack of internal battery and price makes this a hard no for me. Idk what they were thinking - doesn't anyone remember Google Glass? That was half the price and size and nobody wanted it.
That aside your findings point to the prefetcher having identified your linear searches as sequential access so practically every single access was a cache hit (you effectively measured the latency between a CPU and its L1 cache). If you wanted to test this you could do something like make each element some number of cache lines wide. Stream prefetchers have a maximum stride, so variables more than that many bytes apart won't be prefetched. Google's multichase benchmark uses 256 bytes IIRC.
reply