More

lazamar · 2025-07-27T12:34:41 1753619681

Why not search the FM-indexes directly? It is faster than the n-gram search and you can use the exact full text of the needle.

bob1029 · 2025-07-27T13:05:44 1753621544

If you have millions of them, searching all every time could become a problem.

lazamar · 2025-07-26T21:53:27 1753566807

At Meta they are using FM indexes to power text search through the entire commit history of their monorepo.

lazamar · on July 14, 2024

Lol. Loved it.

This was a lovely passage from Dyson’s Web of Stories interview, and it struck a chord with me, like it clearly did with the authors too.

It happened when Dyson took the preliminary results of his work on the Pseudoscalar theory of Pions to Fermi and Fermi very quickly dismissed the whole thing. It was a shock to Dyson but freed him from wasting more time on it.

Fermi: When one does a theoretical calculation, either you have a clear physical module in mind or a rigorous mathematical basis. You have neither. How many free parameters did you use for your fitting?

Dyson: 4

Fermi: You know, Johnny Von Neumann always used to say ‘with four parameters I can fit an elephant; and with five I can make him wiggle his trunk’.

lazamar · on July 4, 2024

Indeed, but worth noting that LZ is a modelling scheme, whilst Huffman is a coding technique.

That is, LZ determines, dynamically as it goes, what are all the elements we want to encode and their probabilities. Then you need a coder, like Huffman, to actually encode it.

In the post I used a semi-static zero-order byte-based model. Which means I counted the byte occurrences first and just used that count for the probabilities throughout all of the encoding. Then I used Huffman codes to translate those probabilities into bits.

But I'm considering writing a follow-up changing this static model for an LZ77 one as I think that would be fun.

lazamar · on July 4, 2024

Haskell's speed can be competitive with systems languages but keep in mind that its killer feature is ease of abstraction.

The idea is that it is simple to assemble multiple parts into a coherent, well organised program. Which is important for the entirety of the program, no just the tight loop.

So, with the nice FFI Haskell has, you can always drop down to languages without a GC for inherently imperative optimisations. Then you wrap that into a library with nice types and you can now leverage that raw power anywhere in your Haskell code where the types will match.

I worked at Meta in a high performance Haskell application and that's what we did. Wrote beautiful, large, fast Haskell programs which in some specialised parts had C++ building blocks. 99% of the time was spent on Haskell land composing things into more and more useful applications.

lazamar · on July 4, 2024

Fixed it. Well spotted!

lazamar · on July 4, 2024

I’d say unbeatable!

The goal was simplicity of implementation and code clarity. For this kind of thing I say Haskell performs the best.

bdahz · on July 4, 2024

For the simplicity of implementation and code clarity, I need to know how much I need to pay for it.

If the Haskell implementation is 3x slower than C/C++/Rust implementation, it would be acceptable.

If it's 30x slower, I would rather choose C/C++/Rust even the implementation won't be simple.

If it is even possible to be 3x faster than C/C++/Rust, then why not the mainstream programmers adopt Haskell everywhere?

rebeccaskinner · on July 4, 2024

The general rule of thumb I’d give is that a performance aware but not micro-optimized Haskell program will typically run in about 2x to 5x the time of a comparable C program, and will take somewhere between 2x and 10x as much memory. For a naive Haskell program the range is much bigger- maybe 2x to 10x as much time and 10x to 1000x as much memory (it’s easy to do a lot of allocations in Haskell).

For extremely optimized Haskell you can get close to the speed of C, but there’s still a garbage collector.

There are also certain classes of problem where a naive Haskell implementation can beat other languages by mile, including C, if you use the same implementation in both languages. Laziness can be really great sometimes. This didn’t happen much in practice though because the kind of code that’s really efficient with lazy evaluation is very obviously not in a strict language so people don’t usually write code that way.

In the end I’d say Haskell is a good choice for performance sensitive but not performance critical program. In a larger Haskell application if you have a performance critical bit you can usually write Haskell code that will be fast enough if you know what your doing. For something stand alone that needs to be as fast as possible, or the most critically performance sensitive parts of a bigger application, I’d consider using C or C++.

ReleaseCandidat · on July 4, 2024

To rephrase using my experience: "performance aware" Haskell is about as "fast" as Go, but needs more memory, and both are slower than the same Java code - but both are way more fun to write ;). Optimising Go is easier for most people though, in Haskell you _really_ need to know Haskell internals (how to read core and write unboxed code) and understand laziness.

My try of the 1 billion row challenge in Go and Haskell (and C) and comparison to other, fast implementations: https://github.com/Release-Candidate/1-billion-row-challenge...

lazamar · on July 4, 2024

The goal of this implementation is not to be fast, but to be clear.

I am doing some inefficient things (like two pass encoding) on purpose to keep things simple and clear. So using this particular piece of code to judge a language's performance potential is not really the way to go here.

mrkeen · on July 4, 2024

That wasn't really the spirit of the question as I read it. 'Performance' has a narrower definition than that.

zarathustreal · on July 4, 2024

The point they’re making is that there is no performance without tradeoffs and “fast” is meaningless unless you define what you’re measuring. Asking the question implies a misunderstanding of the intent of the implementation, OP was trying to subtly let them know.

lazamar · on July 4, 2024

That’s interesting. I guess this is not usually used because you may have a long string of bits that is ambiguous till you get to a disambiguating bit.

Something like

`100000000000000001`

In this case, where to know whether the first code was an `a` or a `c` you have to read all the way to where the zeroes end.

lazamar · on July 4, 2024

There is one way in which Huffman codes are better: they are easier to explain and simpler to implement.

I went for simplicity of exposition in the post, but arithmetic coders can indeed get arbitrarily close to the entropy, which is not quite the case with Huffman.

nottorp · on July 4, 2024

> easier to explain

I think Huffman is the one compression algorithm that compresses stuff significantly that can fit on the proverbial napkin, so it's a good start.

The others require the whole napkin stack at the table.

lazamar · on July 4, 2024

Thanks for the link. I was motivated to write the post after reading Moffat’s book ‘Managing Gigabytes’. A pearl from the 90’s.

The authors mention this technique in the second edition.