Oops, it was meant to be: SELECT <user_id> FROM (SELECT DISTINCT user_id FROM us...

gleb · on July 30, 2015

I'd put more effort into setting up a believable problem in these kind of posts, before presenting a solution. Much like in a company pitch, it's hard to understand the value of product if you don't understand what problem it is trying to solve.

It doesn't help that using unnecessary DISTINCTs is subqueries is a common performance problem in novice SQL. Why people do that I don't really understand, but they do.

That's the thing about probabilistic data structures - I've never seen a real-world performance problem in SQL where they would have been helpful. I really would like to have an "aha" moment where somebody shows me one.

Probabilistic data structures do seem like a natural match for streaming databases, but that's different.