Hacker Newsnew | past | comments | ask | show | jobs | submit | Gadiguibou's favoriteslogin

Very nice post!

Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.

That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.

For n items to sample from, this system does about O(k * log (n/k)) samples and skips.

Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.

What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.

Another related topic is rendezvous hashing: https://en.wikipedia.org/wiki/Rendezvous_hashing

Tangentially related: https://www.keithschwarz.com/darts-dice-coins/ is a great write-up on the alias method for sampling from a discrete random distribution.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: