robert3005's comments

robert3005 · 2025-11-20T15:56:59 1763654219

What you can do if you have gpu friendly format is you send compressed data over PCI-E and then decompress on the gpu. Thus your overall throughput will increase since PCI-E bandwidth is the limiting factor of the overall system.

reactordev · 2025-11-20T17:03:57 1763658237

That doesn’t explain how vortex is faster. Yes, you should send compressed data to the GPU and let it uncompress. You should maximize your PCI-E throughput to minimize latency in execution, but what does Vortex bring? Other than Parque bad, Vortex good.

robert3005 · 2025-11-20T10:58:14 1763636294

Can’t wait for https://github.com/apache/iceberg/issues/12225 to merge so there’s an api to integrate against

robert3005 · 2025-11-20T08:54:05 1763628845

The default writer will decompress the values, however, right now you can implement your own write strategy that will avoid doing it. We plan on adding that as an option since it’s quite common.

robert3005 · 2025-10-05T15:16:40 1759677400

Highly recommend https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a comparable algorithm. It generalizes to arbitrary input and output bit widths.

ashtonsix · 2025-10-05T16:03:05 1759680185

Good work, wish I saw it earlier as it overlaps with a lot of my recent work. I'm actually planning to release new SOTAs on zigzag/delta/delta-of-delta/xor-with-previous coding next week. Some areas the work doesn't give enough attention to (IMO): register pressure, kernel fusion, cache locality (wrt multi-pass). They also fudge a lot of the numbers and comparisons eg, pitching themselves against Parquet (storage-optimised) when Arrow (compute-optimised) is the most-comparable tech and obvious target to beat. They definitely improve on current best work, but only by a modest margin.

I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.

robert3005 · 2025-09-25T11:10:37 1758798637

You might be interested in https://github.com/spiraldb/fastlanes. It gives you fast bit packed vectors. It needs padding to 1024 elements though for the performance.

robert3005 · 2025-05-17T19:04:00 1747508640

You don’t have to wait for the doors to close to be able to scan your ticket in London Underground. The gate will stay open and let you through. It’s a little bit awkward since you have to approach as you scan your ticket leading to your hand lagging behind

robert3005 · on Oct 14, 2024

The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.

infogulch · on Oct 15, 2024

Wow, I think this is the thing I wished existed for years! Most file formats leave a huge compression opportunity on the table just because their choice of physical layout. (I call the simple case "striding order", idk) But getting it right takes a lot of experimentation which becomes too much churn for applications, and can result in storage layouts that are great for compression but are annoying to code against. So the obvious answer (to me at least) is that you need to decouple physical and logical layouts. I'm glad someone is finally trying it!

robert3005 · on May 2, 2024

Fulcrum | Software Engineer | London or New York | ONSITE | Full-Time

Fulcrum is building next generation storage platform for diverse data of the future. We believe users will need to process non tabular and tabular data together and we need to develop new methods to support them.

We develop Vortex (our core storage primitive) in the open https://github.com/spiraldb/vortex and currently are looking to hire more people to our 5 person team to help build our product.

Tech: Rust, Python, Zig

Reach out to me at hn[at]fulcrum[dot]so