Just browsing the Quickwit documentation it seems like the general architecture here is to write JSON logs but stores them compressed. Is this just something like gzip compression? 20% compressed size does seem to align to ballpark estimates of JSON GZIP compression. This is what Quickwit (and this page) calls a "document": a single JSON record (just FYI).
Additionally you need to store indices because this is what you actually search. Indices have a storage cost when you write them too.
When I see a system like this my thoughts go to questions like:
- What happens when you alter an index configuration? Or add or remove an index?
- How quickly do indexes update when this happens?
- What about cold storage?
Data retention is another issue. Indexes have config for retention [1]. It's not immediately clear to me how document retention works, possibly from S3 expiration?
So, network transfer from S3 is relatively expensive ($0.05/GB standard pricing [2] to the Internet, less to AWS regions). This will be a big factor in cost. I'm really curious to know how much all of this actually costs per PB per month.
IME you almost never need to log and store this much data and there's almost no reason to ever store this much. Most logs are useless and you also have to question what the purpose is of any given log. Even if you're logging errors, you're likely to get the exact same value out of 1% sampling of logs than you are with logging everything.
You might even get more value with 1% sampling because your query and monitoring might be a whole lot easier with substantially less data to deal with.
Likewise, metrics tend to work just as well from sampled data.
This post suggests 60 day log retention (100PB / 1.6PB daily). I would probably divide this into:
1. Metrics storage. You can get this from logs but you'll often find it useful to write it directly if you can. Getting it from logs can be error-prone (eg a log format changes, the sampling rate changes and so on);
2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;
3. "Offline" data, which you would generally only query if you absolutely had to. This is particularly true on S3, for example, because the write costs are basically zero but the read costs are expensive.
Additionally, you'd want to think about data aggregation as a lot of your logs are only useful when combined in some way
Quickwit (like Elasticsearch/Opensearch) stores you data compressed with ZSTD in a row store, builds a full text search index, and stores some of your fields in a columnar. The "compressed size" includes all of this.
The high compression rate is VERY specific to logs.
- What happens when you alter an index configuration? Or add or remove an index?
Changing an index mapping was not available in 0.8. It is available in main and will be added in 0.9. The change only impacts new data.
- Or add or remove an index?
This is handled since the beginning.
- What about cold storage?
What makes Quickwit special is that we are reading everything is on S3. We adapted our inverted index to make it possible to read straight from S3.
You might think this is crazy slow, but we typically search into TBs of data in less than a second. We have some in RAM cache too, but they are entirely optional.
> 2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;
Sometimes, sampling is not possible. For instance, some of Quickwit users (including Binance) use their logs for user support too. A user might come asking details about something fishy that happened 2 months ago.
You have very good questions, I can only guess one answer: s3 network transfer is free for AWS services
Your link[1] said:
You pay for all bandwidth into and out of Amazon S3, except for the following:
[...]
- Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket (including to a different account in the same AWS Region).
Additionally you need to store indices because this is what you actually search. Indices have a storage cost when you write them too.
When I see a system like this my thoughts go to questions like:
- What happens when you alter an index configuration? Or add or remove an index?
- How quickly do indexes update when this happens?
- What about cold storage?
Data retention is another issue. Indexes have config for retention [1]. It's not immediately clear to me how document retention works, possibly from S3 expiration?
So, network transfer from S3 is relatively expensive ($0.05/GB standard pricing [2] to the Internet, less to AWS regions). This will be a big factor in cost. I'm really curious to know how much all of this actually costs per PB per month.
IME you almost never need to log and store this much data and there's almost no reason to ever store this much. Most logs are useless and you also have to question what the purpose is of any given log. Even if you're logging errors, you're likely to get the exact same value out of 1% sampling of logs than you are with logging everything.
You might even get more value with 1% sampling because your query and monitoring might be a whole lot easier with substantially less data to deal with.
Likewise, metrics tend to work just as well from sampled data.
This post suggests 60 day log retention (100PB / 1.6PB daily). I would probably divide this into:
1. Metrics storage. You can get this from logs but you'll often find it useful to write it directly if you can. Getting it from logs can be error-prone (eg a log format changes, the sampling rate changes and so on);
2. Sampled data, generally for debugging. I would generally try to keep this at 10TB or less;
3. "Offline" data, which you would generally only query if you absolutely had to. This is particularly true on S3, for example, because the write costs are basically zero but the read costs are expensive.
Additionally, you'd want to think about data aggregation as a lot of your logs are only useful when combined in some way
[1]: https://quickwit.io/docs/overview/concepts/indexing
[2]: https://aws.amazon.com/s3/pricing/