I would say from my experience, for _application logs_, it's the exact opposite. When you deal with a few GB/day of data, you want to have logs, and metrics can be derived from those logs.
Logs are expensive compared to metrics, but they convey a lot more information about the state of your system. You want to move towards metrics over time only one hotspot at a time to reduce cost while keeping observability of your overall system.
I'll take logs over metrics any day of the week, when cost isn't prohibitive.
I was at a large financial news site, They were a total splunk shop. We had lots real steel machines shipping and chunking _loads_ of logs. Every team had a large screen showing off key metrics. Most of the time they were badly maintained and broken, so only the _really_ key metrics worked. Great for finding out what went wrong, terrible at alerting when it went wrong.
However, over the space of about three years we shifted organically over to graphite+grafana. There wasn't a top down push, but once people realised how easy it was to make a dashboard, do templating and generally keep things working, they moved in droves. It also helped that people put metrics emitting system into the underlying hosting app library.
What really sealed the deal was the non-tech business owners making or updating dashboards. They managed to take pure tech metrics and turn them into service/business metrics.
It's fair that you had a different experience than I had. However, your experience seems to be very close to what I was describing. Cost got prohibitive (splunk), and you chose a different avenue. It's totally acceptable to do that, but your experience doesn't reflect mine, and I don't think I'm the exception.
I've used both grafana+metrics and logs to different degrees. I've enjoyed using both, but any system I work on starts with logs and gradually add metrics as needed, it feels like a natural evolution to me, and I've worked at different scale, like you.
I feel like I shouldn't need to mention this, but comparing a news site to a financial exchange with money at stake is not the same. If there is a glitch you need to be able to trace it back and you can't do that with some abstracted metrics.
Yea, on a news site, the metrics are important. If suddenly you start seeing errors accrue above background noise and it's affecting a number of people you can act on it. If it's affecting one user, you probably don't give a shit.
In finance if someone puts and entry for 1,000,000,000 and it changes to 1,000,000 the SEC, fraud investigators, lawyers, banks, and some number of other FLAs are shining a flashlight up your butt as to what happened.
I'm not saying that you can't log, I'm saying that logging _everything_ on debug in an unstructured way and then hoping to devine a signal from it, is madness. You will need logs, as they eventually tell you what went wrong. But they are very bad at telling you that something is going wrong now.
Its also exceptionally bad at allowing you quickly pinpointing _when_ something changed.
Even in a logging only environment, you get an alert, you look at the graphs, then dive into the logs. The big issue is that those metrics are out of date, hard to derrive and prone to breaking when you make changes.
verbose logging is not a protection in a financial market, because if something goes wrong you'll need to process those logs for consumption by a third party. You'll then have to explain why the format changed three times in the two weeks leading up to that event.
Moreover you will need to seperate the money audit trail from the verbose application logs, ideally at source. as its "high value data" you can't be mixing those stream at all
> Logs are expensive compared to metrics, but they convey a lot more information about the state of your system.
My experience has been the kind of opposite.
Yes, you can put more fields in a log, and you can nest stuff. In my experience however, attics tend to give me a clearer picture into the overall state (and behaviour) of my systems. I find them easier and faster to operate, easier to get an automatic chronology going, easier to alert on, etc.
Logs in my apps are mostly relegated to capturing warning error and error states for debugging reference as the metrics give us a quicker and easier indicator of issues.
I’m not well versed in QA/Sysadmin/Logs but surely metrics suffer from Simpson’s paradox compared to properly probed questions only answered through having access to the entirety of the logs?
If you average out metrics across all log files you’re potentially reaching false or worse inverse conclusions about multiple distinct subsets of the logs
It’s part of the reason why statisticians are so pedantic about the wording of their conclusions and to which subpopulation their conclusions actually apply to
Logs are expensive compared to metrics, but they convey a lot more information about the state of your system. You want to move towards metrics over time only one hotspot at a time to reduce cost while keeping observability of your overall system.
I'll take logs over metrics any day of the week, when cost isn't prohibitive.