englambert's comments

englambert · on July 22, 2021

Opstrace explains log aggregation and how its open source approach is different.

englambert · on July 6, 2021

We haven't decided on that yet. We will probably use a WYSIWYG Markdown editor (recommendations welcome!) along with a Git GUI client. That may not work forever, but it should be fine in the medium-term.

Note: marketing folks will probably not be editing the product docs much--the marketing site content is kept in a separate repo so we can provide a different solution there.

englambert · on Feb 2, 2021

Heh, yes, the PII scrubbing need is very real. This is definitely a contributing factor to our data ownership commitment--just keep your data in your own account.

On the cost topic, last week we published a blog post analyzing the cost of running Opstrace on AWS (https://opstrace.com/blog/pulling-cost-curtain-back). (In fact, feel free to do a local repro to confirm our results.) As mentioned elsewhere here on HN, we are incentivized to provide total transparency in terms of what you spend on your cloud infrastructure. We haven’t compared ourselves to everyone, but feel confident that letting our customers pay S3 directly is the best deal possible.

englambert · on Feb 2, 2021

Our installer is indeed an important part of what we’re offering and we’re continuously evolving our operator to manage the ongoing maintenance. But in terms of being a feature complete, Open Source Datadog, you’re right that we have a long way to go to achieve our vision. As mentioned in other replies, we are working on other interesting components as well, such as a new collaborative UI (https://news.ycombinator.com/item?id=25996154), API integrations (https://news.ycombinator.com/item?id=25994268), and more.

That being said, in case you couldn’t tell, we love software from Grafana Labs. It’s popular for a reason. However, we want it to be as easy to install and maintain as clicking a button, i.e., as simple as Datadog. So one problem we are trying to solve today is that while, yes, you can stitch together all of their OSS projects yourself (and many, many people do), it’s a non-trivial exercise to set up and then maintain. We’ve done it ourselves, seen friends go through it; we’d like to stop everyone from becoming a subject matter expert and reinventing the wheel. (Especially since when our friends do it themselves they always skimp on important things like, say, security.) Bottom line—we’re inspired by Grafana Labs. We strive to also be good OSS stewards and contribute to the overall ecosystem like they have.

Another way to solve the “stitching-it-together” problem, as you mentioned, is of course pay Grafana Labs for their SaaS (which I’ve done in the past) or one of their on-prem Enterprise versions. However, these are not open source. The former is hosted in their cloud account and single-tenant; the latter have no free versions. We think Opstrace provides a lot of value, but we understand that it’s not for everyone.

englambert · on Feb 2, 2021

Yes, indeed! Currently you can already using the Prometheus remote_write API as discussed here: https://news.ycombinator.com/item?id=25992392. So if you can collect your in-house metrics with Prometheus, then you could write to Opstrace.

Additionally, we are close to launching a Datadog API as mentioned here: https://news.ycombinator.com/item?id=25994268

So stay tuned to our blog or newsletter for more on this.

Are there other specific APIs you’re interested in?

rtkaratekid · on Feb 2, 2021

Thanks for the reply! Not necessarily, I work in R&D developing low-level data collection so I'm more just trying to keep my ear to to ground in terms of what's going on in the stack just above what I'm doing :)

englambert · on Feb 2, 2021

Sounds good. I know that feeling. :-) Cheers.

englambert · on Feb 1, 2021

Hey, thank you. :-) That’s kind of how we feel -- it seems like everyone is building tooling around Prometheus, and frankly, we hope that collective effort can hopefully be redirected to more impactful value creation for our industry. On a personal note, most of us on the team have been there in one way or another, struggling to actually monitor our own work. We’ve had surprise Datadog bills and felt the pain of scaling Prometheus. (In fact, I’m planning a blog post about this struggle, so stay tuned.) It feels like this problem should already be solved, but it’s not. So we’re trying to fix it.

crazy5sheep · on Feb 1, 2021

Prometheus is great, the main problem is the bloat of metrics it's collecting. one really needs to carefully define the rules to scrape, compute, reduce and filter the ones that are not needed and the ones that need to precompute.

englambert · on Feb 2, 2021

You’re absolutely right.

As mentioned earlier (https://news.ycombinator.com/item?id=25993825), our goal is to be super transparent; we want you to fully understand what you’re spending on infrastructure. We feel good that there’s an incentive to help you work through the problems that you’ve mentioned.

Attributing collection and querying is made easier with authentication enabled by default. You can make your tenants as fine- or coarse-grained as you want, handing out authentication tokens to the producers writing to those tenants. This makes it easier to trace back to sources of bloat. You can also place rate limits on individual tenants to prevent bloat in the first place.

Additionally, we think users might reconsider the premise of the problem. Because the cost of running Opstrace follows cloud economics (because it runs in your own cloud account), it's basically as cheap as it can possibly be. So you might consider that you do not have as much pressure to curate what is stored as you think. (I didn’t say "no" pressure, but "less" might be a huge improvement. :-) )

englambert · on Feb 1, 2021

Perhaps on a related note, see this discussion about the power of incentives here: https://news.ycombinator.com/item?id=25994653

englambert · on Feb 1, 2021

It's hard to answer that concretely without knowing a little bit more about your use cases. Care to share a bit more?

One thing comes to mind: we don't bill by data volume. Wavefront is charging you for the volume of data your applications produce. This can lead to negative outcomes, such as surprise bills from a newly deployed service and a subsequent scramble to find and limit the offenders.

We think this pricing model forms the wrong incentives. Charging by volume means a company is more incentivized to have their customers (you) send you more data, and less incentivized to help them get more value from that data. This is a fundamental change we want to bring to the market--we want our incentives to align with yours, we want to be paid for the value we bring to your company. We charge on a per-user basis. You should monitor your applications and infrastructure the right way, not afraid to send data because it might blow the budget.

opsunit · on Feb 1, 2021

Wavefront brings a number of things to the table that aren't core competencies we wish to maintain in-house.

I know it can scale to massive volumes without interaction from us.

I know it'll be available when our infrastructure isn't. By being a third party we can be confident that any action on our part (such as rolling an SCP out to an AWS org, despite unit tests) won't impact the observability we rely on to tell us we've screwed that up.

I can plug 100s of AWS accounts and 10s of payers into it and I don't have to think about that in terms of making self-hosted infrastructure available via PrivateLinks or some other such complication.

I pay mid six-figure sums annually for these things to "just work". If you folks believe I can achieve this functionality on a per-seat basis I'd be interested in saving those six figures.

englambert · on Feb 1, 2021

We’re building Opstrace to be as simple as a provider like Wavefront -- we’ve failed if you need additional competencies to manage it. That being said, we’re early in our journey and still have a ways to go.

As mentioned in the original post here, at the core of Opstrace is Cortex (https://cortexproject.io). We know that Cortex scales well to hundreds of millions of unique active metrics, so depending on the exact characteristics of your workload, the fundamentals should be there.

However, Cortex is a serious service to run and if you were to DIY it would require operations work that you currently don’t have with Wavefront. This is the problem we’re trying to solve—making these great OSS solutions easier to use for people like you.

Opstrace is made to be exposed on the internet (which is optional of course), so you can easily run it in an isolated account to keep it safe from all other operations. And in fact, this is the configuration we recommend for production use.

Regarding “100s of AWS accounts and 10s of payers”... does that include any form of multi-tenant isolation? We support multi-tenancy out of the box to enable controlling rate limits and authorization limits for different groups. We’d need to talk in more detail about that. If you’d like to do that privately, please shoot me at chris@opstrace.com. We’re of course happy to continue the discussion here with you as well.

mh- · on Feb 2, 2021

As a heads up, I think you meant to link https://cortexmetrics.io/

jgehrcke · on Feb 2, 2021

Thanks for the correction! You linked to the right Cortex, not to be confused with https://github.com/TheHive-Project/Cortex, haha. https://github.com/cortexproject/cortex is what we talk about. Naming is hard.

englambert · on Feb 2, 2021

:facepalm: Yes, indeed, I conflated the website and the GitHub org. Mea culpa.

jgehrcke · on Feb 2, 2021

JP from Opstrace here.

Thanks for sharing this perspective, stressing the relative value of predictability.

Of course, when things go pear-shaped the last thing you want to discover is that your monitoring pipeline doesn't work as expected. We feel you.

Your skepticism is justified and I'm super happy to see that here. We know that our future users are (and should be) quite demanding with respect to robustness of the platform.

We're not naively assuming that it's easy to build a platform that is highly available, auto-scaling, and generally worry-free.

In fact, based on our experience, we really know that we'll have to invest an incredible amount of engineering effort in order to make things super reliable and predictable. On the other hand, by making some smart decisions we can get far with little effort. We have super strong building blocks that we can rely on (such as using a cloud-provided database for storing critical configuration state).

> If you folks believe I can achieve this functionality on a per-seat basis I'd be interested in saving those six figures.

The bet is on, but of course we need a bit of time :)

englambert · on Feb 1, 2021

Chris here, from the Opstrace team. As it turns out, it’s just a happy coincidence. When we discovered theirs we fell in love with it as well. They have many different versions of their monster (https://www.scylladb.com/media-kit/)... similarly you’ll see several new versions of our mascot, Tracy the Octopus, over time!