Hey HN. We just released an Upgrade Context Server and an Upgrade Agent to upgrade OSS components running on a K8s Platform. Platform, DevOps, SRE teams can use this Server + Agent to easily and safely raise IaC PRs in their existing coding assistants (starting with Cursor).
The hard part of Platform upgrades isn’t the IaC syntax (TF, Jsonnet, CDK, Pulumi, etc.); it’s other “stuff”. This “stuff” has two main dimensions:
1. Gathering environment context: what’s actually running, which version is compatible for your stack, which breaking changes matter, and what IaC patterns you use, etc.
2. Making upgrade choices and getting team’s +1: Once you’ve got the environment context nailed down, you choose a version that will hold in production, applying diffs in a meaningful way (e.g. only surgical Helm values / RBAC changes, while CRDs can be overwritten), and surfacing “the why” and “the what” of these upgrade choices to PR reviewers so they can provide meaningful feedback.
Chkk Upgrade Context Server + Upgrade Agent focuses on those two problems.
- Version recommendations are computed against your constraints and policies so targets are compatible and stable, not simply recent.
- Diffs are scoped to what you actually run, with ordering and selectors that make changes deterministic and auditable.
- Breaking changes, critical fixes, and upgrade considerations appear next to the change, so reviews move quickly and the blast radius for each change is well-understood upfront.
Initial support includes the Cursor IDE and Helm IaC across 260+ open-source projects, including Networking (Istio, Cilium, …), Databases (PostgreSQL, Redis, …), Data Jobs (Argo Workflows, Jenkins, …), Data Streams (Kafka, RabbitMQ, …), Observability (Datadog, Grafana, …), Security (cert-manager, Kyverno, …), Autoscaling (Karpenter, KEDA, …), and Developer Tooling (ArgoCD, Bitbucket, …).
Workflow walkthrough is in the launch blog linked with this post.
Or you can just dive directly into setup and docs here: https://docs.chkk.io/ai/chkk-plus-cursor
It’s early and we would really appreciate feedback from the HN community, especially from teams juggling multi-cluster upgrades or inherited IaC. Looking forward to your feedback and happy to answer questions .
Guess what.... agents, like humans, also get confused with too much context, all at once.
"Minimum relevant context injected at the right time" is how you assure quality artifacts and actions from your agentic workflows.
It's just sooo hard to reconstruct the "minimum relevant context" from running infrastructure state matched against a moving target (i.e. the next version you should be upgrading to).
"Operational Safety" is the neglected child of software operations. I saw how it was implemented effectively when working at AWS, but the broader software ecosystem appeared oblivious to this key concept. While the CrowdStrike outage caused havoc, its silver lining is that Operational Safety has now become a key consideration for software leaders, all the way to CIOs.
It must stay this way as complex, mission-critical systems will continue to rely more and more on software and cascaded failures are just a fact of life in these systems.
I agree that the broader software ecosystem has been slow to recognize the importance of Operational Safety. The CrowdStrike outage, while unfortunate, has indeed served as a wake-up call, elevating Operational Safety to a priority for software leaders and CIOs alike.
As you pointed out, the reliance on complex, mission-critical systems is only increasing, and cascading failures are an inherent risk we must address proactively. By learning from organizations like AWS that have successfully integrated Operational Safety into their practices, we can work towards a more resilient and reliable software ecosystem. Let's continue to advocate for making Operational Safety a foundational element in software operations across the industry.
Most are just focusing on incident response and reactively improving things. That's why a proactive discipline that prevents these issues from happening is dearly missed.
Even when postmortems are done, information continues to exist in team silos as there is no way to share these learning across teams and enterprises. Hence, everyone repeats each others' mistakes.
Platform teams is such a new concept and is catching on like wildfire inside enterprises. However, the technology, people and process challenges faced by these teams are really not understood very well. In this blog, we tried to crystallize the biggest obstacles that we see Platform teams facing.
Thanks for your encouraging words and we are glad that the challenges and solutions resonate.
What an excellent post. I love the "first felt PMF" criterion which is the right way to think about it because, like validated by interviews in this blog, most of the times you remain unsure that you have actually found PMF... and that's perfectly ok!
Being a dual Pakistan/USA citizen, I am deeply saddened by this meddling by US in Pakistan's internal politics. Imran, who I don't support BTW, was a hugely popular leader. Him being thrown out, then clandestine kidnappings of 10s of thousands of his party members, and finally the press and media not being allowed to even utter his name (yes, they are not allowed to say his name on air), is a civilian martial law that is as anti democratic as it gets.
I hope future US governments actually act as the flagships of democracy that we claim to be.
1) At the end of the day, no one told Pakistan military what to do about journalists, media, kidnappings, etc.
2) Ukraine sold Pakistan battle tanks in the 90s when no one else would and also took Pakistani side on Kashmir. If this was the result of “interagency consultations” people in the Foreign Office need to find new jobs
The hard part of Platform upgrades isn’t the IaC syntax (TF, Jsonnet, CDK, Pulumi, etc.); it’s other “stuff”. This “stuff” has two main dimensions: 1. Gathering environment context: what’s actually running, which version is compatible for your stack, which breaking changes matter, and what IaC patterns you use, etc. 2. Making upgrade choices and getting team’s +1: Once you’ve got the environment context nailed down, you choose a version that will hold in production, applying diffs in a meaningful way (e.g. only surgical Helm values / RBAC changes, while CRDs can be overwritten), and surfacing “the why” and “the what” of these upgrade choices to PR reviewers so they can provide meaningful feedback.
Chkk Upgrade Context Server + Upgrade Agent focuses on those two problems. - Version recommendations are computed against your constraints and policies so targets are compatible and stable, not simply recent. - Diffs are scoped to what you actually run, with ordering and selectors that make changes deterministic and auditable. - Breaking changes, critical fixes, and upgrade considerations appear next to the change, so reviews move quickly and the blast radius for each change is well-understood upfront.
Initial support includes the Cursor IDE and Helm IaC across 260+ open-source projects, including Networking (Istio, Cilium, …), Databases (PostgreSQL, Redis, …), Data Jobs (Argo Workflows, Jenkins, …), Data Streams (Kafka, RabbitMQ, …), Observability (Datadog, Grafana, …), Security (cert-manager, Kyverno, …), Autoscaling (Karpenter, KEDA, …), and Developer Tooling (ArgoCD, Bitbucket, …).
Workflow walkthrough is in the launch blog linked with this post. Or you can just dive directly into setup and docs here: https://docs.chkk.io/ai/chkk-plus-cursor
It’s early and we would really appreciate feedback from the HN community, especially from teams juggling multi-cluster upgrades or inherited IaC. Looking forward to your feedback and happy to answer questions .