How do you handle lost webhooks in production?

renewiltord · 2025-12-01T15:21:34 1764602494

Yeah, common problem. But trivial to solve. Just have minimal webhook server that records full request and return 200. Then process async.

Trivial Go program, day’s work. Stick it in Postgres, run continuously.

Bizarrely there are vendors who are weird about webhooks. Lifefile, as an example, charges pharmacies a dollar per webhook firing. So the pharmacies are crappy about retry policy.

Tbh I wouldn’t buy any product in this space. It’s too simple with exclusive HTTP server plus Postgres plus processing loop. And with already delicate thing I would rather not introduce more vendors.

No, not even if you converted it into event queue via websocket or zmq or what have you.

everydaydev · 2025-12-01T15:44:52 1764603892

Your approach works, and lots of teams do exactly that. The tradeoff is that you’re now on the hook for uptime, retries, backpressure, tooling, on-call, metrics, etc.

Relae exists for teams who’d rather outsource that operational surface, similar to why people use managed queues instead of running their own RabbitMQ. Not everyone needs it — but some prefer not to own that part of the stack.

super256 · 2025-12-01T13:44:31 1764596671

Ofc I rely on the retry policy. Stripe retries with exponential back off for three days. If Stripe can't reach our endpoint in 3 days we probably went bankrupt or a solar flare ate IT.

everydaydev · 2025-12-01T14:01:14 1764597674

Stripe does retries right, no argument there.

Where things get messy is when you have a mix of providers with wildly different retry behaviors, or internal services that have their own rate limits or downtime windows. A relay layer keeps the intake consistent even when the rest of the system isn’t.

samarthr1 · 2025-12-01T13:36:56 1764596216

Wait, so your product moves the point of failure from my infra to your infra?

Plus trusts y'all with contents of said webhook?

everydaydev · 2025-12-01T13:45:45 1764596745

Fair question — we’re not eliminating failure so much as isolating it behind a system that’s purpose-built for durability. Our infra is built with redundant queues, retry pipelines, and observability you typically wouldn’t stand up for a single product team.

And on the data side, we don’t use webhook payloads for anything other than delivery. They’re encrypted at rest, transit, and automatically purged based on retention settings.

nickphx · 2025-12-01T14:32:56 1764599576

Yeaaaaaaaaaaaaah.. I am not sure adding an additional third party and point of potential failure would help mitigate the issue of receiving data from third parties... but good luck.

everydaydev · 2025-12-01T14:42:19 1764600139

Fair point. The value isn’t in reducing the number of components, it’s in swapping a fragile one (your app endpoint) for something built specifically to stay up, queue, retry, and give you visibility when the rest of your stack isn’t. There are plenty of other services on the market that offer similar services.

journal · 2025-12-01T22:30:41 1764628241

anomaly detection, checks to make sure something is still happening.

phillipseamore · 2025-12-01T15:55:38 1764604538

svix.com

everydaydev · 2025-12-01T16:00:01 1764604801

Svix is a solid managed webhook solution, and their platform is clearly geared toward enterprise teams. For smaller teams or startups, the same reliability patterns—durable delivery, retries, replay—are valuable but often at a lower cost point. That’s where products like Relae aim to make sense: providing similar operational guarantees in a way that’s more accessible for non-enterprise use cases.