This is a good point. I wish there was more discussion around *when* to consider...

mumblemumble · on March 11, 2021

What I tend to see missing is clear explanations of, "What problems does this solve?" Usually you see a lot of, "What does it do?" which isn't quite the same thing. The framing is critical because it's a lot easier for people to decide whether or not they have a particular problem. If you frame it as, "Do we need this?" though, well, humans are really bad at distinguishing need from want.

Take Hadoop. I wasted a lot of time and money on distributed computing once, almost entirely because of this mistake. I focused on the cool scale-out features, and of course I want to be able to scale out. Scale out is an incredibly useful thing!

If, on the other hand, I had focused on the problem - "If your storage can sustain a transfer rate of X MB/s, then your theoretical minimum time to chew through Y MB of data is Y/X seconds, and, if Y is big enough, then Y/X becomes unacceptable" - then I could easily have said, "Well my Y/X is 1/10,000 of the acceptable limit and it's only growing at 20% per year, so I guess I don't really need to spend any more time thinking about this right now."

braveyellowtoad · on March 11, 2021

Yep. All of us tech folks read sites like hackernews and read all about what the latest hot Silicon Valley tech startup is doing. Or what Uber is doing. Or what google is doing. Or what Amazon is doing. And we want to be cool as well so we jump on that new technology. Whereas for the significant majority of applications out there, older less sexy technology on modern fast computers will almost certainly be good enough and probably easier to maintain.

XorNot · on March 11, 2021

Uber is one of those alleged behemoths I'm incredibly skeptical about the scaling problems of.

The service they provide can't be anywhere near the Facebook/Google league of scale, nor Netflix level of data/performance demand.

mumblemumble · on March 11, 2021

I've had several experiences where I attended a tech talk by someone from Uber, and never once did I come away with the impression that the problem they were trying to solve was the kind of thing Fred Brooks had in mind when he coined the term "essential complexity."

That said, volume and velocity aren't the only kinds of scale that technical teams have to grapple with. I've spent enough time at a big organization to understand that Conway's Law costs CPU cycles. Lots of them.

barbazoo · on March 12, 2021

They struggle with the size of their mobile apps so they've got that going for them

908B64B197 · on March 12, 2021

> And we want to be cool as well so we jump on that new technology.

I would rather have cool tech on a resume rather than boring tech. Helps if someone decides to jump to a cool company.

edejong · on March 11, 2021

We don't have data sheets, like professional electronic components have. I guess we value Turing/Church more than predictability.

6510 · on March 11, 2021

> What I tend to see missing is clear explanations of, "What problems does this solve?"

This still sounds like one (like a scientist) starts out by looking at something for it being cool, hip and perhaps useful before considering if it solves any of the problems. The proper working-order is to start by having the problem. Hardly any problem requires a state of the art solution.

When one is wiring a house (where I live) the regulation says you should use the same standards for everything on a group on the switchboard. This hilariously means that if you need to extend iron pipes with canvas isolated wires you have to use metal pipes with canvas wrapped wires.(or rip everything out and replace it with something modern)

cesaref · on March 11, 2021

When you are in a position to pay someone full time to look after the web infrastructure of your project, that's when you should think this stuff through.

Before that, don't worry about the tech, go for what you know, and expect it to need totally re-writing at some point. Let's face it, this stuff moves quickly.

marcosdumay · on March 11, 2021

That's not good. That kind of stuff must be derived from your business plan, not from the current situation. You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)

If your business requires enough infrastructure that this kind of tooling is a need at all, it may very well be an important variable on whether you can possibly be profitable or not.

lmm · on March 12, 2021

> You should know that "if I'm successful at this level, I will need this kind of infrastructure" before the problems appear, and not be caught by surprise and react in a hush. (Granted that reacting in a hush will probably happen anyway, but it shouldn't be surprising.)

What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway? In my experience companies that planned for scaling don't handle it noticeably better than companies that didn't plan for it, so maybe it's better not to bother?

rualca · on March 12, 2021

> What's the ROI on spending time on planning for that, given that you acknowledge that the response will be mess anyway?

Let's put it this way: what's the ROI of failing to address any of the failure modes avoided or mitigated by taking the time to thing things through at design time?

Are you planning on getting paid when you can't deliver value because your service is down?

lmm · on March 12, 2021

The ROI of reducing time to market is huge, whereas I don't think I've ever seen thinking things through at design time deliver any real benefits (not even a reduced rate of outages).

rualca · on March 13, 2021

> I don't think I've ever seen thinking things through at design time deliver any real benefits (...)

That's mostly the Dunning-Kruger rearing its head. It makes zero sense to state that making fundamental mistakes has zero impact on a project.

This type of obliviousness is even less comprehensible given that every single engineering field has project planning and decision theory imbued in its matrix, and every single intro to systems design book has its emphasis on failure avoidance and mitigation, but somehow in software development hacking is tolerated and dealing with perfectly avoidable screwups resulting from said hacking is ignored as if they were acts of nature.

lmm · on March 15, 2021

I suspect it's tolerated because it works better; IMO the value of planning is something cargo-culted from engineering fields where changing something after it's partially built is costly. In software, usually the cheapest way to figure out whether something will work is to try it, which is in stark contrast to a field like physical construction.

marcus_holmes · on March 11, 2021

if your business plan has this much detail, you'll spend more time updating it to reflect current reality than actually shipping features.

Companies that are growing fast enough to need Kubernetes are also changing fast and learning fast.

Melkman · on March 11, 2021

Also, when cost caused by downtime and updates and/or hardware problems exceed the cost of this full time engineer. Even then K8s is probable not the next step. Simple HA with a load balancer and seperate test and production environments plus some automation with an orchestration tool works wonders.

blacktriangle · on March 11, 2021

That's the best metric I've heard. When your Heroku bill is the same as what a full time ops person would cost, start looking to hire.

jacquesm · on March 11, 2021

I would say it has to be at least double that and maybe even higher.

rualca · on March 12, 2021

If you are already deploying to Heroku, you are already doing ops.

It makes zero sense to presume you don't need to know what you're doing or benefit from automating processes just because you can't hire a full-time expert.

blacktriangle · on March 12, 2021

True, but there's a difference between solving ops problems by throwing more money at hardware and solving ops problems by thinking smarter. That's why I think that some multiple of an FTE salary is a good metric for when you start transitioning to more complex operational setups, because it gives you a number for when the gain of that new complexity is worth the cost.

marcus_holmes · on March 11, 2021

This. When you have enough revenue to hire new dev, and infrastructure is your most pressing problem to solve with that money, hire someone who knows this stuff backwards to solve it.

richardw · on March 11, 2021

No simple advice will cover all the cases but not a bad place to start:

https://basecamp.com/gettingreal/04.5-scale-later

jameshart · on March 11, 2021

So I agree there’s definitely a need for more writing in this space (and you’d think it might be particularly of interest to incubating startups...)

My top level take on this is that the cues to take things up a level in terms of infrastructure diligence are when the risk budget you’re dealing with is no longer entirely your own. Once you have a banker, a lawyer, an insurance policy, an investor, a big customer, a government regulator, shareholders... ultimately once there’s someone you have to answer to if you haven’t done enough to protect their interests.

And risk management isn’t just ‘what will happen if the server catches fire’, it’s ‘how do I guarantee the code running in production is the code I think it is?’ And ‘how do I roll back if I push a bug to production?’ And ‘how can I make sure all the current versions of the code come back up after patching my servers?’

And it turns out that things like kubernetes can help you solve some of those problems. And things like serverless can make some of those problems go away completely. But of course they come with their own problems! But when those things become important enough that you need to solve for them to answer the risk aversion of sufficient stakeholders, the cost of adopting those technologies starts to look sensible.

yawaramin · on March 11, 2021

FTA:

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great. You have the funds to do things “right”. Your fancy zero-click continuous deployment system saves you thousands/millions a year. But at indie maker, “Hey look at this cool thing I built … please, please someone look at it (and upvote it on Product Hunt too, thx)” scale — the scale of almost every single site on the net — that 0.1% is vanishingly insignificant.

rualca · on March 12, 2021

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books,

I'm yet to see a company with any o line presence where downtime doesn't eat away their profits.

In the very least, downtime means you're paying money to get no service.

Downtime is fine for someone's blog on his personal website that he setup on a weekend while drinking beer. I'm yet to see a business state that they are ok with getting their site 404 or 500 randomly and during a whole workday, which is what 0.1% translates to.

yawaramin · on March 13, 2021

Fear not, the article has you covered (emphasis mine):

> Obviously if you’re FAANG-level or some established site where that 0.1% downtime translates into vast quantities of cash disappearing from your books, this stuff is all great.

rualca · on March 13, 2021

This is not a FANG thing. Not being able to deliver a product or service is a business thing. This happens in small mom&pop shops as well as FANGs. This is a free market economy thing. If your business is not open them you get expenses but zero income. How is this hard to understand?

The software world involves way more than your small side project that you managed to cobble together during a weekend. Things do need to work reliably and predictably. Otherwise not only do you not get your lunch but also your competitors eat it from under your nose. Why is this even being discussed, in HN of all places?

yawaramin · on March 14, 2021

Because having to keep a site live doesn't automatically mean I need all the complexity of K8s? I can deploy my server on two VMs in two AZs with Ansible and run them with systemd to restart on crash. Just because I don't immediately jump to K8s doesn't mean I don't know how to run a site.

jameshart · on March 11, 2021

This is a really weird take on what ‘every single site on the internet’ is. Or of what proportion of the software development community is working on sites of that scale.

It’s like saying ‘the vast majority of people building houses don’t need to dig foundations’ because you build lego houses, and after all the vast majority of houses are made of Lego, right?