I hate Clouds - a personal perspective on why I think Clouds suck

loudwhisper@infosec.pub · 4 months ago

I hate Clouds - a personal perspective on why I think Clouds suck

conciselyverbose@sh.itjust.works · 4 months ago

So the whole thing is well worth a read IMO, and addresses a lot of the issues I have with cloud as the solution for everything.

My main point here is that individuals and organizations that require all the flexibility that cloud services offer are a (tiny) minority. This means that for the majority of us, all the complexity necessary to provide this flexibility ends up being purely a complication or worse, a liability.

There are absolutely companies who need the scaling. But it’s a fucking lot of overhead if you don’t.

Let’s repeat it one more time: complexity hides and creates security issues.

This is similar to all the LLM code stuff. If you don’t actually fully understand what your code does, bad stuff happens.

This premise has the consequence that Cloud systems are a big puzzle. The pieces of the puzzle are the Cloud products. Engineers working with Cloud systems essentially need to understand the abstraction but not necessarily the underlying, ultimate working mechanism of what those abstractions do. For example, a cloud expert might know everything about the difference between NACLs and Security Groups, all the details about how to configure them, their limitations etc., but the main idea is that such expert doesn’t need to know anything below that (e.g., how the traffic is filtered).

Ultimately my perspective, and I appreciate it’s a very personal one, is that building and working with the Cloud makes me feel like a glorified application administrator. My job becomes researching how the Cloud solved the problem that I need to solve, and compose the solution in the way the Cloud provider imagined it should be solved, rather than solving the problem

I was going to bring up basically this point:

because vendor-lock is not something that has only to do with infrastructure. It has also to do with the skills of the engineers involved. Cloud knowledge, for the most part, is not portable. You are a wizard of IAM policies in GCP? Good job, this is completely useless if you go to Azure. Oh, you are a guru of VPCs and private endpoints? Well done, this is completely useless if you move to a different cloud.

But you covered it pretty well. Abstractions are great. Proprietary abstractions that are more focused on how they can bill you than real, useful, functional categories? Not so much.

Despite the efforts means something which is ironic: many companies which run on Cloud, at some point, will have one or more teams whose main purpose is understanding how they are spending money in the Cloud and to reduce those costs. If this sounds conflicting with the idea of reducing personnel, well, it is. The digital infrastructure of my organization is not that huge. Give or take 2000 compute instances (some very small). Something that 200 servers could easily provide. Cloud bills are more than $15 millions/year. I checked a server builder for example, and an absolute beast (something like 2x Xeon platinum processor, 200TB of NVME disks, 1TB of RAM etc.) would still stay comfortably under $250k. 100 servers this powerful will probably be a multiple of our computing power, and cost almost a third if we consider a lifetime of 3 years, which is very low. A more realistic estimation of 5 years leads to a saving of ~$50 millions over 5 years. Completely insane! This is of course if you want to buy hardware. Powerful servers rented run you for $500-1000/month. Assuming a cost of $1000/month, my company could rent more than 1000 powerful servers, and still save money compared to Cloud costs, leaving plenty for additional services such as networking, storage, premium support (remote hands) or actual engineers salary

So there’s a level of rent seeking behind all the software moving to subscriptions, and them wanting to lock you in just like their service providers are doing to them. But I have to think the massive costs of cloud junk also pay a role in stuff like a calendar charging double digit annual fees for something that takes very little storage and very little computation (and you of course can’t just buy software any more).

I have no words for multi-cloud. Even like a Facebook or YouTube scale site, are you really going to double (or more for some reason?) your storage costs (plus whatever intercommunication between the two), just in case the provider goes down for a couple hours (which is extremely rare, and you won’t be the only site impacted, so people won’t really blame you for.) Plus that architecture sounds like the shitshow to end all shitshows.

BearOfaTime@lemm.ee · edit-2 4 months ago

Agreed on it all.

I think a big driver for cloud clients is bean counters - cloud is an expense, while having your own systems is capital investment.

They’d rather have the waste of leasing too much compute than have to pay taxes on systems plus the cost of staff to run it.

We won’t really see this get addressed until companies have to truly own the risks they take on (see all the hacks that happen on a daily basis because CIO won’t pay for the security that IT management is screaming to build). When fines for these breaches are meaningful, cloud will be less interesting.

loudwhisper@infosec.pub · 4 months ago

Thanks!

But I have to think the massive costs of cloud junk also pay a role in stuff like a calendar charging double digit annual fees for something that takes very little storage and very little computation (and you of course can’t just buy software any more).

Absolutely agree. I did not even think about this aspect, but I think you are absolutely spot on. Building something with huge costs is something that ultimately gets passed to the users in addition to the rent-seeking aspect.

I have no words for multi-cloud.

You and me both. I have to work with it and the reality is, there is nobody who actually understands the whole thing. The level of complexity (and fragility, I might add) of it all is astonishing. And all of this to mitigate some (honestly) low risk of downtime from the cloud provider. I have lobbied a little bit against at work, but ultimately it has become a marketing tool to sell to customers, so goodbye any hope of rational evaluation…

Tja@programming.dev · 4 months ago

It’s all shits and giggles until a network config takes down your cloud provider for 11 hours and you can’t even look at the logs. And multicloud is quite robust if done right, more so than a single cloud, if your setup is fragile someone is not doing their job right.

loudwhisper@infosec.pub · 4 months ago

Complexity brings fragility. It’s not about doing the job right, is that “right” means having to deal with a level of complexity, a so high number of moving parts and configuration options, that the bar is set very high.

Also, I would argue that a large number of organizations don’t actually need the resilience that they pay a very high price for.

Tja@programming.dev · 4 months ago

Complexity in this case should bring redundancy, not fragility. You are adding components in parallel, not in series, thus reducing fragility.

A raid 5 is more complex than a single drive, but it’s less fragile.

loudwhisper@infosec.pub · 4 months ago

I wish it worked like that, but I donct think it does. Connecting clouds means introducing many complex problems. Data synchronization and avoiding split-brain scenarios, a network setup way more complex, stateful storage that needs to take into account all the quirks and peculiarities of all services across all clouds, service accounts and permissions that need to be granted and segregated for all of them, and way more. You may gain resilience in some areas, but you introduce a lot more things that can fail, be misconfigured or compromised.

Plus, a complex setup makes it harder by definition to identify SPOFs, especially considering it’s very likely nobody in the workforce is going to be an expert in all the clouds in use.

To keep using your simile of the disks, a single disk with a backup might be a better solution for many people, considering you otherwise might need a RAID controller that can fail and all the knowledge to handle and manage a RAID array properly, in addition to paying 4 or 5 times the storage. Obviously this is just to make a point, I don’t actually think that RAID 5 vs JBOD introduces comparable complexity compared to what multi-cloud architecture does to single-cloud.

Tja@programming.dev · 4 months ago

Split brain are easily solved, there’s of the shelf solutions and if you have some custom code you can use plenty of well researched solutions, for instance raft. Putting bizantine fault in Google scholar yields thousands of papers,if you want something fancier.

Same for most problems you mentioned, they were an issue 10 years ago, nowadays you can federate, abstract or outsource most of it.

Making it harder to identify SPFOs doesn’t increase fragility. If you whole system a single instance it’s trivial to identify (the whole thing) but very brittle.

loudwhisper@infosec.pub · 4 months ago

Of course the problem is solved, but that doesn’t mean that the solution is easy. Also, distributed protocols still need to work on top of a complicated network and with real-life constraints in terms of performances (to list a few). A bug, misconfiguration, oversight and you have a problem.

Just to make an example, I remember a Kafka cluster with 5 replicas completely shitting its pants for 6h to rebalance data during a planned maintenance where one node was brought offline. It caused one of the longest outages to date with the websites which relied on it offline. Was it our fault? Was it a misconfiguration? A bug? It doesn’t matter, it’s a complex system which was implemented and probably something was missed.

Technology is implemented by people, complexity increased the chances of mistakes, not sure this can be argued.

Making it harder to identify SPOF means you might miss your SPOF, and that means having liabilities, and having anyway scenarios where your system can crash, in addition for paying quite a lot to build a resilience that you don’t achieve.

A single instance with 2 failure scenarios (disk failure and network failure) - to make an example - is not more fragile than a distributed system with 20 failure scenarios. Failure scenarios and SPOF can have compensating controls and be mitigated successfully. A complex system where these can’t be fully identified can’t have compensating control and residual risk might be much harder. So yes, a single disk can fail more likely than 3 disks at once, but this doesn’t give the whole picture.

Tja@programming.dev · 4 months ago

The only problem is that the single instance also has 20 scenarios (and keeps the 2 as well), making it more brittle.

A well design system removes points of failure, disk, power and network are obvious ones, and as long as you keep it byzantine safe, anything you added should be redundant so if one fails the system still runs. Ideally you remove all of them but if there’s one hidden it’s still better than “the whole thing is a single point of failure”.

I hate Clouds - a personal perspective on why I think Clouds suck

I hate Clouds - a personal perspective on why I think Clouds suck

I hate Clouds | Loudwhisper