Opinion: AWS’s avoidable cloud outage highlights a bigger problem

  • Hyperscalers don’t have to follow the same critical infrastructure rules as carriers – and they don’t
  • The fallout from AWS, Google, Microsoft and others bringing their IT department playbook into critical industries will likely be catastrophic
  • If hyperscalers won’t hold themselves to a higher standard, lawmakers should do it for them

Cloud services are made up of many complicated, cutting-edge technologies.

DNS (for Domain Name System) is not one of them. It is the 40-year-old internet phonebook, translating text URLs into numeric IP addresses.

In terms of sophistication, DNS is to cloud communications what pre-school finger-painting is to Géricault’s ‘The Raft of the Medusa.’

And yet it was a DNS malfunction that caused this week’s major network outage on the AWS cloud service.

Tier 1 carriers implement so-called eternal redundancy to avoid a failure in this fundamental infrastructure element — with multiple layers of hardware, software and data replication to keep DNS and other functions running non-stop in the face of everything from terrorist bombs to power outages to cyberattacks.

But cloud hyperscalers like AWS are not Tier 1 carriers, and they don’t build their networks with the same discipline or reliability as their telco counterparts. They’re comfortable with the less expensive, cobbled-together three 9’s reliability associated with IT departments, versus the five 9’s gold standard adhered to by carriers.

The results of hyperscalers’ reckless ‘move fast and break the network’ approach were clear in the AWS outage.

It's one thing when consumer platforms like Fortnite and Snapchat are offline, as they were this week, but hyperscalers are all now branching into vertical industries—healthcare, transportation, energy and more—as part of the Industry 4.0 shift, looking to tackle industry-specific challenges with cloud, AI, and data analytics. These businesses are the high-wire acts of the communications industry—where failure can often be a matter of life or death.

What should be done to head off the consequences of mixing critical data with second-in-class services?

Fake news

According to the media and analysts, nothing.

This is Amazon’s third major outage in five years (all of them originated in the same US-EAST-1 data centre cluster in Virginia, which is boggling). Yet, pundits took an “oops, they did it again” tone with their commentary, initially downplaying the impact of the AWS blackout by estimating the number of users affected in ‘thousands,’ increasing that to ‘hundreds of thousands’ later in the day, before finally settling on “millions.” Considering the number of daily users of the affected consumer services, the actual total is in the hundreds of millions. It may even have exceeded half a billion.

Reuters went further. Speaking on its eponymous news channel, Reuters tech reporter Stephen Nellis blamed AWS’s customers for not investing enough time and money to develop workarounds that could save them when AWS’s shitty cloud service failed (and welcome, I guess, to a new world where when you are stranded because your car has failed due to a manufacturer's defect, it’s your fault for not always towing a spare vehicle behind you at your own expense).

The solution

One reason carriers build really reliable networks, and hyperscalers really don’t, is company culture -- a trickle-down trait. By the admittedly low standards of CEOs in general, carrier execs are a decent bunch, whereas Elon Musk, Jeff Bezos, Mark Zuckerberg, Alex Karp and so on are not.

But another reason is that there are laws in place to ensure that carriers do their jobs correctly.  Carrier services worldwide are governed by independent oversight mechanisms and government rules that ensure they meet reliability standards for critical infrastructure.  In the US, for example, there are five organisations (three within the FCC) that create and enforce the regulations and audit networks to ensure compliance. 

Hyperscalers don’t have to comply with any of these rules. They are part of a deregulated zone (DRZ), where they can build as they wish, without fear of consequence. A communications free-for-all. Except that it’s not (free, for all). When cloud services from AWS and others fail, their customers pay the price. As hyperscalers start to creep into vertical industries, the price is going to be exponentially higher.

The simplest way to head off what is likely to be a series of increasingly severe hyperscaler-incepted disasters across critical industries is for the US government to extend regulatory oversight to cloud services. Given the unethical, cash-based relationship between Big Tech and the administration, such a scenario is unlikely to occur. That leaves enterprises and industries to control their own future by selecting traditional telco network providers and their partner ecosystems for critical services.

Steve Saunders is a British-born communications analyst, investor and digital media entrepreneur with a career spanning decades.


Opinions from industry experts, analysts or our editorial staff do not represent the opinions of Fierce Network.