Building on Solid Foundations: Why Your Cloud Infrastructure Matters More Than Ever

Maurice Manning

TLDR

The “Building on Solid Foundations” blog draft reflects on yesterday’s massive AWS outage caused by a DNS resolution failure in AWS’s US-EAST-1 region, which cascaded through numerous services globally. From my perspective as part of our vCISO team, this outage starkly illustrates that relying on a single cloud provider or DNS system—without resilient, multi-provider DNS and multi-cloud failover strategies—is akin to building your business on sand.

I stress that while we are primarily a Google Cloud Platform (GCP) shop, we work with clients on all clouds, advocating a multi-cloud approach rather than replacing existing providers. GCP offers significant advantages, especially for generative AI workloads through Vertex AI, Google’s TPU infrastructure, and the Gemini AI ecosystem. Google Workspace also serves as an underused but vital business continuity solution, complementing existing productivity suites.

The blog emphasizes practical DNS resilience configurations—including multi-provider authoritative DNS from GoDaddy with delegated NS records to both AWS Route 53 and Google Cloud DNS, geographic DNS distribution, automated health-check-driven failover, and DNSSEC security. This DNS strategy, combined with automated monitoring and disaster recovery testing, is crucial to mitigating DNS-centric disruption risks like yesterday’s outage.

Finally, I connect these resilience strategies to the broader risks that CISOs and vCISO teams manage daily—third-party vendor risk, business continuity planning, and incident response—and affirm that effective multi-cloud architecture and infrastructure-as-code practices are foundational to supporting mission-critical AI deployments with secure, compliant, and reliable infrastructure.

In short, yesterday’s outage highlights the imperative for organisations to embrace multi-cloud, invest in DNS resilience, and build infrastructure foundations that endure—not just perform.


Full Article

Monday’s AWS outage was a stark reminder of something I discuss with clients constantly: your cloud infrastructure is only as reliable as the foundations you’ve built it on. As I watched over 1,000 companies scramble to restore services, 6.5 million users unable to access critical applications, and businesses losing revenue by the minute, the same thought kept running through my mind—this was entirely preventable for many of them.

This is written not to criticise AWS—outages happen to every provider, including Google Cloud—but to share what we’ve learned as a vCISO team helping organisations build resilient, future-proof cloud architectures. Because here’s the uncomfortable truth: if you’re betting your entire business on a single cloud provider with no failover strategy, you’re building your house on sand.

Your cloud infrastructure is only as reliable as the foundations you’ve built it on.”

The DNS Reality Check We All Needed

Yesterday’s outage originated from something deceptively simple: a DNS resolution failure for DynamoDB API endpoints in AWS’s US-EAST-1 region. The Domain Name System—essentially the internet’s phone book—couldn’t translate DynamoDB service names into their corresponding IP addresses. When that happened, applications couldn’t find their databases, websites couldn’t reach their authentication systems, and smart home devices couldn’t phone home.

This is exactly the kind of single point of failure that keeps me awake at night as part of a vCISO team.

What made this particularly devastating wasn’t just that DNS failed—it’s that so many organisations had zero DNS resilience strategy. No multi-provider DNS configuration. No automated failover to alternative regions. No geographically distributed name servers. When AWS’s DNS resolution broke, these companies discovered they’d built their entire digital presence on a single DNS dependency.

Here’s what proper DNS resilience looks like in practice:

Multi-Provider DNS Architecture: Configure authoritative DNS across multiple providers (we typically recommend Google Cloud DNS alongside your primary provider). When one DNS provider experiences issues, queries automatically resolve through healthy alternatives.

Geographic DNS Distribution: Distribute DNS servers across multiple regions and providers. If AWS’s US-EAST-1 DNS fails, your DNS infrastructure should seamlessly serve responses from GCP’s europe-west2 or asia-southeast1 regions.

Health Check-Driven Failover: Implement automated health monitoring that detects endpoint failures and updates DNS records in real-time, redirecting traffic to healthy alternatives before users even notice degradation.

DNSSEC and Security Controls: Enable DNS Security Extensions across all providers to prevent cache poisoning and spoofing attacks during failover scenarios.

The organisations that weathered yesterday’s outage successfully had invested in these DNS resilience fundamentals. As a Principal Cloud Architect within our vCISO team, this is exactly the kind of infrastructure foundation work we implement for clients—because DNS failures are inevitable, but DNS-driven business disruption is entirely optional.

The AI Gold Rush and the Infrastructure Nobody Talks About

Everyone’s racing to implement generative AI right now. Gemini, ChatGPT, custom LLMs—the boardroom conversations are electrifying. We’re seeing organisations spinning up AI projects at breakneck speed, desperate not to fall behind competitors. And I get it. The potential is genuinely transformative.

But here’s what keeps me up at night as someone who’s part of a vCISO team: 98% of organisations are exploring generative AI, but most haven’t built the infrastructure foundation to support it reliably.

Think about what happens when you deploy a production AI service on infrastructure that hasn’t been hardened for resilience:

  • Your AI-powered customer service chatbot goes dark during a provider outage
  • Your generative content pipeline stops mid-campaign
  • Your AI-driven analytics engine loses access to real-time data feeds
  • Your compliance-critical AI monitoring tools can’t alert you to security incidents

You’ve just built a Ferrari on a foundation of sand. And when the tide comes in—and it will, as we saw yesterday—everything collapses.

You’ve just built a Ferrari on a foundation of sand.

Why Google Cloud Platform Deserves Your Attention

I want to talk about Google Cloud Platform, keeping in mind that Wursta is not exclusively a GCP shop (we work with clients across all cloud providers), but because GCP offers some genuinely compelling advantages that are often overlooked in the AWS-Azure duopoly conversation.

The AI Infrastructure Advantage

Google didn’t bolt AI onto an existing cloud platform—they built their cloud infrastructure around AI from the ground up. When you’re deploying generative AI workloads on GCP, you’re leveraging:

  • Vertex AI: A unified platform that handles everything from model training to deployment with enterprise-grade security and governance built in from day one. No bolting on compliance controls as an afterthought.
  • TPU v5p and Ironwood chips: Google’s custom AI accelerators deliver up to 2.5× better inference performance per dollar compared to previous generations. When you’re running AI at scale, that cost efficiency isn’t a nice-to-have—it’s the difference between a sustainable AI strategy and a budget black hole.
  • The Gemini ecosystem: Access to Google’s cutting-edge multimodal models (Gemini 2.5 Flash, Imagen for images, Veo for video) with a 2 million token context window. For context-heavy AI applications, that’s game-changing.
  • Global network infrastructure: Google’s private fibre network consistently delivers the lowest median latency globally—around 40ms for static content delivery to UK/Europe users versus 50ms for Azure and 60ms for AWS. When you’re building real-time AI applications, those milliseconds matter.
  • DNS Resilience Built-In: Google Cloud DNS includes geographic distribution, automated health checking, and multi-provider synchronisation capabilities out of the box. After yesterday’s AWS outage, this isn’t a technical nice-to-have—it’s business insurance.

The Multi-Cloud Imperative

Here’s where things get interesting. We’re not saying “rip out AWS and go all-in on GCP” or abandon your existing cloud provider entirely. That would be just as foolish as the single-provider strategy that left so many companies vulnerable yesterday.

What we advocate for is strategic multi-cloud architecture.

Yesterday’s AWS outage demonstrated what I’ve been warning clients about: resilience in 2025 means distributing critical workloads across multiple cloud providers. Not because any single provider is unreliable, but because no provider is immune to disruption.

A well-architected multi-cloud strategy gives you:

  • Resilience Against Outages: When AWS US-EAST-1 goes down, your failover to GCP europe-west2 keeps operations running. Your customers don’t know (or care) which cloud you’re on—they just know your service works.
  • Best-of-Breed Services: Run your AI training workloads on Google’s TPUs and Vertex AI, leverage Google Workspace for productivity and business continuity, and use your existing cloud provider for specific legacy workloads that make sense to maintain. Why limit yourself to one provider’s interpretation of “best practice” when you can cherry-pick the actual best solutions?
  • Cost Optimisation: Multi-cloud isn’t just about resilience—it’s about leverage. When you can genuinely move workloads between providers, you can negotiate better pricing. And with AI infrastructure costs spiralling, that negotiating power is worth its weight in gold.
  • Regulatory Compliance: Different providers have different regional data centre footprints. Multi-cloud gives you the flexibility to meet data sovereignty requirements across jurisdictions without architectural gymnastics.

The Google Workspace Angle Nobody Considers

Here’s something most organisations miss: if you’re using Microsoft 365 and AWS, you have zero failover for your productivity layer.

I’ve seen this play out in real time. A client running entirely on Microsoft 365 experienced an Azure Active Directory outage. Email gone. Teams gone. SharePoint gone. Every collaboration tool their business depended on—offline simultaneously.

Google Workspace as a business continuity solution is an underutilised strategy. Even if Google Workspace isn’t your primary productivity suite, having it as a dormant failover environment means:

  • Pre-configured access for critical personnel during a Microsoft 365 outage
  • Synchronised essential data that can be activated during a crisis
  • Communication channels that remain operational when your primary tools fail
  • Chromebooks configured with balanced security that can be distributed to key teams

We help clients implement this as part of comprehensive Business Continuity Planning. It’s not about replacing Microsoft—it’s about having a plan B that actually works when everything else fails. As someone working within a vCISO capacity, I’ve seen too many organisations discover during a crisis that their “disaster recovery plan” was just wishful thinking.

Building Foundations That Last

So what does a solid cloud foundation actually look like in 2025? Based on the hundreds of architectures I’ve reviewed and rebuilt as part of our vCISO team, here’s what resilient infrastructure requires:

1. Multi-Region, Multi-Cloud by Design

Not as an afterthought. Your architecture should assume provider failures will happen and design around them from day one. This includes DNS failover strategies that don’t depend on your primary cloud provider’s DNS infrastructure.

2. Infrastructure as Code with Disaster Recovery Built In

If you can’t rebuild your entire infrastructure in another region or another cloud within hours, your IaC isn’t mature enough. We help clients implement Terraform configurations that are genuinely cloud-agnostic, with DNS routing policies that automatically failover to healthy endpoints.

3. Data Sovereignty and Replication Strategy

Where is your data? How often is it replicated? Can you access it if your primary cloud goes dark? These aren’t theoretical questions—they’re the ones being asked by executives during an outage when it’s too late to implement solutions.

4. Automated Monitoring and Failover

Manual failover procedures fail. Humans panic during crises. Automated health monitoring that triggers pre-tested failover procedures—including DNS record updates—is the only approach that works at 3 AM when AWS US-EAST-1 is returning errors.

5. Regular Disaster Recovery Testing

We insist clients conduct quarterly DR drills that include cloud provider failure scenarios and DNS resolution failures. Because the only way to know your failover works is to test it under realistic conditions.

The AI Infrastructure Reality Check

Let me bring this back to AI because that’s where we’re seeing the most dangerous infrastructure decisions right now.

Generative AI is expensive. Training costs, inference costs, token costs—they add up fast. And when organisations rush to deploy AI without proper infrastructure foundations, they end up with:

  • Security vulnerabilities: AI models accessing data they shouldn’t, with no proper governance layer
  • Compliance nightmares: Training data crossing jurisdictional boundaries, violating data sovereignty requirements
  • Cost overruns: Inefficient inference pipelines burning through budgets at eye-watering rates
  • Reliability failures: AI services that go dark during provider outages, destroying user trust

Google Cloud’s AI infrastructure advantage isn’t just about performance—it’s about providing the governance, security, and cost controls that production AI deployments actually need.

When we help clients deploy generative AI on GCP, we’re implementing:

  • Data residency controls that keep sensitive data within specific geographic boundaries
  • Model versioning and governance through Vertex AI Model Registry
  • Cost monitoring and budget alerts that prevent runaway inference costs
  • Security controls that ensure AI models can only access appropriately permissioned data
  • Multi-region deployment patterns that survive regional outages
  • DNS resilience strategies that ensure AI services remain accessible even during provider DNS failures

This is the foundational work that doesn’t make headlines but determines whether your AI initiative succeeds or becomes a cautionary tale.

What We Bring to the Table

As a Principal Cloud Architect within our vCISO team, my perspective is different from traditional cloud consultancies. I’m not just optimising for performance or cost—I’m architecting for risk.

When we review a client’s infrastructure, we’re asking:

  • What happens when your primary cloud provider fails?
  • Can you prove compliance during an audit when your logging infrastructure is offline?
  • How do you maintain business operations when your identity provider is unreachable?
  • What’s your RTO (Recovery Time Objective) and RPO (Recovery Point Objective), and can your architecture actually achieve it?
  • Do you have DNS failover strategies that don’t depend on your primary provider?

These are vCISO-level questions because cloud resilience is fundamentally a security and risk management challenge.

Yesterday’s AWS outage fell squarely within the vCISO domain because it impacted:

  • Third-party vendor risk management: Is your cloud provider a single point of failure?
  • Business continuity planning: Can your organisation operate during extended cloud outages?
  • Incident response: Do you have playbooks for “critical vendor down” scenarios?

As part of our vCISO team, I help organisations answer these questions before the crisis, not scramble to figure them out during a 9-hour outage.

The Path Forward

If you’re running critical workloads on a single cloud provider—whether that’s AWS, Azure, or even Google Cloud—you need to ask yourself: what’s my plan when (not if) they experience an outage?

If the answer is “wait and hope they fix it quickly,” you’re building on sand.

Here’s what we recommend:

  1. Audit your current cloud dependencies to identify single points of failure, particularly DNS dependencies
  2. Evaluate GCP as a resilience layer, particularly for AI workloads where they genuinely excel
  3. Implement multi-cloud architecture for business-critical services with automated failover and DNS resilience
  4. Consider Google Workspace as a business continuity solution alongside your existing productivity suite
  5. Build infrastructure as code that’s cloud-agnostic and can deploy across multiple providers
  6. Test your disaster recovery procedures with real cloud provider failure scenarios, including DNS resolution failures

The organisations that weathered yesterday’s AWS outage with minimal impact weren’t lucky—they were prepared. They’d invested in the foundational architecture that treated cloud resilience as a first-class concern, not an afterthought.

Let’s Build Something Solid Together

The generative AI revolution is real, and the opportunities are extraordinary. But if you’re building AI capabilities on infrastructure that can’t survive a provider outage, you’re setting yourself up for failure.

We specialise in building cloud architectures that don’t just perform—they endure. Multi-cloud strategies that balance Google Cloud’s AI advantages with the resilience that comes from thoughtful provider diversification. Google Workspace integrations that provide genuine business continuity. Infrastructure as code that’s truly cloud-agnostic. Security and compliance controls that survive provider disruptions. DNS resilience strategies that ensure your services remain accessible even when major providers experience DNS failures.

Because at the end of the day, the most advanced AI model in the world is worthless if it’s running on infrastructure built on sand.

If yesterday’s outage made you question your cloud resilience strategy, let’s talk. As part of our vCISO team, we’re here to help you build foundations that last.

Visit our website to get to know our vCISO managed service. Ready to chat about a vCISO partnership? Connect with us today!


Meet the Author – Maurice Manning – As a Principal Cloud Architect within our vCISO team, I work with organisations to design secure, resilient, multi-cloud architectures that support their most critical workloads—including generative AI deployments. If you’d like to discuss how Google Cloud Platform or Workspace could enhance your Business & infrastructure resilience, or explore multi-cloud strategies that provide genuine business continuity, reach out to our team.