How Enterprises Can Solve the Challenge of AI Resilience

AI Resilience

If your organisation is like most enterprises today, its Generative AI adoption strategy hinges on using Cloud-based AI services delivered through hyperscale platforms such as Microsoft Copilot and Google Gemini.

Part of the appeal of these AI solutions is that they are managed by hyperscale vendors with excellent track records of providing high availability and performance. You might think then that there is little your organization needs to do to ensure the resilience of its AI services.

In reality, though, there are a litany of risks surrounding generative AI that extend beyond the hosting of AI services themselves. Managing those risks is crucial to developing an effective AI resilience strategy.

What is AI Resilience and Why Does It Matter?

AI resilience is the ability of AI infrastructure and services to operate reliably even in the face of unexpected challenges or disruptions. When AI systems are resilient, they rarely crash or experience significant performance degradations.

As AI becomes a central part of enterprise IT strategies, ensuring the resilience of AI systems is growing more critical. For companies that depend on AI to power mission-critical business services, keeping AI solutions resilient is just as important as maximising the uptime of traditional IT resources such as servers and storage infrastructure. It’s the only way to retain the ability to innovate using AI while still keeping AI-related risks in check.

AI Resilience Challenges

A variety of problems may cause AI systems to fail. Common AI resilience challenges include:

  • Failure of the infrastructure or data centers that host AI services
  • Disruptions to the data pipelines that move data into and out of AI systems
  • Data quality problems that disrupt the ability to feed accurate, complete data to AI systems
  • A lack of effective prompts for generating accurate, consistent results from AI systems
  • Unexpected increases in the cost of operating AI systems which could undercut the organization’s ability to use AI solutions effectively

The first item on this list – disruption to the infrastructure that hosts AI – tends not to be a major challenge for enterprises today because few organizations host their own AI models. Most are instead using AI services provided by hyperscale platforms which almost never go down and provide multiple cloud availability zones and regions to mitigate the impact of outages when they do occur.

But consuming AI services from a hyperscale provider doesn’t mitigate the other resilience challenges listed above. Your data pipelines could fail, or you could experience a major degradation in data quality if the infrastructure that collects and stores data within your business has a problem. Likewise, if you rely on a prompt library – meaning a collection of approved prompts – to interact with AI models, the library could crash, be hacked or become unavailable. And your AI quality or costs could spiral out of control due to factors such as cost-inefficient prompts or changes to the pricing of the AI models your business uses.

What’s worse, you may not know about these problems until it’s too late. Unlike traditional workloads whose performance and availability are easy to track in real time using conventional monitoring and observability software, it’s much rarer for organizations to have tools in place that automatically generate alerts when their data pipelines can’t feed information quickly enough to their AI services, when the cost of prompts suddenly surges or response quality plummets.

How to Maximise the Resilience of Enterprise AI

To plug this gap, businesses must implement novel types of solutions that can guarantee AI resilience.

Doing so starts with establishing holistic quality checks for AI. This means systematically tracking interactions with AI models and services across the entire company and monitoring the performance and cost of each transaction.

When you can do this, you can quickly identify problems such as slow data pipelines or data quality issues. You can also comprehensively track cost and performance in a granular way that allows you to compare how different models respond to the same prompt, making it possible to determine which model delivers the best balance between cost and performance for a given type of prompt.

At present, commercial solutions that deliver functionality like this are challenging to find. But implementing an AI quality monitoring solution in-house is more feasible than it might sound. At Lemongrass we’ve done it by using multiple AI models to evaluate the performance of other models, allowing us to identify resilience issues quickly while also optimizing our costs.

Getting Started With AI Resilience Today

Given that AI solutions and enterprise AI strategies are still rapidly evolving, it might seem reasonable to wait until the adoption process is complete and the technology has fully matured to worry about AI resilience.

But that would be a major mistake. The very fact that enterprise AI technology remains so fluid is part of the reason why businesses need to be thinking about and acting on ways to mitigate resilience risks starting today. AI’s fast-evolving nature breeds substantial risk and the only way to manage that risk is to develop a resilience strategy.

The sooner organizations do this by comprehensively monitoring the performance and quality of the AI systems they use, the better positioned they will be to leverage AI as a driver of innovation while protecting themselves against disruption to the AI-powered services businesses increasingly depend on.

Published: Intelligent CIO North America

Related Content