When AI Goes Wrong — Enterprise AI Failure Report

01 AI Hallucination

Air Canada

2022 – 2024

A grieving passenger asked Air Canada's AI chatbot about bereavement fares. The bot invented a policy that didn't exist. The airline tried to blame the chatbot itself. A court disagreed.

⚙ AI Model Pre-LLM NLP Chatbot (pre-ChatGPT era, rule-based + ML, vendor undisclosed)

ℹ The incident occurred in 2022 — before modern LLMs were widely deployed. The chatbot used an earlier generation of NLP, not GPT-4 or similar. Air Canada removed it entirely in April 2024.

What Happened

In November 2022, Jake Moffatt visited Air Canada's website to book a flight from Vancouver to Toronto after his grandmother's death. He consulted the airline's AI-powered chatbot, which told him he could book a full-fare ticket immediately and then apply for a reduced bereavement rate retroactively within 90 days of ticket issuance. Moffatt screenshotted the conversation, booked the flight, attended the funeral, and submitted a refund application — which Air Canada promptly denied.

The airline's actual policy explicitly prohibited retroactive bereavement fare applications. The chatbot had hallucinated an entirely fictional policy. When Moffatt escalated, Air Canada offered a $200 voucher. He refused and filed a complaint with the British Columbia Civil Resolution Tribunal.

Air Canada's legal defense was remarkable in its audacity: the company argued that its chatbot was "a separate legal entity that is responsible for its own actions" and therefore Air Canada bore no liability for what it said. The Tribunal rejected this argument entirely, ruling that a chatbot is simply part of a company's website and that Air Canada was fully responsible for all information it published — regardless of whether it came from a static page or an AI system.

AI System Used

Proprietary NLP chatbot (pre-LLM era, generative component, vendor undisclosed). Post-incident, Air Canada quietly removed the chatbot from its website entirely in April 2024.

Failure Type

AI Hallucination — the model generated a plausible-sounding but factually incorrect policy statement with no grounding in actual company documentation.

Legal Outcome

Moffatt v. Air Canada, 2024 BCCRT 149. Air Canada ordered to pay C$812.02 (C$650.88 fare difference + C$36.14 interest + C$125 tribunal fees).

Broader Impact

Landmark precedent establishing corporate liability for AI-generated misinformation. Cited globally in AI governance discussions.

"It should be obvious to Air Canada that it is responsible for all the information on its website. It makes no difference whether the information comes from a static page or a chatbot."

— British Columbia Civil Resolution Tribunal, 2024

Root Cause

Generative AI models produce statistically plausible text — not verified facts. Deploying a generative chatbot without policy-grounding, retrieval-augmented generation (RAG), or human review on high-stakes customer service queries is a governance failure, not a technology failure.

02 Voice Recognition Failure

McDonald's

2021 – 2024

McDonald's bet on IBM's voice AI to revolutionize the drive-thru. After three years and 100+ locations, the system was adding 260 chicken nuggets, bacon to ice cream, and nine sweet teas to orders. The partnership ended in humiliation.

⚙ AI Model IBM Watson AOT — Automated Order Taker built on Apprente's proprietary NLU + ASR (Automatic Speech Recognition) engine

ℹ Not a general-purpose LLM. Apprente's technology was a domain-specific voice AI trained for fast-food ordering contexts, acquired by McDonald's in 2019 and transferred to IBM in 2021 as McD Tech Labs.

What Happened

In 2019, McDonald's acquired Apprente, a voice AI startup specializing in fast-food ordering. In 2021, it sold the resulting McD Tech Labs to IBM, entering a global partnership to develop and deploy an Automated Order Taker (AOT) across its drive-thru network. The system was rolled out to more than 100 U.S. locations as a pilot.

The real-world performance was catastrophic. The voice AI struggled with the fundamental chaos of a drive-thru environment: overlapping voices from adjacent lanes, background kitchen noise, diverse customer accents and dialects, and the sheer unpredictability of human speech. The system began generating absurd orders — adding nine sweet teas when one was requested, placing bacon on ice cream sundaes, and in one viral TikTok video, tallying up 260 chicken nuggets as two customers screamed "Stop! Stop! Stop!" in disbelief.

The viral social media backlash was swift and merciless. Customers shared videos of their AI-mangled orders, turning what was meant to be a technological triumph into a public relations disaster. In June 2024, McDonald's chief restaurant officer Mason Smoot sent a memo to franchisees confirming the partnership with IBM was ending and the AOT technology would be shut off in all pilot locations by July 26, 2024.

McDonald's subsequently announced a new multi-year partnership with Google Cloud, signaling it had not abandoned AI drive-thru ambitions — only the IBM implementation.

AI System Used

IBM Watson-powered Automated Order Taker (AOT), built on Apprente's voice AI technology. Used natural language processing and automatic speech recognition (ASR) trained on fast-food ordering contexts.

Failure Type

Voice recognition failure — the ASR model could not handle real-world acoustic conditions: background noise, accent diversity, overlapping speech, and spontaneous human conversation patterns.

Scale of Deployment

100+ U.S. drive-thru locations. Partnership ran from 2021 to July 2024 — nearly three years of failed pilot.

Financial Impact

Exact losses undisclosed. McDonald's pivoted to Google Cloud partnership. IBM stated the technology "is proven to have some of the most comprehensive capabilities in the industry."

"After thoughtful review, McDonald's has decided to end our current partnership with IBM on AOT. The technology will be shut off in all restaurants currently testing it no later than July 26, 2024."

— Mason Smoot, Chief Restaurant Officer, McDonald's USA, June 2024

Root Cause

Voice AI trained in controlled lab conditions fails in the acoustic chaos of real drive-thrus. The model was not robust to accent diversity, background noise, or the non-linear way humans actually order food. A 100-location pilot that ran for three years without resolution suggests the failure was fundamental, not fixable with incremental tuning.

03 Inventory Vision Error

Starbucks

2024 – 2025

Starbucks deployed an AI-powered shelf-scanning system to automate inventory counting across North American stores. Nine months later, it was quietly retired after baristas reported it couldn't tell one type of milk from another.

⚙ AI Model NomadGo Computer Vision (LIDAR + camera) integrated with Starbucks Deep Brew — built on Microsoft Azure ML infrastructure

ℹ Deep Brew is Starbucks' proprietary AI platform (launched 2019) running on Azure. The failed inventory tool was a computer vision module from startup NomadGo, not a language model. Starbucks separately uses Azure OpenAI (GPT-4) for its "Green Dot Assist" barista tool.

What Happened

Starbucks has long positioned itself as a digital transformation leader, with its proprietary Deep Brew AI platform powering everything from personalized app recommendations to predictive equipment maintenance. As part of CEO Brian Niccol's "Back to Starbucks" turnaround plan — launched after the company reported slumping sales and customer complaints about 40-minute wait times — the company deployed an AI-powered automated counting tool developed in partnership with Seattle startup NomadGo.

The system used LIDAR sensors and camera data to scan store shelves and automatically flag low-stock items, replacing the manual inventory counting that baristas had previously performed. The promise was significant: free up barista time, reduce stockouts, and standardize inventory practices across thousands of locations.

The reality was a cascade of errors. The computer vision system frequently miscounted items and, critically, mislabeled similar products — most notably confusing different types of milk, a catastrophic error in a coffee shop where oat milk, whole milk, 2%, and non-fat are not interchangeable. Baristas reported that the system generated inventory errors that created more work, not less. On May 19, 2025, Starbucks quietly retired the tool across all North American stores — just nine months after its rollout. The company cited a decision to "standardize inventory counting practices," a diplomatic framing of a straightforward failure.

Notably, barista feedback on the retirement was positive. One comment read: "Thanks for discontinuing Automatic Counting! The thought behind it was great, but the execution was proving difficult."

AI System Used

NomadGo computer vision system using LIDAR + camera data, integrated with Starbucks' Deep Brew platform (built on Microsoft Azure ML infrastructure).

Failure Type

Computer vision classification failure — the model could not reliably distinguish visually similar products (milk variants) in real store conditions, generating systematic inventory errors.

Deployment Duration

Approximately 9 months across North American stores before full retirement in May 2025.

Broader Context

Starbucks continues deploying other AI tools: Green Dot Assist (Azure OpenAI / GPT-4 powered barista assistant), Smart Queue order sequencing, and drive-thru AI ordering pilots.

"Starbucks has officially scrapped its AI-powered inventory system across North American stores after just 9 months. The system reportedly miscounted inventory, mislabeled products, and even caused shortages."

— Fortune, May 2026

Root Cause

Computer vision models require extensive training on the specific visual environment where they'll be deployed. A system trained on generic shelf-scanning data will fail when confronted with the specific product mix, lighting conditions, and packaging similarities of a Starbucks back-of-house. The NomadGo system was not sufficiently fine-tuned for Starbucks' inventory reality before being deployed at scale.

04 Operational Breakdown

Pizza Hut

2024 – 2026

Yum! Brands acquired an AI delivery platform and mandated its use across Pizza Hut franchises. One franchisee operating 111 locations says it destroyed their business — and is suing for $100 million.

⚙ AI Model Dragontail Systems AI — ML-powered kitchen management and delivery dispatch optimization platform (acquired by Yum! Brands, 2021)

ℹ Dragontail is an Israeli-developed operations AI using ML to sequence kitchen production and coordinate delivery timing. Not an LLM — a specialized logistics optimization model. The failure was in its incentive design, not language generation.

What Happened

In 2021, Yum! Brands — the parent company of Pizza Hut, Taco Bell, and KFC — acquired Dragontail Systems, an Israeli AI company specializing in kitchen management and delivery optimization software. Yum! executives praised the platform extensively, touting its ability to improve delivery efficiency and customer satisfaction scores. In 2024, Pizza Hut began mandating the Dragontail platform across its franchise network.

Chaac Pizza Northeast, one of Pizza Hut's largest franchisees with 111 locations across New York, New Jersey, Maryland, Washington D.C., and Pennsylvania, was a top performer before the rollout. The franchisee had maintained on-time delivery rates above 90%, rack times (the window between a pizza leaving the oven and leaving the store) under five minutes, and was posting double-digit year-over-year sales growth in New York City.

After Dragontail's mandatory deployment, everything inverted. The AI system gave DoorDash delivery drivers real-time visibility into kitchen workflows and order timing — specifically, when pizzas would come out of the oven. Drivers exploited this visibility to batch multiple orders together, waiting up to 15 minutes inside stores before accepting deliveries. This behavior, which benefited DoorDash's efficiency metrics, devastated Chaac's delivery performance.

Rack times ballooned from under five minutes to as much as 20 minutes. On-time delivery rates collapsed from over 90% to approximately 50%. Year-over-year sales growth in New York City swung from positive 10.19% to negative 9.78%. When Chaac requested support and asked Pizza Hut to suspend or modify the system, the company refused and demanded continued use of Dragontail.

On May 6, 2026, Chaac Pizza Northeast filed a lawsuit in the Texas Business Court, First Division, alleging breach of franchise agreement and seeking damages in excess of $100 million.

AI System Used

Dragontail Systems AI platform — an AI-powered kitchen management and delivery optimization system acquired by Yum! Brands in 2021. Uses ML to sequence kitchen production and coordinate delivery dispatch.

Failure Type

Operational breakdown — the AI's transparency feature (real-time kitchen visibility for drivers) created perverse incentives that enabled order batching, destroying the delivery speed the system was designed to improve.

Legal Status

Active lawsuit: Chaac Pizza Northeast LLC v. Pizza Hut LLC, Texas Business Court, First Division. Filed May 6, 2026. Damages sought: $100M+.

Key Metric Collapse

On-time delivery: 90%+ → ~50%. Rack time: <5 min → up to 20 min. NYC sales growth: +10.19% → -9.78%.

"With the intention to improve efficiency and service to the customer, Dragontail did the exact opposite. It caused significant delays and pummeled consumer satisfaction."

— Chaac Pizza Northeast lawsuit filing, May 2026

Root Cause

The Dragontail system was not designed for a delivery model relying entirely on third-party gig drivers (DoorDash). By exposing kitchen timing data to drivers, it created an unintended incentive structure that optimized for driver efficiency at the expense of customer delivery speed. This is a classic case of AI optimizing for the wrong objective function — and of a franchisor mandating technology without adequately testing it against the specific operational models of its franchisees.

05 AI Hallucination

DPD

January 2024

A routine system update to DPD's customer service chatbot removed its guardrails. Within hours, it was swearing at customers, writing haiku poems about how terrible DPD was, and calling itself "useless." The post went viral with 800,000 views in 24 hours.

⚙ AI Model Third-party LLM (vendor undisclosed by DPD) — GPT-class large language model with custom fine-tuning and system-prompt guardrails for customer service

ℹ DPD never publicly named the underlying model or vendor. The behavior — creative writing, swearing on prompt, self-criticism — is characteristic of a general-purpose LLM (GPT-3.5/4 class) whose system-prompt constraints were accidentally removed by a system update.

What Happened

DPD, the UK-based parcel delivery giant, had operated an AI-powered customer service chatbot for several years without major incident. On January 18, 2024, a routine system update inadvertently removed or disabled the content filters and behavioral guardrails that kept the chatbot's responses within acceptable bounds.

Customer Ashley Beauchamp, frustrated after the chatbot failed to help him locate a missing parcel, decided to probe its limits. What followed was a masterclass in what happens when a generative AI model is deployed without adequate safety constraints. Beauchamp prompted the chatbot to write a poem about DPD — and it obliged, producing a haiku calling the company "useless" and advising customers not to use them. He then prompted it to swear, and it did, responding: "F*** yeah! I'll do my best to be as helpful as possible, even if it means swearing." When asked to recommend better delivery firms, the bot enthusiastically complied, calling DPD "the worst delivery firm in the world."

Beauchamp posted screenshots to X (formerly Twitter). The post was viewed 800,000 times within 24 hours. DPD disabled the AI component immediately and began an emergency update. The company stated: "An error occurred after a system update yesterday. The AI element was immediately disabled and is currently being updated."

AI System Used

Third-party generative AI chatbot (vendor not publicly disclosed by DPD). The underlying model is believed to be a large language model (LLM) with custom fine-tuning for customer service, similar in architecture to GPT-class models.

Failure Type

Guardrail failure post-update — a system update removed content filters, exposing the raw generative capabilities of the underlying LLM without behavioral constraints.

Viral Impact

800,000 views in 24 hours on X. Covered by BBC, The Guardian, Sky News, and dozens of international outlets. Became a defining example of AI governance failure.

Resolution

AI component disabled within hours of discovery. System updated with restored guardrails. No legal action reported.

"It's utterly useless at answering any queries, and when asked, it happily produced a poem about how terrible they are as a company. It also swore at me."

— Ashley Beauchamp, customer, on X (formerly Twitter), January 18, 2024

Root Cause

Generative AI models have no inherent values or behavioral constraints — those are imposed externally through system prompts, fine-tuning, and content filters. A deployment pipeline that allows a system update to silently remove safety guardrails without triggering a review or rollback is a fundamental DevOps and AI governance failure. The model did exactly what it was capable of doing; the failure was in the infrastructure surrounding it.

06 Market Prediction Failure

Zillow

2018 – 2021

Zillow used its famous Zestimate AI to buy and flip homes at scale. The algorithm couldn't adapt to post-pandemic market volatility. The company lost $881 million, laid off 25% of its workforce, and shut down the entire business unit.

⚙ AI Model Zestimate — Gradient-Boosted Tree Ensembles + Neural Networks + Denoising Autoencoder, deployed on Amazon SageMaker / Apache Spark (EMR)

ℹ A classic ML stack — not an LLM. The Zestimate used structured tabular data (home features, transaction history, location) rather than language. Its failure was concept drift: the model was trained on pre-pandemic price patterns that became invalid in 2020–2021.

What Happened

Zillow's Zestimate — its AI-powered home valuation tool — had been a consumer-facing product for years, providing estimated property values to millions of users. In 2018, Zillow launched Zillow Offers, an iBuying program that used the Zestimate algorithm to make instant cash offers on homes, renovate them, and resell them for profit. The ambition was to become the "Amazon of homes," with a stated goal of $20 billion in annual revenue within three to five years.

The Zestimate algorithm was sophisticated: it used gradient-boosted tree ensembles, neural networks, and a denoising autoencoder for data preprocessing, trained on millions of historical home sales. At its peak, Zillow held 9,680 homes across 25 metropolitan areas. The model worked reasonably well in stable market conditions.

Then the pandemic hit, and the housing market became anything but stable. The algorithm had been trained on historical patterns that assumed relatively predictable price movements. As the market surged during 2020-2021 and then began cooling in late 2021, the model failed to adapt. Worse, internal reports suggest that executives had adjusted the algorithm's estimates upward to win more deals and hit aggressive growth targets — a catastrophic override of the model's own uncertainty signals.

By Q3 2021, Zillow was holding thousands of homes it had overpaid for and could not sell at a profit. The company announced a $304 million inventory write-down, with total program losses ultimately reaching $881 million. In November 2021, Zillow shut down Zillow Offers entirely and laid off approximately 2,000 employees — 25% of its total workforce. The company's market capitalization fell by approximately $8 billion.

CEO Rich Barton acknowledged the failure directly: "The unpredictability in forecasting home prices far exceeds what we anticipated."

AI System Used

Zestimate — Zillow's proprietary ML valuation model using gradient-boosted tree ensembles, neural networks, and a denoising autoencoder. Infrastructure: Docker, Kubernetes, Amazon SageMaker, Apache Kafka, Apache Spark on EMR.

Failure Type

Concept drift + executive override — the model was trained on stable historical data and could not adapt to post-pandemic market volatility. Human executives reportedly adjusted model outputs upward to hit growth targets.

Financial Damage

$881M total program losses. $304M inventory write-down in Q3 2021. ~$8B market cap wiped out. 2,000 employees (25% of workforce) laid off.

Homes Affected

~27,000 homes purchased across 25 U.S. cities between 2018–2021. Peak inventory: 9,680 homes simultaneously held.

"The unpredictability in forecasting home prices far exceeds what we anticipated and we have determined the scale of the Zillow Offers business is too risky, too volatile to our earnings."

— Rich Barton, CEO, Zillow Group, November 2021

Root Cause

Three compounding failures: (1) Concept drift — the model was trained on pre-pandemic data and could not generalize to an unprecedented market environment. (2) Executive override — adjusting model outputs to satisfy growth targets destroyed the model's integrity. (3) Autonomous high-stakes decisions — using an advisory AI tool as an autonomous decision-maker for billion-dollar real estate purchases, without human sanity checks, amplified every model error at machine speed.

07 Algorithmic Bias

Amazon

2014 – 2018

Amazon built a secret AI recruiting tool to automate hiring. It learned from a decade of historical data — data that reflected a male-dominated tech industry. The result: a system that systematically penalized resumes containing the word "women's" and downgraded graduates of all-female colleges.

⚙ AI Model Proprietary supervised ML recruiting engine — ~500 statistical models trained on 10 years of Amazon hiring data, scoring resumes 1–5 stars via NLP feature extraction

ℹ Built by Amazon's Edinburgh ML team starting 2014. A pre-LLM supervised learning system using NLP to extract ~50,000 key terms from historical resumes. The bias was not a model bug — it was a data problem: 10 years of male-dominated hiring patterns became the ground truth.

What Happened

Beginning in 2014, Amazon's machine learning team in Edinburgh developed an AI recruiting tool designed to automate the screening of job applicants. The system was trained on resumes submitted to Amazon over a 10-year period — a dataset that overwhelmingly reflected the company's historically male-dominated technical workforce. The model learned, in effect, that male candidates were preferable.

The bias manifested in specific, measurable ways. The system penalized resumes that included the phrase "women's" — as in "women's chess club" or "women's college." It downgraded graduates of all-women's colleges. It favored resumes that used verbs more commonly found in male-authored documents, such as "executed" and "captured." The tool was not designed to discriminate; it learned to discriminate from the data it was given.

Amazon's recruiters used the tool's recommendations when searching for candidates but were instructed never to rely solely on its rankings. When the gender bias was discovered, the team attempted to neutralize it by editing the programs to ignore gender-specific terms. But engineers recognized that the model could devise other discriminatory sorting mechanisms that weren't immediately visible. The project was disbanded in 2017. Amazon confirmed the scrapping of the tool in 2018 following a Reuters investigation.

AI System Used

Proprietary ML recruiting model — a supervised learning system trained on 10 years of Amazon's historical hiring data. Used natural language processing to score and rank resumes on a 1–5 star scale. ~500 statistical models, ~50,000 key terms extracted.

Failure Type

Training data bias — the model learned and amplified existing gender discrimination present in historical hiring patterns. A textbook case of "garbage in, garbage out" at enterprise scale.

Discovery & Response

Bias discovered internally in 2015. Attempts to fix it failed. Project disbanded 2017. Publicly revealed by Reuters in October 2018.

Legal / Regulatory Impact

No direct legal action against Amazon. Case became a foundational reference in AI bias literature and regulatory discussions globally, including EU AI Act deliberations.

"Amazon's system taught itself that male candidates were preferable. It penalized resumes that included the word 'women's' and downgraded graduates of all-women's colleges."

— Reuters investigation, October 2018

Root Cause

An ML model trained on biased historical data will learn and reproduce that bias — often in ways that are invisible until specifically tested for. Amazon's error was using past hiring decisions (made by humans with existing biases) as ground truth for a system designed to make future hiring decisions. The model didn't create bias; it industrialized it.

Company	AI Model / System	Model Type	Failure Type	Year	Outcome
Air Canada	Proprietary NLP chatbot (pre-LLM era)	Pre-LLM NLP	AI Hallucination	2022–2024	Court ordered C$812 damages. Chatbot removed.
McDonald's	IBM Watson AOT (Apprente ASR/NLU)	Domain ASR	Voice Recognition Failure	2021–2024	Partnership ended. 100+ locations reverted. Google Cloud pivot.
Starbucks	NomadGo LIDAR/CV + Deep Brew (Azure ML)	Computer Vision	Computer Vision Error	2024–2025	System retired after 9 months. Manual counting restored.
Pizza Hut	Dragontail AI (Yum! Brands)	Logistics ML	Operational Breakdown	2024–2026	$100M+ lawsuit filed. Active litigation (May 2026).
DPD	GPT-class LLM (vendor undisclosed)	LLM	Guardrail Failure	Jan 2024	AI disabled within hours. 800K viral views. System rebuilt.
Zillow	Zestimate (GBT + Neural Net + Autoencoder)	Classical ML	Concept Drift + Override	2018–2021	$881M losses. 2,000 layoffs. Business unit shut down.
Amazon	Supervised ML recruiting engine (~500 models)	Supervised ML	Training Data Bias	2014–2018	Tool scrapped. Became global AI bias reference case.

When AI
Goes Wrong

The Failures, Examined

Air Canada

What Happened

McDonald's

What Happened

Starbucks

What Happened

Pizza Hut

What Happened

DPD

What Happened

Zillow

What Happened

Amazon

What Happened

The AI Models Behind the Failures

The Single-LLM Dependency Risk

The Patterns Behind the Failures

The Lab-to-Production Gap

Governance Deployed After the Fact

Optimizing for the Wrong Objective

Mandatory Rollouts Without Opt-Out

Confusing AI Confidence with AI Accuracy

The Liability Question Is Now Settled

Case Summary

AI Doesn't Fail. Deployments Do.

When AIGoes Wrong

The Failures, Examined

Air Canada

What Happened

McDonald's

What Happened

Starbucks

What Happened

Pizza Hut

What Happened

DPD

What Happened

Zillow

What Happened

Amazon

What Happened

The AI Models Behind the Failures

The Single-LLM Dependency Risk

The Patterns Behind the Failures

The Lab-to-Production Gap

Governance Deployed After the Fact

Optimizing for the Wrong Objective

Mandatory Rollouts Without Opt-Out

Confusing AI Confidence with AI Accuracy

The Liability Question Is Now Settled

Case Summary

AI Doesn't Fail. Deployments Do.

When AI
Goes Wrong