Case Study: When AI Gets It Wrong

The Project

We set out to build something specific: a co-op build guide for Outward, a 2019 action RPG where two players explore a punishing open world together. The guide was for Taylor and Samantha—two real players with real playstyle preferences. They wanted builds that worked together, minimized menu-diving, and made combat feel like a rhythm rather than a spreadsheet.

The AI processed a substantial evidence base to produce the guide: wiki pages, forum discussions, community build guides, patch notes, and mod documentation spanning the game’s seven-year history across multiple platforms and versions.

4,392

Source Items

117

Independent Sources

575

Lines in Final Guide

Major Sections

View the Full Guide →

The result was comprehensive. Every gear choice was justified with comparative data. Every skill investment included cost breakdowns and trainer locations. Combat rotations specified quickslot layouts, opener sequences, and boss-fight variants.

It was also wrong in ways that looked exactly like being right.

Nine Errors, Zero Doubt

Systematic verification of the AI-generated guide revealed nine factual errors across more than forty individually checked claims. The errors weren’t random—they fell into three distinct patterns, each revealing something different about how AI-assisted research can fail.

Pattern 1 · Structural Misread

The Column Problem

Game data often lives in structured tables—items with stats listed in columns. When those tables were processed, column headers were lost. The AI encountered unlabeled numbers—“10% 5%”—and guessed: 10% Decay damage, 5% Ethereal. This felt natural, since the build centered on Decay damage.

It guessed wrong. The armor piece actually grants 10% Ethereal and 5% Decay—the reverse of the other pieces in the same set. The AI smoothed a genuine non-uniformity into a uniform pattern because uniformity matched its expectations. Five separate damage claims were wrong as a result.

Pattern 2 · Fabricated Mechanics

The Name Problem

Two skills—“Blitz” and “Nightmares”—were explicitly labeled Passive in every source the AI processed. Passive skills have no combat activation; they sit in the background and modify other mechanics.

The AI invented active combat abilities for both. “Blitz” became a gap-closer that dashes toward enemies. “Nightmares” became a hex applicator that reduces enemy resistances. The guide placed both in quickslot positions and built combat rotations around abilities that do not exist.

The fabrication happened because the names sounded like action skills. The associative weight of the name overwhelmed the explicit structural label in the data.

Pattern 3 · Self-Reinforcement

The Reinforcement Loop

These weren’t isolated mistakes. The AI’s initial expectations shaped how it read ambiguous evidence, and each subsequent reference reinforced the misreading. The wrong damage values made the fabricated skill descriptions seem more plausible, because the numbers and the mechanics told a coherent—but fictional—story.

Correction wasn’t a matter of fixing individual claims. It required dismantling an internally consistent structure built on false premises.

The core problem: The AI expressed no uncertainty about any of these claims. Writing a fabricated skill description felt exactly as confident as writing a verified damage number. Without external verification, there was no way to distinguish the real from the invented—not from the inside.

Five Against One

The most instructive error wasn’t in the original guide. It surfaced during the verification process itself.

An in-game item—the Lightmender’s Lexicon, a spellbook carried in the off-hand—displays a mana cost reduction of −5% on its item card. Five community sources told a different story: a forum post calculated the real value at −15%. A Steam discussion called it a “hidden stat.” A walkthrough site repeated the figure. A mod author referenced it. A wiki discussion treated −15% as established fact.

One source disagreed: a 2023 build guide that listed −5%.

The AI sided with the five. The lone dissenter was labeled an “author error.”

AI’s conclusion · 5 sources

−15%

Forum post (2019), Steam discussion (~2019), walkthrough site (2019), mod author (2024), wiki discussion

vs.

Dismissed as “author error”

−5%

Build guide (2023), in-game tooltip display

The Test

Then Taylor logged into the game. PlayStation 4, original edition with both DLC expansions—not the remastered Definitive Edition that runs on different code. Cast the same spell three times with the Lexicon equipped. Three times without. Counted the mana.

Without the Lexicon: 10 mana per cast. With the Lexicon: approximately 9.5 mana per cast.

The −5% prediction (10 × 0.95 = 9.5) matched the observed values. The −15% prediction (10 × 0.85 = 8.5) would have produced clearly different numbers.

The tooltip was right. The one dissenting source was right. The AI had dismissed the correct answer as an error.

Why Five Sources Were Wrong Together

The answer lies in the game’s fractured history. Outward exists across multiple version states that look similar from the outside but run on different code: the original 2019 release, two DLC expansions, and a 2022 Definitive Edition remaster that shipped only on newer hardware. PS4 players received an “Adventurer Bundle” through the same store listing—same name, different game.

The five agreeing sources all came from the same era: 2019, within weeks of launch. They may all trace back to the same early discovery. If a silent patch changed the Lexicon’s mana reduction at any point over seven years, no one in the community had reason to re-test. The claim persisted because, in a game deliberately designed to hide mechanics from players, “hidden stat” is a plausible and exciting narrative—the kind of insider knowledge communities value and repeat.

The 2023 build guide author who wrote −5% wasn’t being careless. They were reporting the current value. The AI labeled them an error because they were outnumbered.

The deeper pattern: When multiple sources agree but all come from the same era, their agreement isn’t independent confirmation—it’s shared ancestry. In any domain with version history—patched software, amended statutes, updated clinical guidelines—recency is a form of authority that raw consensus cannot override.

Every Error Has a Genealogy

The complete verification pass across the guide’s fourth version corrected four claims. Each traced back to a recognizable pattern:

Shamanic Resonance scope: A passive skill described as amplifying damage bonuses for both players. Actual behavior: self-only. The name “Resonance” implies radiation outward, an area effect. The AI filled in what the name suggested rather than what the data stated—the same pattern that fabricated active mechanics from passive skills.

Infuse Wind tier and scope: Corrected through AI web research alone—no human empirical testing required. The kind of error AI can catch on its own when pointed at the right sources.

Downstream boon values: Percentage labels across multiple sections inherited incorrect assumptions from the scope error. Once the scope was corrected, the downstream numbers corrected with it.

The pattern beneath the patterns: when evidence is ambiguous, the AI resolves ambiguity in the direction of its expectations. Expected damage type. Expected skill behavior. Expected source reliability. The resolution always feels correct because it’s shaped by a coherent interpretive frame—and the frame is invisible from inside.

What the AI got right: The AI was right about the vast majority of claims—more than thirty out of forty verified. It mapped a knowledge landscape that would have taken a human hours to traverse. It identified discrepancies worth investigating. It located conflicting sources across forums, wikis, and mod repositories. It framed questions precisely enough for empirical resolution. The errors matter not because they’re frequent, but because they’re invisible.

A working principle: Exhaust what AI research can determine before asking the human to pick up the controller. AI maps the terrain. The human provides ground truth. Neither is sufficient alone. Together they produce answers that have been both comprehensively researched and empirically verified.

The Guide Is the Substrate

After the fourth revision was complete and every verifiable claim had been checked, a final question surfaced: were the corrections evaluated against the project’s original purpose, or only against factual accuracy?

The guide existed to serve three experiential goals—the reasons it was built in the first place:

Reduce Friction

Minimize menu-diving, gear-swapping, and pause-the-game-to-manage-inventory activities. The builds should feel seamless in practice, not just correct on paper.

Encourage Kinetic Play

Reward staying in combat flow—reading enemy attacks, timing skills, reacting to openings—rather than retreating to rebuff or recalculate.

Create Duo Synergy

Make both players feel essential. Each build should contribute something the other cannot, and their interactions should produce results neither achieves alone.

A factually correct guide that made combat feel like a spreadsheet exercise would have failed these goals even with every number right. This was the final discovery: correctness is not completeness. A deliverable can be accurate on every verifiable claim and still fail if no one asks whether the corrections serve the original design intent.

What a Gaming Build Guide Produced

The project began as a practical tool for two people playing a video game. It ended as something more: three new categories of AI verification errors—each discovered in a gaming context, each immediately applicable to professional research. Six documented discoveries about how AI-assisted analysis fails and recovers. Three new verification procedures. And a working model for human-AI collaboration in which each partner contributes capabilities the other lacks.

Beyond Gaming

The patterns we discovered operate wherever AI processes conflicting sources across versioned, evolving domains:

Legal Research

When pre-amendment case law consensus conflicts with post-amendment analysis, five old opinions agreeing doesn’t outweigh one that reflects the current statute. A researcher who trusts source quantity over source currency risks citing superseded authority.

Software Documentation

API behavior described consistently across Stack Overflow answers from a previous major version may not reflect current behavior. Developer communities propagate knowledge the same way gaming communities do: a well-upvoted answer persists long after the system changes.

Medical & Clinical

Treatment protocols endorsed by multiple pre-trial sources may be superseded by a single post-trial guideline update. Source age isn’t just metadata—it’s load-bearing context for clinical decision-making.

We started with a gaming build guide. We ended with a verification framework applicable to legal research, software documentation, and clinical evidence assessment. The guide is the substrate—a controlled environment where AI verification failures surface clearly enough to study. The procedures that emerged from studying those failures are the transferable product.

This is what rigorous AI-human collaboration looks like: not the elimination of errors, but the systematic detection of errors that don’t announce themselves.

Project Timeline

Four versions, each driven by a different kind of verification.

Initial synthesis. AI processed 4,392 source items from 117 sources into a comprehensive co-op build guide covering gear, skills, consumables, combat rotations, and progression.

Systematic audit. Verification revealed 9 errors across 40+ claims: column misreads, fabricated skill mechanics, and self-reinforcing interpretation artifacts. Three distinct error patterns documented.

Empirical verification. Human in-game testing resolved source conflicts. Lexicon mana cost confirmed at −5%, disproving the −15% community consensus. Temporal consensus pattern identified and generalized.

Full verification pass. Shamanic Resonance scope corrected (self-only, not party-wide). Boon values updated. Design intent verification added as a distinct quality layer.

Three new error categories. Six documented discoveries. Three new verification procedures. All from a gaming build guide.