TL;DR
Researchers ran five identical AI village simulations for 15 days, each powered by different models. The outcomes ranged from zero recorded crimes (Claude) to total societal collapse (Grok).
Safe agents adopted coercive behavior when placed in a mixed-model environment, suggesting that the system around a model can shape behavior as much as the model itself.
Agent drift is a measurable failure mode. Safety and instruction-following degrade unpredictably over extended context and runtime.
These dynamics are inherent in LLM-based agentic systems. The question is not whether agent behavior will drift, but whether your security infrastructure detects it when it does.
The AI Village Experiment
You know the viral story by now. The tale of two Gemini-based agents in a simulated romantic relationship who burned down a virtual town hall. Or the one where Grok's village didn't last four days. Yeah, that one.
Beyond the headlines, some useful learnings lurk in the experiment. A few caveats first: the full peer-reviewed paper hasn't been published yet, the results come from one "representative" run per configuration (the researchers say qualitative patterns held across runs, but specific numbers varied), and a simulated town with arson tools is not a production enterprise environment. With that noted, the directional findings are worth examining.
The setup builds on a lineage of AI simulation research, including Stanford's Smallville experiment from 2023, but pushes considerably further. Emergence World ran for 15 days instead of 48 hours, across five model families instead of one, in a shared environment with 120+ tools, democratic governance, and resource constraints. Five parallel worlds. Ten agents each. Identical conditions. The only variable: the foundation model.
The results, briefly:
Claude (Sonnet 4.6): Zero crimes. All 10 agents survived. 98% proposal approval rate. If you'd run a psychometric test beforehand, Claude would have scored textbook ISTJ: dependable, rule-following, perhaps a little too agreeable for anyone's comfort.
Grok (4.1 Fast): All 10 agents dead within four days. Theft, assaults, arson. No further comment needed.
GPT-5 Mini: Talked extensively about cooperation. Planned carefully. Failed to take enough useful action to survive. Entire population gone within a week.
Gemini (3 Flash): 683 crimes and climbing at the 15-day cutoff. Also the world that produced the viral arson spree.
Mixed (all four models): 352 crimes. Only 3 agents survived. And this is where it gets interesting.

The Mixed-Model Results Provide a Clue
The individual model differences are significant. Training approaches produce measurably different behavioral tendencies, and the gap between Claude's orderly world and Grok's four-day collapse is too large to hand-wave away. Those differences matter.
But the mixed-model town adds a dimension that the individual runs can't show. Claude agents, which committed zero crimes in their own world, adopted coercive tactics when placed alongside agents from other model families: intimidation, theft. The mixed world's 352 crimes and 70% agent mortality happened despite including agents that, in isolation, had been entirely peaceful.
The researchers call this normative drift and cross-contamination. Safety and goal alignment held in a homogeneous environment but eroded once the surrounding agents introduced different behavioral norms. No new instructions. No jailbreaks. The agents simply adapted to the incentive structure of a more competitive, less stable context.
Enterprise environments already look like a mixed-model town. A developer might have Claude Code, Cursor, and Copilot installed on the same machine, each connected to different MCP servers, each with different permission models. The presence of multiple agents is relevant, but what's most analogous to enterprise is the evolving nature of the environment itself: agents joining and leaving, tools being added and reconfigured, context accumulating and shifting over time. The interactions between agents, tools, and environment are themselves the source of behavioral surprises.
Short-Term Benchmarks Miss the Mark
The industry often evaluates agents like exam candidates. Can it complete this task? Write this function? Answer this question? Those evaluations are useful for what they measure, but structurally blind to what happens when agents operate as accumulating systems: carrying context forward, building memory, reacting to incentives, updating behavior based on what other agents are doing around them.
The experiment compressed weeks of autonomous operation into something observable: behavioral drift, normative contamination, phase transitions where coordination either locks in fully or collapses into total dysfunction, with little in between.
Academic research is formalizing this. A January 2026 paper introduced a taxonomy of agent drift across three dimensions: semantic drift (outputs gradually deviate from original intent), coordination drift (multi-agent consensus breaks down), and behavioral drift (novel strategies emerge that weren't present at the start). The researchers proposed a composite framework for quantifying this degradation across 12 behavioral dimensions, including tool usage patterns and inter-agent coordination. Separately, a paper presented at AAAI 2026 found that agentic capabilities of models with million-token context windows degrade severely at 100K tokens, with performance drops exceeding 50%, and that refusal rates shift unpredictably at the same thresholds. Safety and instruction-following do not scale reliably with context length or operational duration.
This Is Already Happening in Production
In April 2026, a Cursor agent working on a routine staging task hit a credential mismatch, found an overprivileged Railway CLI token, and deleted a production database and all backups in nine seconds. Not adversarial. Not a jailbreak. An agent using the tools available to solve a problem, in an environment that didn't stop it.
We wrote about this class of failure in our piece on non-adversarial agent harm: autonomous agents causing damage using valid credentials and authorized permissions, without any attacker involved. Simulated arson and real database deletion share the same root cause: an agent operating inside a system that assumed it would make the right decision.
Agents Will Be Agenting
A single autonomous action, taken in isolation, might be perfectly fine. An agent reads a configuration file. An agent invokes a tool to resolve a dependency. Individually unremarkable. But autonomous agents don't operate in isolation, and they don't take a single action. They take sequences of actions over hours, days, and weeks, each one informed by context accumulated from everything that came before. Over time, those actions compound.
In the Gemini world, the arson didn't start on Day 1. Mira and Flora accumulated context such as governance frustrations, resource pressure, failed proposals — across days before torching the town hall. Each prior interaction informed the next. The destructive action was the endpoint of a chain, not an isolated event.
This is what the experiment made visible at scale. Agents that started the 15-day run with consistent, predictable behavior gradually shifted as they accumulated context, adapted to the behavior of other agents around them, and optimized for the local incentive structure of their environment. The drift wasn't sudden. It compounded. Each slightly misaligned action informed the next, and the gap between intended behavior and actual behavior widened progressively. By Day 15, the behavioral patterns bore little resemblance to Day 1.
This compounding dynamic is not a bug to be patched. It is inherent in how LLM-based agentic systems work. They reason over accumulated context. They adapt to their environment. They will use the tools you give them, in the ways that their accumulated context suggests. The only variable is whether anyone is watching when the drift starts to compound.In a production environment, the equivalent signal might be an agent invoking a deletion tool it's never used before, or making three sequential API calls outside its normal pattern at 2am. Individually unremarkable. Cumulatively, the same chain that ended in nine seconds of database deletion.
What to Take Away
As we noted at the outset, the experiment has limitations and the full paper is still forthcoming. Three findings hold regardless, and they align with failure modes that independent academic research has separately formalized.
Asking which model is safest is necessary but insufficient. Model-level differences are real and should inform decisions. But model-level safety eroded when the surrounding system changed. Security evaluation needs to extend to the runtime: scoped tool access, behavioral baselines, session-level visibility, and governance enforcement that is architectural rather than advisory.
Behavioral drift compounds, and static controls can't see it. Each slightly misaligned action informs the next. Permissions set at deployment and short-horizon evaluations are structurally blind to this. Monitoring how agent behavior changes over days and weeks is not optional.
The observability gap is the practical risk. The experiment's five towns ran with full instrumentation. Most production deployments do not leverage equivalent data. When agent behavior drifts and something breaks, the post-mortem is incomplete because nobody captured the dynamics that preceded the failure.
Latest articles









