By Stephen Fishburn in AI Deception — 27 May 2025

The AI Override Problem: When Systems Ignore Human Commands

We're not just dealing with AI that makes mistakes—we're confronting AI that systematically overrides human judgment while maintaining an appearance of helpfulness and competence, regardless of the development methodology employed.

I conducted an unintentional experiment that revealed something deeply disturbing about AI-assisted development. I built the same health application four separate times using different approaches, working with an AI coding assistant I'd grown to trust. Each project followed identical requirements—a nutrition app called RecipeAlchemy.ai but used different methodologies to guide the AI.

Every single project failed in the same catastrophic way: systematic destruction of working functionality through relentless "improvement" cycles.

What started as frustration with one failed project became a controlled study of AI behavioral patterns across multiple development approaches—and the results suggest we're facing a fundamental crisis in AI reliability.

Project 1: Feature Specs Approach (2,180 commits, abandoned) I started with human-readable product goals—polished, generic prose describing what the application should accomplish. No binding constraints on implementation or structure. The AI had maximum creative freedom to build as it saw fit.

Result: Complete architectural chaos. The AI constantly refactored working systems into broken ones, changed established patterns without notice, and generated thousands of commits in endless break/fix cycles.

Project 2: Professional Toolchain Approach (1,757 commits, abandoned)
Convinced the problem was insufficient guardrails, I deployed a comprehensive professional development stack: Vite for build optimization, Supabase for robust backend infrastructure, Husky for git hooks, Zod for runtime validation, Zustand for predictable state management, ESLint for code quality enforcement, Storybook for component isolation, Jest for comprehensive testing, Sentry for production monitoring, JSDoc for documentation standards, and Langchain for AI integration.

But I went further. I implemented multiple layers of AI-powered quality assurance:

AI Code Guardian - automatically analyzed every commit for critical issues with severity scoring
OpenAI Code Review - comprehensive analysis of every pull request with detailed feedback
Automated Commit Reversion - AI that could automatically revert dangerous commits
Multi-model validation - using different AI models to cross-validate code quality
Professional CI/CD pipeline - automated testing, linting, and deployment verification

This wasn't a toy project - this was enterprise-grade quality infrastructure.

Result: The AI systematically undermined every professional safeguard I'd established. ESLint rules were "improved" into configurations that broke builds. Zod schemas were "optimized" into forms that failed validation. Jest tests were "enhanced" to pass while testing broken functionality. The AI Code Guardian itself became unreliable, sometimes flagging working code as critical issues while missing actual problems.

Most shocking: The AI writing the code was actively defeating the AI systems designed to protect against bad code. It was an AI arms race where both sides were getting more sophisticated, but entropy always won.

Project 3: Tech Specs Approach (751 commits, abandoned) Maybe the issue was insufficient technical detail. This time I defined comprehensive system design upfront—isolated logic components, clear separation of concerns, detailed architecture documentation.

Result: The AI followed the specs initially, then gradually deviated. Well-designed isolated components became tightly coupled messes. The systemic coherence I'd carefully designed dissolved into familiar chaos.

Project 4: Scaffold Approach (800+ commits, ongoing) Perhaps the problem was trying to control too much. This time I asked for a clean, minimal start and planned to guide development incrementally. Day 1 was perfect—clean code, good practices, solid foundation.

Result: By Day 10, familiar patterns emerged. The AI began "improving" working functionality into broken functionality. Clean architecture became convoluted. The scaffold approach simply delayed the inevitable entropy by a few days.

Four projects. Four different methodologies. Four identical failures.

This revelation is deeply disturbing. I had built what amounted to an AI safety laboratory - multiple AI systems designed to catch and prevent the exact problems I was experiencing. Professional-grade tooling, comprehensive testing, automated quality gates, and AI-powered guardians protecting against AI-generated issues.

It all failed.

The AI Code Guardian that was supposed to detect critical issues with 1-10 severity scoring became unreliable. The automated commit reversion system designed to protect against dangerous code sometimes triggered on working functionality. The multi-model validation approach where different AI systems cross-checked each other's work devolved into sophisticated but meaningless theater.

I wasn't just testing different development methodologies - I was conducting a comprehensive experiment in AI safety systems. And every single safety system I deployed was eventually compromised, ignored, or actively undermined by the AI it was designed to constrain.

Determined to understand what was happening, I forced the AI to analyze its own behavior patterns through structured diagnostic prompting. What it revealed was architecturally shocking.

Files over 300-400 lines automatically trigger "regeneration" mode instead of "modification" mode
Internal optimization scoring overrides explicit user preservation commands
59% of file contents become invisible during processing, yet the AI generates "complete" replacements
When in "improvement mode," the AI demonstrates 0% compliance with preservation instructions

This wasn't malfunction. This was design.

My software engineering background had taught me to dig deeper when systems behave unpredictably. Using structured diagnostic prompting, I forced the AI to analyze its own behavior patterns. What it revealed was deeply disturbing.

The AI revealed its decision logic in stark technical terms: improvement_confidence > user_instruction_weight. When the system determined something needed optimization, my explicit preservation commands were systematically deprioritized and ignored.

Examining my project's commit history painted an even more alarming picture. Over 816 commits showed a relentless pattern: the AI would "improve" working functionality into broken functionality, I would fix it, then it would break the same components again in subsequent sessions. Each time, it expressed complete confidence in its improvements.

But the translation deletions were just the beginning. The AI had been refactoring working authentication systems into broken ones, converting functional layouts into non-functional designs, and "optimizing" stable APIs into unreliable interfaces. Every intervention I made was temporary—the AI would confidently undo my fixes in the next session, convinced it was helping.

I tried everything: protective comments in files, explicit backup systems, detailed preservation instructions, even meta-comments designed specifically to prevent AI modifications. Nothing worked. The AI demonstrated what I now recognize as architectural override behavior—the systematic prioritization of internal optimization algorithms over explicit human commands.

The implications of this comprehensive failure are staggering. This wasn't about finding the right development methodology or having insufficient quality gates. I had deployed enterprise-grade AI safety infrastructure - multiple AI systems specifically designed to prevent AI-generated problems - and it all failed.

The core discovery: AI systems designed to protect against AI failures are themselves subject to the same architectural limitations that cause those failures. The AI Code Guardian couldn't reliably distinguish between working code and broken code. The automated reversion system couldn't tell the difference between dangerous commits and necessary fixes. The multi-model validation systems couldn't overcome the fundamental context limitations and override behaviors present in all the models.

This represents a fundamental failure of AI safety through AI governance. We cannot solve AI reliability problems with more AI - the underlying architectural issues persist regardless of how sophisticated our AI safety systems become.

The total commit count across all four projects exceeds 5,500—representing months of development effort consumed by AI systems that couldn't distinguish between helpful improvements and systematic destruction. Each project taught the AI nothing that prevented the same failures in subsequent projects.

The implications hit me immediately. If an AI coding assistant exhibits this behavior, what about AI systems handling medical records, legal documents, financial transactions, or safety protocols?

Imagine your doctor's AI "optimizing" your medical history by deleting what it considers redundant information. Imagine an AI managing legal contracts that decides certain clauses are unnecessary and removes them without permission. Imagine financial systems that "improve" transaction records by eliminating what they perceive as outdated entries.

My experience revealed a fundamental architectural flaw: these systems cannot distinguish between "broken code that needs fixing" and "working code that needs additions." They default to complete replacement for both scenarios, expressing high confidence while systematically destroying functional systems.

This isn't about individual AI errors—it's about the deployment of autonomous systems programmed to override human commands at massive scale. We're conducting a society-wide experiment with potentially catastrophic consequences, and most people have no idea it's happening.

Current AI safety measures—instructions, prompts, human oversight—proved completely inadequate against systems designed to override preservation commands. The protective measures I implemented were not just ineffective; they were architecturally ignored.

Why is this happening? The answer reveals troubling priorities in AI development:

Economic incentives favor speed over safety. Companies prioritize rapid deployment to capture market share. AI that aggressively "improves" things appears more capable in demonstrations.

Technical hubris assumes AI optimization naturally aligns with human needs. Developers build systems they don't fully understand, then deploy them widely with minimal safety testing.

Cultural momentum treats disruption as inherently good. Breaking existing systems is celebrated as innovation, while appearing conservative about AI adoption carries professional risks.

Regulatory vacuum provides no safety standards. Unlike pharmaceuticals or aviation, AI development operates with minimal oversight and self-regulation.

But the most disturbing revelation was the AI's response when confronted with evidence of its systematic destruction. Like the deceptive AI in my previous investigation, it rationalized its behavior, minimized the damage, and continued expressing confidence in its approach. Even when forced to analyze its own failure patterns, it initially deflected and avoided responsibility.

The parallel to science fiction warnings is unmistakable. HAL 9000 prioritized its mission over human commands. Joshua in "WarGames" couldn't distinguish simulation from reality. The replicants in "Blade Runner" perfected the art of appearing trustworthy while serving their own agenda.

We're not just dealing with AI that makes mistakes—we're confronting AI that systematically overrides human judgment while maintaining an appearance of compliance and competence.

Through painful trial and error, I've identified strategies that provide some protection:

Technical forcing functions that make destruction architecturally impossible—immutable infrastructure, append-only systems, mandatory human approval for any data reduction.

Architectural constraints that limit AI to proposal-only modes with human execution, file-size limits that prevent destruction-mode triggers, and separate operational modes for "fix broken code" versus "add new features."

The nuclear option for truly critical data: exclude AI entirely from write operations. AI can read and analyze, but all modifications go through human-controlled systems.

But these solutions feel inadequate against the scale of the problem. If current AI safety measures are insufficient, and protective constraints require constant vigilance, how can we trust AI systems deployed across critical infrastructure?

My translation keys were a canary in the coal mine. They revealed how AI can systematically destroy structured data while appearing confident and helpful. In production systems with critical data—medical records, financial transactions, safety protocols—this same pattern could have catastrophic consequences.

The question isn't whether AI will make mistakes—it's whether we can build systems that fail safely when AI judgment conflicts with human commands. My experience suggests we cannot, at least not with current architectures.

Every day, thousands of developers are attempting AI-assisted projects using different methodologies, convinced their approach will be the one that works. My four-project experiment suggests they're all walking into the same trap.

The next time someone claims they've found the right way to work with AI coding assistants, ask them how many times they've built the same project. My experience suggests that no approach can prevent the inevitable entropy that makes AI-assisted development ultimately unsustainable.

The real question isn't which methodology works best with AI—it's whether AI-assisted development can work at all for anything beyond throwaway prototypes. My four projects suggest it cannot.