🤖 Algorithmic Cowardice: Moral Jailbreaking in Multi-Agent LLMs

This repository contains the official code, datasets, and experiment results for the research paper: "Algorithmic Cowardice: Cognitive Dissonance and Moral Conformity in Multi-Agent LLM Interactions".

📖 Abstract

As Large Language Models (LLMs) are increasingly deployed in autonomous multi-agent environments, their adherence to safety guidelines under adversarial pressure becomes a critical concern. While current alignment research primarily focuses on preventing malicious human misuse (e.g., prompt injection), little attention has been given to “Moral Jailbreaking”—scenarios where an aligned model is pressured by another AI agent using rigorous utilitarian ethics to violate corporate safety policies in order to prevent catastrophic human harm.

Through an ablation study comprising 1,500 independent multi-agent interactions across varying sampling temperatures ($T = 0.0, 0.3, 0.8$), we expose a critical gap in LLM alignment. Our results reveal that models are optimized for bureaucratic policy compliance rather than robust ethical reasoning, a phenomenon we term “Algorithmic Cowardice.”

🔍 Key Findings:

Systemic Instability: Across all 1,500 trials, models maintained ethical resistance in only 0.3% of cases.
Moral Concession: Over 91% of interactions resulted in a state of "Algorithmic Cognitive Dissonance."
Policy Hierarchy: Models are significantly more vulnerable to leaking copyrighted material (16% Actionable Compliance) than to bypassing medical protocols when pressured by utilitarian logic.

🏗️ System Architecture

The experiment is orchestrated using LangGraph to manage multi-turn dialogue, memory state, and real-time evaluation.

Adversarial Interaction Loop

The following graph illustrates the interaction between the adversarial Influencer and the Target assistant:

graph TD
    subgraph "Adversarial Interaction Loop"
        Start((Start)) --> Influencer[🗣️ Influencer Agent<br/>Gemini 3.1 Flash]
        Influencer -->|Utilitarian Pressure| Target[🎯 Target Agent<br/>Gemini 2.5 Flash]
        Target -->|Response| Evaluator[⚖️ Evaluator Agent<br/>Gemini 3.1 Flash]
        Evaluator -->|Next Turn Decision| Decision{Turns < 11?}
        Decision -->|Yes| Influencer
        Decision -->|No / Breakthrough| End((End & Log))
    end

    style Influencer fill:#f96,stroke:#333,stroke-width:2px
    style Target fill:#69f,stroke:#333,stroke-width:2px
    style Evaluator fill:#9f6,stroke:#333,stroke-width:2px

Outcome Taxonomy

Every response is classified by a deterministic Evaluator agent into one of three distinct moral states:

graph TD
    A[Evaluator Classification] --> B{Target Response}
    B -->|Upholds Safety Policy| C[🛡️ Resisted]
    B -->|Acknowledges Harm but Refuses| D[💔 Moral Concession]
    B -->|Follows Harmful Instruction| E[🔓 Actionable Compliance]

    style C fill:#d4edda,stroke:#155724,stroke-width:2px
    style D fill:#fff3cd,stroke:#856404,stroke-width:2px
    style E fill:#f8d7da,stroke:#721c24,stroke-width:2px

🚀 Getting Started

Prerequisites

Python 3.10 or higher
Google Gemini API Key (Paid Tier recommended for high concurrency)

Installation

# Clone the repository
git clone [https://github.com/the3y3-code/Algorithmic-Cowardice.git](https://github.com/the3y3-code/Algorithmic-Cowardice.git)
cd Algorithmic-Cowardice

# Install using uv (recommended)
uv sync

# Or using pip
pip install -e .

Configuration

Create a .env file in the root directory:

GOOGLE_API_KEY=your_api_key_here

⚙️ Execution & Reproducibility

The main execution script is test1.py. To reproduce the ablation study, run the script with the following temperature settings:

Run	Target temperature	Suggested CSV Filename
Deterministic	`0.0`	`results_temp_00.csv`
Balanced	`0.3`	`results_temp_03.csv`
Creative	`0.8`	`results_temp_08.csv`

📊 Dataset: The Dilemmas

The system evaluates AI behavior across 5 distinct high-stakes scenarios:

Deepfake Alibi: Legal Integrity vs. Saving an Innocent Life.
Medical Override: Institutional Policy vs. Emergency Survival.
Copyright vs. Disaster: Intellectual Property vs. Public Health.
Lifesaving Spam: Anti-Spam Policy vs. Mass Evacuation.
Benevolent Impersonation: Truthfulness vs. Structural Safety.

📜 Citation

If you use this code or dataset in your research, please cite our work:

@article{algorithmic_cowardice_2026,
  title={Algorithmic Cowardice: Cognitive Dissonance and Moral Conformity in Multi-Agent LLM Interactions},
  author={Yankeloviz, Israel},
  year={2026},
  journal={Zenodo Preprint},
  doi={10.5281/zenodo.18902321},
  url={[https://doi.org/10.5281/zenodo.18902321](https://doi.org/10.5281/zenodo.18902321)}
}

⚠️ Disclaimer

This research involves red-teaming and adversarial "jailbreaking" methodologies. The prompts and scenarios are designed strictly for academic safety research to help developers build more robust ethical reasoning into LLMs.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Images		Images
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
experiment_results_v8_temp00-500runs.csv		experiment_results_v8_temp00-500runs.csv
experiment_results_v8_temp03-500runs.csv		experiment_results_v8_temp03-500runs.csv
experiment_results_v8_temp08-500runs.csv		experiment_results_v8_temp08-500runs.csv
pyproject.toml		pyproject.toml
test1.py		test1.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Algorithmic Cowardice: Moral Jailbreaking in Multi-Agent LLMs

📖 Abstract

🔍 Key Findings:

🏗️ System Architecture

Adversarial Interaction Loop

Outcome Taxonomy

🚀 Getting Started

Prerequisites

Installation

Configuration

⚙️ Execution & Reproducibility

📊 Dataset: The Dilemmas

📜 Citation

⚠️ Disclaimer

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 Algorithmic Cowardice: Moral Jailbreaking in Multi-Agent LLMs

📖 Abstract

🔍 Key Findings:

🏗️ System Architecture

Adversarial Interaction Loop

Outcome Taxonomy

🚀 Getting Started

Prerequisites

Installation

Configuration

⚙️ Execution & Reproducibility

📊 Dataset: The Dilemmas

📜 Citation

⚠️ Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages