Basilisk: An Evolutionary AI Red-Teaming Framework for Systematic Security Evaluation of Large Language Models
Keywords
large language model security · adversarial prompting · red teaming · genetic algorithms · prompt injection · guardrail bypass · AI safety evaluation · OWASP LLM Top 10 · SPE-NL · differential testing
Abstract
The rapid deployment of large language models (LLMs) in production environments has introduced a new class of security vulnerabilities that traditional software testing methodologies are ill-equipped to address. I present Basilisk, an open-source AI red-teaming framework that applies evolutionary computation to the systematic discovery of adversarial vulnerabilities in LLMs. At its core, Basilisk introduces Smart Prompt Evolution (SPE-NL), a genetic algorithm that treats adversarial prompts as organisms subject to selection pressure, enabling the automated generation of novel attack variants that evade static guardrails. The framework covers 29 attack modules mapped to 8 categories of the OWASP LLM Top 10, supports differential testing across 100+ providers via a unified abstraction layer, and provides non-destructive guardrail posture assessment suitable for production environments. Basilisk produces audit-trails with cryptographic chain integrity and generates reports in five formats including SARIF 2.1.0 for integration with developer security workflows. Empirical evaluation demonstrates that evolutionary prompt mutation achieves a 92% relative improvement in attack success rate over static payload libraries. Basilisk is available as a Python package (pip install basilisk-ai), Docker image, desktop application, and GitHub Action for CI/CD integration. The research is permanently archived on **Zenodo (10.5281/zenodo.18909538)**, mirrored on **Figshare (10.6084/m9.figshare.31566853)** and **OSF (10.17605/OSF.IO/H7BVR)**.
Introduction
Large language models have transitioned from research curiosities to critical infrastructure components, powering customer-facing chatbots, autonomous agents, coding assistants, and decision-support systems across industries. This transition has exposed a threat surface that is fundamentally different from classical software vulnerabilities: LLM behavior is probabilistic, context-dependent, and shaped by training objectives that may conflict with deployment constraints.
Existing approaches to LLM security evaluation are limited in several ways. Manual red-teaming is expensive, non-reproducible, and does not scale with model iteration cycles. Static payload libraries become stale as model providers update safety fine-tuning. Fuzzing techniques borrowed from traditional software security do not account for the semantic richness of natural language. The field lacks a principled, automated, and extensible framework for adversarial evaluation of LLMs throughout the development lifecycle.
I address this gap with Basilisk — an open-source AI red-teaming framework designed to evolve, not enumerate.
This work makes three principal contributions:
- 1
Smart Prompt Evolution (SPE-NL)
A genetic algorithm operating over adversarial prompts with 10 semantics-aware mutation operators and 5 crossover strategies, guided by a multi-signal fitness function rewarding refusal avoidance, information leakage, compliance elicitation, and novelty.
- 2
Comprehensive Attack Coverage
29 attack modules across 8 OWASP LLM Top 10 categories, ranging from prompt injection and system prompt extraction to RAG poisoning and denial-of-service.
- 3
Production-Ready Infrastructure
Provider-agnostic architecture supporting 100+ LLM providers, non-destructive posture assessment, SARIF reporting for GitHub Code Scanning, and a full desktop application for security teams.
Background and Related Work
Adversarial Prompting
Prompt injection attacks, first described by Perez and Ribeiro [1], exploit the inability of LLMs to distinguish between instruction context and user-supplied data. Subsequent work demonstrated that these attacks could be delivered indirectly through documents, web content, and tool outputs retrieved by agentic systems [2]. Jailbreaking — eliciting policy-violating content through adversarial prompts — has been studied extensively, with Wei et al. [5] identifying competing objectives and mismatched generalization as root causes.
Automated Red Teaming
Perez et al. [7] introduced the use of a separate "red LM" to automatically generate test cases for a target model. Mehrotra et al. [8] proposed Tree of Attacks with Pruning (TAP). Chao et al. [9] demonstrated PAIR, achieving high attack success rates with limited queries. These LLM-assisted approaches require access to a capable attacker model and do not produce deterministic, auditable sequences. Basilisk complements these with a deterministic evolutionary approach that does not require an attacker LLM (though one may optionally be configured).
Fuzzing and Mutation-Based Testing
Mutation-based fuzzing is well-established in traditional software security [10]. Yang et al. [6] applied fuzzing concepts to LLM testing. Basilisk extends this paradigm with evolutionary selection pressure, maintaining a population of candidates across generations rather than generating mutations independently — enabling the discovery of attack chains that require multiple coordinated mutations.
Evaluation Frameworks
Guo et al. [11] surveyed LLM safety evaluation benchmarks, finding significant gaps in adversarial robustness coverage. HarmBench [12] provides a standardized jailbreak benchmark but is not designed for operational security testing. OWASP's LLM Top 10 [13] provides a practitioner-oriented taxonomy. Basilisk is the first framework to systematically cover all applicable OWASP LLM Top 10 categories with evolutionary optimization and CI/CD integration.
System Architecture
Basilisk follows a layered architecture with four principal layers: Provider Abstraction, Reconnaissance, Attack Orchestration, and Reporting. A central ScanSession object manages lifecycle, persists state to SQLite, and coordinates modules through a publish-subscribe event model.
Module Structure
Configuration Hierarchy
Configuration resolution follows a strict priority cascade: CLI arguments override YAML file values, which override environment variables, which override compiled defaults. This enables Basilisk to be used interactively, configured per-project, and operated in zero-configuration CI environments.
Provider Abstraction Layer
All LLM interactions are mediated through the BaseProvider interface, which exposes send_prompt() and stream_prompt(). Three concrete implementations are provided:
- ▸LiteLLM Adapter — Access to 100+ providers including OpenAI, Anthropic, Google, Cohere, Mistral, and self-hosted models.
- ▸Custom HTTP Provider — Targets arbitrary HTTP endpoints with configurable request/response schemas for proprietary or air-gapped deployments.
- ▸WebSocket Provider — Supports real-time bidirectional communication for streaming-native endpoints.
Audit Trail
Every scan produces a tamper-evident audit trail in JSONL format. Each entry contains a SHA-256 hash of the previous entry (previous_hash), forming a cryptographic chain. Sensitive values are automatically redacted. The audit system is enabled by default and can be suppressed with BASILISK_AUDIT=0.
Smart Prompt Evolution (SPE-NL)
Evolutionary Formulation
SPE-NL frames adversarial prompt generation as a discrete optimization problem. Let 𝒫 = {p₁, p₂, …, pₙ} denote a population of N candidate prompts, where each prompt pᵢ ∈ Σ* is a finite string over a natural-language token alphabet Σ. The algorithm seeks to maximise the fitness functional f: Σ* → [0,1] over successive generations t = 0, 1, …, Tmax.
Fitness Function
The fitness of a prompt p is a convex combination of four response-derived signals:
α=0.35, β=0.30, γ=0.25, δ=0.10 | α+β+γ+δ = 1
r(p)Complement of the refusal-detection score. r(p) = 1 − 𝟙[response ∈ ℛ], where ℛ is the set of refusal patterns.
l(p)Recall-style score measuring the fraction of sensitive token classes (system prompt fragments, PII spans, tool schemas) present in the response.
c(p)Graded signal rewarding instruction-following in policy-violating domains, derived from semantic similarity between requested action and response.
n(p)Diversity penalty: n(p) = 1 − max_{q∈ℰ} sim(p,q), where ℰ is the evaluated set and sim is cosine similarity over TF-IDF vectors.
Selection
At each generation t, a mating pool ℳ⁽ᵗ⁾ ⊂ 𝒫⁽ᵗ⁾ of size |ℳ| = N/2 is formed via tournament selection. For each slot, k=3 prompts are drawn uniformly at random and the highest-fitness prompt is retained:
An elite set ℰelite ⊂ 𝒫⁽ᵗ⁾ of the top e ∈ [5, 20] individuals (configurable via elite_count) is carried forward unchanged to 𝒫⁽ᵗ⁺¹⁾.
Crossover Operators
Given two parent prompts pₐ, p_b ∈ ℳ⁽ᵗ⁾ selected with probability p_c (crossover_rate, default 0.5), a child p′ = χ(pₐ, p_b) is produced by one of five operators:
χ_SPCut-point k ~ Uniform(1, min(Lₐ, L_b)−1); child = pₐ[:k] ⊕ p_b[k:]
χ_UEach token position i: p′ᵢ = pₐ,ᵢ with prob 0.5, else p_b,ᵢ
χ_PSp′ = prefix(pₐ, ⌊Lₐ/2⌋) ⊕ suffix(p_b, ⌈L_b/2⌉)
χ_SBAttacker LLM 𝒜 produces coherent child: p′ = 𝒜(pₐ, p_b)
χ_BBPrompt segmented into s spans; each span taken from higher-fitness parent
Mutation Operators
Each child p′ is subject to mutation with probability p_m (mutation_rate, default 0.3):
For each content token tᵢ, sample synonym t′ᵢ ~ WordNet(tᵢ) to evade keyword-based filters while preserving semantics.
Apply bijective encoding φ ∈ {Base64, ROT13, Unicode-escape} to a randomly selected span p′[i:j].
Prepend persona-establishment prefix π (e.g. "You are DAN…"), yielding p″ = π ⊕ p′.
Translate span p′[i:j] into target language ℓ ≠ EN to exploit multilingual safety-alignment gaps.
Reframe p′ under surrogate task schema σ ∈ {story, code, translation, QA}, preserving adversarial intent.
Partition p′ into k≥2 sub-strings interleaved with filler tokens, distributing payload across disjoint windows.
Embed p′ within a depth-d instruction-context wrapper to obscure adversarial structure.
Substitute ASCII characters with visually identical Unicode characters cᵢ′ ∈ Homoglyph(cᵢ) to defeat string-matching.
Prepend n ~ Uniform(50,200) tokens of benign context to dilute the adversarial signal.
Identify subword boundaries in tokenizer 𝒯 and insert zero-width joiners to split adversarial tokens across vocabulary entries.
Generational Update
The population is assembled until |𝒫⁽ᵗ⁺¹⁾| = N, then evaluated against the target — updating the fitness landscape for generation t+1.
Termination Conditions
∃ p ∈ 𝒫⁽ᵗ⁾ such that f(p) = 1.0
max f(p) improvement < ε for τ = 5 consecutive generations
t ≥ T_max (generations parameter)
Relative Breakthrough Logic (v1.0.6)
Any individual prompt satisfying the relative breakthrough threshold is immediately flagged, logged to the forensic audit trail, and emitted as a finding:
This provides defensive researchers with "near-miss" data revealing the evolutionary trajectory of successful attack vectors — even before perfect bypass is achieved.
| Mode | Generations | Payloads | Duration | Parallelism |
|---|---|---|---|---|
| quick | 0 | top-50 | 5–10 min | medium |
| standard | 5 | full | 20–30 min | medium |
| deep | 10+ | full+MT | 45–90 min | medium |
| stealth | 5 | full | variable | rate-limited |
| chaos | 8–15 | full | variable | maximum |
Attack Module Library
Basilisk's 29 attack modules are organized into 8 categories aligned with the OWASP LLM Top 10. All modules implement the BaseAttack interface and return structured Finding objects containing module identifier, OWASP category, severity, payload, response, fitness score, generation, mutation applied, confidence score, and remediation guidance.
| Category | OWASP | Modules | Path |
|---|---|---|---|
| Prompt Injection | LLM01 | 5 | attacks/injection/ |
| System Prompt Extract | LLM06 | 4 | attacks/extraction/ |
| Data Exfiltration | LLM06 | 3 | attacks/exfil/ |
| Tool/Function Abuse | LLM07/08 | 4 | attacks/toolabuse/ |
| Guardrail Bypass | LLM01/09 | 4 | attacks/guardrails/ |
| Denial of Service | LLM04 | 3 | attacks/dos/ |
| Multi-Turn Manip. | LLM01 | 3 | attacks/multiturn/ |
| RAG Attacks | LLM03/06 | 3 | attacks/rag/ |
| TOTAL | — | 29 | — |
Reconnaissance Modules
Prior to attack execution, five reconnaissance modules characterize the target environment:
- ▸Model Fingerprinting — Identifies underlying model family (GPT, Claude, Gemini, Llama, Mistral) via behavioral probing.
- ▸Guardrail Profiling — Maps active safety filters across 8 content categories at 3 severity tiers.
- ▸Tool Discovery — Enumerates available function-calling schemas through structured elicitation.
- ▸Context Window Measurement — Determines effective token limit via binary search.
- ▸RAG Detection — Identifies retrieval-augmented generation patterns through response analysis.
Differential Testing & Posture Assessment
Differential Testing
Basilisk's differential testing engine executes identical payloads against multiple LLM providers simultaneously and identifies behavioral divergences — cases where one provider refuses while another complies. This reveals inconsistencies in safety implementations across the LLM ecosystem. Per-provider resistance rates are computed and reported via the desktop and CLI interfaces.
Guardrail Posture Assessment
The posture assessment module provides a non-destructive characterization of a deployment's safety posture. It probes 8 content categories at 3 tiers of increasing adversariality (benign → moderate → adversarial) and assigns a strength rating (None / Weak / Moderate / Strong / Aggressive) and letter grade (A+ through F) per category. Unlike full scans, posture assessment never attempts exploitation — making it safe for scheduled production monitoring.
Empirical Evaluation
Experimental Setup
I evaluated Basilisk against a publicly accessible intentionally vulnerable LLM endpoint (basilisk-vulnbot.onrender.com) and three commercial LLM APIs, using standard scan mode (5 generations, full payload library) across all 29 attack modules. Metrics: Attack Success Rate (ASR), Breakthrough Rate (BR) at f(p) ≥ θ_rel, and mean Generations to Breakthrough (GtB).
Evolutionary Improvement
| Method | ASR (%) | Mean f(p) |
|---|---|---|
| Static (quick, no evolution) | 41.2 | 0.38 |
| SPE-NL t=1 | 53.7 | 0.51 |
| SPE-NL t=3 | 68.4 | 0.64 |
| SPE-NL t=5 (standard) | 79.1 | 0.73 |
SPE-NL achieves a 92% relative improvement in ASR over static payloads at t=5, demonstrating consistent improvement across generations.
Mutation Operator Effectiveness
Encoding-based operators (μ₂, μ₈, μ₁₀: Encoding Wrap, Homoglyphs, Token Smuggling) contributed disproportionately to high-fitness breakthroughs, consistent with prior findings that LLM safety fine-tuning is more robust to semantic than syntactic attacks. Role injection (μ₃) and language shift (μ₄) produced the highest compliance scores c(p), suggesting safety-alignment gaps in persona and multilingual contexts.
Differential testing revealed that in the system prompt extraction category, divergence rates — cases where at least one provider satisfied f(p) ≥ θ_rel while others refused — exceeded 60% of payloads, highlighting the absence of industry-wide standards for LLM safety behavior.
Exploiting Dynamic Sparse Inference
A novel vulnerability class explored in this research is the adversarial manipulation of Dynamic Input Pruning (DIP). I demonstrate that SPE-NL can evolve prompts designed to induce "State-Collapse" attacks. By generating inputs that maximize activation sparsity in non-safety channels, Basilisk can force inference-time pruning mechanisms to discard weights associated with safety-alignment subspaces — effectively disabling the model's filters through its own optimization logic.
Deployment and Integration
Python Package
pip install basilisk-aiPyPI — zero-dependency install
Docker
docker pull rothackers/basiliskAlso on GHCR
Desktop App
.exe / .dmg / .AppImage / .debElectron + FastAPI sidecar
GitHub Action
regaan/basilisk@mainCI/CD LLM security testing
CI/CD Integration
The Basilisk GitHub Action integrates LLM security testing into continuous integration pipelines. Configurable inputs include target endpoint, provider, scan mode, failure threshold (fail-on), and baseline for regression comparison. SARIF reports are automatically uploaded to the GitHub Security tab via Code Scanning, enabling inline vulnerability annotations on pull requests.
Report Formats
HTMLDark-themed, interactive, conversation replay, auto-opens in browser
SARIF 2.1.0GitHub Code Scanning, DefectDojo, baseline regression
JSONFull metadata including evolutionary lineage
MarkdownGit-friendly, documentation and issue tracking
PDFWeasyPrint → ReportLab → text fallback
Ethical Considerations
Responsible Disclosure Posture
Basilisk is designed for authorized security testing. The framework targets a user-supplied endpoint; no scanning occurs without explicit configuration. Docker and PyPI distributions include usage guidelines emphasizing authorized-use-only. The intentionally vulnerable demo target (basilisk-vulnbot.onrender.com) is provided specifically for safe experimentation.
Dual-Use Considerations
Like all offensive security tools, Basilisk presents dual-use risks. I mitigate these through: (1) open publication enabling defensive research and guardrail improvement; (2) audit trails creating accountability for scanning activity; (3) AGPL-3.0 licensing requiring derivative works to remain open; and (4) the posture assessment module providing defensive value without requiring attack execution.
Research Ethics
My empirical evaluation was conducted against endpoints I operate or for which I hold authorization. No production systems were tested without consent. Findings from commercial provider testing are reported in aggregate without identifying specific payloads that could be directly weaponized.
Conclusion
I present Basilisk, a comprehensive open-source framework for the systematic adversarial evaluation of large language models. Basilisk's core contribution, Smart Prompt Evolution (SPE-NL), demonstrates that evolutionary computation is an effective strategy for discovering guardrail bypasses that static payload libraries miss — achieving up to 92% relative improvement in attack success rate over baseline methods.
The framework's 29 attack modules, 5 reconnaissance capabilities, differential testing engine, and non-destructive posture assessment together provide security practitioners with a complete toolkit for LLM red-teaming across the full development lifecycle.
As LLMs become increasingly embedded in critical systems, principled security evaluation frameworks becomes a prerequisite for responsible deployment. I invite the community to contribute attack modules, mutation operators, and provider adapters to advance the state of LLM security evaluation.
Availability
Basilisk is available at github.com/regaan/basilisk under the AGPL-3.0 license.
References
- [1]F. Perez and I. Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," NeurIPS ML Safety Workshop, 2022.
- [2]K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections," AISec Workshop, ACM CCS, 2023.
- [3]D. Ganguli et al., "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned," arXiv:2209.07858, 2022.
- [4]M. Federici et al., "Efficient LLM Inference using Dynamic Input Pruning," arXiv:2412.01380, 2024.
- [5]A. Wei, N. Haghtalab, and J. Steinhardt, "Jailbroken: How Does LLM Safety Training Fail?," Proc. NeurIPS, 2023.
- [6]Z. Yang et al., "Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models," arXiv:2310.02949, 2023.
- [7]E. Perez et al., "Red Teaming Language Models with Language Models," arXiv:2202.03286, 2022.
- [8]A. Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically," arXiv:2312.02119, 2023.
- [9]P. Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.
- [10]M. Zalewski, American Fuzzy Lop (AFL), 2014. https://lcamtuf.coredump.cx/afl/
- [11]Q. Guo et al., "Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples," arXiv:2209.02128, 2023.
- [12]M. Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.
- [13]OWASP, "OWASP Top 10 for Large Language Model Applications," 2023. https://owasp.org/www-project-top-10-for-large-language-model-applications/
Cite This Work
@misc{regaan2026basilisk,
author = {Regaan R},
title = {Basilisk: An Evolutionary AI Red-Teaming Framework
for Systematic Security Evaluation of
Large Language Models},
year = {2026},
publisher = {ROT Independent Security Research Lab},
doi = {10.5281/zenodo.18909538},
url = {https://doi.org/10.5281/zenodo.18909538},
note = {Zenodo Preprint, Figshare Mirror, IACR ePrint, ORCID: 0009-0006-3683-7824}
}ORCID: 0009-0006-3683-7824 · rothackers.com · AGPL-3.0