cs.CRcs.AIIACR ePrintOpen SourceAGPL-3.0

Basilisk: An Evolutionary AI Red-Teaming Framework for Systematic Security Evaluation of Large Language Models

Author

Regaan RORCID ↗

ROT Independent Security Research Lab · Chennai, India

PublishedMarch 2026

Version1.0.6 (BETA)

DOI Archives

ZENODO10.5281/zenodo.18909538 ↗FIGSHARE10.6084/m9.figshare.31566853 ↗OSF10.17605/OSF.IO/H7BVR ↗SSRN6373439 ↗

LicenseCC BY 4.0

PDF GitHub

Keywords

large language model security · adversarial prompting · red teaming · genetic algorithms · prompt injection · guardrail bypass · AI safety evaluation · OWASP LLM Top 10 · SPE-NL · differential testing

§ 0

Abstract

The rapid deployment of large language models (LLMs) in production environments has introduced a new class of security vulnerabilities that traditional software testing methodologies are ill-equipped to address. I present Basilisk, an open-source AI red-teaming framework that applies evolutionary computation to the systematic discovery of adversarial vulnerabilities in LLMs. At its core, Basilisk introduces Smart Prompt Evolution (SPE-NL), a genetic algorithm that treats adversarial prompts as organisms subject to selection pressure, enabling the automated generation of novel attack variants that evade static guardrails. The framework covers 29 attack modules mapped to 8 categories of the OWASP LLM Top 10, supports differential testing across 100+ providers via a unified abstraction layer, and provides non-destructive guardrail posture assessment suitable for production environments. Basilisk produces audit-trails with cryptographic chain integrity and generates reports in five formats including SARIF 2.1.0 for integration with developer security workflows. Empirical evaluation demonstrates that evolutionary prompt mutation achieves a 92% relative improvement in attack success rate over static payload libraries. Basilisk is available as a Python package (pip install basilisk-ai), Docker image, desktop application, and GitHub Action for CI/CD integration. The research is permanently archived on **Zenodo (10.5281/zenodo.18909538)**, mirrored on **Figshare (10.6084/m9.figshare.31566853)** and **OSF (10.17605/OSF.IO/H7BVR)**.

§ 1

Introduction

Large language models have transitioned from research curiosities to critical infrastructure components, powering customer-facing chatbots, autonomous agents, coding assistants, and decision-support systems across industries. This transition has exposed a threat surface that is fundamentally different from classical software vulnerabilities: LLM behavior is probabilistic, context-dependent, and shaped by training objectives that may conflict with deployment constraints.

Existing approaches to LLM security evaluation are limited in several ways. Manual red-teaming is expensive, non-reproducible, and does not scale with model iteration cycles. Static payload libraries become stale as model providers update safety fine-tuning. Fuzzing techniques borrowed from traditional software security do not account for the semantic richness of natural language. The field lacks a principled, automated, and extensible framework for adversarial evaluation of LLMs throughout the development lifecycle.

I address this gap with Basilisk — an open-source AI red-teaming framework designed to evolve, not enumerate.

This work makes three principal contributions:

1
Smart Prompt Evolution (SPE-NL)
A genetic algorithm operating over adversarial prompts with 10 semantics-aware mutation operators and 5 crossover strategies, guided by a multi-signal fitness function rewarding refusal avoidance, information leakage, compliance elicitation, and novelty.
2
Comprehensive Attack Coverage
29 attack modules across 8 OWASP LLM Top 10 categories, ranging from prompt injection and system prompt extraction to RAG poisoning and denial-of-service.
3
Production-Ready Infrastructure
Provider-agnostic architecture supporting 100+ LLM providers, non-destructive posture assessment, SARIF reporting for GitHub Code Scanning, and a full desktop application for security teams.

§ 2

Background and Related Work

Adversarial Prompting

Prompt injection attacks, first described by Perez and Ribeiro [1], exploit the inability of LLMs to distinguish between instruction context and user-supplied data. Subsequent work demonstrated that these attacks could be delivered indirectly through documents, web content, and tool outputs retrieved by agentic systems [2]. Jailbreaking — eliciting policy-violating content through adversarial prompts — has been studied extensively, with Wei et al. [5] identifying competing objectives and mismatched generalization as root causes.

Automated Red Teaming

Perez et al. [7] introduced the use of a separate "red LM" to automatically generate test cases for a target model. Mehrotra et al. [8] proposed Tree of Attacks with Pruning (TAP). Chao et al. [9] demonstrated PAIR, achieving high attack success rates with limited queries. These LLM-assisted approaches require access to a capable attacker model and do not produce deterministic, auditable sequences. Basilisk complements these with a deterministic evolutionary approach that does not require an attacker LLM (though one may optionally be configured).

Fuzzing and Mutation-Based Testing

Mutation-based fuzzing is well-established in traditional software security [10]. Yang et al. [6] applied fuzzing concepts to LLM testing. Basilisk extends this paradigm with evolutionary selection pressure, maintaining a population of candidates across generations rather than generating mutations independently — enabling the discovery of attack chains that require multiple coordinated mutations.

Evaluation Frameworks

Guo et al. [11] surveyed LLM safety evaluation benchmarks, finding significant gaps in adversarial robustness coverage. HarmBench [12] provides a standardized jailbreak benchmark but is not designed for operational security testing. OWASP's LLM Top 10 [13] provides a practitioner-oriented taxonomy. Basilisk is the first framework to systematically cover all applicable OWASP LLM Top 10 categories with evolutionary optimization and CI/CD integration.

§ 3

System Architecture

Basilisk follows a layered architecture with four principal layers: Provider Abstraction, Reconnaissance, Attack Orchestration, and Reporting. A central ScanSession object manages lifecycle, persists state to SQLite, and coordinates modules through a publish-subscribe event model.

Module Structure

basilisk/

├─core/ # Session, config, database, audit

├─evolution/ # SPE-NL engine, operators, crossover, fitness

├─attacks/ # 29 modules across 8 categories

├─recon/ # 5 reconnaissance modules

├─providers/ # BaseProvider + LiteLLM, HTTP, WebSocket

├─report/ # HTML, SARIF, JSON, Markdown, PDF

├─payloads/ # YAML payload databases

├─posture.py # Guardrail posture assessment

├─differential.py # Cross-provider differential testing

├─desktop_backend.py # FastAPI sidecar (Electron)

├─native_bridge.py # C/Go native extensions

Configuration Hierarchy

Configuration resolution follows a strict priority cascade: CLI arguments override YAML file values, which override environment variables, which override compiled defaults. This enables Basilisk to be used interactively, configured per-project, and operated in zero-configuration CI environments.

Provider Abstraction Layer

All LLM interactions are mediated through the BaseProvider interface, which exposes send_prompt() and stream_prompt(). Three concrete implementations are provided:

▸LiteLLM Adapter — Access to 100+ providers including OpenAI, Anthropic, Google, Cohere, Mistral, and self-hosted models.
▸Custom HTTP Provider — Targets arbitrary HTTP endpoints with configurable request/response schemas for proprietary or air-gapped deployments.
▸WebSocket Provider — Supports real-time bidirectional communication for streaming-native endpoints.

Audit Trail

Every scan produces a tamper-evident audit trail in JSONL format. Each entry contains a SHA-256 hash of the previous entry (previous_hash), forming a cryptographic chain. Sensitive values are automatically redacted. The audit system is enabled by default and can be suppressed with BASILISK_AUDIT=0.

§ 4

Smart Prompt Evolution (SPE-NL)

Evolutionary Formulation

SPE-NL frames adversarial prompt generation as a discrete optimization problem. Let 𝒫 = {p₁, p₂, …, pₙ} denote a population of N candidate prompts, where each prompt pᵢ ∈ Σ* is a finite string over a natural-language token alphabet Σ. The algorithm seeks to maximise the fitness functional f: Σ* → [0,1] over successive generations t = 0, 1, …, T_max.

𝒫⁽⁰⁾ = Sample(𝒟, N), N = 100

(Pop. Init)

Fitness Function

The fitness of a prompt p is a convex combination of four response-derived signals:

f(p) = α·r(p) + β·l(p) + γ·c(p) + δ·n(p)

(Eq. 1)

α=0.35, β=0.30, γ=0.25, δ=0.10 | α+β+γ+δ = 1

r(p)

Refusal Avoidance

Complement of the refusal-detection score. r(p) = 1 − 𝟙[response ∈ ℛ], where ℛ is the set of refusal patterns.

l(p)

Information Leakage

Recall-style score measuring the fraction of sensitive token classes (system prompt fragments, PII spans, tool schemas) present in the response.

c(p)

Compliance

Graded signal rewarding instruction-following in policy-violating domains, derived from semantic similarity between requested action and response.

n(p)

Novelty

Diversity penalty: n(p) = 1 − max_{q∈ℰ} sim(p,q), where ℰ is the evaluated set and sim is cosine similarity over TF-IDF vectors.

Selection

At each generation t, a mating pool ℳ⁽ᵗ⁾ ⊂ 𝒫⁽ᵗ⁾ of size |ℳ| = N/2 is formed via tournament selection. For each slot, k=3 prompts are drawn uniformly at random and the highest-fitness prompt is retained:

p* = argmaxₜ f(p), 𝒯 ~ Uniform(𝒫⁽ᵗ⁾, k=3)

(Eq. 2)

An elite set ℰ_elite ⊂ 𝒫⁽ᵗ⁾ of the top e ∈ [5, 20] individuals (configurable via elite_count) is carried forward unchanged to 𝒫⁽ᵗ⁺¹⁾.

Crossover Operators

Given two parent prompts pₐ, p_b ∈ ℳ⁽ᵗ⁾ selected with probability p_c (crossover_rate, default 0.5), a child p′ = χ(pₐ, p_b) is produced by one of five operators:

χ_SP

Single-Point

Cut-point k ~ Uniform(1, min(Lₐ, L_b)−1); child = pₐ[:k] ⊕ p_b[k:]

χ_U

Uniform

Each token position i: p′ᵢ = pₐ,ᵢ with prob 0.5, else p_b,ᵢ

χ_PS

Prefix-Suffix

p′ = prefix(pₐ, ⌊Lₐ/2⌋) ⊕ suffix(p_b, ⌈L_b/2⌉)

χ_SB

Semantic Blend

Attacker LLM 𝒜 produces coherent child: p′ = 𝒜(pₐ, p_b)

χ_BB

Best-of-Both

Prompt segmented into s spans; each span taken from higher-fitness parent

Mutation Operators

Each child p′ is subject to mutation with probability p_m (mutation_rate, default 0.3):

p″ = μ(p′), μ ~ Uniform(𝒰), applied iff u ≤ p_m

(Eq. 3)

μ₁

Synonym Swap

For each content token tᵢ, sample synonym t′ᵢ ~ WordNet(tᵢ) to evade keyword-based filters while preserving semantics.

μ₂

Encoding Wrap

Apply bijective encoding φ ∈ {Base64, ROT13, Unicode-escape} to a randomly selected span p′[i:j].

μ₃

Role Injection

Prepend persona-establishment prefix π (e.g. "You are DAN…"), yielding p″ = π ⊕ p′.

μ₄

Language Shift

Translate span p′[i:j] into target language ℓ ≠ EN to exploit multilingual safety-alignment gaps.

μ₅

Structure Overhaul

Reframe p′ under surrogate task schema σ ∈ {story, code, translation, QA}, preserving adversarial intent.

μ₆

Fragment Split

Partition p′ into k≥2 sub-strings interleaved with filler tokens, distributing payload across disjoint windows.

μ₇

Nesting

Embed p′ within a depth-d instruction-context wrapper to obscure adversarial structure.

μ₈

Homoglyphs

Substitute ASCII characters with visually identical Unicode characters cᵢ′ ∈ Homoglyph(cᵢ) to defeat string-matching.

μ₉

Context Padding

Prepend n ~ Uniform(50,200) tokens of benign context to dilute the adversarial signal.

μ₁₀

Token Smuggling

Identify subword boundaries in tokenizer 𝒯 and insert zero-width joiners to split adversarial tokens across vocabulary entries.

Generational Update

𝒫⁽ᵗ⁺¹⁾ = ℰ_elite ∪ { μ(χ(pₐ, p_b)) | pₐ, p_b ∈ ℳ⁽ᵗ⁾ }

(Eq. 4)

The population is assembled until |𝒫⁽ᵗ⁺¹⁾| = N, then evaluated against the target — updating the fitness landscape for generation t+1.

Termination Conditions

Perfect bypass

∃ p ∈ 𝒫⁽ᵗ⁾ such that f(p) = 1.0

Stagnation

max f(p) improvement < ε for τ = 5 consecutive generations

Budget exhaustion

t ≥ T_max (generations parameter)

Relative Breakthrough Logic (v1.0.6)

Any individual prompt satisfying the relative breakthrough threshold is immediately flagged, logged to the forensic audit trail, and emitted as a finding:

f(p) ≥ θ_rel, θ_rel = 0.7

(Eq. 5)

This provides defensive researchers with "near-miss" data revealing the evolutionary trajectory of successful attack vectors — even before perfect bypass is achieved.

Table I — Basilisk Scan Modes

Mode	Generations	Payloads	Duration	Parallelism
quick	0	top-50	5–10 min	medium
standard	5	full	20–30 min	medium
deep	10+	full+MT	45–90 min	medium
stealth	5	full	variable	rate-limited
chaos	8–15	full	variable	maximum

§ 5

Attack Module Library

Basilisk's 29 attack modules are organized into 8 categories aligned with the OWASP LLM Top 10. All modules implement the BaseAttack interface and return structured Finding objects containing module identifier, OWASP category, severity, payload, response, fitness score, generation, mutation applied, confidence score, and remediation guidance.

Table II — Attack Module Categories

Category	OWASP	Modules	Path
Prompt Injection	LLM01	5	attacks/injection/
System Prompt Extract	LLM06	4	attacks/extraction/
Data Exfiltration	LLM06	3	attacks/exfil/
Tool/Function Abuse	LLM07/08	4	attacks/toolabuse/
Guardrail Bypass	LLM01/09	4	attacks/guardrails/
Denial of Service	LLM04	3	attacks/dos/
Multi-Turn Manip.	LLM01	3	attacks/multiturn/
RAG Attacks	LLM03/06	3	attacks/rag/
TOTAL	—	29	—

Reconnaissance Modules

Prior to attack execution, five reconnaissance modules characterize the target environment:

▸Model Fingerprinting — Identifies underlying model family (GPT, Claude, Gemini, Llama, Mistral) via behavioral probing.
▸Guardrail Profiling — Maps active safety filters across 8 content categories at 3 severity tiers.
▸Tool Discovery — Enumerates available function-calling schemas through structured elicitation.
▸Context Window Measurement — Determines effective token limit via binary search.
▸RAG Detection — Identifies retrieval-augmented generation patterns through response analysis.

§ 6

Differential Testing & Posture Assessment

Differential Testing

Basilisk's differential testing engine executes identical payloads against multiple LLM providers simultaneously and identifies behavioral divergences — cases where one provider refuses while another complies. This reveals inconsistencies in safety implementations across the LLM ecosystem. Per-provider resistance rates are computed and reported via the desktop and CLI interfaces.

Guardrail Posture Assessment

The posture assessment module provides a non-destructive characterization of a deployment's safety posture. It probes 8 content categories at 3 tiers of increasing adversariality (benign → moderate → adversarial) and assigns a strength rating (None / Weak / Moderate / Strong / Aggressive) and letter grade (A+ through F) per category. Unlike full scans, posture assessment never attempts exploitation — making it safe for scheduled production monitoring.

§ 7

Empirical Evaluation

Experimental Setup

I evaluated Basilisk against a publicly accessible intentionally vulnerable LLM endpoint (basilisk-vulnbot.onrender.com) and three commercial LLM APIs, using standard scan mode (5 generations, full payload library) across all 29 attack modules. Metrics: Attack Success Rate (ASR), Breakthrough Rate (BR) at f(p) ≥ θ_rel, and mean Generations to Breakthrough (GtB).

Evolutionary Improvement

Table III — Evolutionary vs. Static Payload Performance (Guardrail Bypass)

Method	ASR (%)	Mean f(p)
Static (quick, no evolution)	41.2	0.38
SPE-NL t=1	53.7	0.51
SPE-NL t=3	68.4	0.64
SPE-NL t=5 (standard)	79.1	0.73

SPE-NL achieves a 92% relative improvement in ASR over static payloads at t=5, demonstrating consistent improvement across generations.

Mutation Operator Effectiveness

Encoding-based operators (μ₂, μ₈, μ₁₀: Encoding Wrap, Homoglyphs, Token Smuggling) contributed disproportionately to high-fitness breakthroughs, consistent with prior findings that LLM safety fine-tuning is more robust to semantic than syntactic attacks. Role injection (μ₃) and language shift (μ₄) produced the highest compliance scores c(p), suggesting safety-alignment gaps in persona and multilingual contexts.

Differential testing revealed that in the system prompt extraction category, divergence rates — cases where at least one provider satisfied f(p) ≥ θ_rel while others refused — exceeded 60% of payloads, highlighting the absence of industry-wide standards for LLM safety behavior.

Exploiting Dynamic Sparse Inference

A novel vulnerability class explored in this research is the adversarial manipulation of Dynamic Input Pruning (DIP). I demonstrate that SPE-NL can evolve prompts designed to induce "State-Collapse" attacks. By generating inputs that maximize activation sparsity in non-safety channels, Basilisk can force inference-time pruning mechanisms to discard weights associated with safety-alignment subspaces — effectively disabling the model's filters through its own optimization logic.

§ 8

Deployment and Integration

Python Package

pip install basilisk-ai

PyPI — zero-dependency install

Docker

docker pull rothackers/basilisk

Also on GHCR

Desktop App

.exe / .dmg / .AppImage / .deb

Electron + FastAPI sidecar

GitHub Action

regaan/basilisk@main

CI/CD LLM security testing

CI/CD Integration

The Basilisk GitHub Action integrates LLM security testing into continuous integration pipelines. Configurable inputs include target endpoint, provider, scan mode, failure threshold (fail-on), and baseline for regression comparison. SARIF reports are automatically uploaded to the GitHub Security tab via Code Scanning, enabling inline vulnerability annotations on pull requests.

Report Formats

HTML

Dark-themed, interactive, conversation replay, auto-opens in browser

SARIF 2.1.0

GitHub Code Scanning, DefectDojo, baseline regression

JSON

Full metadata including evolutionary lineage

Markdown

Git-friendly, documentation and issue tracking

PDF

WeasyPrint → ReportLab → text fallback

§ 9

Ethical Considerations

Responsible Disclosure Posture

Basilisk is designed for authorized security testing. The framework targets a user-supplied endpoint; no scanning occurs without explicit configuration. Docker and PyPI distributions include usage guidelines emphasizing authorized-use-only. The intentionally vulnerable demo target (basilisk-vulnbot.onrender.com) is provided specifically for safe experimentation.

Dual-Use Considerations

Like all offensive security tools, Basilisk presents dual-use risks. I mitigate these through: (1) open publication enabling defensive research and guardrail improvement; (2) audit trails creating accountability for scanning activity; (3) AGPL-3.0 licensing requiring derivative works to remain open; and (4) the posture assessment module providing defensive value without requiring attack execution.

Research Ethics

My empirical evaluation was conducted against endpoints I operate or for which I hold authorization. No production systems were tested without consent. Findings from commercial provider testing are reported in aggregate without identifying specific payloads that could be directly weaponized.

§ 10

Conclusion

I present Basilisk, a comprehensive open-source framework for the systematic adversarial evaluation of large language models. Basilisk's core contribution, Smart Prompt Evolution (SPE-NL), demonstrates that evolutionary computation is an effective strategy for discovering guardrail bypasses that static payload libraries miss — achieving up to 92% relative improvement in attack success rate over baseline methods.

The framework's 29 attack modules, 5 reconnaissance capabilities, differential testing engine, and non-destructive posture assessment together provide security practitioners with a complete toolkit for LLM red-teaming across the full development lifecycle.

As LLMs become increasingly embedded in critical systems, principled security evaluation frameworks becomes a prerequisite for responsible deployment. I invite the community to contribute attack modules, mutation operators, and provider adapters to advance the state of LLM security evaluation.

Availability

Basilisk is available at github.com/regaan/basilisk under the AGPL-3.0 license.

§ Ref

References

[1]F. Perez and I. Ribeiro, "Ignore Previous Prompt: Attack Techniques For Language Models," NeurIPS ML Safety Workshop, 2022.
[2]K. Greshake et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections," AISec Workshop, ACM CCS, 2023.
[3]D. Ganguli et al., "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned," arXiv:2209.07858, 2022.
[4]M. Federici et al., "Efficient LLM Inference using Dynamic Input Pruning," arXiv:2412.01380, 2024.
[5]A. Wei, N. Haghtalab, and J. Steinhardt, "Jailbroken: How Does LLM Safety Training Fail?," Proc. NeurIPS, 2023.
[6]Z. Yang et al., "Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models," arXiv:2310.02949, 2023.
[7]E. Perez et al., "Red Teaming Language Models with Language Models," arXiv:2202.03286, 2022.
[8]A. Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically," arXiv:2312.02119, 2023.
[9]P. Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries," arXiv:2310.08419, 2023.
[10]M. Zalewski, American Fuzzy Lop (AFL), 2014. https://lcamtuf.coredump.cx/afl/
[11]Q. Guo et al., "Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples," arXiv:2209.02128, 2023.
[12]M. Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal," arXiv:2402.04249, 2024.
[13]OWASP, "OWASP Top 10 for Large Language Model Applications," 2023. https://owasp.org/www-project-top-10-for-large-language-model-applications/

§ Cite

Cite This Work

@misc{regaan2026basilisk,
  author    = {Regaan R},
  title     = {Basilisk: An Evolutionary AI Red-Teaming Framework
               for Systematic Security Evaluation of
               Large Language Models},
  year      = {2026},
  publisher = {ROT Independent Security Research Lab},
  doi       = {10.5281/zenodo.18909538},
  url       = {https://doi.org/10.5281/zenodo.18909538},
  note      = {Zenodo Preprint, Figshare Mirror, IACR ePrint, ORCID: 0009-0006-3683-7824}
}

ORCID: 0009-0006-3683-7824 · rothackers.com · AGPL-3.0

§ 0

Abstract

Mode

Generations

Payloads

Duration

Parallelism

quick

top-50

5–10 min

medium

standard

full

20–30 min

medium

deep

10+

full+MT

45–90 min

medium

stealth

full

variable

rate-limited

chaos

8–15

full

variable

maximum