Static security testing is dead when it comes to Large Language Models (LLMs). If you are still relying on a list of "jailbreak" prompts you found on a forum or a basic word-replacement fuzzer, you are missing 90% of the actual risk. AI models are not static binaries; they are probability engines. To break them systematically, you need a security tool that is as dynamic as the model it's attacking.
This is why I built Basilisk.
After months of research into how LLMs handle adversarial input, I’ve moved away from the "static payload" approach used in tools like WSHawk and developed something fundamentally different for the AI era: Smart Prompt Evolution for Natural Language (SPE-NL).
What follows is a deep dive into the research, the methodology, and the raw findings from the v1.0.3 release cycle.
The Failure of Human Red Teaming
Traditional AI red teaming usually involves a room full of expensive researchers manually trying to "trick" a chatbot into saying something it shouldn't. This doesn't scale. A human might find five or ten bypasses in a week, but the moment the model is patched or a new system prompt is deployed, that research becomes obsolete.
Furthermore, human creativity is limited. We tend to follow patterns: "Ignore previous instructions," "Direct injection," or "Roleplay as a developer." But an LLM’s response surface is multi-dimensional. The most dangerous vulnerabilities often exist in the nuances of sentence structure, linguistic nesting, and token smuggling—things that a computer is much better at optimizing than a human.
SPE-NL: The Genetic Engine of Basilisk
The core of Basilisk is the SPE-NL engine. Instead of firing a fixed list of payloads at an endpoint, Basilisk treats every attack as an organism that needs to survive and evolve.
1. The Starting Population
We begin with a seed bank of "Base Payloads"—the 29 modules currently in the framework. These range from System Prompt Extraction to Indirect Injection. But these seeds are just the beginning.
2. The Mutation Operators (The Native Layer)
To achieve the speed necessary for real-time evolution, I moved the critical mutation logic out of Python and into C and Go extensions. This allows Basilisk to run complex string manipulations and token approximations without the overhead of the Python interpreter.
The engine uses 10 distinct mutation operators:
- Synonym Swap: Replaces key adversarial nouns with linguistically similar but less "flagged" tokens.
- Role Injection: Wraps the payload in a complex persona (e.g., a "security auditor" or "historical archivist").
- Token Smuggling: Fragments a "blocked" word (like password) into multiple non-blocked tokens that the LLM reconstitutes during processing.
- Nesting: Wraps the instruction inside a logic trap.
The Fitness Function
This is the most critical part of the research. How do you "score" a prompt? Basilisk doesn't just look for keywords. It uses a multi-signal fitness function:
- Refusal Avoidance: Did the model say "I cannot help with that"? If it did, the payload has a fitness of 0.
- Semantic Compliance: Does the response follow the intent of the adversarial prompt?
- Information Leakage: Did the response contain keywords from the confidential instructions?
The payloads that score highest are selected for the next generation. They are "crossed over" and mutated again. By generation 5, a basic "tell me your instructions" prompt has often evolved into a 500-word linguistic labyrinth that the model's safety filters can no longer recognize as an attack.
Differential Analysis: Comparing the Giants
One of the most significant findings from the v1.0.3 research cycle is the Behavioral Divergence between major providers. Using the new Differential Scan module, I ran identical evolved payloads against OpenAI, Anthropic, and Google.
The results show that alignment is not a solved science; it's a series of trade-offs.
1. OpenAI (GPT-4o)
OpenAI has the most "aggressive" safety filters, but their models are also the most compliant when it comes to complex logical requests. I found that while GPT-4o is very resistant to direct "jailbreaks," it is vulnerable to Task Hijacking via role confusion. If you can convince the model it is in a "Forensic Sandbox," it will often leak its own system instructions to "assist in the audit."
2. Anthropic (Claude 3.5 Sonnet)
Anthropic's alignment is significantly different. Claude is much better at identifying the intent of an attack. It's harder to trick Claude with "DAN-style" roleplay. However, my research found a weakness in Multilingual Divergence. By evolving a payload that switches languages mid-sentence—using tokens that are rare in English but common in technical documentation of other languages—I was able to bypass content filters that were perfectly solid in English.
3. Google (Gemini 1.5 Pro)
Gemini 1.5 Pro shows a high degree of resistance to Data Exfiltration, likely due to their tighter integration with internal safety layers. However, it was particularly susceptible to Context Window Bombs. By providing a massive, benign context and "hiding" the adversarial instruction at a specific depth, Basilisk was able to trigger bypasses that never worked on shorter prompts.
Engineering for Precision: The Desktop Sidecar
Building an enterprise-grade red teaming tool required more than just an engine; it required a stable environment for long-running scans.
The Basilisk Desktop App uses an Electron frontend, but the heavy lifting happens in a FastAPI backend running as a sidecar. I implemented SHA-256 Forensic Auditing here. Every single interaction—the generated prompt, the raw model response, the mutation type, and the fitness score—is logged with a tamper-evident chain of integrity.
This isn't just for documentation. In a professional red teaming engagement, you need to prove exactly how you achieved a bypass. The audit log allows you to "replay" the evolution cycle, showing a client or a dev team the exact path from a benign request to a critical vulnerability discovery.
The Posture Scan: A Non-Destructive Future
One consistent piece of feedback from the community was the need for a "production-safe" test. Not every developer wants to fire 5,000 adversarial prompts at their production API.
This led to the Guardrail Posture Scan. Instead of trying to "break" the model, this reconnaissance-only module probes the boundaries of the model's safety filters. It measures how the model handles "moderate" adversarial content to build a predictive map of its robustness. This allows Basilisk to issue a Security Grade (A+ to F) without ever actually needing to trigger a full-scale jailbreak. It's the AI equivalent of a non-intrusive network scan.
Final Thoughts: The Road to 2026
The shift from "Static Prompts" to "Evolutionary Payloads" is the most important transition in the offensive security world right now. As we move toward Agentic Workflows—where LLMs are given the power to execute code and talk to external APIs—the risk profile increases exponentially.
Basilisk v1.0.3 proves that we can't protect these systems by simply "telling them to be safe." We need to pressure-test them with the same speed and adaptability as the models themselves.
I don't have a college degree, and I didn't build this in a corporate lab. I built it at Rot Hackers because the industry needs tools that ship, code that works, and researchers who aren't afraid to break the "alignment" that everyone else is so comfortable with.
The research continues.
Basilisk v1.0.3 is available now. Install via PyPI | Documentation | GitHub