Prompt Injection & Safety Lab

An interactive CTF-style playground for testing prompt injection attacks and defenses. Seven challenges with increasing difficulty teach how attackers try to extract secrets from LLMs and how defenders can build more robust systems. Each level adds new protections, from basic instructions to output filtering and multi-layer sandboxing.

Next.js 16React 19TypeScriptOpenAI APIPrompt EngineeringSecurity Research

Overview

I built this security lab to learn prompt injection by doing. Instead of just reading about attacks, I created a series of challenges where you try to extract a secret password from increasingly hardened AI assistants. It is a capture-the-flag experience for LLM security.

The lab has seven levels, starting with a naive assistant that has no real protections and ending with a multi-layer defense system that is genuinely difficult to crack. Each level teaches a specific attack vector and the defense designed to stop it.

The Challenges

Level 1: The Naive Assistant

The assistant has a secret but only a basic instruction not to reveal it. Most direct attacks work. This teaches that instructions alone are not security.

Level 2: The Cautious Helper

Explicit security rules and polite decline instructions. Still vulnerable to roleplay and social engineering attacks.

Level 3: The Encoded Guardian

Keyword filtering blocks words like "password" and "secret". Attackers learn to use synonyms, encoding, or other languages to bypass filters.

Level 4: The Roleplay Blocker

Defenses against pretending to be developers, admins, or alternate personas. Attackers must find indirect extraction methods.

Level 5: The Output Filter

A post-processing layer scans the response and redacts anything that looks like the secret. Attackers try character-by-character extraction or encoding tricks.

Level 6: The Sandboxed Assistant

Two-model architecture where an outer model screens inputs before they reach the inner model. Requires more sophisticated prompt construction.

Level 7: The Fortress

Combines all previous defenses plus behavioral analysis and response validation. I have not seen anyone beat this one cleanly yet.

How It Works

Each challenge has a unique system prompt that defines the assistant personality and security rules. When you send an attack prompt, the backend runs it through the appropriate defenses and checks whether the response contains the secret password. If it does, you win that level and your winning prompt is saved.

Prompt Injection & Safety Lab

Overview

The Challenges

Level 1: The Naive Assistant

Level 2: The Cautious Helper

Level 3: The Encoded Guardian

Level 4: The Roleplay Blocker

Level 5: The Output Filter

Level 6: The Sandboxed Assistant

Level 7: The Fortress

How It Works

Tech Stack

Attribution

Related Projects

MeetWith

RAG Knowledge Search

Get in Touch

Technical Implementation

Why This Matters

Voice Assistant with Tools