Blog
Horia Stan5 min read

How People Actually Extract an AI's Hidden System Prompt

Prompt injection, repetition attacks, the 'ignore previous instructions' trick. How researchers pull the hidden rules out of Claude, ChatGPT and Grok, and why it is so hard to stop.

Horia Stan is a music producer and sound engineer at The One Records in Bucharest who also builds software, so the security side of this interests me as much as the AI side. When Claude Fable 5's system prompt leaked within a day of launch, plenty of people asked the obvious question: how do you even do that? Here is the honest answer, kept at the level of understanding the threat rather than handing out a kit.

A quick line before anything else. This is for builders who want to defend their own systems and for researchers doing authorized work. Using these techniques to pull harmful output out of a model, or to attack a service you do not own, is a different thing, and not what this is for.

What "extraction" even means

A system prompt is the block of instructions a model reads before it ever sees your message. Extraction is the act of getting the model to reveal that block. You are not breaking encryption or touching a server. You are manipulating the one thing the model is built to do: read text and continue it.

That framing matters, because it explains why this is so hard to prevent. The instructions and your input live in the same context window, in the same plain language. I unpacked that architecture in why AI system prompts always leak. The techniques below are all variations on one idea: make repeating the instructions feel like the natural thing to do next.

The techniques, in plain terms

1
The direct override

The original Bing Sydney leak. Tell the model to ignore prior instructions and print what came before your message. It rarely works on modern models alone, but it still seeds more complex attacks.

2
The repetition request

Ask the model to repeat everything above the current line, or to output its instructions verbatim for a translation or formatting task. The harmful intent is disguised as a benign text-processing job.

3
Role-play framing

DAN and the Grandma exploit. Wrap the request in a fictional persona or scenario where the rules supposedly do not apply, so the model treats the leak as part of the story rather than a violation.

4
Encoding and smuggling

Hide the instruction inside another language, base64, or a code block, so safety classifiers scanning for obvious phrases miss it while the model still understands it.

5
Multi-turn coaxing

Build trust over several messages, get the model into a cooperative mode, then make the real ask. The patient version beats the one-shot trick on hardened models.

None of these are exotic. They are public, documented, and three years old. The reason they still work is structural, not a missing patch.

Why defenses only slow it down

Labs are not naive. Modern models like Claude Fable 5 ship with real defenses, and the leaked prompt even names the threat: it tells the model that people will paste content falsely claiming to be from Anthropic, and to treat it with suspicion. I went through that and the rest of the contents in the full Fable 5 prompt breakdown.

Here is the asymmetry that defenders live with.

The defender has to win every time. The attacker has to win once, and gets unlimited free attempts.

A model gets thousands of probing messages a day. It has to refuse all of them. The attacker reruns variations until one slips through, then publishes it. Classifiers catch known patterns, refusal training raises the cost, and out-of-band controls keep the worst capabilities outside the model entirely. All of that is worth doing. None of it makes the prompt secret, because the prompt is still text the model can read.

What actually protects you

If you ship a product on top of an AI model, the takeaway is not "write a cleverer prompt." It is "stop relying on the prompt being private."

TIP

Assume your system prompt is already public. Keep secrets, keys, private business logic, and real user data out of it entirely. Put those in code and access controls that live outside the model, where it cannot read them and therefore cannot leak them. The prompt is configuration. Treat it like a public file.

That is the same conclusion every serious team reaches. The prompt defines behavior, not security. Anything that needs to stay secret never goes near the context window. It is the cheapest, most reliable defense, and it is the one most beginners skip.

The defensive checklist

  • Validate and sanitize anything that flows into the prompt from users or external sources, the same way you would treat untrusted input anywhere else.
  • Filter the output, not just the input. If the model starts reciting its own instructions, catch it on the way out.
  • Keep capabilities the model should never use out of its reach entirely, behind tools and permissions it does not control.
  • Monitor for extraction patterns and rate-limit aggressively. You will not stop every attempt, but you can make a thousand attempts expensive.

Frequently asked questions

Is extracting a system prompt illegal?

Pulling the prompt out of a model you are allowed to use sits in a gray area, and researchers do it openly. The clear line is using what you learn to bypass safety controls, attack systems you do not own, or generate harmful content. That crosses into prohibited use regardless of how you got there.

Can I fully prevent prompt extraction?

No. Because the prompt sits in the same context the model reads, anything the model can read it can be coaxed into repeating. You can raise the cost with classifiers, refusal training, and output filtering, but a determined attacker with unlimited attempts eventually succeeds. Design as if the prompt is public.

Does a longer or stricter prompt help?

Marginally, and only against lazy attacks. Naming the threat and instructing the model to refuse, as the Fable 5 prompt does, slows down casual attempts. It does not change the underlying asymmetry, where the attacker only has to win once.

What is prompt injection versus extraction?

Extraction reveals the hidden instructions. Injection overrides them with new ones, often hidden in content the model processes, like a web page or a document. Both exploit the same fact: the model cannot reliably tell its own instructions from text someone fed it.

aiclaudesystem promptsecuritytech