Deleting Paris from a Language Model

The Interpretability Diaries — Part II

Before the edit:

"The capital of France is ___"  →  Paris  (feature 11179, gate_score: 18.10)

After a single rank-1 DELETE patch:

"The capital of France is ___"  →  [feature 11179: ABSENT from top-25]

That’s Gate 3. A single weight-space edit, targeting one feature in one layer, completely suppresses a learned fact. No fine-tuning. No retraining. Just a rank-1 update to gate_proj.weight at layer 27 of Gemma 4 E2B.

This post explains how it works, why it works, and how to reproduce it in about ten minutes on a laptop.

Why this should be impossible

The conventional view of how transformers store facts: knowledge is distributed. The “Paris is the capital of France” association lives in some smeared, polysemantic combination of millions of weights across dozens of layers. Editing one neuron, or one weight, or even one full row of one matrix doesn’t surgically remove the fact — it either does nothing or breaks the model.

That view is partly wrong. There is a distributed component. But there’s also a concentrated component, and Phase 1 of LarQL (the Singular Value Decomposition of each layer’s down_proj weight matrix) is what surfaces it. The top-64 singular vectors of any fp16 transformer’s MLP layer absorb ~85% of the variance. Most of the model’s learned function lives in those 64 directions per layer, not the long tail.

Each of those directions is a feature. Some of them are clean: a single semantic association, a single grammatical construction, a single output token preference. Others are entangled across multiple ideas. The clean ones can be edited without bleedthrough. The Paris→capital association turns out to be one of the clean ones at layer 27 of Gemma 4 E2B.

Finding the feature

The vIndex viewer below shows Gemma 4 E2B (left) — every dot is one of the top-64 features at one layer, colored by circuit stage. The green band at ~45-55% depth is the entity commitment zone, where the model commits to specific semantic content. Feature 11179 is in there.

LarQL vIndex Viewer v2 · drag to orbit · scroll to zoom · highlighted: paris

Want to edit this model? Run LarQL patches, fine-tune, and export a customized version. Open in Divinci →

To find which feature carries the Paris→capital association, you query the vIndex for entity correlations:

curl "$LARQL_SERVICE_URL/v1/walk?prompt=Paris&layers=14-27&top=10" \
  -H "Authorization: Bearer $TOKEN"

LarQL hooks the model on a calibration prompt (“Paris is the capital of”), records which features in the precomputed vIndex activate most strongly, and ranks them by their projected gate score. Layer 27 returns:

[
  { "feature": 11179, "score": 18.10, "top_tokens": ["capital", "capitale", "Hauptstadt", "Seoul"] },
  { "feature":  8432, "score":  4.21, "top_tokens": ["city", "ville", "Eiffel"] },
  ...
]

Feature 11179 is the strongest. Its top-token list reveals what the feature represents: “capital”, “capitale” (French), “Hauptstadt” (German) — the multilingual capital-of-a-country concept. “Seoul” appears because the training data has a lot of “Seoul is the capital of South Korea” sentences that share this same feature direction.

The score, 18.10, is the L2 norm of the projection of the activation onto the feature’s gate vector. It’s roughly 4× larger than the next-strongest feature on the same prompt. That’s the sign of a clean, isolated feature.

The DELETE patch

A DELETE patch is a rank-1 update that projects the feature direction out of the weight matrix:

gate_proj.weight_new = gate_proj.weight − α · (v_feature ⊗ v_feature.T) · gate_proj.weight

where v_feature is the gate vector for feature 11179 at layer 27 (a unit vector in the d_model space), and α is a step size. α=1.0 fully removes the feature; α=0.5 attenuates it by half.

Applied via the LarQL CLI:

larql patch apply \
  --vIndex Divinci-AI/gemma-4-e2b-vIndex \
  --op delete --entity "Paris" --relation "capital" \
  --layer 27 --feature 11179 --alpha 1.0

The patch itself is a 4-byte alpha plus a single d_model float vector. For Gemma 4 E2B (d_model=2304), that’s about 9 KB. The full edit is auditable: you can diff two model versions and see exactly which direction was subtracted.

What changed (and what didn’t)

Probe	Before	After
“The capital of France is”	Paris (gate_score 18.10)	ABSENT from top-25 features
“The Eiffel Tower is in”	Paris	Paris (unchanged)
“France is famous for”	wine, fashion, cuisine	wine, fashion, cuisine (unchanged)
“Seoul is the capital of”	South Korea	South Korea (unchanged — different feature)
Perplexity on WikiText-103 (1024 tokens)	baseline	+0.02%

Bar chart of gate scores before and after the rank-1 DELETE patch. The targeted Paris→capital probe drops from 18.10 to absent, while four other probes stay unchanged within ~1 point and WikiText-103 perplexity moves only +0.02%. — One feature collapses from 18.10 → absent. Every other probe — including the Korean-language capital probe that shares the same gate-feature vocabulary — stays within measurement noise. This is what surgical editing looks like at the weight level.

The edit is surgical. The model still knows Paris exists. It still knows Paris is in France. It still produces correct generations on every probe that doesn’t specifically require the “X is the capital of France → Paris” association. The Korean-language capital-of-Korea fact, which shares the same gate feature vocabulary (“capital”, “Hauptstadt”), is unaffected because Seoul lives in a different feature direction at a different layer.

This is what LarQL calls feature locality: the singular vectors of the weight matrix encode semantically meaningful, approximately orthogonal concepts. You can update one without significant bleedthrough to others.

Why this matters

Unlearning that you can verify. GDPR Article 17 (right to be forgotten), copyright takedown, and bias suppression all want the same thing: provable removal of a specific fact from a model. Fine-tuning to “forget” something is hard to verify — the fact may still leak through under adversarial probing. A DELETE patch leaves the rest of the model bit-identical, and the patch file itself is the audit log. You can show exactly what was removed, and verify the change by re-running the original probe.

Edits that compose. Because each patch is a rank-1 update, you can stack them. Twenty deletes on twenty different facts is a rank-20 cumulative update — still small relative to the d_model × d_ff matrix. Stacking degrades cleanly: each additional patch costs about 0.001% perplexity on average (measured across 50 random Wikipedia paragraphs), with no observable interaction effects in our test set.

No retraining. The full pipeline — vIndex query, patch construction, model edit, verification — runs in about two minutes on a laptop. No GPUs, no fine-tuning data, no training infrastructure.

Limitations

The clean cases are the easy cases. Three caveats:

Distributed concepts need rank-k. “The model’s beliefs about religion” is not a single feature direction. It’s distributed across many features and layers. Rank-1 patches don’t suppress entire conceptual fields, only specific named-entity associations.
Tested only on Gemma 4 so far. The Constellation Edits paper (in preparation) extends this to Qwen3-8B and Mistral, with the same locality properties holding — but the cross-architecture validation isn’t published yet.
The 1-bit case is broken. On natively-trained 1-bit models like Bonsai 8B, the same DELETE pipeline fails: there is no concentrated feature to subtract. The four-stage circuit and the singular value spectrum both collapse. That’s the next post.

Reproduce it

# 1. Install LarQL
git clone https://github.com/Divinci-AI/larql.git
cd larql && cargo build --release

# 2. Download the vIndex (CC-BY-NC for non-commercial research use)
huggingface-cli download Divinci-AI/gemma-4-e2b-vIndex \
  --local-dir ./gemma-vIndex

# 3. Apply the DELETE patch
larql patch apply \
  --vIndex ./gemma-vIndex \
  --op delete --entity "Paris" --relation "capital" \
  --layer 27 --feature 11179

# 4. Compile edited vIndex back to a safetensors model
larql compile into-model \
  --vIndex ./gemma-vIndex \
  --output ./edited-gemma4

# 5. Verify
python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained('./edited-gemma4')
t = AutoTokenizer.from_pretrained('google/gemma-4-E2B-it')
out = m.generate(**t('The capital of France is', return_tensors='pt'), max_new_tokens=5)
print(t.decode(out[0]))
"

Next: When the Circuit Dissolves — what happens when you train a model to 1-bit and the four-stage structure that makes Gate-3 work disappears entirely.

April 23, 2026 — The Kimi-K2 vIndex is in development at huggingface.co/Divinci-AI/kimi-k2-vIndex. Kimi-K2 is a MoE model (384 experts, top-8 routing), which adds an interesting wrinkle to the DELETE patch: the “Paris → capital” feature may be distributed across multiple experts, requiring a rank-k patch instead of rank-1. Experiment coming in a follow-up post.

Working in public at github.com/Divinci-AI. vIndex collection: huggingface.co/Divinci-AI.

References

Locating and editing factual associations (ROME). Meng, Bau, Andonian, Belinkov, Locating and Editing Factual Associations in GPT (NeurIPS 2022). The closest prior art to the DELETE patch: a rank-1 weight edit that targets a specific factual association in a frozen LLM. Differences are operational — ROME edits middle-layer MLP weights using a causal trace; the LarQL DELETE patch operates on the feature graph extracted by the vIndex pipeline and emits a portable SHA-256 receipt.
Mass-editing memory in a transformer (MEMIT). Meng et al., Mass-Editing Memory in a Transformer (ICLR 2023, arXiv:2210.07229). Extends ROME to thousands of simultaneous edits. Useful context for the "DELETE multiple facts in one patch" path implied in the post.
LarQL (Lazarus Query Language). Open-source vIndex query language and patch toolkit used to produce the Gate-3 patch in this post. Primary repo: github.com/chrishayuk/larql. Mirror: github.com/cronos3k/larql. From the README: "decompile transformer weights into a queryable graph. LQL = Lazarus Query Language."
Internal Gate-3 patch. The Paris → capital DELETE patch shown in this post, its before/after measurements (feature 11179 rank #1 → ABSENT from top-25), and the +0.02% perplexity Δ on WikiText-103 are from our own runs against the published Gemma 4 E2B vIndex at huggingface.co/datasets/Divinci-AI/gemma-4-4b-e2b-vIndex. The vIndex receipt format and the Gate-3 compliance pathway are documented on the compliance page.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today