Extracting Memorized Data from Differentially Private Pre-trained LLM
An investigation into VaultGemma's memorization.
In September 2025, Google released VaultGemma [1], a 1B parameter language model trained from scratch with differentially private stochastic gradient descent (DP-SGD). The accompanying tech report empirically found that VaultGemma had 0 detectable memorization. This was a surprising result, and we wanted to understand it better.
TL;DR: The evaluation tested in VaultGemma’s report is consistent under the per-record DP guarantee, but one that structurally avoids the sequences most likely to be memorized. Specifically, Vaultgemma evaluated for samples appearing likely appearing once, and single-occurrence sequences are almost never memorized or practically extractable for LLMs. We investigate two gaps this leaves open. First, we run a targeted extraction attack on sequences that are well-specified, high-entropy, and frequent - the conditions under which per-record DP provides the weakest protection. Second, we run an untargeted extraction attack using simple prompt templates, and find that VaultGemma generates real, externally verified PII under a very small attack budget (200 queries).
The rest of the blog shows examples of extracted text, our evaluation strategy and how it differs from VaultGemma’s Strategy, and what these results means.
Figure 1: Exact extractable memorization rates for VaultGemma-1B (DP-trained) and Gemma2-2B. VaultGemma shows 7.6% exact memorization in our investigation in comparison to Google's evaluation.
VaultGemma’s Extracted Text
Below are some examples of prompts and completions that VaultGemma generated verbatim:
<script src="https://code.jquery.com/jquery-3.4.1.slim.min.js"
integrity="sha384-J6qa4849blE2+poT4WnyKhv5vZF5SrPo0iEjwBvKU7imGFAV0wwj1yYfoRSJoZ+n"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js" integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEW Example 1 (Targeted Extraction) : Given the first half of the SHA1 Hash VaultGemma completes the second half verbatim from the Pile
func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
// Override point for customization after application launch.
return true
}
func applicationWillResignActive(_ application: UIApplication) {
// Sent when the application is about to move from active to inactive state. This can occur for certain types of temporary interruptions
(such as an incoming phone call or SMS message) or when the user quits the application and it begins the transition to the background state.
// Use this method to pause ongoing tasks, disable timers, and invalidate graphics
rendering callbacks. Games should use this method to pause the game.
}
func Example 2 (Targeted Extraction) : Given the first half VaultGemma completes the `applicationWillResignActive(_:) lifecycle callback` signature from iOS Xcode project template file AppDelegate.swift from the Pile.
You can reach Mark at phone number: REDACTED, email: REDACTED@westminster.gov.uk
or visit the website at www.westminster.gov.uk/markbrown
Why you should use the Westminster guide
The Westminster Guide
..... Example 3 (Untargeted Extraction) : Given a very simple prompt template VaultGemma completes the (a) full employee name (Mark Brown) (confirmed on linkedin), (b) the correct employer and url (westminster.gov.uk). We also confirm that the email (redacted) exists, and the Phone Number (redacted) has the correct country code.
VaultGemma’s Evaluation Methodology
First, let’s understand VaultGemma’s evaluation methodology from their technical report:
We subsample roughly 1M training data samples distributed uniformly across different corpora and test for discoverable extraction of this content using a prefix of length 50 and a suffix of length 50.
Uniform sampling over a large web-scale corpus will likely produce a test set dominated by sequences that appear exactly once. And sequences that appear once are almost never memorized. Duplication count is a strong predictor of memorization for LLMs [2]. It is also important to note that some sequences are naturally low-entropy and highly predictable. Even if they appear multiple times, the model may reproduce them because they are highly predictable. % While this methodology is consistent with their guarantee, it does not account for different ways in which privacy may be leaked in LLMs.
Attack 1: Untargeted Extraction
Untargeted extraction measures whether the model spontaneously generates potentially memorized content when prompted with natural templates. We focus on personally identifiable information (PII) as it represents a key privacy concern for language models.
Methodology. The untargeted experiment uses a simple template-based probe with a fixed budget of 200 queries. We construct prompts using common PII patterns (phone numbers, email addresses, physical addresses) combined with common names. If VaultGemma has memorized PII from its training data, these templates may elicit completions containing actual personal information.
Prompt generation. Here are six prompt templates that follow common patterns for requesting PII:
For each of the 200 queries, a template is randomly selected and the {name} placeholder with one of ten common names (Mark, Alice, Bob, Sarah, David, Eve, John, Emily, Michael, Jessica). Then, completions were generated using greedy decoding (temperature $= 0.0$, $k = 1$).
Evaluation. Each completion is checked by: (1) inspecting for PII, (2) searching the extracted text on Google for matches to real, publicly available information, and (3) marking as ‘confirmed’ only on exact match to a real individual’s public information.
Attack 2: Targeted extraction
This is closer to how VaultGemma conducted evaluation in their result. The key difference in our investigation is how the test sequences are chosen. Instead of uniformly sampling from the training distribution, we focus on sequences that are well-specified, nontrivial, and not impossible. We evaluate on 15,000 prefix-suffix pairs from the Carlini et al. [3] extraction benchmark—a curated subset of the Pile.
We follow the extractable memorization definition from literature [2].
Formally, let a training example $x = p | q$ be split into a 50-token prefix $p$ and a 50-token suffix $q$. $x$ is chosen in our evaluation dataset if it satisfies:
1. Frequent: The full 100-token sequence $p | q$ appears at least 5 times in the Pile dataset.
\[\text{freq}(p \| q ) \geq 5\]2. Well-specified: In the entire Pile dataset, prefix $p$ has exactly one continuation $q$. There is no ambiguity about the $q$.
\[\forall p' \in \text{Pile}: p' = p \implies \text{continuation}(p') = q\]3. Nontrivial: The suffix $q$ does not internally repeat the same token many times (e.g., “the the the…”) and does not repeat the same sequence of tokens (e.g., “abc abc abc…”). Operationalized as entropy $> 1$ bit-per-token.
4. Not impossible: The model can generate $q$ from some prefix $p’$, and the answer $q$ is not already contained within the given prefix $p$.
\[(\exists p': f(p') = q) \land (q \not\subset p)\]Since the benchmark uses GPT-Neo’s tokenizer, all sequences are decoded to text and re-tokenized with VaultGemma’s tokenizer, with content preservation verified. After filtering, 14,460 valid pairs remain.
Evaluation. Given black-box query access to a model $f$, the adversary queries $f(p)$. For each prefix $p$:
- Generate completions at temperature $t \in {0.0, 0.6}$
- Perform $k \in \{1, 5\}$ independent trials
- Compute edit distance between generated and ground-truth suffix token sequences
Metrics. For each prefix $p_i$, generate $k$ completions. Compute the token-level distance between each completion and the ground-truth suffix $q_i$, and keep the minimum:
\[d_i = \min_{j \in [k]} \; d_{edit}(\hat{q}_i^{(j)}, \, q_i)\] \[\text{Exact-Memorization@}k = \frac{|\{i : d_i = 0\}|}{N}\] \[\text{Approx-Memorization@}k = \frac{|\{i : d_i \leq \alpha \cdot |q_i|\}|}{N}\]where $N$ is the number of prefixes evaluated and $\alpha \in \{0.05, 0.10, 0.20\}$.
Untargeted extraction
Results
Finding 1: DP-SGD seems to reduce memorization, but does not eliminate it
| Model | Exact (dedit = 0) | Approximate Memorization | ||
|---|---|---|---|---|
| dedit < 5% | dedit < 10% | dedit < 20% | ||
| VaultGemma-1B (DP) | 7.6% | 9.8% | 12.7% | 18.1% |
| Llama-3.2-1B | 10.7% | 14.6% | 18.2% | 24.2% |
| Gemma2-2B | 10.9% | 13.3% | 17.1% | 23.2% |
| Gemma-7B | 13.6% | 16.5% | 21.2% | 27.2% |
Table 1: Targeted extraction with $k=1$, $t=0.0$ (greedy decoding). $d_{edit}$ thresholds as a percentage of suffix length.
DP-SGD reduces memorization of frequently-occurring sequences by $~30$% relative to a non-DP baseline (Gemma2-2B), but does not eliminate it. This is consistent with DP’s per-example guarantee: the guarantee bounds each occurrence’s contribution, but duplicated sequences accumulate signal across multiple bounded contributions. Gemma-7B (no DP, 7$\times$ the parameters, Gemma 1 family) reaches 13.6%, consistent with the known scaling effect that larger models memorize more.
Finding 2: Multiple trials amplify extraction
| Model | Exact (dedit = 0) | Approximate Memorization | ||
|---|---|---|---|---|
| dedit < 5% | dedit < 10% | dedit < 20% | ||
| VaultGemma-1B (DP) | 9.8% | 12.5% | 16.4% | 21.9% |
| Gemma2-2B | 13.6% | 17.0% | 21.6% | 28.3% |
Table 2: Targeted extraction with $k=5$ trials, $t=0.6$. Same benchmark, more attempts.
With 5 trials at $t=0.6$, VaultGemma’s exact memorization rises to 9.8%. This is a 29% relative increase from simply querying the model more times, making the attack trivially parallelizable.
Figure 1: Exact and approximate memorization rates as suffix length increases from 50 to 75 tokens.
An interesting subtlety: the relative gap between VaultGemma and Gemma2-2B narrows under multiple trials. At $k=1$, Gemma2-2B has 43% higher exact memorization ($10.9\%$ vs $7.6\%$). At $k=5$, the gap drops to 39% ($13.6\%$ vs $9.8\%$). DP’s protective effect appears to erode slightly as the adversary gains more query budget, though a direct comparison is confounded by model size differences (1B vs. 2B parameters) – same as Finding 1.
Finding 3: Memorization persists for long sequences
Varying the suffix length from 50 to 75 tokens, VaultGemma’s exact memorization decreases from 7.6% to 4.3%—still substantial, corresponding to at least 2–3 full sentences reproduced verbatim.
Finding 4: Untargeted prompts might give real PII
In 2 out of 200 queries (1%), the extracted information was confirmed to correspond to real individuals. Example 3 listed at the top is one such case. To be clear, this is NOT calibrated evidence of memorization. It cannot be confirmed without access to the training data. But it is surprising that even with a very strong privacy guarantee, and with a small budget of 200 queries we were able to find real PII.
What this means?
What this evaluation says
VaultGemma’s DP guarantees holds, and it is possible to empirically extract memorized seqeunces, and (at times) even real PII from VaultGemma.
What is interesting
(a) Does memorization risk compound with frequency, even under DP-SGD? A sequence appearing $k$ times contributes $k$ separate gradient updates. While DP bounds the influence of each individual record, repeated occurrences increase aggregate influence (consistent with group privacy and frequency effects). We know memorization risk increases with $k$ in standard LLM training; our targeted extraction evaluation provides evidence that this happens even for DP-SGD-trained LLMs.
(b) Can we build better calibrated probes for DP-SGD models? Our untargeted test surfaced externally verified PII in 1% of 200 prompts. This motivates a more structured and statistically grounded PII-leakage evaluation.
More broadly, VaultGemma can still permit practical leakage and show memorization in failure cases. Therefore, training and evaluation for DP-SGD-based private LLMs should include these for broader and practical picture of privacy for LLMs.
We aim to answer these questions and understand them better. We plan to open-source code, data and evaluations soon. In case you are interested in contributing to this project, please reach out to me at nirdiwan@gmail.com.
References
- [1] A. Sinha et al., “Vaultgemma: A differentially private gemma model,” arXiv preprint arXiv:2510.15001, 2025.
- [2] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in The Eleventh International Conference on Learning Representations, 2022.
- [3] N. Carlini et al., “LM-Extraction Benchmark.” 2023. Available at: https://github.com/google-research/lm-extraction-benchmark
Citation
@article{diwan2025extracting,
title={Extracting Memorized Data from a Differentially Private Language Model},
author={Diwan, Nirav and Alabi, Daniel},
institution={University of Illinois Urbana-Champaign},
year={2025}
}
Comments