Corpus PII-Prevalence Audit: Sample Evidence Pack Synthetic sample · no real PII

Calibrated PII-prevalence audit · 95% confidence interval · generated 2026-06-30 14:44 UTC

Calibrated corpus PII prevalence
22.66%
95% CI  [15.61%, 31.55%]
Corpus
3,000
Sample judged
450
Human gold
160
Raw flag-rate
28.89%
Excluded (UNCALIBRATED)
1

The headline corrects the judge's raw flag-rate for its measured sensitivity and specificity (Rogan-Gladen), per stratum, then design-weights to the corpus. It is computed over the calibrated and pooled strata only; ill-posed strata are reported separately and never fold into the headline.

1Per-stratum prevalence

StratumStatusPrevalence95% CI SampleFlaggedGold PII/cleanWeight
support-ticketsCALIBRATED32.91%[22.54%, 46.41%]2487028/600.550
wiki-pagesPOOLED5.34%[0.00%, 14.20%]148142/510.330
id-verification-logsUNCALIBRATED83.93%[73.34%, 92.23%]544619/00.120

CALIBRATED se/sp estimated from this stratum's own gold subset.   POOLED gold below the 5-each-class floor; se/sp partial-pooled from the corpus posterior.   UNCALIBRATED calibration ill-posed (a gold class is empty); raw flag-rate only, excluded from the headline.

2Honesty caveats

POOLEDwiki-pages: insufficient gold (have 2 PII / 51 clean, need 5/5) -> reported POOLED (se/sp partial-pooled from the corpus); label 3 more PII-positive to calibrate it locally.
UNCALIBRATEDid-verification-logs: insufficient gold (have 19 PII / 0 clean, need 5/5) -> reported UNCALIBRATED (raw flag-rate, excluded from the headline); label 5 more clean to calibrate it.
Dev vs validatedThe judge prompt and taxonomy are the development version; the estimator (stratified Rogan-Gladen with Monte-Carlo uncertainty propagation) is the validated build. This bound is only as strong as the human gold it rests on, it is a calibrated estimate from a sample, not a census or a scanner.

3Method

4Provenance · reproducibility

corpus audit_cli/sample_evidence_pack/synthetic_corpus.jsonl
strata field:source
sample seed=23 method=neyman n=450
gold seed=41 subset=160
calibrate seed=31 confidence=0.95
note SYNTHETIC corpus, deterministic, no real PII