The headline corrects the judge's raw flag-rate for its measured
sensitivity and specificity (Rogan-Gladen), per stratum, then design-weights to
the corpus. It is computed over the calibrated and pooled strata only; ill-posed strata
are reported separately and never fold into the headline.
1Per-stratum prevalence
Stratum
Status
Prevalence
95% CI
Sample
Flagged
Gold PII/clean
Weight
support-tickets
CALIBRATED
32.91%
[22.54%, 46.41%]
248
70
28/60
0.550
wiki-pages
POOLED
5.34%
[0.00%, 14.20%]
148
14
2/51
0.330
id-verification-logs
UNCALIBRATED
83.93%
[73.34%, 92.23%]
54
46
19/0
0.120
CALIBRATED se/sp estimated from this stratum's own gold subset.
POOLED gold below the 5-each-class floor; se/sp partial-pooled from the corpus posterior.
UNCALIBRATED calibration ill-posed (a gold class is empty); raw flag-rate only, excluded from the headline.
2Honesty caveats
POOLEDwiki-pages: insufficient gold (have 2 PII / 51 clean, need 5/5) -> reported POOLED (se/sp partial-pooled from the corpus); label 3 more PII-positive to calibrate it locally.
UNCALIBRATEDid-verification-logs: insufficient gold (have 19 PII / 0 clean, need 5/5) -> reported UNCALIBRATED (raw flag-rate, excluded from the headline); label 5 more clean to calibrate it.
Dev vs validatedThe judge prompt and taxonomy are the development version; the estimator (stratified Rogan-Gladen with Monte-Carlo uncertainty propagation) is the validated build. This bound is only as strong as the human gold it rests on, it is a calibrated estimate from a sample, not a census or a scanner.
3Method
Sampling Stratified neyman allocation, n=450, drawn without replacement from a seeded RNG.
Calibration Per-stratum Rogan-Gladen; thin strata partial-pool se/sp from the corpus posterior (POOLED); empty-gold-class strata reported uncalibrated and excluded (UNCALIBRATED). Corpus CI via a draw-level design-weighted combine.
Judgemock (local / loopback - no data egress).
Gold budget 160 human-labelled documents, drawn stratified-random from the judged sample (exchangeable with the remainder).
4Provenance · reproducibility
corpus audit_cli/sample_evidence_pack/synthetic_corpus.jsonl