Product
How much personal information is hiding in your data? Get a number you can prove.
Big datasets are impossible to read by hand, so most companies genuinely do not know how much personal information they are holding: names, emails, addresses, health details, and the like. When a regulator, an auditor, or a customer asks, you need a real number, not a guess. We measure it from a sample, check that sample against your own reviewers, and give you a rate with an honest margin of error. A number you can put on the record.
✓ Tested on public datasets where the true answer was already known. Our number landed on target every time.
Pricing
Three ways to get the same evidence pack.
Have us certify one dataset for the record, have us run and stand behind all of your data, or license the tool and run it yourself. You get the same evidence pack either way.
Compliance Evidence Pack
The evidence you have been asked to produce, that stands up.
For a privacy regulator's inquiry, a lawsuit, a merger or a breach, or an EU AI Act training-data summary. We certify a single dataset and hand you the pack your legal team can put on the record.
- A measured rate, with a 95% margin of error
- A breakdown by source
- A written method anyone can re-check
Prevalence Diagnostic
We run all of your data and stand behind the number.
- Everything in the evidence pack, across every source
- We size the dataset and the sampling plan with you
- We stand behind the result with your auditors and lawyers
Keep it current: a monitoring retainer from $4,000/mo. We re-run it on a schedule and update the number as your data changes.
Scope a diagnosticLocal tool license
Run the audit yourself, on your own computers.
- The same evidence pack, produced in-house
- Runs on your own computers, so no document leaves your building
- For teams that audit regularly
Enterprise: pricing on request.
License the toolWant to try it first? The open-core tool and a free Claude skill are available at no cost. Get the free tool
What you can measure
Personal information is just the first thing you can count this way.
If a trained person can label it, we can put a number on how often it turns up in a pile of documents too big to read by hand. Below: what we have already proven, and where the same method fits next. Click any card to open the detail.
Under the hood
How the number is produced.
You never read the whole dataset, and you never take the detector's word for it. Four steps, below. The full walk-through, with the real sample pack, is on the how it works page.
We take a representative sample across the dataset, spread over your sources, so a small read stands in for the whole.
Your reviewers label a small answer key, on your own computers. That is what the number is measured against.
We measure how often the detector is wrong, checking it against the answer key, and correct for that, source by source.
We report a rate for the whole dataset with a 95% margin of error, and we re-check it as your data changes.
Why a scanner isn't enough
A raw count of matches is not something you can prove.
A scanner flags matches one document at a time. Across a huge dataset that is millions of flags nobody can review, and you cannot stand behind a raw count. We check a sample instead, correct for how often the detector is wrong, and give a rate for the whole dataset with a margin of error. The review cost barely changes as the dataset grows, so the bigger your data, the more this matters. It works the same whether you are counting personal information, copyrighted text, or anything else a person can label.
The overcount
Your redaction number may be too high.
Off-the-shelf detectors are tuned to over-flag rather than miss, so their raw totals overstate how much personal information you actually hold, and you end up redacting, reviewing, and migrating data that was never personal. We correct for that and report only what your reviewers would call personal. On most datasets the honest number lands below the raw scan: less to redact, less to mitigate, a smaller bill. It moves both ways, so when a scanner is under-counting the number goes up instead. Either way, it is the number you can defend.
Scope a diagnostic
Tell us about your data and what you are worried about. We come back with a sampling plan, a labeling budget, and a fixed price.
All demo data is synthetic. We never ship real data out to audit your data.
For personal information the method is tested on public datasets where the true answer was already known (legal judgments and a public privacy dataset), and our number landed on target. It has not yet run on a live customer dataset, and the first project is that check. The other uses above are ones the same method fits but we have not run with you yet. The margin of error covers sampling; it does not settle disagreement over what counts, which we pin down with your reviewers up front.