Product

How much personal information is hiding in your data? Get a number you can prove.

Big datasets are impossible to read by hand, so most companies genuinely do not know how much personal information they are holding: names, emails, addresses, health details, and the like. When a regulator, an auditor, or a customer asks, you need a real number, not a guess. We measure it from a sample, check that sample against your own reviewers, and give you a rate with an honest margin of error. A number you can put on the record.

See what you get Walk a worked example

Runs on your own computers, nothing is sent out One number for the whole dataset, not a pile of alerts Checked against your own reviewers

✓ Tested on public datasets where the true answer was already known. Our number landed on target every time.

Pricing

Three ways to get the same evidence pack.

Have us certify one dataset for the record, have us run and stand behind all of your data, or license the tool and run it yourself. You get the same evidence pack either way.

Evidence pack · fixed price

Compliance Evidence Pack

$22,000

fixed, one dataset

The evidence you have been asked to produce, that stands up.

For a privacy regulator's inquiry, a lawsuit, a merger or a breach, or an EU AI Act training-data summary. We certify a single dataset and hand you the pack your legal team can put on the record.

A measured rate, with a 95% margin of error
A breakdown by source
A written method anyone can re-check

Request an evidence pack

Diagnostic · we run all your data

Prevalence Diagnostic

$45,000+

one-time, sized to your data

We run all of your data and stand behind the number.

Everything in the evidence pack, across every source
We size the dataset and the sampling plan with you
We stand behind the result with your auditors and lawyers

Keep it current: a monitoring retainer from $4,000/mo. We re-run it on a schedule and update the number as your data changes.

Scope a diagnostic

License · run it yourself

Local tool license

$1,000/yr

per team

Run the audit yourself, on your own computers.

The same evidence pack, produced in-house
Runs on your own computers, so no document leaves your building
For teams that audit regularly

Enterprise: pricing on request.

License the tool

Want to try it first? The open-core tool and a free Claude skill are available at no cost. Get the free tool

What you can measure

Personal information is just the first thing you can count this way.

If a trained person can label it, we can put a number on how often it turns up in a pile of documents too big to read by hand. Below: what we have already proven, and where the same method fits next. Click any card to open the detail.

Under the hood

How the number is produced.

You never read the whole dataset, and you never take the detector's word for it. Four steps, below. The full walk-through, with the real sample pack, is on the how it works page.

Step 1

Sampleby source

We take a representative sample across the dataset, spread over your sources, so a small read stands in for the whole.

Step 2

Labelyour reviewers

Your reviewers label a small answer key, on your own computers. That is what the number is measured against.

Step 3

Correctfor detector error

We measure how often the detector is wrong, checking it against the answer key, and correct for that, source by source.

Step 4

Reportrate + margin

We report a rate for the whole dataset with a 95% margin of error, and we re-check it as your data changes.

Why a scanner isn't enough

A raw count of matches is not something you can prove.

A scanner flags matches one document at a time. Across a huge dataset that is millions of flags nobody can review, and you cannot stand behind a raw count. We check a sample instead, correct for how often the detector is wrong, and give a rate for the whole dataset with a margin of error. The review cost barely changes as the dataset grows, so the bigger your data, the more this matters. It works the same whether you are counting personal information, copyrighted text, or anything else a person can label.

The overcount

Your redaction number may be too high.

Off-the-shelf detectors are tuned to over-flag rather than miss, so their raw totals overstate how much personal information you actually hold, and you end up redacting, reviewing, and migrating data that was never personal. We correct for that and report only what your reviewers would call personal. On most datasets the honest number lands below the raw scan: less to redact, less to mitigate, a smaller bill. It moves both ways, so when a scanner is under-counting the number goes up instead. Either way, it is the number you can defend.

Scope a diagnostic

Tell us about your data and what you are worried about. We come back with a sampling plan, a labeling budget, and a fixed price.

Scope a diagnostic See a worked example

All demo data is synthetic. We never ship real data out to audit your data.

For personal information the method is tested on public datasets where the true answer was already known (legal judgments and a public privacy dataset), and our number landed on target. It has not yet run on a live customer dataset, and the first project is that check. The other uses above are ones the same method fits but we have not run with you yet. The margin of error covers sampling; it does not settle disagreement over what counts, which we pin down with your reviewers up front.