Friday, 14 June 2013

Total Surveillance, the NSA and the problem of False Positives

Total Surveillance, Prism, the NSA and the problem of false positives

Total Surveillance, Prism, the NSA and the problem of false positives

This is just a quick one and pretty off topic from my usual posts, but it was inspired by hearing statments like “If you're innocent you have nothing to hide” once too often.

For background here, google “Prism”, “Edward Snowden”, etc, etc, but I'm going to assume you haven't been living in a cave for the past couple of weeks and are aware that it turns out that the US government has been spying on its citizens to an extent that the Soviets and East Germans could only have dreamed of. They've been gathering data from the Big Data giant squids to search for terrorists or something. You know that time when you searched for Valentine's presents for your loved one at home and then found that when you got to work every site you visited had loads of adverts for lingerie? Kind of like that, only much more scary.

There are those of us who worry about this kind of thing. And there are others who think “hey, I'm not a terrorist, this can only be good for me”. I'll leave aside arguments that revolve around possible abuses by theoretical future oppressive governments. Not because I don't think those arguments aren't valid, just because there is a much more pressing reason you should worry about the government reading your emails, tweats, Facebook profile, browsing history, etc, etc. This is the problem of false positives.

Let us accept that no statistical test is flawless. And whatever the details of the NSA's program, these are certainly statistical tests, unless you are willing to believe that they have a mirror population living in an alternate reality sifting through your records by hand.

I don't have space here for a crash course in Bayesian probability, but I assume anyone reading this blog can refresh their memories from Wikipedia.

So lets say that the probability of the test showing red (or whatever) for a particular person, given that they are actually a terrorist is \( P \left[ Red | Terrorist \right] = \frac{99}{100} \). The exact number in there doesn't really matter, but 99% seems reasonable. Let's also claim that the chances of a given person being a terrorist is pretty low. How low? One in a million? Seems reasonable to me, but let's call in one in a hundred thousand: \( P \left[ Terrorist \right] = \frac{1}{100000} \). For comparison, that's more than the probability someone in the UK is likely to commit murder in the next year based on a back-of-the-envelope calculation with 2011-2012 data.

Now I'm interested in the probability that the NSA has found an actual, real live terrorist given that the test has flashed red. As opposed, to, say, a chemical engineering grad student googling for photos of the village his dad grew up in Pakistan. So call up the Reverend Bayes thusly:

\[ P \left[ Terrorist | Red \right] = \frac { P \left[ Red | Terrorist \right] P \left[ Terrorist \right] }{ P \left[ Red | Terrorist \right] P \left[ Terrorist \right] + P \left[ Red | Innocent \right] P \left[ Innocent \right] } \]

and note that we're missing the false positive probability I mentioned above - \( P \left[ Red | Innocent \right] \). I've left this one for the moment to demonstrate something. How accurate do you think the test needs to be to be useful in this context? \( P \left[ Red | Innocent \right] = 1\% \)? \( P \left[ Red | Innocent \right] = 0.1\% \)? For context medical drug tests apparently generate false positives about 10% of the time, and this article had false positives for breast cancer screening around 50%!

Put some numbers in the equation above:

  P[Red|Innocent] P[Terrorist|Red]
1              1%          0.0989%
2            0.1%          0.9803%
3           0.01%          9.0083%
4          0.001%          49.749%

So at even uncanny levels of “accuracy” for a statistical test (the top end of the table), the probability of catching an actual terrorist is negligible. More likely they're just kicking down the door of some guy who keeps his blinds down because he's nursing a brutal hangover most days. Or it could be your door. Sure, you can plead innocence, but who're they going to believe? The NSA's computer, or a terrorist? Even at near impossible levels there's only a fifty-fifty chance the test is correct.

The problem here is that the test doesn't provide proof, no statistical test ever does. It just provides evidence. Since the chance of a random person being a terrorist is so small, the evidence needs to be pretty damning to balance that fact out.

The situation isn't really any better if the response to a “red” isn't to immediately scramble a black helicopter squadron. Assume they assign a case agent. In a population of 350m you end up with millions or hundreds of thousands of possible cases to investigate. Only if you get to the extremely unlikely bottom end of the table do you end up with merely tens of thousands.

And no, I don't believe that the NSA has tests to that accuracy. Why? No training set. The number of terrorists caught is tiny (see for example this article). You can't train a prediction algorithm with that number of incidents to the degrees of accuracy to make this whole exercise useful.

Which leads me to my worrying final point: since they haven't arrested half the population of the US, what's the NSA actually been doing? Have they overcome statistical uncertainty and developed some super test with miniscule numbers of false positives? Are they busy working their way through a huge list of possibles, trying to seperate the innocent from the terrorists by old fashioned police work? Are they actually gathering and using the data for some other purpose?

You don't have to be a pinko commie liberal like me to worry about government surveillance. It's not a political choice. It's just mathematics. Even if you think you dont have anything to hide.

No comments:

Post a Comment