Over the last couple of days MSNBC’s Keith Olbermann has aired interviews with former National Security Agency employee Russell Tice that have shed a great deal of light on large-scale surveillance and data acquisition conducted during the Bush Administration. For those of you interested in the short version of the story, it boils down to two things:
1) Systemic, pervasive surveillance of ALL (that’s everything, everywhere!) electronic communications was conducted by the NSA and that the metadata gained from it was used to make possible targeted surveillance of particular people and groups.
2) Those groups included the news media: despite what Tice was told initially about how his work would be used to exclude certain groups, he eventually realized that it was actually being using to include them. These targeted people and groups were then subjected to a higher level of surveillance: that is, not only was the metadata acquired and stored, but the data — the actual content of every email, every IM, every phone call, every fax, everything — was acquired and stored.
For those of you with more than a passing interest in this sort of thing, the following paragraphs will explain the techniques and the technology that make such massive spying possible.
The “rap” on government wiretaps and electronic surveillance has always been that, while it was possible to collect massive quantities of data,—there were literally boxcars full of raw (unheard) tape recordings stored at the NSA facility in Ft. Meade, Md. during the Cold War—actually making sense (or even reviewing) of what was collected was another story altogether. The NSA actually picked up conversations that could have been used to disrupt the 9/11 terrorists; they weren’t listened to until after the attack.
Advances in technology have made massive collection of electronic communications possible and made the process of actually retrieving useful data for analysts quick and easy. Blogger Rider on the Storm over at the Daily Kos provides what I’ve found to be the most painless description of how this process works:
The two tiers of surveillance that Mr. Tice described consist of all-encompassing metadata acquisition and more tightly focused data acquisition. Here’s an example of how that might work: suppose you know that The Bad Guys all picked up a certain brand of cheap digital camera and that’s what they’re using to take pictures of potential targets and share them. Suppose that this particular model of camera has a default setting of 1846×948 pixels, and suppose that The Bad Guys are transferring these files around via email, using accounts on free mail providers like Yahoo and Hotmail and Gmail.
What might happen is that somebody writes an algorithm that looks at all the email and flags anything that is to a free mail provider, from a free mail provider, has attached photos, and has attached photos that are 1846×948. That’s the first tier, based entirely on metadata.
Whenever a message is found that matches those criteria, the sender and recipient(s) are noted and from then on, everything they send or receive gets vacuumed up. And that extends way beyond email: if the sender’s phone number or fax number or IM account or anything else can be identified, then everything associated with those gets included too. And per Mr. Tice’s comments about pulling in data from external databases: their credit card records, their bank records, everything else. That’s the second tier, where every scrap of data is picked up.
Which means that if you happened to buy the same cheap digital camera as The Bad Guys and you happen to use Gmail, you’re going to be swept up by that same algorithm and all of your data will be given the same special attention as theirs.
Now, I suppose you’re wondering just how the NSA can scan so much data. Let’s let software developer JRandomPoster (also writing for Daily Kos) explain exactly how this all works. It’s not as difficult as you might think.
Fundamentally, these programs attempt to classify the data into sets. In the case of on-line fraud detection, the algorithm would attempt to classify a given order being placed with the on-line retialer as fraudulent or not. In the case of insurance companies, the software would attempt to classify into the categories of good or bad risk. Note that the classification into a given solution set is often weighted; that is, there is a probability associated with the classification. So, again using on-line fraud detection algorithms as an example, a given order might be classified as having a 80% chance of being fraudulent and a 20% chance of being legitimate. It is up to the user of the algorithm to determine what threshold is used to consider a data point properly classified; in our example, if the threshold was 75%, the order would be considered fraudulent; if it was 90%, it would not.
Such machine learning programs rarely directly process entire set of raw data available to generate a classification. Rather, metadata is used. In some cases, meta data is available as part of the total data available, as with the case of emails. In an email, the header gives information about the sender and the recipient, the route the mail took, the subject line, and so forth. The body of the email, however, falls into the category of raw data. This does not mean, however, that metadata cannot be extracted from the body. There are numerous algorithms and techniques for taking large bodies of raw data and extracting descriptive metadata from it that can be used by a classifier. Such extracted metadata may not be human readable, but it allows the classifier to operate on the transformed raw data.
Thus, it should be noted that the manner in which such classifiers work does not require the program to “understand” everything in the communication. Rather, certain critical data points are extracted or extrapolated from the overall data. Thus, a voice recognition component of such a system would not have to understand every word; rather, it could look for certain words or combinations of words. Similarly, a text analyzer could look for certain key words, phrases and constructions. A common every day example of such a technique that most of us use regularly are the Bayesian classifiers used to filter for spam.
One fundamental principle of such machine learning and data mining programs, or classifiers, is that they a require data points from each set that they will be classifying for. For example, when on-line retailers train their fraud detection software, that software must be provided with a data set of customers who are placing legitimate orders, and a data set of orders that are fraudulent. (These data sets are drawn from previous orders that have been proven to be legitimate or fraudulent). In general, a model is built based on known data with the assumption that the known data is representative of future data to be processed.
Unfortunately, given the way such classifiers work, it is not possible to train them with just a set of, in this example, non-fraudulent orders, and call any others fraudulent. This is a fundamental principle of machine learning; each and every set classification that the program attempts to assign samples to must have been seen by the classifier during its training phase. This does not mean that every possible data point must have been seen by any means; however, it does mean that an adequate representative sample from each possible classification must have been seen by the classifier during its training phase.
In the case of any government wiretapping, and given the sheer number of such communications, it would be impossible to have human beings read and classify them all as either terrorist or non-terrorist. However, in order for an automated classifier to work, it would have to be trained with a set of both terrorist communications, and a set of non-terrorist communications.
And herein lies the first problem with the government wire tapping programs, even if the intended goal is to automate the entire system wherein the communications are not necessarily read by people. In order to train the automated system, it requires that a sample set of legitimate communications be used to train the program to recognize legitimate communications. And the only way to do this is to use known legitimate communications, which would have had to been selected manually. Furthermore, given the complexity of training such a system, the number of communications required to train would have to be very large. Additionally, as communication patterns change, the system would have to be continually trained anew, meaning more manually examined communications. This means that no matter how well intended, no matter how many legal protections were in place, at some point, known legitimate communications would have to be human read. There is no escaping this with existing technologies and techniques.
Finally, there is the problem of searching the databases that are created using these methods. The government has collected all this data and they needed to find a way to make it useful. Enter the “Fractual Database”. Again, I’ll defer to ANOTHER blogger over at Daily Kos, PBnJ:
Fractal data technology isn’t new. It’s been around since the ’80s and has been used to burn CDs in addition to compressing and de-compressing images.
A fractal is essentially any geometric pattern that repeats inward and outward upon itself. Forever, like an unending snowflake.
No matter how many times you zoom in, there are more tiny snowflakes. This is a bit simplistic because in a fractal database, the data isn’t so structured and neatly organized as this. But the nature of the geometry makes it structured. Fractal technology creates order from chaos. The very same chaos that occurs from dumping literally a centillion bits of bytes into a database someplace deep in the heart of the NSA. We’re talking many centillion terabytes of data. About you and me. Your kids. Your friends and what you do late at night when you think no one is watching. They’re watching everything.
Until recently its applications have been limited. The real power behind fractal geometry as it applies to data has been pretty elusive.
But as with all things technology, Moore’s Law ensured it was only a matter of time before it became powerful enough, and cheap enough, where someone figured out how to make it work and money from it — or how to effectively spy on people.
With a fractal database, someone simply has to dump data in. Then the magic happens.
That fractally little thing knows stuff about you. And it doesn’t speak in any sort of code. It speaks in normal words that reveal who you are and what you do. It looks at you and it knows you travel. And where. It knows who your friends are through your phone calls and your Facebook profile. It knows what you spend and how you spend it. It can search in upon itself again and again, in mere nanoseconds, finding patterns and themes about you like the most complex mashup ever seen, combining the web sites you visit, the phone calls you make, where you work, your donations, your grocery purchases, and the kind of underwear you buy.
It can find patterns of behavior that are similar to others. It links you to others you don’t know. And the government is very hungry for more data about you.
Well, there it is in a nutshell. Sadly, the Obama administration, despite all the executive orders that have been signed (banning torture & closing Gitmo), doesn’t seem to be in any particular hurry to sort through the implications of all this data the government is collecting on everybody. Stay tuned….