1/10/2006

RedState: Able Danger and Data Mining ? Impossible ?

no data mining technique identified Mohammed Atta (or any other hijacker), at least in any meaningful sense.

Able Danger and Data Mining
By: Buckland · Section: Diaries

.... I will talk about the data mining aspect of this. ...For several reasons that I'll go into in as much detail as you want, no data mining technique identified Mohammed Atta (or any other hijacker), at least in any meaningful sense.

First a little about myself ..I work as a statistician. .. [currently pursuing a] .. Data Mining Masters Degree Program ... In addition I have worked on about 20 data mining projects for companies ...Heck, I even get paid to speak at very expensive conferences on the subject [Search page for "Buckland"].

So here are my objections to the idea that a group of geniuses inside the pentagon pointed their favorite data mining tool at the data, said "sic 'em", and got Mohammed Atta's name.

No Data

My first thought is what data was used for this effort? Probably the only data available to the government that shows Atta for sure is immigration data. When he entered or exited, type of visa, etc. That's pretty barren ground for predicting interesting stuff like terroristic activities. I have serious doubts that data from Egypt (Atta's homeland) would have been either forthcoming or interesting as it would present horrendous integration issues, and real data miners tend to try to stay away from those. I also doubt that integrating credit bureau stuff would have been worth the hassle, as most of the hijackers seem to have been reasonably well off financially.

Airline data would have been even less useful. Prior to 9/11 there was no single repository that housed airline data, each airline keeping it's own data separate. Also airline data is extremely hard to work with. There's no identifier in airline data to identify a passenger beyond name. Matching millions of people and there visa data with airline travel patterns is just something that isn't going to happen with a team of 11 guys. That in itself is a project for years and a large team.

No Training Set

A related objection is what could have been used for a "training set". A training set is the data that shows the result of interest (committed a terrorist act) and is used to build a statistical model that finds the data attributes with such activity. For rare events like terrorism a data miner will "oversample" to get as many terrorist events as possible so the model can be trained correctly. However prior to 9/11 there was almost no record of terrorism by foreigners. Without a large number of actual terrorist events in this country there's just no way to correlated the attributes of a terrorist and assign probabilities to the event. No way to train the model. Prior to 2001 terrorist data included ex army guys from Kansas and 60's era protesters. Picking out an Egyptian student as a terrororist? Just can't happen.

No Explanation

Another problem that I have with the supposed data mining effort is the lack of explanation around it. Any data miner wouldn't send a list of names out of his cubicle without a writeup of exactly how the list was computed. If the model isn't understandable it's worthless. What base model was used -- modern models like neural net or C5.0 or did they use a Logistic, an older approach that still is the best at predicting very rare yes/no events like this one. What was the "goodness of fit"? Are the requisite lift charts included? Of course not.

Maybe some of this information exists somewhere, and the [dons tinfoil hat] pentagon is covering it up. Maybe they did calculate the logit and somebody knows where it resides. But without some amount of statistical rigor behind a data mining effort producing a name is at best a lucky shot in the dark.

A More Likely Explanation

The subtitle to nearly every book coming from Washington should be "If they would have only listened to me". I don't doubt that a group calling themself Able Danger existed, and they may have played a little with data mining techniques. However nothing presented leads me to think that they would have had any success in finding terrorists, and the fact that people are talking without any supporting documentation tells me that it may not exist. A more likely scenario is that some staffers produced some names (any data mining software will produce results). How many names were on the list? Sixty is a number that I've heard, but that hasn't been confirmed. Would a list of 60 names have been meaningful? What about 60,000? With the number of people entering and exiting this country the difference in those list is a rounding error. The "propensity to terrorism" of the 60,000th name would have been virtually identical to the 60th. That's just the way picking very rare events work.

People like to blow up their role in events. More likely some staffers told a gullible congressman (who was predisposed to believe them) some war stories about their data mining days. "Yeah, we picked out Atta, but the Brass wouldn't let us pass it to the FBI". If they would have only listened to us.

One more note: If the names of 4 or so hijackers were identified before hand (and they weren't part of a 500,000 name list), why didn't it come out before now? I don't buy that 11 guys, especially data miners, could keep that type of stuff quiet for this long. Statisticians of all stripes are always trying to show how smart they are, and a coup like this would have gotten out. But the 9/11 commission had no interest in them, nor did anyone else until the too gullible Weldon came along.

|| RedState

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home