Data Mining In Depth:
Column published in DM Review Magazine
April 2003 Issue
By Herb Edelstein
After September 11, 2001, there was a lot of criticism of the intelligence community for failing to "connect the dots." Many articles appeared pointing out the shortcomings of the FBI databases and the need for data mining.
In addition to addressing problems at the FBI, there are many programs the government is trying to put in place that involve data mining. The Transportation Security Agency wants to improve its ability to detect terrorists. Recently, there has been much written about a controversial government initiative: Total Information Awareness, or TIA. This is a Department of Defense research program in the Defense Advanced Research Project Agency designed to detect and stop foreign terrorist activities. One of the goals of TIA was reputed to be unifying all transaction databases and using data mining to find suspected terrorists.
However, data mining is not the magic bullet that the FBI, TIA and others are looking for. Aside from social and managerial issues, there are four main technical problems.
1. Data integration and data quality. Anyone who has ever built a data warehouse will tell you that the biggest implementation problems stem from integrating multiple data sources and the generally low quality of transaction data (especially when it is used for purposes other than originally intended). Each database has its own problems of missing data and incorrect values. Bringing together data from multiple sources adds other problems such as different meanings for the same terms (semantic heterogeneity), different terms for the same entity, differing units and measures, and different values for identifiers (for example, my name appears as Herb, Herbert and H.). As the number of data sources increases, the problems multiply. One of the main reasons large warehousing projects fail, even where the problem is well-defined, is that they never successfully overcome the integration/quality problems. Now put this into the context of integrating data from wildly divergent sources, such as telephone call records, credit card transactions, utility bills, credit applications, etc., and you begin to see the magnitude of this problem.
2. Too much data, too few examples. If I plot 1 million points on a standard size sheet of paper, what does it look like? Solid black. If I tell you there are 19 key points in there and all you have to do is find the right dots and connect them, you will correctly think I'm nuts. Because terrorism is not a very frequent occurrence (happily), there is a very low representation of terrorists in a very large database. For the sake of argument, let's assume there are 1,000 active terrorists in the U.S. (a number that likely overstates the case by an order of magnitude) out of a population (age 16 and up) of approximately 220 million. An algorithm could be 99.999995 percent accurate by saying no one is a terrorist. Even were we to look only at non-citizens (an arguable tactic), we would still have an accuracy rate of 99.99995 percent by declaring no one a terrorist.
3. Lack of sufficient examples to create good signatures (identifying patterns). As noted in number two, there are relatively few examples of terrorism for training data mining algorithms. The more examples you have of what you are looking for, the easier it is to find a pattern. Making the problem even worse is that terror is an adaptive behavior. Much as we may hate to admit it, the people planning terror attacks are not stupid. Consequently, they will change their behaviors, find different types of targets, or otherwise disguise themselves to look like non-terrorists. Some people will argue that if we can characterize normal behavior, all we need to look for is abnormal behavior. This is much more difficult than it seems. It is possible for every individual attribute to be typical while the constellation of behaviors is atypical. For example, it's not unusual to be male or to be pregnant, but pregnant males are unusual. When you are looking at hundreds if not thousands of attributes, determining what constitutes normality based solely on the data is an enormous if not insurmountable problem.
4. False positives. Given the difficulty of developing good signatures and the small number of terrorists relative to the population of the United States, there are likely to be an enormous number of innocent people identified as potential terrorists (false positives). The more you try to avoid false positives, the more likely you are to miss many true positives. Unlike a direct mail campaign where the cost of a false positive is only a few dollars at worst, the costs in identifying terrorists - in dollars, time and wasted opportunity - are staggering. Suppose we had a collection of algorithms that has a false positive rate of only 0.1 percent - extraordinarily good for a problem of this complexity. That would mean 220,000 false positives! There are not enough investigators to investigate every false positive. Even if there were, the dollar cost would be in the billions, as would the cost of the resulting lawsuits. More importantly, the resources and amount of calendar time expended in these mostly useless investigations would likely leave many true terrorists free. Even if we concentrated only on non-citizens, we would still have more than 20,000 false positives to be vetted.
These problems are the same sort of obstacle to the application of data mining as the laws of thermodynamics are to building perpetual-motion machines. Does this mean we shouldn't use data mining in the search for terrorists or expect it to help? Not at all. As a problem, finding terrorists is most similar to fraud detection, albeit more difficult. We need to use our expertise in the terrorism domain to design databases that facilitate answering particular questions rather than amassing all the data possible and searching it for patterns. Databases and data mining are supplements to human investigations, not replacements.
Data Mining In Depth: TIAin't
When everything seems like the movies
Yeah you BLOG bleed just to know you'r alive
Data Mining In Depth: