Data Mining in Depth: Data Mining and Privacy

Column published in DM Review Magazine
December 2003 Issue

By Herb Edelstein and Janet Millenson

There's a great "Cathy" cartoon in which Cathy's boyfriend Irving examines a list of Web sites she's recently visited. "The next time you log on," he remarks, "you should see an ad for singles weight-loss spas in Italy that allow dogs." Cathy run off in distress as Irving reflects, "Everybody wants to be understood. No one wants to be known."

This cartoon captures a common worry: What will my data and transactions reveal about me to strangers? People are concerned about the privacy of their medical, financial, personal and professional information. They're uneasy about others knowing what books they read and what movies they see, what clubs and political parties they belong to, when they are traveling and where. They fear that dissemination of private data could lead to identity theft, increased telemarketing calls and spam, larger debt (from responding to those personally targeted marketing pitches) and unwanted attention from the government.

We generate an enormous amount of data as a by-product of our everyday transactions (purchasing goods, enrolling for courses, etc.), visits to Web sites and interactions with government (taxes, census, car registration, voter registration, etc.). Not only is the number of records we generate increasing, but the amount of data gathered for each type of record is increasing. Latanya Sweeney, assistant professor of computer science and public policy and director of the laboratory for international data privacy at Carnegie Mellon University, has developed a rough measure of the growth in personal data, which she calls the disk storage per person (DSP). The DSP is simply the amount of hard disk storage sold each year divided by the world population. This number has grown from 20KB of data in 1983 to 28MB in 1996 and then to 472MB in 2000.

As data miners, our tasks are colliding with these concerns. In analytic customer relationship management (CRM), we often analyze customer data with the specific intent of understanding individual behavior and instituting sales campaigns based on this understanding. Researchers in economics, demographics, medicine and social sciences are trying to understand the relationships between behaviors and outcomes.

How can we reconcile the legitimate needs of business and research with the equally legitimate desire of people to maintain their privacy? A total prohibition on collecting or retaining data is not really in anyone's interest.

We could solicit people's cooperation. Every organization gathering data can ask people to sign a form granting permission to use the data (known as opt-in) or acquire their permission implicitly when they do not revoke it (opt-out).

We could also respond with regulations about what data may be collected and how it can be used. In some countries, there are already strict laws that prohibit the use of personal data without the individual's explicit opt-in.

In the U.S., health-related companies and researchers are constrained by a complex 1996 law called HIPAA (Health Insurance Portability and Accountability Act), which provides a national standard for the protection of information relating to an individual's health. HIPAA provides for some limited use of the data collected for marketing purposes. For many purposes, however, the data must be stripped of all fields that would enable an individual to be identified, such as name, address, date of birth and Social Security number.

However, the growth and networking of computerized databases has made it possible to identify the "de-identified" people with surprising accuracy. Thus, your anonymity isn't guaranteed even if a database doesn't contain information that easily identifies you. Sweeney conducted an experiment in which, merely by knowing an individual's postal code and birth date, she could identify an individual's personal information in a supposedly anonymous public database with 69-percent accuracy. Knowing gender raised the accuracy to 87 percent!

Data Mining in Depth: Data Mining and Privacy


Post a Comment

Links to this post:

Create a Link

<< Home