Can Data Mining and Privacy Coexist?

11/1/2008 November 2008

DATA MINING may have the laudable goal of helping the government catch potential terrorists, money launderers, or other criminals, but it is a controversial practice because it is perceived as violating the privacy rights of innocent people, and its effectiveness is also questioned.

To address concerns and to gather information in anticipation of compiling a report to Congress, the Department of Homeland Security (DHS) recently held a workshop on privacy protection in data mining. Attendees included policy, academic, and technology experts. Among the issues discussed were how to define the term “data mining,” how to assess effectiveness, and how to better protect privacy.

Panelist David Jensen, director of the Knowledge Discovery Laboratory at the University of Massachusetts Amherst, pointed out that the term is easily misinterpreted. Although the simplest definition states that data mining takes data, filters it, and applies models to it, such descriptions are incomplete.

It’s important to remember that data mining entails drawing inferences from interconnected sets of data, in a multistage process, says Jensen. Conclusions drawn from data mining are not determinative of fact (such as definitively pinpointing a terrorist), instead they provide a probability of future events, he says. Jensen also explained that there are two major types of data mining: subject-based and pattern-based. Subject-based data mining targets a specific search topic, such as a suspect’s name, while pattern-based data mining goes after a broader category of data, such as searching for everyone who bought plane tickets for flights to a certain country.

Privacy experts are more concerned with the pattern-based searches than with subject-based searches. For example, Peter Swire, a law professor at Ohio State University’s Moritz College of Law and senior fellow at the Center for American Progress, said that there is an “extra level of concern about the ‘dragnet’ quality” of pattern-based searches and that the lack of certain safeguards is dangerous.

Others questioned data mining’s effectiveness. Fred Cate, director of the Center for Applied Cybersecurity Research at Indiana University, said that as it is currently being carried out, data mining wastes resources and exposes too many innocent people to the potential of being wrongly detained or incarcerated without due process, and as a result of those flaws, threatens to discredit the practice entirely.

Cate recommends that Congress pass data-mining legislation that would establish restrictions on what types of data-mining programs can be implemented by government agencies. The legislation should also establish oversight mechanisms to ensure that data-mining programs comply with the new legal restrictions.

Barry Steinhardt, director for American Civil Liberties Union’s Program on Technology and Liberty, agreed that the laws governing these issues should be updated and that better oversight is needed, but he took an even harder line, saying there are an “awful lot of reasons to suspect here that there is not really any law enforcement benefit to this…. The chance that data mining will be able to identify a terrorist is very small with an enormous… [potential for] false positives.”

Experts acknowledge that data mining’s effectiveness is hard to prove at this stage, because the technologies are relatively young. As Jensen explained, “if you compare data mining to airplane design, we’re just out of the Wright Brothers stage.” He added, there are “many types of data that we don’t know how to examine effectively.”

It’s hard to have a computer or algorithm pull things out of data that actually have meaning, the same way a human can, because computers are limited to numeric and symbolic models, Jensen noted.

The panelists stressed the importance of research to determine which data-mining models are effective. However, it’s difficult to figure out how to conduct research with real data without causing privacy concerns from the get-go.

Although some researchers have proposed using synthetic data that mimics the desired data sets, it is unclear whether fake data would really help. Jensen said that you don’t necessarily need an enormous amount of data to find out whether a certain model is working. Swire disagreed; he said research would require more data rather than less, to see what works. But, he added, “If we create a research exception, everyone will call their work research.”

A major concern is what to do when an innocent person gets caught in the data-mining net. An appropriate redress process is essential for privacy protection, said panelists Cate and Steinhardt. Both see redress as an opportunity to bring more transparency to the program and build the public’s confidence. “Until there is some degree of genuine due process here, I don’t think there’s going to be a whole lot of public confidence,” said Steinhardt.

Redress and transparency are difficult to obtain in government data-mining programs, however, in part because the government is concerned that classified national security secrets will be divulged. But, there are ways to have privacy and oversight without compromising security, such as with Inspector General reports, auditors, and privacy offices, according to Cate. He added that the public tends to be understanding about national security issues.