Screen Scraping Security

By John Wagley

01 August 2012

Print Issue: August 2012

INCREASINGLY, organizations are using automated tools to scan and collect information online. They’re looking at sites such as social networks and blogs for reasons such as reputation management, public relations, market research, and background checks.

Tools that can automatically scroll for data known as screen scrapers are also becoming more advanced, but companies that use them must avoid legal pitfalls, which could include personal privacy violations as well as copyright infringement. Social networking and other sites that collect user-generated data should also take steps to protect data on their sites, including establishing appropriate privacy policies and implementing the appropriate technical security measures.

The laws surrounding screen scraping and possible privacy and intellectual property violations are somewhat murky, said Brian Bowman, a partner at the law firm Pitblado. Bowman spoke at the Global Privacy Summit in Washington, D.C., sponsored by the Independent Association of Privacy Professionals.

In the United States, one interpretation of the law is that protected information doesn’t include information in a forum where a user voluntarily shared it, where it’s publicly available, and where users have not been led to believe that there are any technical controls limiting public access, he said. But it is fairly clear that it isn’t acceptable to collect information provided by children or from sites that are aimed at children. In other countries, such as Canada, the laws may be stricter regarding “expectations relating to publicly available information,” Bowman said.

There have been a few legal cases involving screen scraping that can be looked to for guidance. One, in Canada, involved Century 21 and Rogers Communication. The latter was accused of indexing, storing, and displaying photos and descriptions of properties that were for sale from Century 21’s Web site. Rogers had used robots to crawl the site, an action that was prohibited by the site’s terms of use. Rogers was found guilty of copyright infringement; $33,000 was awarded to the plaintiff.

The issue is growing in importance as tools to scrape screens for data are becoming more common and powerful, said Joanne Furtsch, policy and product architect at TRUSTe. Whereas much market data and research used to be collected by telephone, such data collection has been surpassed by online-based research, according to Furtsch.

Many organizations’ marketing, public relations, or research departments may be either considering getting involved in or already engaged in this type of data collection, she said. Privacy officers and other executives at those companies should make sure that those running the program know what kind of data can be legally collected. It’s also important to monitor what may be collected by any third-party research firms. Businesses must avoid infringing on other companies’ privacy policies and terms of use, she said. If any policies are unclear, the organization needs to get clarification before it proceeds with the data collection at that site.

Executives should also assess whether data being collected may be sensitive or personally identifiable information under state and national national laws, said Bowman. Companies should consider applying filters that can remove names from data.

Social networking sites and blogs should be sure to let consumers know, in their privacy policies and other areas, how the information they share on the site could be collected, said Furtsch. Such sites should also let potential screen scrapers know what information they’ll allow to be scraped. Some sites, such as Facebook, forbid any kind of automated data collecting, even if it’s by a user collecting data from his or her own account.

One technical measure that can protect against many scrapers is the robot.txt command, a text file that can give instructions to Web robots, said Furtsch. It has a serious limitation, however. In most cases, screen scrapers must choose to find the file and read its instructions in order for the text file to be effective; malicious bots likely won’t seek the file. Another measure sites should take is to provide their users with mechanisms for deleting their sensitive data whenever they choose.

Another widely used tool to protect against scrapers are known as captchas. They show squiggly letters and numbers that a computer or bot cannot decipher. Sites have people type the captcha during registration to prove that they’re human. Captchas should also be regularly updated, Furtsch said, as some scraping tools have been known to outsmart certain types of captchas.