Getting a Handle on Incidents

4/1/2005 April 2005

Anyone who has spent time defending a computer network from intruders and attackers knows that no matter how good the defenses, successful attacks on IT infrastructure are inevitable. Consequently, a company must have an incident-response program that will minimize the impact of any breach. Let’s look first at some of the key components of an incident-handling program, and then at what a company should consider in setting up its own program.

Key Components

Every company will need to establish an incident-handling team to implement the program and every company will want to follow generally the same steps for incident-handling, including detection, alert analysis, communication, tactical action, and after-action analysis. Of course, some elements of the program may vary depending on the type of organization and its size, level of automation, and sophistication.

The team. The incident-response team needs to be ready to respond at a moment’s notice. It’s important to note that this team is not just a group of engineers and technicians. The following types of people should be considered for inclusion in the incident-response team:

- System administrators and network engineers to handle emergency changes

- Security operations center staff to review the suspicious activity to help identify incidents

- Legal representative to handle any legal issues related to the incident, such as a breach of contract caused by service failure

- Human resources representative to handle personnel issues (for example, the termination of an employee who has caused an incident)

- Public relations representative to determine whether to report to the press and what should be reported

- Incident-response manager to coordinate the effort

If in-house staff do not have all of the requisite skills, the company may want to include a consultant who has experience performing enterprise-architecture design or business-impact analyses, as well as information-security expertise, a strong understanding of operations environment workflow, and good communication skills. The person should be capable of conducting a business-impact analysis, the results of which can be used to develop appropriately detailed, IT-environment-related incident-handling policies and procedures.

Training. Training for the team is a continuous process. Staff recently assigned incident-response responsibilities must be trained on the incident-response policies and procedures and be given follow-up training to account for changes to policies and procedures. Generally, incidents do not occur every day; therefore, even experienced staff should have refresher training.

Detection. The team must have a means of monitoring systems for unusual activity or new vulnerabilities. Detection of unusual activity is usually heavily reliant on monitoring technologies used in the production environment, such as intrusion detection, scripts that monitor login attempts, and controls embedded in applications.

Incident-handling teams also use network scanning tools and research alerts from groups such as the CERT/CC or Security Focus to keep current on attacks and vulnerabilities.

Alert analysis. Often the unusual activity or new vulnerabilities that are detected turn out to be false alarms. The cause of each alert must, therefore, be analyzed. During the analysis process, the potential incident and corresponding evidence should be reviewed to determine whether an actual incident has occurred, and if so, the scope and type of damage that the incident has caused or could cause if not dealt with quickly.

As a part of the incident-handling program, a company will need to have a history of past events in a database or spreadsheet. This information will be used to collect metrics on the effectiveness of the program and will also assist the team in uncovering long-term trends.

While the actual data to be collected will vary depending on the organization and the tools available, the following list describes some examples of data to capture.

- Incident date and time

- List of business processes affected and how they were affected

- Computer system(s) name and/or numbers affected

- Type of incident (data compromise, virus incident, denial-of-service attack)

- Steps taken to prevent the reoccurrence of similar type of incident

- Who was notified of the incident

- Time spent handling the incident

The team should avoid using free text in the history database or spreadsheet. Structured, searchable text will greatly increase the value of the history database.

Communication. An incident-handling program does not operate in isolation. There are a number of internal people and external organizations with which an incident-response team may need to interact.

For example, business-unit owners need to be regularly apprised of the incident and the progress made towards controlling it. Clients, customers, and partners may be affected, so the incident-response team may need to disclose an internal incident to them if they may also be at risk because they were connected to the network. And law enforcement may need to be notified if the business chooses to prosecute a perpetrator. The press may also need to be notified, particularly if the organization is a public company that provides a service to consumers.

Tactical actions. As soon as an incident is reported, containment must begin. There are typically four steps.

First is disconnecting devices from a network or shutting down an entire network. For example, the Internet connection, an infected server, or a network that supports a business unit or office may be turned off and remain that way for hours at a time. This is most easily done if there is a network “chokepoint”—a place or places where a quick action will shut off connectivity, just as with shut-off valves used in a house’s plumbing.

Second is the collecting and storing of incident data in a secure offline location. Incident data may include login logs, user activity logs, full backup or image of a tampered hard drive, and notes from a system administrator about the incident. The amount of data stored and the precautions taken while storing it will depend on whether the organization intends to prosecute in a court of law.

Third is containing and eliminating the incident. Within this process, actions are taken to limit the damage that an incident has caused or can continue to cause, determine the root cause, and eliminate the possibility that the same attack will happen again. These are short-term actions that may or may not be sufficient long-term.

The final element of incident response is recovery. Within this process any data or applications that were tampered with are restored to their original state. This step often involves restoring data from an offline tape backup that was recorded prior to the incident.

After-action analysis. Once the immediate impact from the incident has been mitigated, long-term actions can be taken. Within this process an incident report is prepared, incident-handling procedures are modified, and new projects are identified that will limit the risk that the same or similar incident will happen again.

Traditionally, one of the most neglected steps in incident handling is the postincident analysis. Commonly referred to as an incident postmortem or root-cause analysis, this phase can be lost in the day-to-day activity as more incidents come in and engineers move on to the next problem.

The advantages of performing after-action analysis are too great to ignore. This analysis helps to ensure the effectiveness of existing control systems as well as showing the need for new controls based on real-world scenarios. Implementing these postmortem activities improves the efficiency of the incident-handling teams and increases the security posture of the enterprise.

Not all incidents require the same level of attention. To simplify the incident triage and mitigation process, it is extremely helpful to develop a matrix that ranks common threats (such as denial-of-service attacks, malicious code, and unauthorized access) and determines how severe each will be. The company can then identify unique procedures for each cell in the matrix.

Designing a Program

As with any company program, the first step in establishing an effective incident-handling program is to get buy-in from management and staff. The next step is to establish a vision document. The third step is to ensure alignment with the company’s mission.

Next is to set the policies and procedures based on the objectives, a business-impact analysis, and customer prioritization. Finally, the procedures must be integrated with external processes.

Buy-in. Incident-handling programs are important elements of an overall security program, but not everyone will share your enthusiasm or even understand what you are trying to do. We have seen countless security programs that tried to define an incident-handling program in a vacuum and then attempted to motivate people to implement their design. This simply does not work.

People are always reluctant to change, particularly if they were not involved in identifying the need in the first place. By building buy-in early and allowing some leeway for key players to influence the strategy, you will create project supporters that will stand behind you and support you in your effort.

Vision document. You must make sure that everyone is buying into the same vision. Otherwise, support will fragment as people begin to see that the project is not moving in the direction they thought it would.

One approach that works well is to develop a vision diagram or document and then use it as a starting point to facilitate a discussion about the final strategies. Let the people you want to obtain buy-in from know that this document is only a starting point and that you welcome their input. You can then use the feedback you collect to write the final vision document.

For example, a vision document might define the program as follows. First, security alerts are generated by the security controls that are deployed in the production IT environment (such as intrusion-detection systems or firewalls), and are sent to the security operations center for review. The security operations staff triages these alerts to determine whether they are actually incidents.

When incidents are uncovered, the incident-response team is activated; this team uses incident-handling tools and refers to the organization’s incident-handling policies and procedures to contain and eradicate the damage the intrusion has caused.

Once security is restored, the team updates the incident-history database with the data about this most recent event. This database and any lessons learned from handling these incidents are analyzed to determine how to best protect the enterprise from the same or similar incidents in the future.

Alignment. The purpose of an incident-handling program is to handle an incident in a manner that is most beneficial to the organization and not just to the organization’s IT environment.

The incident-response team will often be reacting to an incident by taking drastic measures such as cutting off network links to customers or partners and shutting down applications that support a business process.

Are they going to consider the extent of the revenue losses, costs of inactivity, or supply-chain delays that they could cause when deciding which actions to take? It is up to the incident-response team to make sure that the answer to this question is yes.

That means that the people doing the work must be aware of the business processes that the network supports. They must understand what it means to shut those processes down in terms of customer satisfaction, reputation, and revenues. And they must know how long the company can bear that cost.

Policies and procedures. Incident-handling policies and procedures are like training manuals on how to handle an incident so that your organization is minimally affected. Once they are written, the incident-response team can be trained on them. If the policies and procedures are written properly, they will govern the incident-response team according to business needs. The first step toward good policies is an impact analysis.

The impact analysis establishes the baseline of what the business does and what its priorities are. We won’t go into the details of conducting an impact-analysis project, because that would be an article in itself. Suffice it to say that the impact analysis allows the incident-response team to determine the criticality of each system and the amount of downtime that can be afforded.

The team can use this information to develop incident-handling policies and procedures. The team can then develop probable scenarios, such as reacting to a virus outbreak or an unauthorized data modification, to illustrate possible impact of the incident to business-unit owners.

The scenarios should show the business-unit owners what actions will be taken, how long the actions will take, how long the business processes they own will be affected, and what the impact to their business processes would be if the incident wasn’t handled quickly. This type of detailed example helps others understand what is at risk and it often elicits from the staff some additional suggestions or issues that may need to be considered in developing the program.

Some readers may be thinking that the company should just leave incident response in the hands of the IT department. That may not be wise. It is also unfair to IT staff, who will inevitably be blamed when a problem occurs.

We have heard many horror stories from IT folks who have been fired or severely reprimanded for taking the actions they thought were appropriate to handle the incident. In a typical example, an incident occurs, and some action is taken to handle it. This action causes a customer to not be able to access a service, or blocks a key executive from his or her e-mail. Or maybe the incident happened just as a time-sensitive process—for example, closing the books for the year—was about to be completed.

A “witch hunt” usually quickly follows. At that stage, the policy of “Do whatever you need to do to solve the problem” changes to “Next time, do what you need to do, but not this.”

Obviously, that ad hoc approach benefits neither the company nor its staff. An effective incident-response program is the answer. It should be backed up by policies and procedures that don’t overregulate the team, which would choke their ability to respond quickly, and it should provide detailed-enough guidance so that “witch hunts” can be avoided.

80/20 rule. The 80/20 rule is helpful in designing the incident-handling procedures. It states that 20 percent of an organization’s customers or employees provide 80 percent of the organization’s revenue. This is important when considering whether to disconnect all customers in order to contain an incident or only disconnect some.

The incident-handling procedures should enumerate these priorities. The 20 percent portion of the customer base is often not disconnected from a service to contain an incident, even if these customers are the source of the incident.

Integrate with external processes. Incident-handling procedures both affect the production environment and are affected by any changes made to the production environment. Therefore it’s necessary for incident-handling procedures to be integrated into the organization’s change-control procedures and the system development lifecycle.

Sometimes an incident-handling program needs to make changes to the production environment where these changes are urgent and cannot be approved through the formal change-control process. Once this happens, the configuration-management baselines will not be synchronized with the production environment.

To avoid difficulties, the company should first decide in detail which events will need to be handled through change control and which events will not. The company should also place a member of the change-control board on the incident-response team. This person will ensure that communications between the two groups are maintained and that both teams’ concerns are addressed in any given situation.

In addition, when new systems are deployed into the production environment, they may introduce new vulnerabilities. The incident-response team must be properly trained on these new applications and products so that an incident can be handled quickly.

This training is even more important if a third party developed the new system. The detailed knowledge of the new system will leave your organization as soon as the contract is over, and the cost to bring back an expert at the time of an incident will probably be extremely high. Thus, the incident-response team must be able to do that on its own.

There is no way to completely avoid an attack on your organization’s computer networks, and even the best-prepared company is vulnerable. The best defense is having a comprehensive and thorough incident-handling plan and an incident-response team in place before that attack ever comes.

James Ryan, CISSP (Certified Information Systems Security Professional), CCIE (Cisco Certified Internetwork Expert), is a management consultant specializing in strategic security management. He has over 13 years of experience in the IT industry. Alex Rosenbaum, CISSP, CCIE, is a management consultant specializing in strategic security management. He has more than 18 years of experience in the IT industry. Scott Carpenter is the director of the Secure Elements, Inc. Remediation Team in Herndon, Virginia, providing enterprise vulnerability-management capabilities to government and corporate customers. He previously established incident-handling programs for the District of Columbia government and private industry.