Your Cyber Response Plan Needs These 6 Components
Cybersecurity incidents are no longer a matter of if, but when. Building a good strategy and architecture to deter intrusions is incredibly important in reducing the frequency and severity of incidents, but there is no scenario where any organization is totally immune. That means that every organization must have a plan for what they will do in both their enterprise (IT) and operational technology (OT) environments in case an incident occurs.
One of the most troubling things to me as an incident response practitioner is observing organizations forced to expend extra resources to resolve an incident because they have not adequately designed, implemented, or tested their incident response plans. Incidents are high stress crisis situations in any environment, but particularly so in process environments where life and safety are on the line. Incident response is also costly in multiple ways. My team of Dragos Incident Response professionals frequently performs response in environments where planning and preparation were underdeveloped—and we must therefore perform the essentials at the expense of extra time, risk, and money before investigation can proceed.
The following items are essential components of incident response planning and documentation. Each one must be completed to some degree to facilitate incident response. If they are not completed, they must be performed ad hoc at the beginning of any substantial incident effort, and their absence may lead to confusion and mistakes. I highly recommend reviewing what you have complete, what you have not completed but can do as a relatively light lift, and actions which will require more time and projectization.
Architectural Understanding and Documentation
To secure against—much less investigate—an intrusion into an environment, we must know what exists in it. This includes network topology, asset inventories, and industrial process documentation. Without this information, it is difficult to focus incident response efforts on the right systems, quantify impacted systems, and calculate the risk an incident poses. We don’t know what we don’t know, and that could include the presence of network connections, systems, or critical process components.
If this essential architectural documentation is not available and up-to-date in an environment in which we are consulting, we must perform mapping of the environment’s assets, IP ranges, and process functions to a degree at IR consulting rates before we can begin true incident response. An internal team will be forced to do the same, at similar cost.
Crown Jewel Analysis
Another key piece of information in cybersecurity is what systems and network segments are most critical and impactful to the business. In IT, this can be somewhat more straightforward; key IT systems like Domain Controllers and Web servers are probably crucial to business function. In OT, this is a much more complicated and physically impactful question.
To identify Crown Jewel systems in OT, we must first consider consequences of concern to the business. What would a “worst day ever” be in the process environment? Then, working down from that, systematically identifying processes and subprocesses which could be responsible for that consequence occurring. As well as process hazards, mitigations must be considered to truly identify what conditions—and then individual components—could cause a scenario of real concern.
The digital or computer-impacted components will potentially become Crown Jewels from a cybersecurity perspective. For example, loss of cooling in a facility might not immediately come to mind as a hazard to process, but after careful functional modeling, engineers might identify that it would create an overheat condition and cause a shut down. Therefore, the HVAC controllers might be identified as a Crown Jewel in that instance. You can read more about Crown Jewel Analysis here.
If realistic Crown Jewels are not noted in incident response documentation and supporting documents, responders may focus on the wrong priorities of forensic evidence collection, triage, and restoration recommendations when performing incident response in an OT environment.
Collection Management Framework
Collection Management Frameworks (CMFs) document forensic data sources that may be used in cybersecurity detection, threat hunting, and incident response. These simple tables contain information about where logs and other digital evidence are retained, their retention period, and how they can be used by the cybersecurity team. In enterprise environments, modern SIEMs often provide a partial CMF. Unfortunately, in less mature and less modern OT environments, this luxury is rarely available.
It is important to know what data sources are available for analysis during incident response. Forensic analysis requires corroboration and correlation across multiple types of data and logs, and various pieces of activity on computer systems are logged in different places. Taking the time to truly understand what data sources exist and how long data is kept before an incident occurs is a huge time-saver, and it can also make gaps in detection coverage apparent. A CMF can be fairly simple or very verbose, depending on needs.
If no CMF exists during an incident, a basic one must be created on the fly. This will require mining for data sources, extraneous conversations, and may lead to loss of volatile data if it is not collected before end of retention. Key data sources may exist but not be utilized in the investigation because nobody is aware of their presence or how to access them. You can read more about CMFs here.
Sample CMF: Expand in a new window
Incident Response Plan Document
An Incident Response Plan (IRP) document is the playbook and supporting documentation for how a team will respond to a cybersecurity incident. While many organizations have some form of IT IRP, fewer have dedicated OT content. Performing incident response in OT environments has very different logistical and technical requirements, so this is typically a major hinderance during response.
IRPs can take a multitude of template formats, but at their most basic they should contain a workflow across an incident response lifecycle of choice, such as PICERL (Preparation, Identification, Containment, Eradication, Recovery, and Lessons Learned). Regardless of selected model, IRPs must define at what threshold an incident is declared, at what severity it will be investigated, communications, forensic analysis, containment, and recovery of operations. Beyond that, a good IRP contains details on resources available, detailed escalation procedures, documentation resources, and network containment procedures. This document should be accessible to all relevant team members and updated routinely.
Where no IRP is in place, my team is often called in very late in response efforts due to disorganization and confusion. Communications can be disrupted or slow, and all stages of response flow slower due to lack of clear playbooks and central documentation. An IRP does not need to be exceedingly lengthy or complex to be deeply impactful to the speed and comfort of incident response. It should contain the essentials, however, and be drilled routinely.
Remember, incident response actions are performed in stressful, exhausting, crisis situations. It is much more difficult to perform tasks under those conditions, and practice and documentation are a huge help to avoid mistakes.
Internal and External Resource Identification
One of the simplest yet hard to answer questions that arises in incident response is who will perform it. Incident response is a group effort, with multiple crucial roles. This includes a commander or handler who organizes the efforts, technical subject matter experts in both OT and cybersecurity, and IT, logistical, communications, and even legal support. These people often do not work in crisis situations daily and must receive sufficient training and hardware and software tooling to do the job.
There are several models for staffing incident response teams. It can be performed entirely in-house, which is very practical but also costly. Retainers are another great option and can be more affordable, if less personal. The least desirable and most expensive option is hiring external incident response support ad hoc because many firms are overbooked, and less reputable ones seek to take advantage of the demand during crises. Of course, a mix of these options is also a possibility.
On multiple occasions, my team has had to entirely redo incident response performed by another firm which was not qualified to investigate OT environments. In some cases, those firms even caused damage to industrial equipment through intrusive scanning and agent installation. Whatever you choose to do that fits your financial and risk model, make sure you have a plan for who will do the work if an incident should occur.
Assessing Cybersecurity Maturity
One of the hardest questions to impartially answer about incident response capability is where your organization stands relative to other organizations. It is important to quantify this to both identify gaps and measure success. Simply making a guess carries a lot of negative and positive biases. So, an external tool can be very valuable.
The U.S. Department of Energy has developed a great self-assessment tool called Cybersecurity Capability Maturity Model (C2M2), which breaks cybersecurity into functional areas. One of those areas is incident response. In each section, the rated organization answers several granular questions about specific capabilities on a scale of one to four, from “not implemented” to “fully implemented.” The tool then provides a calculation of Maturity Level based on these answers. Maturity Level is a measure from 1-3. Maturity Level 1 is accomplished with the completion of fundamentals, while Maturity Level 3 is reserved for the most mature cybersecurity organizations with very advanced capabilities. The objective is to complete everything at Level 1 before proceeding to Level 2, and so forth.
Most organizations, when rated, show a mix of completion across maturity levels. Completing Maturity Level 3 tasks prior to completion of Maturity Level 1 can lead to confusion, gaps, and major delays. It is fine to identify you need to grow! However, jumping ahead can cause more harm than good. Organizations that have not completed a C2M2 assessment often misportray their maturity to my team in either a negative or positive direction. This can lead us to being less helpful and supportive in needed ways than we might be should we understand the reality.
The above tasks may feel daunting when first read. They are substantial lifts, but they are necessary to perform good incident response effectively. Remember, if they are not performed in advance, your internal incident responders, or a consulting team like mine will have to prepare them before starting incident response in a crisis—at the cost of time, money, and risk.
Consider which of the tasks are partially or fully implemented in your environment, and which ones could be completed with less effort. Anything is better than nothing. Then, tackle the harder line items through long-term projects and resource allocation. Any organization can fall victim to an OT cyberattack, and preparation is key to reduce the cost and impact as much as possible.
Lesley Carhart is the director of incident response for North America at the industrial cybersecurity company Dragos, Inc., leading response to and proactively hunting for threats in customers’ ICS environments. Following four years as a principal incident responder for Dragos, Carhart now manages a team of incident response and digital forensics professionals across North America who perform investigations of commodity, targeted, and insider threat cases in industrial networks. Prior to joining Dragos, Carhart was the incident response team lead at Motorola Solutions.
© Lesley Carhart