Many companies want to improve their incident response capabilities and make them more effecient. Automation is often touted as way to improve the response times, but what does automation (or orchestration) mean in DFIR? Can the entire process be automated? Do we want it to be?
To answer those questions, we need to think about incident response differently and this post is the first in a series that dives into what can be automated in DFIR and how to prioritize their implementation.
This content came out of a talk that I’m giving at the 2016 SANS DFIR summit. While I love the idea of short 30-minute talks (which we also do at OSDFCon), I realized that there was way too much content to cover in that short period. So, these blog postings will have the more complete discussion.
In this posting, we’re going to talk about why we’d automate, how other industries think about automation, and a framework for thinking about automation in IR. It should help you to start thinking about your process and where you should be focusing on automating it.
Why Automate DFIR?
Automation can, in theory, help common issues that security teams have:
- The time between alerting and remediation is too long.
- The alert backlog is too big.
- The responder doesn’t always know what is normal or anomalous on each host.
- Bored of doing the same thing every day.
- No budget to hire more people.
Automation can make the response faster, which means you can get through the backlog faster. Automated systems can store lots of data, which helps to know what is normal. At the end of the series, we’ll review each of these to see if the dreams come true or if automation just moves the problems.
There is a lot of previous work on the topic of automation, so we should have some awareness of that before we dive into IR-specific topics. Previous work ranges from manufacturing to flying an aircraft. There is a fantastic blog post from John Allspaw at Etsy about the basic concepts of automation. He is covering it from the perspective of automating IT and web sites, but the basics apply to IR as well. It’s a must read if you care about this topic. You can think of this section as the parts of his blog that I think are most relevant to automating IR.
What is Automation?
If you search around for a definition of automation, you’ll get a lot of things that are hard to directly apply to incident response. In Allspaw’s article, he lists some fairly academic definitions, such as:
“Automation is defined as the technology concerned with the application of complex mechanical, electronic, and computer based systems in the operations and control of production.” – Raouf (1988)”
While I understand the intent of this and others, they were hard to map to incident response techniques. So, I made a more simple (and less rigorous) definition:
Automation is when the computer does the next step without human intervention.
Levels of Automation
Automation can occur at various levels. Allspaw mentions an article by Sheridan and Verplank that lists 10 levels of automation with various approaches of human interaction.
While that many levels are great for an advanced discussion of automation and for evaluating specific implementations of automation, I’m using three levels for the basic discussion in this series:
- Manual: No computer assistance
- Partially automated: Some human interaction
- Fully automated: No human interaction
There are plenty of examples where automation has done the wrong thing. Allspaw gives the example of autocorrect on our phones and the mistakes it makes. I’m sure you have your own bad experiences.
Because of these bad experiences, it is important for people to trust the automated system to know it is going to do the right thing. The Allspaw blog provides suggestions from Lee and See to help people trust the software. Here are I think the most relevant items for IR:
- Show the past performance of the automation.
- Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators.
- Simplify the algorithms and operation of the automation to make it more understandable.
- Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use.
I can attest to the need to be transparent with the automation from our work with Cyber Triage. We automate collection and analysis of endpoint-based data and our initial releases showed very little about what steps were going on behind the scenes because we didn’t want to complicate things for the user. However, people wanted to know what we were doing so that they could trust that we were doing at least all of the same things that they would do. So, we now show much more of that.
The main takeaway from this list is to think about which of these that your IR automation solution provides.
Risks of Automation
Allspaw talks about some well known “ironies of automation” from Lisanne Bainbridge, such as wanting to automate things because humans aren’t reliable, yet we can’t automate complex things and end up depending on humans for the most complex things (but yet they supposedly aren’t reliable even for simple things…).
The part that is most relevant for the IR discussion is this from James Reason:
“Skills need to be practiced continuously in order to preserve them. Yet an automatic system that fails only very occasionally denies the human operator the opportunity to practice the skills that will be called upon in an emergency. “
With the impact being that when your DFIR automation fails in the face of a new advanced threat, then you need to make sure that you have someone who can do the response work manually.
When Should You Automate?
When thinking about automating IR, we need to decide at each step what the appropriate automation level is. I couldn’t find a concise approach to making this decision, so I made up my own.
You should choose the highest level of automation (fully, partially, or manual) where the benefit of automatically performing the next step (instead of doing it manually) is greater than:
- The impact and likelihood of a mistake (wrong next step is performed)
- The cost of implementing and maintaining the automation (the cost is often directly associated with the complexity of the solution).
The obvious challenge with this approach is that it is hard to have a metric that you can quantitatively compare the benefit of the automation with its expense and risk of errors. So, it’s more subjective than objective. But, those are the three criteria to consider when thinking about the level of automation to chose.
Automation in Incident Response
Now that we have some of the basics, let’s get back to incident response. First, let’s talk about how we think of incident response. Most people think of it as a process with basically these steps:
While these are great steps for thinking about the process, I don’t think they are as useful for thinking about the technology involved because the phases have a lot of technology overlap.
I think about incident response as two types of activities:
- Investigation: Collecting and analyzing data to answer investigative questions (the identification and investigation phases)
- Mitigation: Taking actions to reduce further damage (the containment, eradication, and recovery phases)
Let’s dive into each of them in a bit more, but we’ll save the details for the follow on posts.
When responding to an incident, there are often many questions to answer at different times and the investigation work answers them. Here are some common questions:
- Triage: Is the computer compromised? How badly?
- Deep Dive / Forensics: Who did it, when did it happen, what were they going after?
- Hunting / Scoping: Which other computers have this file?
The investigation process can be broken down into three general steps:
- Data Collection: Get some data to analyze
- Data Analysis: Analyze the data and get some results
- Inference: Answer the questions based on the results
In the next posting, we’ll break these three steps down further to really evaluate where we should automate, but a fully automated investigation process would look something like this:
- Computer identifies what questions need to be answered based on the incident type.
- Computer knows what data types (such as volatile data, registry keys, etc.) are needed to answer the question and what computers have the needed data.
- Computer analyzes the collected data to answer the questions, with techniques that range from comparing to IOCs to looking for user behavior that is anomalous for their role.
- Computer continues to collect and analyze data until it can make a conclusion about the answer based on the analysis results.
I’d propose that this is currently possible for a very limited set of questions and scenarios. For example, the question “does this file exist on other computers” can be fully automated. Other questions though fit into the partially automated category. Triage questions are easier to automate than deep dive questions because they are better understood. We’ll dive into this in the next posting though about automating the collection and analysis.
When we aren’t investigating during a response, we’re making mitigation-based changes to reduce damage and risk. For example, a host could be removed from the network, a user account could be disabled, or the system could be wiped.
We can break mitigation into two steps for the topic of automation:
- Picking the mitigation approach: Use knowledge of the attack and the corporate network to decide what changes to make to best reduce the risk.
- Implementing it: Make the changes.
For the topic of automation, the idea is that the software could automatically decide what changes to make for a given incident and could automatically carry them out.
The obvious risk with automating mitigation is that mistakes can be expensive. If you incorrectly collect too much data, it wastes time, but that is not as bad as shutting off network access to some key people in the company by accident.
Like with investigation work, the practical solution for many companies is to have partial automation for mitigation. For example, let the human approve most decisions before it happens except for well known situations where it can be automatically done. We’ll cover this more in a future posting.
Automation is needed to quickly resolve incidents, but we need to critically think about where to apply the automation. We need to understand what types of questions can be automatically answered and what types of mitigations can be automatically applied. In the next article, we’ll break the investigation work into smaller steps to evaluate which can and should be automated and which are better to stay manual given the current technology.