What are the Four Major Types of Data Collection for Forensic Analysis?

When a digital forensics team begins to work, they must decide how to collect the data to support their investigation. Many of the tools like Cyber Triage or Autopsy can work with a number of formats so the investigators must look for the approach that best supports their challenges.

At the highest level, the decision is between choosing a fast and efficient method that may not capture all material or spending the time and effort to create a complete record by capturing all information, including much that may never be needed. It’s a trade-off between speed and thoroughness.

In some cases, investigators may know from the beginning where they will focus their efforts. A ransomware attack, for instance, leaves particular virtual footprints in well-understood parts of the data storage. For example, an employee hiding illicit information may use any part of the disk system.

Hashes only

One of the simplest techniques is to calculate a single value for each file with a one-way hash function or a checksum. These functions are very efficient because they reduce any file to a short number that may be 32, 64 or 256 bits long. Some of the most common hash functions are SHA-256, MD5 or CRC-64.

The forensic investigator analyzes these values by looking for matches, either inside the file system, or by comparing them to databases that track the hash functions of known malware or illicit content. Looking up the values for the files discovered in an investigation is a fast way to see if there is any connection to the well-known historical record.

While tracking on the hashes can be dramatically more efficient, it can often make mistakes or fail to identify some new threats or illicit data. The values from hash functions can only be compared to old values so it can only make connections to previous investigations. If a new virus appears or there are new versions of illicit material, the search will miss them. To make matters worse, hash functions are designed to be extremely sensitive so a small change in the file can change many of the output bits. It’s not possible to get two hash functions that are close to each other and conclude anything about the files.

Full disk image

The opposite of collecting only the hash functions is the full disk image built from making a copy of the entire storage volume. The approach is often used when investigators aren’t sure of the focus of their investigation, and they want to preserve all data in case any bit may be valuable.

The approach is often expensive and time-consuming because making complete copies requires an entirely separate disk that is larger than the original. While this may not be much of an issue with smaller devices like smartphones, it can be prohibitive with larger workstations or enterprise servers.

Full disk images do have some limitations. Some malware is identified by capturing the image of processes running in memory. It may be encrypted on the disk or it may never even reside there, entering from the network. Full disk images focus on the data storage and they will miss other opportunities.

Full file/artifact collection

A cousin of disk imaging is often called the “full file” solution. This will make complete copies of entire sections of the storage system but it won’t necessarily capture all of the different sections. It may ignore less important areas like old text documents or out-of-date system software.

The solution has many of the same benefits and limitations as a complete disk image. It can be more expensive in time and materials because so much data is captured from particular regions, but there is some savings by ignoring sectors less likely to be valuable. If the investigators come to need a particular file later in the investigation, if was stored in the right area it will be found in the full file collection.

The process of limiting the collection to broad areas can be tricky for investigators if there’s not much guidance available at the start. Eliminating some areas from the collection can be very efficient – unless crucial information happens to be found in those areas.

Hybrid / Smart

Some investigators prefer to use some of the most modern tools to create hybrid images that focus on the most likely places to find evidence. These use a set filter functions that may mix together hard rules with adaptive technologies like machine learning to identify the most likely files to be useful to an investigation.

When the filter identifies potential evidence, it will make copies of those files. Some will also make copies of nearby files that may be in the same directory.

What are the Key Takeaways for the C-Suite?

A digital forensics and incident response team (DFIR) can adjust how much data is gathered for investigation. Managers should know:

  • There’s a direct trade-off in the amount of information that’s gathered and the cost in time, storage and analysis.
  • Many files are unlikely to be useful and so an efficient DFIR team can often safely ignore them.
  • It’s impossible, though, to guarantee that a file won’t be useful to some investigator or analysis later.
  • Creating rules and filter functions for determining the most useful files is based upon past investigations. This knowledge is often quite useful, but it may not include new attacks or forms of abuse.