CTO,
Benoit Hamelin
March 10, 2017
When actively monitoring endpoints to detect signs of cyber attacks, preserving visibility through the endpoint sensor is crucial. A likely attack scheme for malware stops the sensor process, does its malware deeds, then restarts the sensor process, or even leaves it dead. However, losing connectivity with a sensor is a likely event due to various systems actions and outside circumstances. This article discusses ways to distinguish various scenarios in case of endpoint sensor connectivity loss and figure out when to sound the red alert.
The first step when trying to figure out why contact was lost is to examine all possibilities. Here are the likely scenarios:
A few simple measures can help distinguish between most of these scenarios. Let’s start from the bottom of the list and go up.
Scenario 4b is easy to distinguish from all others — connectivity is thence lost for all other machines within the same organization. A phone call to the IT staff is in order, but sometimes the Internet is just that tough to deal with. So, it comes down to diagnosing a single machine going incommunicado.
The way Arcadia’s endpoint agents will handle this case is by having them in contact with one another across an ad hoc peer-to-peer network deployed on the customer’s network. In other words, each SNOW-defended endpoint within a LAN maintains a TCP connection with a subgroup of others (the size of this subgroup depending on the machine’s purpose and resource constraints). Should any of these TCP connections fail, its peers know it instantly and can report that fact back to the central analytics cloud. This set of peer-to-peer communication features are under development as this is written.
In addition, the agent is aware of its host’s connectivity problems. When the machine regains connectivity, it reports on how long it was down, so that we can corroborate this information with what was reported by its peers. This way, if an agent disablement is being disguised by the attacker as a network failure, there will be a gap in the offline time interval during which the agent was not alive. This is enough information to warrant deeper a investigation, hence to raise an alert.
Obviously, machine failures get detected as communications failures. However, many such failure modes can be documented using supplemental clues. With scenario 3a, the agent gets a message that the machine is being shut down, and is typically given time to stop of its own volition. It can take advantage of that moment to report a final telemetry beacon indicating the situation.
In scenario 3b, the agent is forcibly restarted when the machine goes down. Additionally, the uptime counter for the machine gets reset. Therefore, the restart of the agent reports the uptime counter: if it’s low enough, the diagnostic is complete. Alternatively, virtual machines (scenario 3c) being interrupted look like network failures, until they are resumed. When this happens, there is a gap in the telemetry stream that corresponds to the duration of the interruption. So when the agent beacons again, without any indication to being restarted and with no telemetry buffered within the realtime interval, we understand the machine was virtual and was interrupted. Further indications that the machine is virtual (such as the presence of VMware Tools or access to AWS local queries) confirm that the machine is virtual, further supporting the diagnostic.
The SNOW endpoint agent is programmed by mere humans, so yes, certain situations get it to crash. There! I said it. Two measures facilitate the detection and remediation of crashes.
First, agent processes are registered with the operating system so as to be restarted as soon as they go down. This way, following a crash, the restarted agent reports immediately to the central analytics cloud, carrying the forensics information it accumulated before the crash event.
Second, the endpoint agent is actually composed of two processes (both registered to the OS, as described above) tracking each other’s life cycle. If any of these two processes goes down, the other reports the fact over to the cloud, and repeatedly attempts to bring it back up. Therefore, effectively disabling the agent requires to bring down these two processes. For any one of these processes to crash is odd, but possible; for both of them to crash simultaneously is highly unlikely, enough to raise an alert regarding the possibility of deliberate termination.
The last point makes it clear that scenario 1a cannot go undetected: either the agent is restarted right away by the OS or its watchdog, so that either the malware actions get recorded despite the attacker’s intent, or the attacker generates so much noise that he raises the attention of hunters. Scenario 1b still is much more pernicious, as its fingerprint is very similar to that of scenario 3c. That said, some heuristics can work in our favor:
A failure to check any of these heuristics should raise an alert to conduct a full investigation on the target host.
The last frontier in agent disablement attack is full agent communications spoofing: an attacker reverses the communications protocol between the agent, peers and central analytics cloud, then responds correctly to neighbor requests and plays dumb with the cloud. This is not an easy attack to pull off, as it takes lots of resources to perform this reverse engineering, and then communicate without tripping any behavior normality heuristic. However, it underscores a fundamental weakness of endpoint protection: up to now, it has been assumed that agents would communicate without any external injection of trust. Therefore, agents can authenticate the cloud, but agents cannot authenticate each other, nor can the cloud authenticate an agent. In other words, the cloud is never sure it is really speaking to an agent, or to a dog. We Arcadians are hard at work on this problem as this is written… stay tuned.