# Machine Intelligence: an Anomaly Detection Blog

### IT Security blogs

See more security articles ⇢

### IT Ops / APM blogs

See more IT Ops / APM articles ⇢

### DevOps blogs

See more DevOps articles ⇢

### Developer blogs

See more developer articles ⇢

There are many possible ways that one can detect “data exfiltration” (data theft), but in many cases, this involves either manual raw data inspection or the application of rules or signatures for specific behavioral violations. An alternative approach is to detect data exfiltration using automated behavioral anomaly detection using data that you’re probably already collecting and storing, and without the use of a DLP-specific security tool.

The key thing to note when using behavioral anomaly detection to expose data exfiltration is that you will be using an approach of looking for deviations of behaviors amongst users or systems - that is, you’re assuming that users or systems that exfiltrate data act differently than the typical user or system. This would either be a deviation with respect to that item’s own history (temporal deviation) or with respect to others within a population (peer analysis).

Review

In part 1 of this post, we talked about the failed paradigm of using thresholds and rules or 'eyeballs on timecharts' to monitor a critical app or service.

Thresholds are notorious for generating 'noise.' It is tremendously difficult to create a sufficiently accurate combination of thresholds and rules to monitor anything but the most egregious indicators of system failure. Some KPI (key performance indicators), like response time for a standard query, may seem straight forward. One might suspect that this should never be more than say 1,000 milliseconds. But you can be pretty much guaranteed that the actual response time will vary widely depending on physical distances, other server workloads and network congestion times. As a result, we would generate a large percentage of false positives with such alert conditions. Given the difficulty of defining accurate alert conditions for any KPI, the number chosen to be monitored this way is often exceedingly low.

In a previous blog post about optimizing the performance of the Engine API, I mentioned that choosing the proper bucketSpan results in not only a possible performance improvement, but I also alluded to bucketSpan affecting the timeliness and quality of your results. In effect, there is a 3-way balance between performance, timeliness of the results, and quality of the results that I’d like to dig into further here.

• Quality of results - The choice of bucketSpan provides different “views” into the data. In general if you want to maximize your detection and minimize your false alarm rate, then choose a bucketSpan roughly equal to the typical duration of an anomaly that you would want to know about. At first, that sounds like generic advice, but let’s look at it within the context of an example: analyzing a log looking for the occurrence rate (count) of events by some error code. Let’s imagine there were 2,880 errors suddenly seen in the span of 1 minute, and then they stopped. This would be highly anomalous and interesting to know about.

Static code analysis has long been touted as a must have for high quality software.  Unfortunately, my experience with it in previous jobs didn't live up to the hype.  Within the last few years the majority of compilers have added a built-in static code analysis capability, so I thought it would be interesting to see how good they are.

The two static code analysis tools for C++ that I've integrated into the Prelert build system are those provided with clang and Visual Studio 2013.

Burdened by heavy traffic, a major metropolitan city worked with us to find a solution to help them in their goal to become a “smart city.” The city knew they needed to collect metrics and data points related to travel time for cars and buses, accidents, construction zones, and congestion in general. Once this massive amount of data was collected however, how were they to prioritize what projects to work on first to have the greatest impact in clearing congestion? How could they identify significant increases in journey times, or identify which roads were most significant so they could be sure to clear any accidents there first?

Since the city already calculated incident data and average journey times for a large number of the roads (which they divided into segments for data collection purposes), Prelert was able to easily analyze that data and correlate the journey times with the incident data. This correlated data was then charted such that the incidents were prioritized by impact on journey times, and then displayed in real-time on a map. At a glance, it was clear which traffic incidents and accidents caused the worst congestion at a given time.

According to a TRAC Research survey on IT performance management challenges the top 2 issues were 'Problems reported by end-users before IT finds them' and 'too much time spent troubleshooting.'

Despite crazy advances in every other field of IT technology, this problem really hasn't changed much in the last 20 years!

Whether you are in WebOps, DevOps, ITOps or App Support your job is not easy. Things move faster every year and get more complex while business pressures put a premium on people and time. A number of IT organizations, large and small, have begun to overcome these challenges by taking a lesson from the big data analytics and Internet of Things (IoT) playbooks and incorporating machine learning anomaly detection in their management stack.

The good news is we can show you how to get there through the following 3 incremental steps.

As with any piece of software, there are performance considerations. If you’ve followed any of our developer blogs, you’ll quickly realize that Prelert’s engineers take creating high performance software seriously. But, performance is not only in how the software is architected, it is also in how you utilize the software. Here we will discuss some operational techniques that will optimize the performance of the Anomaly Detective Engine API.

Tip #1) Improve the quality of the data

Recently, in the context of trying to understand how to quantify unusually common categories, I have found myself needing to study various properties of distributions on categorical data. By way of motivation, what I mean by the unusually common is the following. Suppose we have a set of categories, we know their relative frequencies, and someone has labeled a collection of these categories as interesting, maybe they were some application’s log messages at the time of a performance problem. The unusually common would be the unusually frequent categories in this labeled collection. I won’t discuss this problem any more in this post, but instead discuss some background material on distributions on categorical data. I’ll return to the problem of the unusually common in my next post.

Common building block distributions used to describe categorical data are the Bernoulli and categorical distributions. In fact, the Bernoulli is really just a special case of the categorical distribution with two categories. The categorical distribution is the distribution function of a random variable that takes one of categories $$m$$ with probabilities $$\{p_i\}=\{p_1,p_2,...,p_m\}$$. The distributions I’m going to focus on are the result of counting the number of distinct categories in a collection of independent samples of these random variable, these are the binomial and multinomial distributions, respectively.

In a previous blog, I showed how easy it is to analyze multiple metrics simultaneously by adding multiple “detectors” to your job configuration definition for the Anomaly Detective Engine API. Now, let’s take it a step further by expanding analysis across instances of things by using “byFieldName” and “partitionFieldName.”

The concepts of the “by field” and "partition field" were originally developed for the Anomaly Detective Splunk app. In Splunk, there is a notion of a “group by clause” where one can get separate instances of things simply by naming them in the by-clause, and partition fields are specified using the "partitionfield=<fieldname>" option to the prelertautodetect command. In the Engine API, you can leverage the same capabilities using “byFieldName” and “partitionFieldName." I'll elaborate on the difference between these in a future post, but for now let’s just jump to a simple example.