Building A Large Scale Machine Learning-Based Anomaly Detection System

1
ULTIMATE GUIDE TO BUILDING A MACHINE

LEARNING ANOMALY DETECTION SYSTEM
PART 1: DESIGN PRINCIPLES
Anomaly detection is an imperative for digital businesses today,
but it is a complex task to design and build a truly effective
system in-house.
INTRODUCTION
It has become a business imperative for high velocity online businesses to
analyze patterns of data streams and look for anomalies that can reveal
something unexpected. Most online companies already use data metrics to tell
them how the business is doing, and detecting anomalies in the data can lead to
saving money or creating new business opportunities. This is where the online
world is going; everything is data- and metric-driven to gauge the state of the
business right now.
The types of metrics that companies capture and compare differ from industry to
industry, and each company has its own key performance indicators (KPIs).
What's more, regardless of industry, all online businesses measure the
operational performance of their infrastructure and applications. Monitoring and
analyzing these data patterns in real-time can help detect subtle and
sometimes not-so-subtle and unexpected changes whose root causes warrant
investigation.
Automated anomaly detection is a technique

THERE MIGHT BE HUNDREDS,
of machine learning, and it is a complex
endeavor. There might be hundreds, THOUSANDS OR EVEN MILLIONS OF
thousands or even millions of metrics that METRICS THAT HELP A BUSINESS
help a business determine what is happening DETERMINE WHAT IS HAPPENING
right now compared to what it has seen in the
RIGHT NOW COMPARED TO WHAT
past or what it expects to see in the future.
Data patterns can evolve as well as interact, IT HAS SEEN IN THE PAST OR WHAT
making it difficult to understand what data IT EXPECTS TO SEE IN THE FUTURE.
www.anodot.com | info@anodot.com
2
models or algorithms to apply. Companies that use the right models can detect
even the most subtle anomalies. Those that do not apply the right models can
suffer through storms of false-positives or, worse, fail to detect a significant
number of anomalies, leading to lost revenue, dissatisfied customers, broken
machinery or missed business opportunities.
Anodot was founded in 2014 with the purpose of creating a commercial system
for real-time analytics and automated anomaly detection. Our technology has
been built by a core team of highly skilled and experienced data scientists and
technologists who have developed and patented numerous machine learning
algorithms to isolate and correlate issues across multiple parameters in real-time.
The techniques described within this white paper series are grounded in data
science principles and have been adapted or utilized extensively by the
mathematicians and data scientists at Anodot. The veracity of these techniques
has been proven in practice across hundreds of millions of metrics from Anodot's
large customer base. A company that wants to create its own automated
anomaly detection system would encounter challenges similar to those described
within this document series.
This document, Part 1 in a three-part series, covers the design principles of

creating an anomaly detection system. Part 1 will explore various types of
machine learning techniques, and the main design principles that affect how that
learning takes place. Part 2 of the document series will explore a general
framework for learning normal behavior of a time series of data. Part 3 will cover
the processes of identifying and correlating abnormal behavior. Each of the
documents will discuss the technical challenges and Anodots solutions to these
challenges.
WHY COMPANIES NEED ANOMALY DETECTION

If anomaly detection is so complex and full of technical challenges, why do it? To
detect the unknown.
In a high velocity business, many things occur simultaneously, and different

people/roles are charged with monitoring those activities. For example, at the
level of the underlying infrastructure, a technical IT group carefully monitors the
operation and performance of the network, the servers, the communication links,
and so on. At the business application level, an entirely different group monitors
factors such as web page load times, database response time, and user
experience. At the business level, analysts watch shopping cart conversions by
3
geography and by user profile, conversions per advertising campaign, or

whatever KPIs are important to the business.
Anomalies in one area can affect other areas, but the association might never be
made if the metrics are not analyzed on a holistic level. This is what a large-scale
anomaly detection system should do.
Still, the question arises, why care about anomalies, especially if they simply seem
to be just blips in the business that appear from time to time? Those blips might
represent significant opportunities to save money (or prevent losing it) and to
potentially create new business opportunities. Consider these real-life incidents:
An e-commerce company sells gift cards and sees an unexpected increase

in the number of cards purchased. While that sounds like a great thing,
there is also a corresponding drop in the revenue expected for the gift
cards. Something strange is going on, and it turns out to be a price glitch
something quite common for e-commerce companies. Without looking at
these two metrics together, it is hard to understand that there is a
business incident that could cost the company a lot of money if not caught
and addressed quickly. The relationship between these two metrics is
shown in Figure 1.
Figure 1 Detecting business incidents
A mobile game company notices a decrease in installations of one of its

games, but may discover it several weeks in and not know why. With an
anomaly detection system, it is easy to determine that a problem with the
cross-promotion mechanism serving up the wrong ads or ads with the
wrong link is what led to the decline.
4
As businesses grow, more incidents

go undetected unless an anomaly AS BUSINESSES GROW, MORE
detection system is directed to INCIDENTS GO UNDETECTED UNLESS AN
make sense of the massive volume
of metrics available to every online
ANOMALY DETECTION SYSTEM IS
business. Of course, not every DIRECTED TO MAKE SENSE OF THE
metric is directly tied to money MASSIVE VOLUME OF METRICS
but most metrics are tied to
AVAILABLE TO EVERY ONLINE BUSINESS.
revenue in some way. Say an online
news site counts visitors to its
website. By itself, the visitor count doesn't lead to revenue, but the more visitors
the news site gets, the more opportunity there is to generate revenue from ads
on the pages, or to convert people who read the news site from free to paid
subscribers.
Most companies today tend to do manual detection of anomalous incidents.

There are two main ways to do this:
One is to create a lot of dashboards, create daily/weekly reports and have
people monitor them to watch for spikes or dips. Then they investigate
whatever looks strange. Obviously, this method isn't scalable to more than
a dozen or so key metrics. A business could potentially detect major
anomalies but would miss many smaller incidents. Moreover, the people
would need to know what to look for, so they will miss things they didnt
think to track or analyze. This process also has the business looking in the
rear-view mirror at events that happened in the past, which delays the
insights by days or weeks.
The second method is to use a system that depends on setting upper and
lower thresholds for each metric. An alert can be generated if a
measurement goes outside those thresholds. The downside here is that
setting these thresholds is complicated as it must be done for each KPI,
which is difficult to do if there are many thousands of them. Setting
thresholds requires an intricate understanding of the behavior of each
metric over time. Also, if the thresholds are too high or too low, there
could be a lot of false-positive alerts, or conversely, a lot of missed
anomalies. In short, finding anomalies by setting thresholds is an
impractical endeavor.
The solution, then, is automated anomaly detection, where computers look at

this data, sift through it automatically and quickly, highlight abnormal behavior,
and alert on it. This is far easier said than done, for the computers need to be
taught what an anomaly is compared to normal activity. This is where machine
learning comes in.
5
MACHINE LEARNING METHODS

Anomaly detection is a branch of machine
learning, which itself is a type of artificial MACHINE LEARNING METHODS:
intelligence that provides computers with
the ability to learn without being explicitly SUPERVISED
programmed. Machine learning focuses on UNSUPERVISED
the development of computer programs SEMI-SUPERVISED
that can teach themselves to grow and
change when exposed to new data. The
process of machine learning involves
searching through data to look for patterns and adjusting program actions
accordingly. There are two main types of machine learning methods: supervised
and unsupervised.
Supervised learning problems are ones in which the computer can be given a set
of data and a human tells the computer how to classify that data. Future datasets
can then be evaluated against what the machine has already learned. Supervised
learning involves giving the machine many examples and classifications that it
can compare against. This is not feasible for anomaly detection, since it involves a
sequence of data where no one has marked where there are anomalies. The
machine has to detect for itself where the anomalies areif they even exist.
Thus, a mathematical algorithm that processes the data must be able to detect
an anomaly without being given any examples of what an anomaly is. This is
known as unsupervised machine learning.
A third category exists, known as semi-supervised machine learning, and this is

where Anodot fits in. This is discussed in more detail in the section titled
Definition of Incidents.
6
WHAT IS AN ANOMALY?
The challenge for both machines and humans is identifying an anomaly. Very
often the problem is ill-posed, making it hard to tell what an anomaly is. Consider
the set of images in Figure 2.
Figure 2 What is an anomaly?
In this collection of dog pictures, which one is the anomaly? Is it the sleeping dog,
because all the others are awake? Is it the dog with the very long fur? Is it the dog
with balls in his mouth, or the one with a tongue hanging out? Depending on
what criteria are used to define an anomaly, there could be many answers to the
question. Thus, there must be some constraints on identifying an anomaly;
otherwise almost any answer can be considered correct.
That is the philosophical part of anomaly detection. Fortunately, many metrics

from online systems are expressed in time series signals rather than images. In
time series signals, an anomaly is any unexpected change in a pattern in one or
more of the signals. Figures 3, 4 and 5 present examples of unexpected changes
in time series signals. In these examples, the anomalies are shown in orange to
make them easier to visually identify.
7
Figure 3 Anomalies in a single time series signal
Figure 4 Anomalies in a single time series signal
Figure 5 Anomalies in multiple time series signals
Viewing these examples, humans would probably categorize the signal changes
as anomalies because humans have some concept in mind as to what
"unexpected" means. We look at a long history of the signal and create a pattern
in our minds of what is normal. Based on this history, we expect certain things to
happen in the future.
That is the crux of the question, "What is an anomaly?" It is still ill-defined.
Looking at the example in Figure 6 below, we could say that the two days
highlighted in the red box are anomalous because every other day is much
higher. We would be correct in some sense, but it is necessary to see more
history to know if those days are real anomalies or not. There are a lot of
definitions that go into an algorithm to determine with a high degree of
confidence what an anomaly is.
8
Figure 6 What part of the signal is anomalous?
But anomaly detection is not impossible. There are obvious things that most
people agree on of what it means to be the same, and that's true for time series
signals as wellespecially in the online world where companies measure number
of users, number of clicks, revenue numbers and other similar metrics. People do
have a notion of what "normal" is, and they understand it because, in the end, it
affects revenue. By understanding normal, they also understand when
something is abnormal. Thus, knowing what an anomaly is isn't completely
philosophical or abstract.
That leads to a discussion of the main design principles when building an

anomaly detection system. A company must look at why it needs such a system
and then decide what methods are the right ones for its own use cases.
DESIGN PRINCIPLES OF ANOMALY DETECTION

Based on our experience, there are five main design considerations when
building an automated anomaly detection system:
1. Timeliness How quickly does the company need to an answer to
determine if something is an anomaly or not? Does the determination
need to be in real-time, or is it okay for the system to determine it was an
anomaly after a day, week, month or year has already passed?
2. Scale Does the system need to process hundreds of metrics, or millions?
Will the datasets be on a large scale or a relatively small scale?
3. Rate of change Does the data tend to change rapidly, or is the system
being measured relatively static?
4. Conciseness If there are a lot of different metrics being measured, must
the system produce an answer that tells the whole picture, or does it
suffice to detect anomalies at each metric level by itself?
5. Definition of incidents Are the expected incidents well-defined? Is
anything known about them in advance in terms of what types of things
can be anomalous in the data? Can incidents be categorized over time?
9
TIMELINESS AND SCALE

Most online businesses need real-time decision-making capabilities, which means
they have to know what's happening with their data right now. If something
anomalous happens for example, a sudden drop in site visitors or installs, or an
unexpected spike in sales of a particular product they need to know quickly in
order to make and act on decisions right away. Each incident could lead to
different types of issues or opportunities that are worthy of investigation. A
sudden increase in product purchases might mean the company needs to
increase its inventory because a celebrity endorsed this product and it is now
wildly popular. Selling out without having new inventory coming in could mean
significant lost revenue. Or, a sudden spike in purchases of a specific product
could mean that the price was entered incorrectly, and shoppers are effectively
getting a 90% discount so they are running up sales.
Non-real-time decision-making can be used for anything that relates to longer

term planning, such as capacity planning, budget planning, marketing campaign
analysis, A/B testing analysis, scheduled maintenance and data cleaning. The
anomalies are used for a retrospective analysis of what happened, which helps in
making decisions about the future. For example, an anomalous increase in the
number of purchases after a marketing campaign can lead to a decision to invest
in similar campaigns in the future. Real-time decision-making isn't necessary for
those activities. It's sufficient to get yesterday's anomalies a week from now; i.e.,
doing anomaly detection on the data retroactively.
The distinction of when anomaly

detection must take place in real-time THE DISTINCTION OF WHEN ANOMALY
or not is critically important in DETECTION MUST TAKE PLACE IN
understanding the type of machine
learning algorithms that can be used.
REAL-TIME OR NOT IS CRITICALLY
IMPORTANT IN UNDERSTANDING THE
If real-time decision-making is required, TYPE OF MACHINE LEARNING
the system must use online machine
ALGORITHMS THAT CAN BE USED.
learning algorithms. Online machine
learning algorithms process data
sequentially, and the latest data point
is used to update the best predictor for future data at each step. In other words,
the machine learns as the data comes in piece by piece, rather than waiting for
an entire dataset to be available from which to learn.
Some characteristics of online machine learning algorithms are that they scale
more easily to more metrics and to large datasets, and it is not possible to iterate
over the dataonce data is read it is not considered again. This method is more
10
prone to presenting false-positives because the algorithm gets a data point and
has to produce a result; it cannot go back to fix that result at a later time. We will
go into more detail about online machine learning in Part 2 of this document
series.
If time is not a critical factor in decision-making, batch machine learning

algorithms can be utilized. A company can collect data over a period of time and
then apply the algorithm to that complete dataset for a list of anomalies that
happened in that time period. There is no time urgency, so the algorithm can go
over the data again and again to improve what the machine learns. A lot of
algorithms do improve by repetitively going over the data, which ultimately
results in fewer false-positives. However, this process is computationally
expensive and scales poorly.
Consider the need to learn the average number of users that come to a particular
website every day. There are two ways to learn this. One is to collect all the data
over a period of several months and then compute that average. The alternative
is to compute it as the data comes in. The latter method means that the company
cannot go back and fix issues in what was previously learned about the average.
The algorithm sees the data, uses it and then sets it aside; however, it does not
go back and fix anything if there were anomalies that skewed the calculation of
the average. This example illustrates why online machine learning algorithms
tend to be more prone to false-positives. On the other hand, online learning
methods tend to be more scalable; it is easier to scale them to a much higher
number of data points/metrics to learn on.
At Anodot, we focus on real time anomaly detection at a massive scale, which

determines the types of algorithms we can use. We, indeed, use online machine
learning algorithms.
RATE OF CHANGE
In our experience, most of the businesses that we work with have constant
change in their metrics. Their environment changes; they release new products
or new versions of their applications, which in turn changes the patterns of how
people use them.
The graph in Figure 7 below is an example from an Anodot customer; it

represents a technical metric related to an application. It is obvious that the
metric had a pattern that was fairly consistent for a while, and then it completely
changed its behavior at some point in time, and it stayed this way for a long time.
11
Figure 7 This metric has a change in its pattern
Some systems change very slowly. They tend to be closed systems that are not
impacted by outside events. For example, automated manufacturing processes
tend to be closed systems that do not change much over time. When a company
manufactures some sort of widget, the process is typically fairly steady. The
machine operates with the same amount of force to stamp out each widget; the
widget is heated in a furnace that is within a strict temperature range; the
conveyor belt moves the widgets down the line at a constant speed; and so on
throughout the manufacturing process. A metric with a slow rate of change might
look like the graphic shown in Figure 8 below.
Figure 8 This process has a slow rate of change
The rate of change has implications on the learning algorithms that an anomaly
detection system should use. If a system has constant changes which most
online businesses do then the system needs adaptive algorithms that know to
take into account that things change. However, if the rate of change is very slow,
the system can collect a year's worth of data and learn what is normal from that
dataset. Then the model should not need to be updated for a long time. For
example, the process to manufacture a widget is not going to change, so the
anomaly detection system can apply an algorithm that does not have to be
adaptive to frequent changes. On the other hand, data pertaining to an e-
commerce website will change frequently because that is just the nature of the
business.
At Anodot, the most interesting problem to solve is the one that changes
frequently, so we utilize highly adaptive algorithms that take this into account.
12
CONCISENESS
Conciseness means the system takes multiple metrics into account at once for a
holistic look at what is happening.
As an example of what is meant by conciseness: consider the human body as a

type of system. It is possible to measure a person's vital signs, including blood
pressure, body temperature, pulse rate and respiration rate. Assume that a
person is wearing sensors for all those vital signs, and measurements are taken
every minute. Under normal conditions i.e., no sickness measurement ranges
of the vital signs are stable; a person's body temperature does not fluctuate by
much throughout a normal day. However, if the person comes down with the flu,
the signals from the body will appear anomalous: temperature goes up, pulse
rate might change as the person becomes lethargic, respiration rate changes as
breathing becomes more labored, and so on. All of these metrics, these vital
signs, become anomalous more or less together.
Now the question is, how does a doctor look at all of these vital signs? If he or she
looks at each measurement by itself, it will not be clear what is going on. When
the pulse rate decreases, by itself, it does not tell the doctor very much. Only the
combination of these metrics can begin to tell a story of what is going on.
And so it is with many business systems.

Businesses measure many metrics, and all of BUSINESSES MEASURE
them together tell a story about an issue. For MANY METRICS, AND ALL
example, a buggy deployment of a new version
of a service may lead to high latency in page
OF THEM TOGETHER TELL A
loads, high bounce rates, reduced number of STORY ABOUT AN ISSUE.
visitors, and many more anomalous metrics.
Viewed individually, each metric may not point
to the root cause, but together they tell a clear story. For proper root cause
analysis on most incidents, a company needs concise anomalies.
In terms of design of the anomaly detection system, there are two methods that
take conciseness into consideration: univariate anomaly detection and
multivariate anomaly detection.
UNIVARIATE ANOMALY DETECTION

With univariate anomaly detection, the system looks at each metric by itself,
learning its normal patterns and yielding a list of anomalies for each single
metric. Oftentimes, it is difficult to perform root cause analysis of an issue
because it is hard to see the forest for the trees. The advantage of univariate
13
anomaly detection is that it is a lot easier to do than other methods. It is easier to

scale in terms of computation. Less data is needed to learn what is normal
because the system looks at each metric by itself, as opposed to looking at
combinations of metrics. It is possible to model a lot of different types of metric
behaviors. However, when something unexpected happens that affects a lot of
metrics, the system yields a storm of anomalies. Now someone has to sift
through them to understand what is happening.
MULTIVARIATE ANOMALY DETECTION

Multivariate anomaly detection techniques take input from all the signals
together as one, without separating them out. In the example of the human
body, take all of a person's vital signs and put them into a black box that outputs
a single model of what is normal for all of them together. When the person has
the flu, the anomaly describes the flu based on all the vital signs together.
There are downsides to using multivariate anomaly detection techniques. For one
thing, these methods are very hard to scale. They are best when used with just
several hundred or fewer metrics. Also, it is often hard to interpret the cause of
the anomaly. All of the metrics are taken as input but the output simply says
there is something strangean anomaly, without identifying which metric(s) it is
associated with. In the healthcare analogy, the doctor would put in the vital signs
and receive the patient is sick, without any further explanation about why.
Without having insight into what is happening with each metric, it is hard to know
which one(s) affect the output, making it hard to interpret the results.
Another technical issue with these multivariate techniques is that they require all
the measured metrics to be somewhat homogeneous in their behavior; i.e., the
signal type must be more or less similar. If the set of signals or metrics behave
very differently from each other, then these techniques tend to not work well.
A HYBRID APPROACH
The univariate method causes alert storms that make it hard to diagnose why
there is an anomaly, and the multivariate methods are hard to apply. Anodot
utilizes a hybrid approach to take advantage of the good aspects of each method,
without the technical challenges they present. Anodot learns what is normal for
each one of the metrics by themselves, and after detecting anomalies the system
checks if it can combine them at the single metric level into groups and then give
an interpretation to that group.
We never have a model that indicates how all the metrics should behave
together. Instead we have a model for each metric by itself, but when some of
them become anomalous, we look for smart ways to combine related anomalies
14
into a single incident. This hybrid approach offers a practical way to achieve very
good results. The main challenge in this approach is how to know which metrics
are related to each other. We will describe how to use machine learning
algorithms to automatically discover these relationships in the third part of this
series.
Table 1 summarizes the characteristics of all three approaches to conciseness of

data.
Univariate Anomaly Multivariate Anomaly

Hybrid Approach
Detection Detection
Learn normal model for Learn a single model for Learn normal model for
each metric all metrics each metric
Anomaly detection at the Anomaly detection of a Combine anomalies to
single metric level complete incident single incidents if metrics
Easier to scale to large Hard to scale are related
datasets and many Hard to interpret the Scalable
metrics anomaly Make interpretation from
Causes anomaly Often requires metric groups of anomalies
stormscan't see the behavior to be Can combine multiple
forest from the trees homogeneous types of metric behaviors
Easier to model many Requires additional
types of behaviors methods for discovering
the relationships
Table 1 Characteristics of univariate, multivariate and hybrid anomaly detection methods
DEFINITION OF INCIDENTS
The last design principle asks the question, Are incidents well-defined? While
the answer to this question is typically no, as incidents for an online business
are almost never well-defined, we will cover it because it provides the opportunity
to further discuss supervised versus unsupervised learning and apply it to the
design principle. In addition, it may be that over time, a business can define some
incidentsleading to semi-supervised learning techniques.
A well-defined incident is one in which all (or at least most) of the potential
causes of anomalies can be enumerated. This typically applies to a closed system
with a very limited number of metrics. It might apply, for example, to a relatively
simple machine where the product designers or engineers have written
documentation of what could go wrong. That list could be used to learn how to
detect anomalies. However, for an e-commerce website, it would be a Sisyphean
task to try to list what could go wrong and break those things down to tangible
incidents where mathematical models could be applied.
15
If a system has well-defined incidents, it is possible to apply supervised learning

techniques. There is a well-defined set of incidents; now the system simply must
classify the data into these sets. It requires labeled examples of anomalies, but it
will be limited to the types of things the company is trying to use its system to
find. If there is a list of a hundred different things that could go wrong, the
system is trained about all of them, and then tomorrow the 101st thing happens
that was not previously considered, the system will not be able to find it because
the system is not trained to do so.
Supervised learning methods are

very powerful because when they SUPERVISED LEARNING METHODS ARE
are properly trained on a set of
VERY POWERFUL BECAUSE WHEN THEY
well-defined incidents, they will
catch those incidents and provide ARE PROPERLY TRAINED ON A SET OF
an exact definition for them. In WELL-DEFINED INCIDENTS, THEY WILL
terms of learning algorithms, they CATCH THOSE INCIDENTS AND
are usually more accurate than
unsupervised learning methods.
PROVIDE AN EXACT DEFINITION FOR
There are fewer false-positives, but THEM.
a prerequisite is having a well-
defined set of incidents to start
with, and that is usually where companies encounter problems. It is very difficult
to prepare a list of well-defined incidents and every possible thing that could
happen.
The flipside is unsupervised learning methods, where a system learns what is

normal over time. Anomalies are detected whenever the data that is now
presented deviates from that normal model. The good thing is that such a system
can detect any type of incident, known or unknown. The disadvantage is that the
system is dependent on how "normal" has been defined. If the learning system
does not do a good job of defining what is normal, it will get poor results with its
anomaly detection. The system needs to know what the criteria are for being
normal, and the detection technique is sensitive to that.
Then there are semi-supervised learning methods. Sometimes people can

categorize examples where "this is a real anomaly, but that is not a real
anomaly." It usually covers a very small subset of all the examples that can be
identified. But getting even a few can be helpful. The question then becomes, can
the unsupervised techniques be improved to be a little bit more supervised? For
example, can it help a company choose what the normal model is that it needs to
use? Can it help in choosing other things in the algorithms that learn what normal
is and detect what is abnormal?
16
The answer is yes. There is a whole field called semi-supervised learning, and this
is a technique that Anodot uses. We collect some feedback in our system to
improve how we learn what is normal and how we detect anomalies based on a
few examples that we get from our users. This helps in making the assumption of
what is normal.
Table 2 summarizes the characteristics of these learning methods.

Unsupervised methods
Supervised methods when
when incidents are not Hybrid Approach
incidents are well-defined
well-defined
Requires a well-defined Learning a normal model Use a few labeled
set of incidents to only examples to improve
identify Statistical test to detect detection of
Learning a model to anomalies unsupervised methods
classify data points as Can detect any type of -OR-
normal or abnormal anomaly, known or Use unsupervised
Requires labeled unknown detection for unknown
examples of anomalies cases, supervised
Cannot detect new types detection to classify
of incidents already known cases
Table 2 Characteristics of supervised and unsupervised learning, and a hybrid method
SUMMARY
Building an automated anomaly detection system for large scale analytics is a
tremendously complex endeavor. The sheer volume of metrics, as well as data
patterns that evolve and interact, make it challenging to understand what data
models to apply. Numerous algorithm design principles must be considered, and
if any of them are overlooked or misjudged, the system might overwhelm with
false-positives or fail to detect important anomalies that can affect business. Data
scientists must consider, for example:
Timeliness how quickly a determination must be made on whether
something is an anomaly or not
Scale how many metrics must be processed, and what volume of data
each metric has
Rate of change how quickly a data pattern changes, if at all
Conciseness whether all the metrics must be considered holistically
when looking for anomalies, or if they can be analyzed and assessed
individually
Definition of incidents how well anomalous incidents can be defined in
advance
Anodot has built an anomaly detection system with these design considerations
in mind. In parts 2 and 3 of the series, we will explain how these design
17
considerations played into building the Anodot system. Part 2 looks at various
ways that an anomaly detection system can learn the normal behavior of time
series data. The concept of what is normal is a critical consideration in deciding
what is abnormal; i.e., an anomaly. And part 3 explores the processes of
identifying and correlating abnormal behavior.
For more information, please contact Anodot:
North America
669-600-3120
info.us@anodot.com
International
+972-9-7718707
info@anodot.com
ABOUT ANODOT
Anodot was founded in 2014, and since its launch in January 2016 has been
providing valuable business insights through anomaly detection to its customers
in financial technology (fin-tech), ad-tech, web apps, mobile apps, e-commerce
and other data-heavy industries. Over 40% of the companys customers are
publicly traded companies, including Microsoft, VF Corp, Waze (a Google
company), and many others. Anodot's real-time business incident detection uses
patented machine learning algorithms to isolate and correlate issues across
multiple parameters in real time, supporting rapid business decisions. Learn
more at http://www.anodot.com/.
Copyright 2017, Anodot. All trademarks, service marks and trade names
referenced in this material are the property of their respective owners.
1

PART 2: LEARNING NORMAL TIME SERIES BEHAVIOR
Anomaly detection is an imperative for online businesses today,

and building an effective system in-house is a complex task. It is
a particular challenge to first learn the normal behavior of data
metrics, in order to identify events that differ from the norm; i.e.,
anomalies.
INTRODUCTION
Anomaly detection helps companies determine when something changes in their
normal business patterns. When done well, it can give a company the insight it
needs to investigate the root cause of the change, make decisions, and take
actions that can save money (or prevent losing it) and potentially create new
business opportunities.
High velocity online businesses need

real-time anomaly detection; waiting HIGH VELOCITY ONLINE BUSINESSES
for days or weeks after the anomaly NEED REAL-TIME ANOMALY
occurs is simply too late to have a
DETECTION; WAITING FOR DAYS OR
material impact on a fast-paced
business. This puts constraints on the WEEKS AFTER THE ANOMALY OCCURS
system to learn to identify anomalies IS SIMPLY TOO LATE TO HAVE A
quickly, even if there are a million or MATERIAL IMPACT ON A FAST-PACED
more relevant metrics and the
BUSINESS.
underlying data patterns are
complicated.
Automated anomaly detection is a technique of machine learning, and it is a

tremendously complex endeavor. In this series of white papers, Anodot aims to
help people understand some of the sophisticated decisions behind the
algorithms that comprise an automated anomaly detection system for large scale
2
analytics. In Part 1 of this white paper series, we outlined the various types of
machine learning and the critical design principles of an anomaly detection
system. We highly recommend reading Part 1 to get the foundational information
necessary to comprehend this document.
In Part 2, we will continue the discussion with information about how systems
can learn what normal behavior looks like, in order to identify anomalous
behavior. Part 3 of our white paper series will cover the processes of identifying
and correlating abnormal behavior. In each of the documents, we discuss the
general technical challenges and Anodots solutions to these challenges.
The techniques described within this paper are well grounded in data science
principles and have been adapted or utilized extensively by the mathematicians
and data scientists at Anodot. The veracity of these techniques has been proven
in practice across hundreds of millions of metrics from Anodot's large customer
base. A company that wants to create its own automated anomaly detection
system would encounter challenges like those described within this document.
A GENERAL FRAMEWORK FOR LEARNING NORMAL BEHAVIOR

The general process of any anomaly detection method is to take data, learn what
is normal, and then apply a statistical test to determine whether any data point
for the same time series in the future is normal or abnormal.
Consider the data pattern in Figure 1 below. The shaded area was produced
because of such statistical analysis. We could, therefore, apply statistical tests
such that any data point outside of the shaded area is defined as abnormal and
anything within it is normal.
Figure 1 A general scheme for anomaly detection.
3
The graph below is a normal distribution represented by an average standard

deviation. Given a large number of data points, 99.7% of the data points
submitted should fall within the average, plus or minus three times the standard
deviation. This model is illustrated with the formula in Figure 2.
Figure 2 The mathematical formula for average standard deviation.
Making this assumption means that if the data comes from a known distribution,
then 99.7% of the data points should fall within these bounds. If a data point is
outside these bounds, it can be called an anomaly because the probability of it
happening normally is very small.
This is a very simple model to use and to estimate. It is well known and taught in
basic statistics classes, requiring only computation of the average and the
standard deviation. However, assuming any type of data will behave like the
normal distribution is nave; most data does not behave this way. This model is,
therefore, simple to apply, but usually much less accurate than other models.
There are many different distributions

that can be assumed on data; however, THERE ARE MANY DIFFERENT
given a very large dataset, there are
DISTRIBUTIONS THAT CAN BE
most likely many different types of
behavior in the data. This very fact has ASSUMED ON DATA; HOWEVER, GIVEN
been thoroughly researched for A VERY LARGE DATASET, THERE ARE
hundreds of years, and even more so in MOST LIKELY MANY DIFFERENT TYPES
the last 50 years as data science has
OF BEHAVIOR IN THE DATA.
become important in the computing
world. But the question is, given a huge
amount of literature, techniques and models to choose from, how can someone
choose only one model?
4
The answer is that it is not possible to choose just one.
At Anodot, we look at a vast number of time series data and see a wide variety of
data behaviors, many kinds of patterns, and diverse distributions that are
inherent to that data. There is not one type of distribution that fits all possible
metrics. There has to be some way to classify each signal to decide which should
be modeled with a normal distribution, and which should be modeled with a
different type of distribution and technique.
Choosing just one model does not work, and we have seen it even within a single
company when they measure many different metrics. Each metric behaves
differently. In Part 1 of this document series, we used the example of a persons
vital signs operating as a complete system. Continuing with that example, the
technique for modeling the normal behavior of a person's heart rate may be very
different from that which models his or her temperature reading.
The bodys temperature is a reading that is very close to a normal distribution. If

we were to graph it for a single person over time, we would see that it fits the
normal distribution very well, but a lot of other vital signs are not like that at all.
If we look at a person's heart rate, it changes constantly throughout the day

depending on whether the person is awake or not, or whether he or she is doing
something strenuous like exercising. It is multimodal and seasonal multimodal
because the persons heart rate changes to different states based on activities,
and seasonal because it is likely to follow the daily human cycle (sleep/awake).
The heart has a different rate while the person is running compared to when he
or she is relaxing; even while running, the rate changes to different modes of
operation during a steady run or strenuous sprints.
5
A SINGLE MODEL DOES NOT FIT ALL METRICS

In the Anodot system, every dataset that comes in goes through a classification
phase where we categorize it according to what type of model it fits best. We
have created a bank of model types that each fits one of these signal types, some
of which are illustrated in Figure 3.
Figure 3 A sampling of some of the data models Anodot uses.
For companies that choose to build their own anomaly detection system, this is
often where the first part of the complexity comes into play. Most open source
techniques deal with "smooth metrics." The metrics are not normal distribution,
but they tend to be very regularly sampled and have stationary behavior. They
tend to have behaviors that don't change rapidly, and they don't exhibit other
behaviors. Applying open source techniques only covers a fraction of what is
measured and if they are applied on metrics that are not smooth, the result will
either be a lot of false-positives or there will be many anomalies that are not
detected (i.e. false-negatives), because the metric is fitted with the wrong model.
Not everything is smooth and stationary, and those models only work on a
fraction of the metrics. Worse, it is difficult to know which metrics are like this.
Those datasets would somehow have to be identified.
Consider the pattern in the signal shown in Figure 4. If the smooth techniques are
applied on this data, the little spikes that seem completely normal would be
considered anomalous and would generate alerts every minute. The smooth
model would not work here.
6
Figure 4 A metric with an unusual pattern doesn't fit a smooth model.
Knowing what a data pattern looks like in order to apply an appropriate model is
a very complex task.
If a company has 10 metrics, it is possible to graph the data points with a

statistician. With only 10 metrics, this is feasible to do manually; however, with
many thousands or millions of metrics, there is no practical way to do this
manually. The company would have to design an algorithm that would determine
the proper data model to use for each metric.
There is another aspect we have observed quite often with the data we see from
our customers: the model that is right today, may not be right tomorrow. In
Figure 5, we see how a metrics behavior can change overnight.
Figure 5 Sudden change in metric behavior.
We have seen this happen many times, and each time, it was totally unexpected;
the data starts out one way and then goes into a totally different mode. It may
start kind of smooth and then change to steep peaks and valleysand stay there.
That pattern becomes the new normal. It is acceptable to say at the beginning of
the new pattern, that the behavior is anomalous, but if it persists, we must call it
the new normal.
7
Let us consider how this affects the company building its own detection system.
The companys data scientist will spend several weeks classifying the data for the
company's 1,000 metric measurements and make a determination for a metric
model. It could be that a week from now, what the data scientist did in classifying
the model is irrelevant for some of thembut it may not be clear for which ones.
What is needed, then, is an automated process that constantly looks at the

changing nature of data signals and decides what the right model is for the
moment. It is not static.
THE IMPORTANCE OF MODELING SEASONALITY

Other important aspects that should be included in the algorithms and the model
is whether the data has seasonal patterns and what the seasonal periods are. A
seasonal pattern exists when a series is influenced by seasonal factors (e.g., the
quarter of the year, the month, the hour of the day or the day of the week).
Seasonality is frequently, but not always, of a fixed and known period. This is
illustrated in Figure 6 below.
Figure 6 Illustration of a single seasonal pattern.
We know that many different metrics that are measured have seasonal patterns,
but the pattern might be unknown. Nevertheless, it is important to take the
seasonal pattern into consideration for the model. Why? If the model of what is
normal knows to account for a metrics seasonal pattern, then it is possible to
detect the anomalies in samples that vary from the seasonal pattern. Without
considering the seasonal pattern, too many samples might be falsely identified as
anomalies.
Often we see, not just a single seasonal pattern, but multiple seasonal patterns
and even different types of multiple seasonal patterns, like the two examples
shown in Figure 7.
8
Figure 7 Illustration of multiple seasonal patterns.
Figure 7 shows an example of a real metric with two seasonal patterns working
together at the same time. In this case, they are weekly and daily seasonal
patterns. The image shows that Fridays and weekends tend to be lower, while the
other days of the week are higher. There is a pattern that repeats itself week after
week, so this is the weekly seasonal pattern. There is also a daily seasonal pattern
that illustrates the daytime hours and nighttime hours; the pattern tends to be
higher during the day and lower during the night.
These two patterns are intertwined in a complicated way. There is almost a sine
wave for the weekly pattern, and another faster wave for the daily pattern. In
signal processing, this is called amplitude modulation, and it is normal for this
metric. If we do not account for the fact that these patterns co-exist, then we do
not know what normal is. If we know how to detect it and take it into account, we
can detect very fine anomalies like the ones shown in orange in Figure 7 above.
The values in orange indicate a drop in activity which may be normal on a
weekend but not on a weekday. If we do not know to distinguish between these
patterns, we will not understand the anomaly, so we either miss it or we create
false-positives.
Figure 8 below shows an example of another type of multiple seasonal patterns

one with additive signals.
9
Figure 8 An Illustration of multiple seasonal patterns with additive signals.
In the example above, we see a clear daily pattern. In addition, we see an event
that occurs every four hours which causes a spike that lasts for an hour and then
comes down. The spikes are normal because of a process or something that
happens regularly. The orange line shows an anomaly that would be very hard to
detect if we did not take into account that there is both the daily pattern and the
spikes every four hours. We call this pattern additive because the spikes are
added to what normally happens during the day; the pattern shows a consistent
spike every four hours on top of the daily pattern.
CAN A SEASONAL PATTERN BE ASSUMED?

At Anodot, we have observed millions of metrics and built algorithms that detect
the seasonal patterns if any that exist in them. Some of them in fact, most of
them do not have a seasonal pattern. Out of millions of metrics that Anodot has
seen, about 14% of them have a season to them, meaning 86% of the metrics
have no season at all. Out of the metrics with a seasonal pattern, we have
observed that 70% had a 24-hour pattern to them, and 26% had weekly patterns.
The remainder of the metrics with a seasonal pattern had other types of
patternsfour hours, six hours, and so on.
If we assume there are no seasonal patterns in any of the metrics and we apply
standard techniques, we are either going to be very insensitive to anomalies or
too sensitive to them, depending on what technique we use. However, making
assumptions about the existence of a seasonal pattern has its issues as well.
10
There are two problems with assuming a

seasonal pattern (e.g., daily or weekly). First, TWO PROBLEMS WITH ASSUMING A
it may require too many data points to SEASONAL PATTERN:
obtain a reasonable baseline (in case there
is no seasonal pattern in the metric), or it MAY REQUIRE TOO MANY DATA
would produce a poor normal model all POINTS TO OBTAIN A
together (if there is a different seasonal
pattern in the metric). If we assume a weekly
REASONABLE BASELINE
seasonal pattern for all of our metrics, it MAY PRODUCE A POOR NORMAL
would require many more data points to MODEL
converge to a metric baseline. Not only does
this take time, but the process might not converge to the right distribution if it is a
variable metric.
Second, if the wrong seasonal pattern is assumed, the resulting normal model
may be completely off. For example, if the data point is assumed to be a daily
seasonal pattern, but it is actually a 7-hour pattern, then comparing 8 AM one day
to 8 AM another day is not relevant. We would need to compare 8 AM one day to
3 PM that same day. Improperly defining the seasons will lead to many false-
positives due to the poor initial baseline.
Figure 9 Comparing a 7-hour seasonal pattern with an assumed 24-hour seasonal pattern
For this reason, some tools require the user to define the season in order for the
tool to estimate the baseline. Of course, this is not scalable for more than a few
dozen metrics. What is needed is a system that will automatically and accurately
detect seasonality (if it exists). If this capability is not built into the system,
11
assumptions will have to be made that are going to cause an issue, either from
the statistics side in needing more data, or from the accuracy side in identifying
anomalies.
EXAMPLE METHODS TO DETECT SEASONALITY

Now that we have established the importance of determining if seasonality is
present in the data, we will briefly discuss a few common methods to detect it.
One method uses Fourier transform of signals, a technique in mathematics that

takes a signal, transforms it to the frequency domain and finds frequencies that
are local maximums (peaks) in the power of the Fourier transform. Those peaks
tend to occur where there are seasonal patterns. This technique is fast and
efficient, but it does not work well when there are multiple seasonal patterns.
Additionally, it is difficult to detect low-frequency signal patterns like weekly,
monthly or yearly, and this technique is very sensitive to any missing data. Also,
issues like aliasing in the Fourier transform can cause multiple peaks to be
present, some of which are not the actual seasonal frequency, but rather artifacts
of the Fourier transform computation.
Another technique is autocorrelation of signals, also known as serial correlation

or autocorrelogram (ACF), the correlation of a signal with itself at different points
in time. Informally, it is the similarity between observations as a function of the
time lag between them. It is a mathematical tool for finding repeating patterns.
Compared to the Fourier transform method, it is more accurate and less sensitive
to missing data, but it is computationally expensive.
Anodot developed a proprietary algorithm which we call Vivaldi (patent pending).

At a high level, Vivaldi implements detection using the ACF method, but
overcomes its shortcomings by applying a smart subsampling technique,
computing only a small subset of ACF coefficients, thus reducing computational
complexity. In addition, to accurately identify multiple seasonal patterns, the
method is applied on multiple filtered versions of the time series. The method
has been proven to be accurate both theoretically and empirically, while very fast
to compute.
Finding maximum(s) in Finding maximums in
Anodot Vivaldi method
Fourier transform of signal autocorrelation of signal
Challenging to detect low Computationally Based on autocorrelation
frequency seasons expensive Uses smart subsampling
Challenging to discover More robust with regard to reduce computational
multiple seasons to gaps complexity
Sensitive to missing data Provably accurate
Table 1 Techniques to detect seasonality.
12
REAL-TIME DETECTION AT SCALE REQUIRES ONLINE1 ADAPTIVE

LEARNING ALGORITHMS
Companies that want immediate insight to changes in their business operations
need prompt notification of anomalies in their data. This means that the
algorithms used to detect anomalies must have all the properties that we have
discussed above i.e., detecting seasonality, automatically determining the
proper model, etc. but they also must adapt to changing conditions in the
datasets they are processing. Unexpected changes in the data can be anomalies,
at least initially, but they can also be indicative of a change in the data pattern.
This can happen if, for example, e-commerce sales or the number of visitors to a
website suddenly surges due to a successful marketing campaign.
This underscores the need for online adaptive learning algorithms which learn
the model with every new data point that comes in. This type of learning
algorithm does not wait to receive a batch of data points to learn from; rather, it
updates what has been learned so far with every new data point that arrives.
These so-called online learning algorithms do not have to be adaptive, but by
nature they usually are, which means that every new data point changes what
has been learned up to that time.
We can contrast an online learning model to a model that uses data in batch
mode. For example, a video surveillance system that needs to recognize human
images will learn to recognize faces by starting with a dataset of a million pictures
that includes faces and non-faces. It learns what a face is and what a non-face is
in batch mode before it starts receiving any real data points.
1
What do we mean by "online" machine learning? This is not a reference to the Internet or the World Wide Web.
Rather, "online" is a data science term that means the learning algorithm takes every data point, uses each one to
update the model and then does not concern itself with that data point ever again. The algorithm never looks back at
the history of all the data points, but rather goes through them sequentially.
It is not necessarily real-time because time is not a factor here. In an e-commerce example, time can be a factor, but in
the general sense of an online learning algorithm, it just means that if there are 1,000 data points to learn from, the
algorithm goes through them one by one to learn from them, throws them away and then moves on to the next one. It
is more of a sequential learning algorithm. The word "online" is widely used in the world of data science but it has
nothing to do with the Internet; this is simply the term used in literature. For more information, see the Wikipedia entry
about online machine learning.
13
In the online learning paradigm, the

machine never iterates over the data. It IN THE ONLINE LEARNING PARADIGM,
gets a single data point, learns what it THE MACHINE NEVER ITERATES OVER
can from it, and then throws it away. It
THE DATA. IT GETS A SINGLE DATA
gets another data point, learns what it
can from it, throws it away, and so on. POINT, LEARNS WHAT IT CAN FROM
The machine never goes back to IT, AND THEN THROWS IT AWAY.
previously used data to relearn things;
this is similar to how our brains learn.
When we encounter something, we
learn what we can from it and move on, rather than storing it for later use.
An online adaptive learning algorithm works by initializing a model of what is

normal. It takes a new data point in the next second, minute, hour or whatever
timeframe is appropriate. First, the machine tests if the current data point is an
anomaly or not, based on what it already knows. If it marks the data point as not
being an anomaly, then it updates the current model about what is normal based
on that data point. And then it repeats the process as individual new data points
come in sequentially.
The machine never goes back to previously viewed data points to put them into a
current context. The machine cannot say, "Based on what I see now, I know that
this data point from five days ago is actually not an anomaly." It cannot consider,
"Maybe I should have done something different." The luxury of going back in time
and reviewing the data points again does not exist, which is one of the
shortcomings of this paradigm. The advantage of this approach is that it is fast
and adaptive; it can produce a result now and there is no need to wait to collect a
lot of data before results can be produced. In cases where a rapid analysis is
needed, the advantages of this approach far outweigh its disadvantages.
There are various examples of online adaptive learning models that learn the
normal behavior of time series data that can be found in data science, statistics
and signal processing literature. Among them are Simple Moving Average,
Double/Triple Exponential (Holt-
Winters) and Kalman Filters + ARIMA
and variations
THE ADVANTAGE OF THIS APPROACH
IS THAT IT IS FAST AND ADAPTIVE;
The following is an example of how a IT CAN PRODUCE A RESULT NOW
simple moving average is calculated
AND THERE IS NO NEED TO WAIT TO
and how it is applied to anomaly
detection. We want to compute the COLLECT A LOT OF DATA BEFORE
average over a time series, but we do RESULTS CAN BE PRODUCED.
not want the average from the
14
beginning of time until present. Instead, we want the average during a window of
time because we know we need to be adaptive and things could change over
time. In this case, we have a moving average with a window size of seven days,
and we measure the metric every day. For example, we look at the stock price at
the end of every trading day. The simple moving average would compute the
average of the stock price over the last seven days. Then we compare tomorrow's
value to that average and see if it deviates significantly. If it does deviate
significantly from the average value, it is an anomaly and if not, then it is not an
anomaly. Using a simple moving average is a straightforward way of considering
whether we have an anomaly or not.
The other models listed above are (much) more complex versions of that but, if
one can understand a simple moving average, then the other models can be
understood as well.
THE IMPACT OF LEARNING RATE AVOIDING PITFALLS

All of these adaptive online algorithms have some notion of learning rate. In the
stock price example, we looked at the average value over the last seven days of
the stock price and then compared the next day to that value. In this example,
the seven-day period is a parameter known as the learning rate. Why not 30
days? Why not 180 days? The shorter we make the learning rate, the more of an
effect each daily data point has on the moving average. If we make it a moving
average of the last three days, it will learn any changes that happen faster. If we
make it 365 days, then it will learn very slowly because every day will have a very
small effect on that average.
If our learning rate is too slow, meaning our moving average window is very large,
then we would adapt very slowly to any changes in that stock price. If there are
big changes in the stock price, then the baseline the confidence interval of
where the average should be will be very large, and we will be very insensitive
to changes.
If we make the rate too fast i.e., the window is very small then we will adapt
too quickly and we might miss things. We might think that anomalies are not
anomalies because we are adapting to them too quickly.
15
These scenarios are depicted in Figure 10 below.
Figure 10 The effect of learning rate on detecting anomalies.
How do we know what the learning rate should be? If we have a small number of
time series 100 or fewer we could inspect and change the parameters as
needed. However, a manual method will not work when we have a large number
of time series, so the algorithms need to automatically tune themselves.
There are many different metrics and each one has its own behavior. The rate at
which these metrics change could be fast or slow depending on what they are;
there is no one set of parameters that fits them all well. Auto-tuning these
parameters is necessary to provide an accurate baseline for millions of metrics.
This is something that is often overlooked by companies building anomaly
detection systems (incidentally, auto-tuning is built into the Anodot system).
Auto-tuning is not an easy task, but it is an important one for achieving more
accurate results.
There is another pitfall to be aware of. If we have a metric and we tune the
learning rate to fit it well when it behaves normally, what happens when there is
an anomaly?
Consider a scenario where we have a data point that is an anomaly. Recall that
the three steps of online learning are to read the sample, update the model using
the data point, and move on to the next data point. What happens if a data point
is an anomaly? Do we update the model with the new data point or not?
16
Continuing the example of learning

the stock price model using the IF THE ANOMALY LASTS FOR A WHILE,
moving average method, if we include THEN AT SOME POINT WE WILL SAY
an anomalous data point in the
THIS IS THE NEW NORMAL, AND WE
learning process, the stock price now
becomes anomalous as well. If we use MIGHT EVEN MISS OTHER ANOMALIES
it the next day to compute the next THAT COME ALONG LATER.
moving average, then we completely
shift the average toward that anomaly.
Is that okay or not okay? Good or bad? What happens in reality is, if we allow it to
shift the average or shift the parameters of the model as usual, then if the
anomaly persists beyond that single data point, we will start shifting the normal
behavior towards that anomaly. If the anomaly lasts for a while, then at some
point we will say this is the new normal, and we might even miss other anomalies
that come along later. Or, whenever it goes back to normal, we will say that is an
anomaly as well.
Updating the model with every data point (including anomalous ones), is one
strategy, but it is not a very good one.
ADAPTING THE LEARNING RATE

A better strategy is to adapt the learning rate by assigning weight to the validity of
the data points, with an anomaly carrying a lower weight than a normal value.
This is a tactic that Anodot uses. Whenever Anodot sees that a data point is an
anomaly, the system assigns that value a very low weight when adapting the
model parameters.
Going back to the moving average example, if we have a seven-day moving

average and we get the next data point and see it is outside the expected range,
we categorize it as anomalous compared to the previous average. The Anodot
system will use the anomalous data point to update the moving average; but
instead of using it as is, Anodot gives it a weight as though it is one of 1,000 data
points. That data point will affect the average, but only in a very small way.
Why not just ignore the anomaly in

terms of learning and not include it WE WANT TO PICK UP ON AN
in the model? Quite often anomalies ANOMALY WHEN IT CHANGES BUT WE
happen because something really EVENTUALLY WANT TO ADAPT TO IT
has changedand it is okay that it
changed. We want to pick up on an AND GO BACK TO THAT NEW STATE.
17
anomaly when it changes, but we eventually want to adapt to it and go back to

that new state. A metric could spike and stay high for a very long time. Perhaps
somebody made a change in the measured item and it is perfectly fine. We want
to know about it in the beginning, but after a while we want the system to adapt
to that new state. If we do not let it affect what has been learned, then the system
will never adapt to the new state, and will be stuck in the previous state.
An example of this would be a company that does a stock split. All of a sudden,
the stock price is cut in half; instead of it being $100 a share, it suddenly drops to
$50. It will stay around $50 for a while and the anomaly detection system must
adapt to that new state. Identifying that the drop is an anomaly is not a bad thing,
especially if we are unaware there was a split, but eventually we want our normal
value to go down to around that $50 state.
Another example would be a merger. One company acquires another company,

the stock price goes up or down and it may stay at that new value for a long time.
The valuation of the company has changed quite suddenly, and the system
eventually needs to adapt to that new valuation.
In the online world, these types of changes happen a lot. For example, a company
has a Web application and after a large marketing campaign, the number of users
quickly increases 25 percent. If the campaign was good, the number of users may
stay elevated for the long term. When a SaaS company adds a new customer, its
application metrics will jump, and that is normal. They might want to know about
that anomaly in the beginning, but then they will want the anomaly detection
system to learn the new normal.
These kinds of events happen frequently; we must not ignore them by not
allowing those data points to affect anything from now until eternity. On the
other hand, we do not want the system to learn too quickly, otherwise all
anomalies will be very short, and if it goes back to the previous normal state,
measurements will be off. There is a fine balance here: how fast we learn versus
how adaptive we are.
In the Anodot system, when we see anomalies, we adapt the learning rate in the
model by giving the anomalous data points a lower weight. If the anomaly
persists for a long enough time, we begin to apply higher and higher weights until
the anomalous data points have a normal weight like any other data point, and
then we model to that new state. If it goes back to normal, then nothing happens;
it just goes back to the previous state and everything is okay.
18
These two approaches to updating the learning rate are shown below. In Figure
11, the model is updated without weighting the anomalies. In this instance, most
of the anomaly is actually missed by the model being created.
Figure 11 Updating a model using an anomaly as a fully-weighted data point.
In Figure 12, anomalies are weighted differently to minimize their impact on

normal, unless it becomes apparent that the anomalies are the new normal. This
method allows for the anomaly to be fully captured.
Figure 12 Updating a model by assigning a lower weight to anomalies.
OTHER METHODS FOR LEARNING NORMAL BEHAVIORAL

PATTERNS
This paper covers online adaptive learning methods that fit with the design
principles laid out in Part 1 of this document series. These are the methods that
Anodot has selected for its solution; however, there are other methods for
learning normal behaviors in data patterns. We summarize them in the table
below according to the design criteria defined in Part 1.
Normal Behavioral Learning Methods

Name Adaptive? Real-time? Scalable? Uni/Multi Variate
Holt-Winters Yes Yes Yes Univariate
ARIMA + Kalman Yes Yes Yes Both
HMM No Yes No Multivariate
GMM No No No Both
DBScan No No No Multivariate
K-Means No No No Multivariate
Table 2 Other Normal Behavioral Learning Methods
19
SUMMARY
This document outlines a general framework for learning normal behavior in a
time series of data. This is important because any anomaly detection needs a
model of normal behavior to determine whether a new data point is normal or
abnormal.
There are many patterns and distributions that are inherent to data. An anomaly
detection system must model the data, but a single model does not fit all metrics.
It is especially important to consider whether seasonality is present in the data
pattern when selecting a model.
Real-time detection of anomalies at scale requires online adaptive learning

algorithms, and there are various learning models that can be found in data
science, statistics and signal processing literature. Anodot has chosen a model
that adapts its learning rate to give anomalies their due treatment without over-
emphasizing their impact on the model going forward.
In Part 3 of this series, we will look at the processes of identifying and correlating
abnormal behavior, which help to distill the list of anomalies down to the most
significant ones that warrant investigation. Without these important processes, a
system could identify too many anomalies to investigate in a reasonable amount
of time.
North America
669-600-3120
info.us@anodot.com
International
+972-9-7718707
info@anodot.com
20
ABOUT ANODOT
Anodot was founded in 2014, and since its launch in January 2016 has been
providing valuable business insights through anomaly detection to its customers
in financial technology (fin-tech), ad-tech, web apps, mobile apps, e-commerce
and other data-heavy industries. Over 40% of the companys customers are
publicly traded companies, including Microsoft, VF Corp, Waze (a Google
company), and many others. Anodot's real-time business incident detection uses
patented machine learning algorithms to isolate and correlate issues across
multiple parameters in real time, supporting rapid business decisions. Learn
more at http://www.anodot.com/.
1

PART 3: CORRELATING ABNORMAL BEHAVIOR
INTRODUCTION
Many high velocity online business systems today have reached a point of such
complexity that it is impossible for humans to pay attention to everything
happening within the system. There are simply too many metrics and too many
data points for the human brain to discern. Most online companies already use
data metrics to tell them how the business is doing, and detecting anomalies in
the data can lead to saving money or creating new business opportunities. Thus,
it has become imperative for companies to use machine learning in large scale
systems to analyze patterns of data streams and look for anomalies
Consider an airlines pricing system that calculates the price it should charge for
each and every seat on all of its routes in order to maximize revenue. Seat pricing
can change multiple times a day based on thousands of factors, both internal and
external to the company. The airline must consider those factors when deciding
to increase, decrease or hold a fare steady. An anomaly in any given factor can be
an opportunity to raise the price of a particular seat to increase revenue, or lower
the price to ensure the seat gets sold.
Automated anomaly detection is a technique of machine learning, and it is a

complex endeavor. Anodot is using this series of white papers to help explain and
clarify some of the sophisticated decisions behind the algorithms that comprise
an automated anomaly detection system for large scale analytics. In Part 1 of this
white paper series, we outlined the critical design principles of an anomaly
detection system. In Part 2 we continued the discussion with information about
how systems can learn what normal behavior looks like in order to identify
anomalous behavior. We recommend first reading parts 1 and 2 to gain the
foundational information necessary to comprehend this document.
Here in Part 3, the final document of our white paper series, we will cover the
processes of identifying, ranking and correlating abnormal behavior. Many of the
aspects we discuss in this document are unique to Anodot, such as ranking and
scoring anomalies and correlating metrics together. Most other vendors that
2
provide anomaly detection solutions for do not include these steps in their
analysis, and we believe them to be a real differentiator and a major reason why
Anodots solution goes beyond merely bringing accurate anomalies to light with
minimum false positives and negatives, but puts them into the context of the full
story to provide actionable information.
There are five steps necessary to learn and identify anomalies:

1. Metrics collection universal scale to millions
2. Normal behavior learning
3. Abnormal behavior learning
4. Behavioral topology learning
5. Feedback-based learning
Steps 1 and 2 were covered in detail in the previous two white papers. This
document covers steps 3 and 4. Step 5 is not in the scope of this white paper
series.
ABNORMAL BEHAVIOR LEARNING AND SCORING

The objective of any anomaly detection system is to, well, detect anomalies. But
not all anomalies are equal. Some are more significant than others, and the
reaction an anomaly causes might depend upon how significant it is.
In our earlier documents, we used the example of the human body as a complex
system with many metrics and data points for each metric. Body temperature is
one of those metrics; an individual's body temperature typically changes by
about a half to one degree between its highest and lowest points each day. A
slight temperature rise to, say, 37.8 C (100.0 F), would be anomalous but not a
cause for great concern, as taking an aspirin might help lower the temperature
back to normal. However, an anomalous rise to 40 C (104.0 F) will certainly
warrant a trip to the doctor for treatment. These are both anomalies, but one is
more significant than the other in terms of what it means within the overall
system.
In a complex business system, how do we understand which anomaly is more

significant than another? Lets consider this at the individual metric level, as
shown in Figure 1a below. Figure 1a shows a set of anomaliessome are small,
some are big, some last longer, some are shorter in duration. Though not shown
in this illustration, some anomalies might have a pattern to them, and some
patterns could be a square or a linear increase or decrease. Looking at the chart
3
with the human eye, one could posit what is more or less significant based on
intuition, and this method can be encoded into an algorithm.
For every anomaly

found in a metric, there FOR EVERY ANOMALY FOUND IN A METRIC, THERE
is a notion of how far it IS A NOTION OF HOW FAR IT DEVIATES FROM
deviates from normal as
NORMAL AS WELL AS HOW LONG THE ANOMALY
well as how long the
anomaly lasts. These LASTS. THESE NOTIONS ARE CALLED DEVIATION
notions are called AND DURATION, RESPECTIVELY.
deviation and duration,
respectively. As for the
anomalies seen in Figure 1a, some of them contain many data points, which
means that the data series was anomalous for quite a while (i.e., had a longer
duration), and some of them have fewer data points (i.e., the anomaly had a
shorter duration). In some cases, the peak of the data points is higher (i.e., a
greater deviation from normal), and for others, the peak is lower (i.e., less of a
deviation from normal). There are other conditions around both duration and
deviation, and they all need to be considered in the statistical model. In the case
of Anodot, the input is a set of statistics related to each anomaly, and the output
is a score (on a scale of 0 to 100) of how significant the anomaly is.
Having such a score provides the ability to filter -anomalies based on their
significance. In some cases, the user would want to be alerted only if the score or
the significance is very high; and in other cases, the user would want to see all
anomalies. For example, if a business is looking at a metric that represents the
companys revenue, then the user would probably want to see anomalies
pertaining to anything that happens, even if they are very small. But if the same
business is looking at the number of users coming into its application from a
specific location like Zimbabwe assuming the company doesnt do a lot of
business in Zimbabwe then maybe the user only wants to see the big
anomalies; i.e., highly significant anomalies. In the Anodot system, this is
configured using a simple slider as seen in Figure 2a.
The user needs this input mechanism because all the anomaly detection is
unsupervised, and the system has no knowledge of what the user cares about
more.
Note that the significance slider in the Anodot system does not adjust the
baseline or the normal behavior model; it only defines which anomalies the user
chooses to consume. This helps users focus on what is most important to them,
preventing alert fatigue. If there are too many alerts, such as one for every single
anomaly, the alerts eventually become overwhelming and meaningless.
4
Scoring occurs through machine learning since the scores are relative to past
anomalies of that metric, not an absolute value.
Figure 1a: A single metric with several instances of abnormal behavior
Figure 1b: Anomalies ranked by significance

Consider the anomalies shown in Figure 1b. Even without looking at the assigned
numbers for some of the anomalies, a person looking at the signal would
probably come up with similar scores. How? It is not based on each anomalys
amount of deviation from normal, rather, it is based on the fact that the high
peak anomalies deviated a lot more, and the smaller ones deviated less than the
bigger ones. Even for human eyes, it is all relative.
Now suppose the big spike the one labeled 90 was not there. Without a
significant anomaly to compare to, the other anomalies would look bigger, more
significant. In fact, we would probably change the scale of the graph.
This is an important distinction because there are other scoring mechanisms that
look at the absolute deviation without context of what happened in the past.
Anodot initially took this approach but we saw quickly, from a human
perspective, that when people look at a long history of a time series and see the
anomalies within it, in their minds they consider the anomalies relative to each
other as well as relative to normal. Anodots algorithms now mimic this human
thought process using probabilistic Bayesian models.
5
In the screenshot in Figure 2b, the significance slider is set to 70, meaning that
only the orange anomalies would be alerted on, and not the gray ones, which fall
below that score.
Figure 2a, The significance slider in the Anodot system lets users select the level of anomalies to be alerted
on.
Figure 2b, With significance set at 70, users would be alerted on the two orange alerts that are above 70,
but not the smaller gray alerts below 70.
BEHAVIORAL TOPOLOGY LEARNING

The next step in the overall process of learning and identifying anomalies in a
system is behavioral topology learning. In the first document of this white paper
series, we discussed learning system design principles and covered the
conciseness of anomalies. Conciseness refers to idea that the system considers
multiple metrics simultaneously, to view what is happening holistically.
6
If there are many anomalies at the single metric level and they are not combined
into a story that describes the whole incident, then it is very hard to understand
what is going on. However, combining them into a concise story requires an
understanding of which metrics are related, because otherwise the system runs
the risk of combining things that are completely unrelated. The individual metrics
could be anomalous at the same time just by chance.
Behavioral topology learning provides the means to learn the actual relationships
among different metrics. This type of learning is not well-known; consequently,
many solutions do not work this way. Moreover, finding these relationships at
scale is a real challenge. If there are millions of metrics, how can the relationships
among them be discovered efficiently?
As shown in Figure 3, there are several ways to figure out which metrics are
related to each other.
Figure 3: Methods of relating metrics to each other
7
ABNORMAL BASED SIMILARITY
The first method of relating metrics to each other is abnormal based similarity.
Intuitively, human beings know that when something is anomalous, it will
typically affect more than one key performance indicator (KPI). In the other
papers in this series, we have been using the example of the human body. When
someone has the flu, the illness will affect his or her temperature, and possibly
also heart rate, skin pH, and so on. Many parts of this system called a body will be
affected in a related way.
When an automatic anomaly detection system takes in these measurements, it

does not know that the temperature, heart rate and skin pH are from the same
person (unless someone tells the system that fact). However, if the person gets
the flu several times, several of his or her vital signs will become anomalous at
the same time, thus there is a high likelihood that some of the anomalies on their
measurements will overlap.
The chance of two metrics having a single

concurrent anomaly is high if you are THE MORE OFTEN METRICS
measuring many things. If we were to simply ARE ANOMALOUS AT SIMILAR
rely on anomalies happening together to
TIMES, THE MORE LIKELY IT IS
determine that they are related, it would
cause many mistakes. But the probability of THAT THEY ARE RELATED.
them being anomalous twice at the same
time is much lower. Three times, even lower.
The more often the metrics are anomalous at similar times, the more likely it is
that they are related.
The metrics dont always have to be anomalous together. A persons temperature

could increase but his or her heart rate might not increase at the same time,
depending on the illness. But we know that many illnesses do cause changes to
the vital signs together.
Based on these intuitions, one can design algorithms that find the abnormal
based similarity between metrics. One way to find abnormal based similarity is to
apply clustering algorithms. One possible input to the clustering algorithm would
be the representation of each metric as anomalous or not over time (vectors of
0s and 1s); the output is groups of metrics that are found to belong to the same
cluster. There are a variety of clustering algorithms, including K-means,
hierarchical clustering and the Latent Dirichlet Allocation algorithm (LDA). LDA is
one of the more advanced algorithms, and Anodots abnormal based similarity
processes have been developed on LDA with some additional enhancements.i
8
The advantage that LDA has over other algorithms, is that most clustering
algorithms would allow a data point - or a metric in this case - to belong to only
one group. There could be hundreds of different groups, but in the end, a metric
will belong to just one. Often, it is not that clear-cut. For example, on a mobile
app, its latency metric could be in a group with the metric related to the
applications revenue, but it could also be related to the latency of that app on
desktops alone. By using clustering algorithms that force a choice of just one
group, the system might miss out on important relationships. LDA clusters things
in such a way that they can belong to more than one group, i.e. soft clustering,
as opposed to hard clustering.
Another advantage of
LDA is that most ADVANTAGES OF LDA
clustering algorithms
have some distance
MOST CLUSTERING ALGORITHMS WOULD
function between what ALLOW A DATA POINT - OR A METRIC IN
is being measured that THIS CASE - TO BELONG TO ONLY ONE
is similar. The LDA
GROUP
algorithm allows a
metric to be partially MOST CLUSTERING ALGORITHMS HAVE SOME
similar to the other DISTANCE FUNCTION BETWEEN WHAT IS
metrics. This comes BEING MEASURED THAT IS SIMILAR
back to the softness of
the algorithmit allows
partial similarity for a
metric to still belong to a group. In the context of learning metric relationships,
this is an important feature because, for example, application latency doesnt
always have to be anomalous when the revenue is anomalous. It is not always the
case that latency goes up anomalously and revenue goes down, and there can be
times when the revenue becomes anomalous but the latency does not go up or
down accordingly. The anomaly detection system must be able to take that
partiality into account.
The primary issue with abnormal based similarity is that it does not scale well
we discuss scaling later in the paper. In addition, it requires seeing enough
historical data containing anomalies so it can capture these relationships. Are
there additional types of information that can help capture the metric topology
with less (or no) history? We will discuss two additional methods of capturing
relationships between metrics next.
9
NAME SIMILARITY
Another method for determining relationships among metrics is name similarity.

Every metric in a system must be given a name that is not just free-form text. In
the industry of data handling, there are recommended naming conventions for
metrics, typically comprised of key value pairs describing what is being measured
and its source. For example, say we are measuring the revenue of an app for
Android in the US. For simplicitys sake, we will call this app XYZ. The key value
pair describing what we are measuring and the source would be XYZ together
with US. Thus, the revenue metric might have a name like
appName=XYZ.Country=US.what=revenue.
This particular app is also available in Germany, so the name for the metric that
measures revenue in there might be something like
appName=XYZ.Country=Germany.what=revenue. By looking at the similarity
between these two metric names, we have a measure of how similar they are. If
they are very similar, then we say they should be grouped because they probably
describe the same system. It is reasonable to associate metrics using this
method; it is essentially based on term similarity, by comparing terms to see
whether they are equal and how much overlap they have.
NORMAL BEHAVIOR SIMILARITY
A third method of determining relationships among metrics is normal behavior

similarity, which looks at the metrics under normal circumstances as opposed to
the abnormal based similarities. This method asks questions like, Do the metrics
have the same shape? and Do they look the same when the signal is normal?
What they look like when abnormal does not matter. For example, if we look at
revenue for the XYZ app on different platforms such as Android and iPhone, they
will probably look quite similar; the signals for these two metrics will most likely
have the same shape. However, if we compare the application latency on Android
to the apps revenue on that platform, they wont be similar.
Normal behavior similarity

is the weakest method of NORMAL BEHAVIOR SIMILARITY LOOKS AT THE
the three discussed in this METRICS UNDER NORMAL CIRCUMSTANCES AS
document because it is OPPOSED TO THE ABNORMAL BASED
always possible to find
correlations if one looks SIMILARITIES
hard enough. The question
is how to do it intelligently without getting a lot of false positives.
10
The most commonly used method of performing normal behavior similarity

comparisons is with linear correlation. Here people use measures such as the
Pearson correlation coefficient, which is a measure of the linear dependence
(correlation) between two variables (metrics). This method requires some
caution. For example, it is necessary to de-trend the data, meaning that if there is
a linear line constantly going up or down, it must be subtracted from the original
time series before computing the Pearson correlation. Otherwise, any metric that
is trending up will be correlated with anything else that is trending up, resulting in
a lot of false positives.
It is also necessary to
remove seasonal patterns METRICS WITH A SEASONAL PATTERN ARE
from the metrics; CORRELATED WITH ANY OTHER METRICS THAT
otherwise anything with a
HAVE THE SAME SEASONAL PATTERN; THEREFORE,
seasonal pattern will be
correlated with anything SEASONAL PATTERNS MUST BE REMOVED.
else that has the same
seasonal pattern. If two
metrics both have a 24-hour seasonal pattern, the result will be a very high
similarity score regardless of whether they are related or not. In fact, many
metrics do have the same seasonal patterns but they are not related at all. For
instance, we could have two online apps that are not related, but if we look at the
number of visitors to both apps throughout the day, we will see the same pattern
because both apps are primarily used in the US and have the same type of users.
It could be the XYZ app and a totally unrelated news application.
Unlike abnormal based similarity which creates very few false positives but is
dependent on anomalies happening (which occurs rarely), thus more time to
pass, normal behavior similarity requires much less data in order to be
computed. However, if not done right e.g., if the data patterns are not de-
trended and de-seasonalized this method could create many false positives.
The Pearson correlation is a simple algorithm and is quite easy to implement, but
there are better approaches that are less prone to false positives, such as the
pattern dictionary based approach. Suppose each time series metric can be
partitioned into segments, where each segment is classified to one of N
prototypical patterns that are defined in a dictionary of known patterns like a
daily sine wave, a saw tooth, a square wave-like pattern, or other classifiable
shapes. Once the user has a dictionary of typical shapes, he or she can describe
each metric based on what shapes appeared in it at each segment.
As an example, from 8 AM to 12 PM, the metric had shape number 3 from the
dictionary of shapes, and from 12 PM to 5 PM, it had shape number 10. This
11
changes how the time series is represented with a more compressed

representation, which also describes attributes at a high level, rather than just
the values. From there, it is relatively easy to do clustering or any type of
similarity grouping based on the new representation. It is also easier to discount
the weight of very common shapes in the dictionary by using techniques from
document analysis (such as TF-IDF weightsii). It is safe to assume that things are
correlated if at every point in time, they have similar shapes.
The main challenge in the shape dictionary based approach is how to create the
dictionary. A variety of algorithms can be employed for learning the dictionary,
but they all follow a similar approach: Given a (large) set of time series metric
segments, apply a clustering technique (or soft clustering technique such as LDA)
on all the segments, and then use the representations of the clusters as the
dictionary of shapes. Given a new segment of a metric, find the most
representative cluster in the dictionary and use its index as the new
representation of the segment.
One of the most promising algorithms tested at Anodot for creating such a
dictionary is a Neural-Network based approach (Deep Learning), namely, Stacked
Autoencoders. Stacked autoencoders are a multi-layer Neural Network designed
to discover a high-level representation of the input vectors in the form activation
of the output nodes. Training stacked autoencoders is done with a set of
segments of the time series; the activated nodes at the output of the network are
the dictionary representing prototypical shapes of the input segments. The
details of implementing this deep learning technique to accomplish this task are
out of the scope of this white paper.
USER INPUT
There are additional methods of establishing

relationships among metrics that do not TWO TYPES OF USER INPUT
require sophisticated algorithms; one is direct
user input. If a user says that all XYZ app DIRECT
metrics are related, this fact can be encoded INDIRECT
into the learning model. It is a technical
process, not an algorithmic one, but this type of
direct input can be useful if the user can provide it.
The second method is indirect input, in which the user manipulates the metrics to
create new metrics out of them. If there is revenue of XYZ app in multiple
countries, the user can now create a new metric by calculating the sum of the
revenue from all the countries. It can be assumed that if it makes sense to create
12
a composite metric of multiple metrics, then the individual metrics are likely
related to each other.
Anodot uses both methods, depending on what information is available.
A MATTER OF SCALE
Of the various methods discussed
above, one of the major challenges is HOW CAN THESE COMPARISONS BE
scale. How can these comparisons be APPLIED AT VERY LARGE SCALE, SUCH
applied at very large scale? The
AS A BILLION METRICS?
algorithm-based methods are
computationally expensive when there
are a lot of metrics to work with. It either requires a lot of machines or a lot of
time to get results. How can it be done efficiently on a large scale, such as a
billion metrics?
One method is to group the metrics. We would start with one billion metrics
sorted into 100 different groups that are roughly related to each other. We can go
into each group and perform the heavy computation because now the number of
groups is small, and each group has its own order. If we have a group of one
million metrics, and then we separate them into 10 groups, we end up with 10
groups of 100k metrics each, which is a much smaller, more manageable
number. A mechanism is needed to enable fast and accurate partitioning.
How can this be done without knowing what things are similar? A locality
sensitive hashing (LSH) algorithm can help here. For every metric a company
measures, the system computes a hashtag that determines which group it
belongs to. Then, additional algorithms can be run on each group separately. This
breaks one big problem into a lot of smaller problems that can be parsed out to
different machines for faster results. This methodology does have a certain
probability of false positives and false negatives; however, the algorithm can tune
the system, depending on how many false positives and false negatives users are
willing to tolerate.
In this case, false positive means that two things are grouped together, despite
not exhibiting characteristics that would cause them to be grouped together.
False negative means that two things are put into separate groups when they
should be in the same group. The tuning mechanism allows the user to specify
the size of the groups based on the total number of metrics, as well as the
tolerance of false positives and false negatives that he or she is willing to accept.
One way to reduce the number of false negatives is to run the groups through
13
the algorithms a few times, changing the size of the group each time. If the
groups are small enough, they can run rapidly while not being computationally
expensive.
THE IMPORTANCE OF EACH STEP

The real goal of performing anomaly detection in a business system is not to
merely identify unusual things that are happening within that system, but to use
the insights about when and where the anomalies happen to understand the
underlying cause(s) and hopefully uncover opportunities to improve the business.
A large-scale business system can have hundreds of thousands or even millions
of metrics to be measured. A well-known social network that is used by billions of
people around the world is estimated to have 10 billion metrics.
Any large-scale system with a high number of metrics will yield many anomalies
perhaps too many for the business to investigate in a meaningful time frame.
This is why all of the steps discussed across our series of three white papers are
important. Each step helps reduce the number of anomalies to a manageable
number of truly significant insight. This is illustrated in Figure 4 below.
Figure 4: The importance of all steps in an anomaly detection system
This chart illustrates the importance of all the steps in an anomaly detection
system: normal behavior learning, abnormal behavior learning, and behavioral
topology learning. Consider a company that is tracking 4 million metrics. Out of
this, we found 158,000 single metric anomalies in a given week, meaning any
anomaly on any metric. This is the result of using our system to do anomaly
detection only at the single metric level, without anomaly scoring and without
14
metric grouping. Without the means to filter things, the system gives us all the
anomalies, and that is typically a very large number. Even though we started with
4 million metrics, 158,000 is still a very big numbertoo big to effectively
investigate; thus, we need the additional techniques to whittle down that
number.
If we look at only the anomalies that have a high significance score in this case a
score of 70 or above the number of anomalies drops off dramatically by an
order of magnitude to just over 910. This is the number of significant anomalies
we had for single metrics out of 4 million metrics for one week910 of them.
Better, but still too many to investigate thoroughly.
The bottom of the funnel shows how many grouped anomalies with high
significance we end up after applying behavioral topology learning techniques.
This is another order of magnitude reductionfrom 910 to 147. This number of
anomalies is far more manageable to investigate. Any organization with 4 million
metrics is large enough to have numerous people assigned to dig into the
distilled number of anomalies, typically looking at those anomalies that are
relevant to their areas of responsibility.
Figure 4 does not necessarily show the accuracy of the anomalies; rather, it
shows why all these steps are important; otherwise the number of anomalies can
be overwhelming. Even if they are good anomalies they found the right things
it would be impossible to investigate everything in a timely manner. Users
would simply stop paying attention because it would take them a long time to
understand what is happening. This demonstrates the importance of grouping
really reducing the forest of 158,000 anomalies into 147 grouped anomalies per
week. This goes back to the notion of conciseness covered in the design
principles white paper (part 1 of this series). Concise anomalies help to tell the
story of what is happening without being overwhelming, enabling a human to
investigate more quickly. Then the business can take advantage of an opportunity
that might be presented through the anomaly, or take care of any problem that
the anomaly has highlighted.
THE ARCHITECTURE OF AN ANOMALY DETECTION SYSTEM

In a generic sense, any large scale anomaly detection system should follow the
design principles we outlined in the first part of this white paper series. In Figure
5 below, we use Anodots system as an example to describe the architecture and
components of a typical system. Where possible, we will point out how the
Anodot system might differ from others.
15
Figure 5: The architecture of a Anodots large scale anomaly detection system
The most important requirement of this architecture is that it be scalable to a

very large number of metrics. Anodot achieves this by performing most of the
normal behavior learning as the data flows into our system. We perform machine
learning on the data stream itself. This is shown in the central part of the
illustration, Anodotd, labeled Online Baseline Learning.
The flow of data comes from Customer Data Sources, as shown at the bottom of
the illustration, into what we call Anodotd, or Anodot Daemon, which does the
learning. When a data point comes in from a metric, the system already has the
pre-constructed normal model for that metric in its memory. If there is an
anomaly, it scores it using the abnormal model and sends it to the Anomaly
Events Queue (we use Kafka) on the left side of the illustration. If there is no
anomaly, Anodotd simply updates the model that it has so far and stores that
model in the database.
Many machine learning systems do not work this way. They pull data from a
database, do their learning and then push the data back to a database. However,
if you want the system to scale and find anomalies on 100% of the metrics
because it is unknown which metrics are important then the learning must be
done on all the samples that come in. If the data has already been stored in a
database and then must be pulled out in order to do the learning, the system will
not be able to scale up. There is no database system in the world that both read
efficiently and write rapidly. Enlarging the database system is a possibility, but it
will increase costs significantly. Certainly, to get the system to scale, learning
must be done on the data stream itself.
16
The components on the left side of

the illustration perform the
processes of identifying and
correlating abnormal behavior.
Anomaly events on single metrics
pass through the queue to the
Grouper component which checks
whether to group single metric
anomalies based on the
information from the metric
relationship graph.
All information about the

anomalies is passed to the
Metadata Indexing and Search
Engine. Any changes to an anomaly
will be updated in this Engine.
Anodot uses Elastic Search for this
process, which does not affect the
design of the system.
Figure 6: Anodot in numbers, per day
On the right side of Figure 5 is Hadoop/Spark HIVE offline learning. There are
some processes that Anodot runs offline, for example, the behavioral topology
learning or seasonality detection can be run offline; we do not have to run this
process on the data stream itself. Discovering that one metric is related to
another is not something that will change from data point to data point. Finding
that something has a weekly seasonal pattern does not have to be detected on
every data point that comes in for that metric. There is a price to pay when
processes run on the data stream, often in the form of accuracy. With online
learning, there is no luxury of going back and forth; thus, Anodot performs these
activities offline. This combination of online and offline learning optimizes
accuracy and efficiency.
Not all anomaly detection systems have all these components, but Anodot
believes they are all important to yield the best results in an efficient and timely
manner.
17
Figure 7: A screenshot of the Anodot Anoboard Dashboard showing anomalies detected in time series
data
THE HUMAN ELEMENT

This white paper series has focused on the technical elements necessary to build
an anomaly detection system, a recipe, as it were. But what about the chef? It is
not enough to pick up the ingredients at the market; someone has still must cook
the meal. This brings us to the human factor of an anomaly detection system --
the team needed to build the system.
At a minimum, you will need a team of data scientists with a specialty in time
series data and online machine learning. Just as chefs and doctors have their own
specialties, data scientists do as well. While there is a shortage of data scientists
in the market in general, the scarcity is even more acutely felt when searching for
particular specialties such as time series, and you may find yourself in
competition for talent with companies such as Google, Facebook and other
industry giants.
Besides the data scientists, you need a team of developers and other experts to
build a system around the algorithms which is efficient at scalable stream
processing and developing backend systems and has an easy-to-use user
interface. At the bare minimum, you would need backend developers creating
data flows, storage, and management of the large scale backend system, in
addition to UI experts and developers, QA and product management.
Note that this team not only has to develop and deploy the system, but maintain
it over time.
18
While one might be tempted to skimp on UI for an internally-developed solution,

this is a mistake. An early investment in UI means that the eventual anomaly
detection system will be able to be used widely in the organization by multiple
teams, with multiple needs. If the UI is not simple enough for everyone to learn
easily, the data science and business intelligence teams will forever find
themselves as the frustrating bottleneck, providing retroactive reports and alerts.
Conversely, the more people

using the anomaly detection THE MORE PEOPLE USING AN ANOMALY
system within the organization DETECTION SYSTEM WITHIN THE
(and the more metrics being ORGANIZATION (AND THE MORE METRICS
analyzed), the more powerful BEING ANALYZED), THE MORE POWERFUL
the insights it can provide. For
THE INSIGHTS IT CAN PROVIDE.
example, at Anodot, we have
customers that have hundreds
of people on dozens of
different teams from sales to executive management to BI to monitoring to
devops using the Anodot system to alert them to anomalies relevant to their
areas of responsibility.
Based on our own experience and discussions with our customers who have
faced the build or buy decision, we estimate that it would take a minimum of 12
human years (a team of data scientists, developers, UI and QA) to build even the
most rudimentary anomaly detection system. And this basic system could still
encounter various technical issues that are far beyond the scope of this paper.
SUMMARY
Across this series of three white papers, we have covered the critical processes
and various types of learning of a large scale anomaly detection system.
In Part 1, we discussed what an anomaly is, and why a business would

want to detect anomalies. We outlined the five main design considerations
when building an automated anomaly detection system: timeliness, scale,
rate of change, conciseness, and definition of incidents. And finally, we
discussed supervised and unsupervised machine learning methods.
In Part 2, we detailed the processes of learning the normal behavior of

time series data. After all, we have to know what is normal for a business
system in order to identify what is not normalan anomaly. We talked
about creating data models, uncovering seasonality, and the importance
of online adaptive learning models.
19
In Part 3, this document, we discussed how to identify and correlate

abnormal behavior to determine the significance of anomalies. This
process is critical for distilling the total number of discovered anomalies
into a much smaller number of only the most important anomalies.
Without this distillation process, there would be too many alerts to
investigate in a timely and cost effective manner.
Hopefully these documents have given the reader some insight to the complexity
of designing and developing a large scale anomaly detection system. The Anodot
system has been carefully designed using sophisticated data science principles
and algorithms, and as a result, we can provide to our customers truly
meaningful information about the anomalies in their business systems.
North America
669-600-3120
info.us@anodot.com
International
+972-9-7718707
info@anodot.com
ABOUT ANODOT
Anodot provides valuable business insights through anomaly
detection. Automatically uncovering outliers in vast amounts of time series data,
Anodots business incident detection uses patented machine learning algorithms
to isolate and correlate issues across multiple parameters in real-time,
supporting rapid business decisions. Anodot customers in fintech, ad-tech, web
apps, mobile apps and other data-heavy industries use Anodot to drive real
business benefits like significant cost savings, increased revenue and upturn in
customer satisfaction. The company was founded in 2014, is headquartered in
Raanana, Israel, and has offices in Silicon Valley and Europe. Learn more
at: http://www.anodot.com/.
20
i
Anodot uses an enhanced version of the latent Dirichlet allocation (LDA) algorithm in a unique way to
calculate abnormal based similarity. In natural language processing, LDA is a generative statistical
model that allows sets of observations to be explained by unobserved groups that explain why some
parts of the data are similar. For example, if observations are words collected into documents, it posits
that each document is a mixture of a small number of topics and that each word's creation is
attributable to one of the document's topics.
In LDA, each document may be viewed as a mixture of various topics, where each document is
considered to have a set of topics that are assigned to it via LDA. In practice, this results in more
reasonable mixtures of topics in a document.
For example, an LDA model might have topics that can be classified as CAT_related and DOG_related.
A topic has probabilities of generating various words, such as milk, meow and kitten, which can
be classified and interpreted by the viewer as CAT_related. Naturally, the word cat itself will have
high probability given this topic. The DOG_related topic likewise has probabilities of generating each
word: puppy, bark and bone might have high probability. Words without special relevance, such
as the will have roughly even probability between classes (or can be placed into a separate
category). A topic is not strongly defined, neither semantically nor epistemologically. It is identified on
the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence.
A lexical word may occur in several topics with a different probability, however, with a different typical
set of neighboring words in each topic. (Wikipedia)
ii
TF-IDF is short for term frequencyinverse document frequency. It is a numerical statistic intended
to reflect how important a word is to a document in a collection or corpus. It is often used as a
weighting factor in information retrieval and text mining. (Wikipedia)

Building A Large Scale Machine Learning-Based Anomaly Detection System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Building A Large Scale Machine Learning-Based Anomaly Detection System

Uploaded by

Copyright:

Available Formats

1

ULTIMATE GUIDE TO BUILDING A MACHINE

Automated anomaly detection is a technique

This document, Part 1 in a three-part series, covers the design principles of

WHY COMPANIES NEED ANOMALY DETECTION

In a high velocity business, many things occur simultaneously, and different

geography and by user profile, conversions per advertising campaign, or

An e-commerce company sells gift cards and sees an unexpected increase

Figure 1 Detecting business incidents

A mobile game company notices a decrease in installations of one of its

As businesses grow, more incidents

Most companies today tend to do manual detection of anomalous incidents.

The solution, then, is automated anomaly detection, where computers look at

MACHINE LEARNING METHODS

A third category exists, known as semi-supervised machine learning, and this is

Figure 2 What is an anomaly?

That is the philosophical part of anomaly detection. Fortunately, many metrics

Figure 3 Anomalies in a single time series signal

Figure 4 Anomalies in a single time series signal

Figure 5 Anomalies in multiple time series signals

That is the crux of the question, "What is an anomaly?" It is still ill-defined.

Figure 6 What part of the signal is anomalous?

That leads to a discussion of the main design principles when building an

DESIGN PRINCIPLES OF ANOMALY DETECTION

TIMELINESS AND SCALE

Non-real-time decision-making can be used for anything that relates to longer

The distinction of when anomaly

If time is not a critical factor in decision-making, batch machine learning

At Anodot, we focus on real time anomaly detection at a massive scale, which

The graph in Figure 7 below is an example from an Anodot customer; it

Figure 7 This metric has a change in its pattern

Figure 8 This process has a slow rate of change

As an example of what is meant by conciseness: consider the human body as a

And so it is with many business systems.

UNIVARIATE ANOMALY DETECTION

anomaly detection is that it is a lot easier to do than other methods. It is easier to

MULTIVARIATE ANOMALY DETECTION

Table 1 summarizes the characteristics of all three approaches to conciseness of

Univariate Anomaly Multivariate Anomaly

If a system has well-defined incidents, it is possible to apply supervised learning

Supervised learning methods are

The flipside is unsupervised learning methods, where a system learns what is

Then there are semi-supervised learning methods. Sometimes people can

Table 2 summarizes the characteristics of these learning methods.

For more information, please contact Anodot:

ULTIMATE GUIDE TO BUILDING A MACHINE

Anomaly detection is an imperative for online businesses today,

High velocity online businesses need

Automated anomaly detection is a technique of machine learning, and it is a

A GENERAL FRAMEWORK FOR LEARNING NORMAL BEHAVIOR

Figure 1 A general scheme for anomaly detection.

The graph below is a normal distribution represented by an average standard

Figure 2 The mathematical formula for average standard deviation.

There are many different distributions

The answer is that it is not possible to choose just one.

The bodys temperature is a reading that is very close to a normal distribution. If

If we look at a person's heart rate, it changes constantly throughout the day

A SINGLE MODEL DOES NOT FIT ALL METRICS

Figure 3 A sampling of some of the data models Anodot uses.

Figure 4 A metric with an unusual pattern doesn't fit a smooth model.

If a company has 10 metrics, it is possible to graph the data points with a

Figure 5 Sudden change in metric behavior.

What is needed, then, is an automated process that constantly looks at the