We frequently find ourselves observing and reacting to changes in our customers and our website. Here are a few of the more common situations that we experience:
An enticing product is added to one of our comparison pages, causing an increase in people selecting products on that comparison page.
A bug is released onto an important landing page, preventing people from continuing onto the next page.
News stories over several days cause a spike in search engine traffic to some of our guide pages.
These types of situations are quickly noticed by people.
‘Oh, traffic has gone up on the site!’ says someone watching real-time data stream at the right moment.
‘Why has this KPI dropped in the last hour?’ asks someone watching a KPI dashboard.
‘This landing page is getting loads of traffic,’ says an analyst looking at landing pages in a weekly trends analysis.
There are a few challenges here. The first is that real-time data streams can’t be checked constantly in a reliable way by people who have other things to do. It distracts from more important tasks and there are tasks, like meetings or lunch hours, that take up the time of entire teams. As a result, things get missed. Sometimes something important that could have been known within 10 minutes becomes known three hours later. Sometimes it’s someone from outside the most relevant team that spots the unexpected blip that pushes the team into a flurry of action. Not ideal.
The second challenge is that it’s time consuming to check thousands of time series regularly for changes. Although we have a fairly manageable number of KPIs and metrics we’re interested in, when split by other variables they often become unmanageable. We have many thousands of pages on our site; hundreds of groups of these pages; hundreds of variable values about the device, browser and operating system used by the people accessing our site; a staggering number of search terms from search engine traffic; and many other ways to split a quick-to-check list of important metrics into a tedious, soul-destroying, coffee-demanding endurance checklist. It’s enough to make analysts want to say, “Let’s just look at the top-level metrics and maybe check some likely variable splits if any of the the top-level metrics are looking funny.”
To be clear, these are still challenges for us and we haven’t investigated all the likely detection methods available. But we’ve progressed and continue to do so. My guess is that some of the useful things we’ve found in this area could be useful to some of you, too.
What we want from this area
We primarily want these three things:
To detect large and important changes quickly. For us this means detecting large spikes or dips in important metrics as they’re happening, usually on the timescale of minutes.
To detect less time-sensitive changes eventually, which for us usually means the day after, sliding down to a week or month for less dramatic changes.
Have a small overall number of false positives so that each set of points identified as anomalous are investigated. We want to save time and get important things noticed quickly. We don’t want an algorithm to constantly say ‘check everything, every minute, every day’.
Some approaches we use
The approaches we’ve taken range from the simple to the complex. A few that we’ve looked at are rules based on domain knowledge, Robust Principal Component Analysis and changepoint detection.
Approach one: Rules based on domain knowledge
There are some simple situations based on domain knowledge and experience that we can look out for in the data. Examples include:
Time series starting or stopping, corresponding to items being added or removed on site. For instance, if a page were to be added to the site, the number of daily pageviews of that page would be zero until the very first pageviews. Similarly, if a page is removed, the daily pageviews would be zero for each day after the removal date. We can categorise these time series as such with a simple rule like ‘if a time series starts with zero counts at the first X time steps, categorise this time series as “new time series started”’.
Minimum/maximum thresholds, corresponding to known danger zones. For instance, if we had zero pageviews of our results pages between 13 and 13:30 on a Monday, we would be certain that something was wrong with our site (or our tracking). On the opposite end of the spectrum, we also have known upper limits where our servers might struggle to handle the traffic.
Approach two: Robust Principal Component Analysis
A number of useful articles have been written about this on the web so I won’t go into much technical detail about this method.
This method is useful for finding temporary spikes and dips in time series. It can account for seasonalities as well as work with multiple variables simultaneously. We’ve found it useful in alerting us to unusual changes in less important variables from the previous day, as well as pointing us to any spikes and dips in seasonal time series.
Robust Principal Component Analysis in our case works in the following way:
- Start with a time series such as hourly views of A Very Important Page every hour over the last 52 weeks.
- Choose an interval of time that corresponds to a season in our variable — a week, for example.
- Convert our time series into a collection of these intervals. In our example, a collection of week-long time series)
- Create a matrix D with each row as one of these intervals, so row one: week one; row two: week to.
- Actual RPCA step: find a way of representing this matrix as a low rank matrix L added to a sparse matrix S added to a (small) error matrix.
A low rank matrix mostly has rows which are linear combinations of a small subset of rows, a sparse matrix has mostly zeroes as its elements, and the error matrix has very small, normally-distributed values. Our interpretation is that L gives what is expected to happen if everything behaves normally considering other data points in the cycle/season, and S shows the effect of the anomalous behaviour.
The coefficients these matrices are multiplied by allow us to control how much we care about optimizing each one. For instance, if the coefficient for S is relatively small, then we’ll see more data points being treated as anomalous. Conversely, if it’s relatively large, we’ll see very few data points treated as anomalous and L being quite close to the original matrix.
Robust Principal Component Analysis can be done with Netflix's Surus Package in R. If you provide the time series and the periodicity, it will perform the matrix decomposition for you.
As an example, we took hourly pageviews to a page on our site over the course of a year and looked at the weeks with the largest spikes or dips. Here are the top two results:
The algorithm that finds the L and S matrices does so by minimising this cost function:
0.5*EuclideanNorm(X-L-S)^2 + LPenalty*NuclearNorm(L) + SPenalty*L1Norm(S)
The first term uses the squared euclidean distance value of how far X is from L + S. The second term uses the nuclear norm of L and approximates the rank of L. The third term uses the L-1 norm of S, i.e. it sums the absolute values of the elements of the matrix.
The coefficients (LPenalty and SPenalty) allow us to control how much we care about optimising the sensitivity of the anomaly detection. For instance, if the coefficient for the L-1 norm of S is relatively small, then we’ll see more data points being treated as anomalous. Conversely, if it’s relatively large, we’ll see very few data points treated as anomalous and L being quite close to the original matrix.
Approach three: Changepoint detection
Many changepoint detection methods take the following approach:
- Choose a time series.
- Assume that a known probability distribution with an unknown mean generates the data at each point.
- Find a partition on which the mean is constant on each segment.
How the partition is chosen varies between methods. The changepoint package in R is effective at changepoint detection using several different methods.
Here are some examples using dummy data with the means (in orange) determined using the cpt.mean function from the package. The dummy data points are sampled from three separate normal distributions with different means and variances. The method used is BinSeg (an implementation of the method in this paper: https://www.ime.usp.br/~abe/lista/pdfXz71qDkDx1.pdf) with different values for Q (the maximum number of changepoints).
A useful working method for us is to use the BinSeg method with a small max number of changepoints, apply this to a set of time series and rank the series in descending order of largest step changes.
In conclusion, we:
- Use Robust Principal Component Analysis to identify spikes and dips
- Use changepoint detection to identify step changes in data
- Use simple size metrics based on the above algorithms to prioritise anomalies when working with many time series
- Use rules based on domain knowledge to ensure known event indicators aren’t missed