### 1. Introduction

Over the next few months I am going to be doing a literature review of common techniques for detecting drift in data sets. What is data drift? My definition of it will likely change over the course of this review. However, as a starting point, I like to think about it in terms of changes in the underlying distribution of whatever is generating your data over time. This can also be called non-stationarity. It is actually quite common and can be caused by a variety of factors in real world settings. One very common change might be that there is that the population itself changes. Another common change is that the techniques or tools used to observe the population can also change.

It is important to distinguish between data drift, anomalous data and noisy data. All three refer to deviations from some reference distribution of data. However, a key distinction might be considered the volume and consistency of the deviations. Deviations that are low volume and/or incosistent in their placement might be considered noise in the data generating process. Deviations that are consistent but low volume might be endemic problem with the data generating process but not indicative of drifting data. On the other hand, high volume, consistent deviations may be an indication that the data distribution has actually moved significantly. This latter case would be just one type of data drift and we’ll discuss other types of data drift below.

Lastly it’s important to consider the impact of the data drift. In most cases the data is being used for some machine learning task, whether classification or regression. Drifting data can impact the performance of these models. However, sometimes it may change in ways that do not affect the performance of these models. Understanding this distinction can help to diagnose model degradation.

For this literature review, I will be sharing an article a week here starting with A Survey on Concept Drift Adaptation. This paper covers a lot of ground, some of which is interesting to the topic of drift detection on its own but much of it is dedicated to machine learning algorithms that can adapt to drift detection. For the purpose of this post, hence forth we’ll refer to data drift as concept drift.

### 2. Types of data drift

As can be seen in figure one below the paper distinguishes between two different kinds of drift. Firstly, there is *real concept drift*, namely changes in your data that impact your posterior predictive distribution p(y|X). Secondly, there is *virtual concept drift* meaning changes in p(X) that don’t impact the posterior distribution. These two types can be seen in figure 1 below. It is important to distinguish these because in may industrial applications of machine learning, data changes all the time, but it may not matter unless it has a meaningful impact on the posterior distribution. To put it another way, if your model is still performing well it may not matter if the data changed.

There are a couple of drivers of concept drift. Each of these may or may not impact a model’s performance. Firstly, there is class probabilities change. For instance, when you start collecting data, Class A appears 60% of the time. Perhaps over time this changes such that Class B now appears 60% of the time. Another change may be in the input space. The class conditional distribuiton p(X|y), or in other words the groupings of features that correspond to original classes has changed in some way.

The last point to make here is that there are different time-scales and patterns that data drift can take. Perhaps the one most people are aware of is sudden drift. This is where over a short time window a meaningful change has occured in the data distribution. There is also incremental drift, where it takes longer and there are many intermediate steps between starting point of the distribution and where it stabilizes. Similarly, there is gradual drift, which is like incremental drift but not as smooth, meaning the data bounces back and forth between the two distributions until it ultimately settles in the new one. Finally, there are recurring concepts, this can be things like seasonal patterns in the data. This may not be data drift, but can look like data drift if you haven’t seen enough data yet. All these types are shown in figure 2. above from the paper.

### 3. Algorithms for drift detection

When it comes to detecting change in the data, the survey outlines four main categories of techniques: *sequential analysis*, *Control Charts*, *monitoring two distributions*, and *contextual*. *Sequential analysis* techniques compare new observations of the data to some trailing or cumulative statistic about the original data, if the difference in the new observation is above some user defined threshold then an alarm is set off. For example, the Page-Hinkley test computes the distance between the observed data and the mean up until the point of the observation. *Control Charts* monitor the performance and continuously updating the minimum error rate of online predictions and their variance. Using this variance and minimum error rate new observations can be monitored according to how many standard deviations from the minimum they are. Similar to sequential analysis, *monitoring two distributions* tracks the data distribution over time. However, rather than comparing the data distribution to some trailing statistics, this approach compares the current batch of data to some reference batch. This can the original batch that a model was trained on, or it can be periodic snapshots a fixed temporal distance in the past. Finally, we have *contextual* approaches. In these approaches, time is explicitly modeled as an input feature into decision models. If a given batch of training data identifies that time feature as an important feature then this particular batch determiend to be temporally unstable and not included.

### 4. Closing Thoughts

The parts of this survey that I covered provide a good foundation in some of the common problems in modeling non-stationary data. Furthermore, it gives a good theoretically grounding in how to think about types of data drift. Finally, it outlines some of the common detection methods in 2014. What struck me the most about this survey and, in particular, the methods for change detection, was the lack of now fashionable frameworks such as neural networks – frameworks which are nearly unavoidable in every other ML subdomain. While this paper was written in 2014, it still feels very much grounded in classical machine learning techniques. On one hand this is refreshing. On the other hand, many of the techniques lacked an elegance that comes from a unified or holistic approach. I am excited to dive further into this literature and see where it has come in the last five years.

### Citations

- Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4), 44.