Skip to content

[ML] Improve handling of bucket count variation for mean value anomaly detection #1386

Open
@tveasey

Description

@tveasey

Currently, we use a worst case estimate of the impact of changes in the count of values in a bucket on their mean variance. This is safe in the sense of not generating false positives, but can lead to large changes in the model plot bounds and potentially false negatives when the count of values in the bucket is low.

Specifically, we assume all measurements are independent so that the variance of the mean statistic will be proportional to 1 / "number samples" in the bucket. If the rate of values is highly variable this can lead to large increases in the width of the model bounds when the count is low. Unfortunately, this is not calibrated to the actual data behaviour. For example, in the other extreme, if all measurements in each bucket were perfectly correlated then we would get no change in variation of the mean statistic as a function of bucket count.

It would be possible to estimate the relationship between the bucket count and the sufficient statistics related to data variation for all the residual distributions we fit since we know the sample count for each bucket. This would also be a more accurate way of calibrating heavy tailed distributions like the log-normal to observed changes in the seasonal variation.

A computationally feasible formulation would be to use linear regression. For example, for the normal model we could fit the linear model (x_i - m)^2 = [c_i s_i] [p_1 p_2]^t for parameters p_1 and p_2, observed bucket values x_i, mean of x_i m and bucket count and seasonal variance scale c_i and s_i, respectively. If we solve this in the least squares sense we only need to maintain a small set of statistics rather than all bucket values, which is the key to this being tractable for us in the streaming setting. The same formulation carries over for the log mean and log variance we estimate for the log-normal distribution and so on.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions