Faced with these challenges, we started building CueObserve. Below is how we are solving these.
1. Create a Virtual Dataset
Datasets are similar to aggregated SQL VIEWS of data. We write a SQL GROUP BY query with aggregate functions to roll-up data, map its columns as dimensions and metrics, and save it as a virtual Dataset.
Below is a sample GROUP BY query for BigQuery:
2. Define an anomaly
We can now define one or more anomaly detection jobs on the dataset. The anomaly detection job can monitor a metric at an aggregate level or split the metric by a dimension.
When we split a metric by a dimension, we limit the number of unique dimension values. We use one of 3 ways to limit:
- Top N: limit based on the dimension value’s contribution to the metric.
- Min % Contribution: limit based on the dimension value’s contribution to the metric.
- Minimum Average Value: limit based on the metric’s average value.
3. Execute Dataset SQL
As the first step in the anomaly detection process, we execute the dataset’s SQL query and fetch the result as a Pandas dataframe. This dataframe acts as the source data for identifying dimension values and the anomaly detection process.
4. Generate Sub dataframes
Next, we create new dataframes on which the actual anomaly detection process will run. During this process, we find dimension values and create sub-dataframes by filtering on the dimension. The dimension values for which sub dataframes need to be created are determined by one of the 3 dimension split rules mentioned above. For example, if the dimension split rule is Top N, an internal method determines the Top N dimension values and returns a list of dicts, each containing the dimension value string, its percentage contribution, and the sub dataframe.
The sub dataframes mentioned are just dataframes after filtering for specific dimension values and removing all other columns except the timestamp column and metric column.
datasetDf[datasetDf[dimensionCol] == dimVal][[timestampCol, metricCol]]
5. Aggregate Sub dataframes
One important step in the preparation of sub dataframes is the aggregation on timestamp which you can see in the previous code snippet.
"df": aggregateDf(tempDf, timestampCol)
This aggregation involves grouping the filtered sub dataframe over the timestamp column and summing it over the metric column. We also rename the timestamp column as “ds” and the metric column as “y”, as Prophet requires the dataframe columns to be named as such.
Continue reading: https://towardsdatascience.com/running-timeseries-anomaly-detection-at-scale-on-sql-data-4407eb3d3bd3?source=rss—-7f60cf5620c9—4