Monitoring availability metrics at scale creates a familiar problem: you have a time series, you need to know when it drops, and you need to know this automatically — without someone staring at a dashboard.
This post walks through a statistical algorithm I built to do exactly that. It detects dips in any continuous metric (availability, reachability, error rate) and returns precise start and end timestamps for each event. No ML required — just a modified z-score, two rolling windows, and a few transition rules.
The Problem #
A “dip” in a time series sounds easy to define: the value goes down. But in practice:
- Metrics fluctuate constantly — you don’t want to fire on every small wobble
- Some drops are so brief they’re noise (a single anomalous minute)
- Some recoveries are partial — the metric bounces back briefly before dropping again
- The absolute threshold that matters varies by day, because the baseline isn’t constant
A naive threshold (if value < 0.999, it's a dip) breaks quickly. You either miss real events or drown in false alarms.
The Assumptions #
The algorithm makes two working assumptions:
- The distribution is stationary — the metric doesn’t have predictable time-of-day patterns where lower values are expected. If your metric does have those patterns, you’d extend this with a time-varying baseline.
- Under normal conditions, the metric is roughly normally distributed — a bell curve around a stable central value, with dips appearing as strong negative deviations.
Both hold well for high-level availability metrics aggregated over many requests.
Step 1: Compute a Modified Z-Score #
Rather than comparing against a fixed threshold, we normalise each data point against the current distribution of the signal. This makes the algorithm adaptive — the “what counts as a dip” question is answered relative to the recent behaviour of the metric itself.
For each data point $x_i$:
$$z_i = \frac{x_i - \text{reference_val}}{\sigma}$$
Where:
reference_valis either the median of the dataset (adaptive) or a fixed SLA value (e.g. 0.99999) — configurableσis the standard deviation of the full time series window (typically 24 hours)
A data point is flagged as a potential dip if $z_i < -1$ — i.e. the value is more than one standard deviation below the reference.
sttdev = dip_timeline_reindex.std()
reference_val = (
dip_timeline_reindex.median() if reference == "median" else reference_val
)
comp = dip_timeline_reindex.sub(reference_val) / sttdev
dip_threshold = (comp < -1) * 1 # binary: 1 = possible dip, 0 = normalUsing the median as reference (rather than the mean) makes the baseline robust to outliers — a single deep dip doesn’t pull the reference value down and cause the algorithm to miss subsequent events.
Step 2: Filter Out Noise with a Minimum Duration Window #
A single minute below threshold is almost certainly noise. We only want to flag an event as a real dip start if it’s sustained — specifically, if the next min_window minutes (default: 5) are also flagged.
We track two additional signals:
shift: the lagged value ofis_dip— tells us what the previous minute’s state wasroll_sum: a forward-looking rolling sum over the nextmax_windowminutes — tells us what’s coming
dip_finder["shift"] = dip_finder.shift(1).fillna(0)
dip_finder["roll_sum"] = (
dip_finder.is_dip.shift(-max_window + 1)
.rolling(window=max_window)
.sum()
.fillna(0)
)Step 3: Label Dip Transitions #
With these three signals (is_dip, shift, roll_sum), we can precisely label each minute as a dip start, dip end, or neither:
$$\text{start_end}_i = \begin{cases} \text{“dip_start”} & \text{if } \text{is_dip}_i = 1 \wedge \text{shift}_i = 0 \wedge \text{roll_sum}_i \geq \text{min_window} \ \text{“dip_end”} & \text{if } \text{is_dip}_i = 0 \wedge \text{shift}_i = 1 \wedge \text{roll_sum}_i = 0 \ \text{NaN} & \text{otherwise} \end{cases}$$
In plain English:
- Dip start: the current minute is below threshold, the previous minute was not, and the next
min_windowminutes are also below threshold - Dip end: the current minute is above threshold, the previous minute was not, and the next
max_windowminutes are all above threshold — meaning we’ve genuinely recovered, not just bounced
The max_window check on dip end (default: 15 minutes) is deliberate. Without it, a brief recovery in the middle of a sustained dip would split it into two separate events, making the duration statistics meaningless.
dip_finder["start_end"] = dip_finder.apply(
lambda dip: (
"dip_end"
if (dip["is_dip"] == 0) & (dip["shift"] == 1) & (dip["roll_sum"] == 0)
else (
"dip_start"
if (dip["is_dip"] == 1) & (dip["shift"] == 0) & (dip["roll_sum"] >= min_window)
else np.nan
)
),
axis=1,
)Step 4: Remove Consecutive Dip Starts #
In practice, you can get runs of consecutive dip_start labels during a noisy entry into a dip. We only want the first one. This is handled by grouping consecutive identical labels and keeping only the first occurrence:
dips_only["consecutive_count"] = (
dips_only["start_end"]
.groupby((dips_only["start_end"] != dips_only["start_end"].shift()).cumsum())
.cumcount() + 1
)
dips_only = dips_only.loc[~(dips_only["consecutive_count"] > 1)].copy()Step 5: Pair Starts and Ends #
Finally, dip_start and dip_end events are paired sequentially. Only valid pairs are retained — a start without a following end (e.g. a dip still ongoing at the end of the observation window) is excluded. Duration is computed in both timedelta and minutes.
dip_start_end_df["duration"] = dip_start_end_df.dip_end - dip_start_end_df.dip_start
dip_start_end_df["duration_min"] = dip_start_end_df["duration"].dt.total_seconds() / 60The Parameters #
The algorithm has five configurable parameters:
| Parameter | Default | What it controls |
|---|---|---|
min_window |
5 min | Minimum sustained duration to call something a dip start |
max_window |
15 min | Minimum recovery duration to call something a dip end |
reference |
"median" |
Whether to normalise against the dataset median or a fixed SLA value |
reference_val |
0.99999 | The SLA value, if reference = "SLA" |
smooth |
False |
Apply a rolling average before detection (trades sensitivity for noise reduction) |
The smooth parameter deserves a note: it was initially appealing as a way to eliminate false alarms, but in practice it caused the algorithm to miss short but real dips — the smoothing would average away a 3-minute event entirely. For most use cases, leaving it off and relying on min_window to filter noise is the better approach.
Why Not ML? #
A few reasons this approach was chosen over a machine learning model:
Interpretability. When an alert fires, you can trace exactly why: the z-score was below -1 for more than 5 consecutive minutes, the reference value was X, the standard deviation was Y. There’s no black box.
No training data required. The algorithm works on the current 24-hour window. You don’t need historical labelled examples of dips to get started.
Robustness to distribution shift. If the baseline level of a metric drifts over time (e.g. availability naturally improves as infrastructure scales), the median-based reference value adapts automatically.
Low computational overhead. The entire algorithm is vectorised pandas operations — it runs on minute-level data for a 24-hour window in milliseconds.
What This Enables #
The output is a clean dataframe of (dip_start, dip_end, duration, duration_min) tuples. This feeds naturally into downstream analysis: which contributing entities were responsible for the dip during each window, how severe it was relative to the reference, and whether the pattern matches known failure modes.
The algorithm is written in Python for prototyping but is straightforward to port to Scala for production pipeline integration — all the logic is standard window functions and groupby operations that map directly to Spark or Flink semantics.
The full function signature with all parameters:
def find_dip_start_end(
dip_timeline: pd.DataFrame,
min_window: int = 5,
max_window: int = 15,
smooth: bool = False,
reference: str = "median",
reference_val: float = 0.99999,
metric: str = "avg_availability",
) -> pd.DataFrame:
...