Customer segmentation is one of those problems that sounds straightforward until you actually sit down with the data. In this post I’ll walk through an approach I built for segmenting customers based on their HTTP traffic patterns — the kind of traffic data that tells you not just how much a customer uses a service, but how they use it.
The goal was a “look-alike” segmentation: group customers with similar traffic behaviour into cohorts, so you can make data-driven inferences about what one customer in a group is likely to need based on what others in the same group are already doing.
The Data #
The starting point is monthly HTTP traffic data aggregated at the account level, covering a 12-month rolling window. The raw features are:
- Total Requests — volume of traffic
- Threat Requests — traffic flagged as malicious or suspicious
- API Requests — programmatic/API traffic
- Media Requests — image and video serving traffic
- Web Requests — standard browser/web traffic
The challenge with monthly time series data at the account level: some accounts have gaps — months with zero traffic because they were inactive, not because the data is missing. These need to be treated differently from genuinely absent data points.
Feature Engineering #
The raw time series needs to be collapsed into a single row per account that captures the character of their traffic, not just a snapshot. Three transformations do most of the work:
1. Traffic composition as percentages
Rather than using raw counts for threats, API, media, and web traffic (which would be dominated by account size), each is expressed as a percentage of total requests:
df["api_pct"] = df["api_requests"] / df["total_requests"]
df["media_pct"] = df["media_requests"] / df["total_requests"]
df["web_pct"] = df["web_requests"] / df["total_requests"]
df["threats_pct"] = df["threat_requests"] / df["total_requests"]This makes the features size-agnostic — a small account with 80% API traffic looks like a large account with 80% API traffic, which is the right behaviour for a look-alike model.
2. 75th Percentile aggregation
Instead of taking the mean across the 12 months, we take the 75th percentile of each feature per account. This was chosen over the mean for robustness: traffic distributions are highly right-skewed, and the 75th percentile gives a stable representation of typical high-usage behaviour without being pulled by outlier months.
features = (
df.groupby("account_id")[["total_requests", "api_pct", "media_pct", "web_pct", "threats_pct"]]
.quantile(0.75)
.reset_index()
)3. Relative Standard Deviation (Coefficient of Variation)
To capture traffic stability — not just level — we compute the Relative Standard Deviation (RSD) of total requests for each account:
$$\text{RSD} = \frac{\sigma}{\mu}$$
An account with RSD of 10% has very consistent traffic month to month. An account with RSD of 90% has highly variable traffic — potentially seasonal, growing rapidly, or irregularly active. This becomes its own segmentation dimension.
Outlier Removal: Why Standard Methods Failed #
The standard approaches to outlier removal — IQR-based filtering and standard deviation thresholds — both failed in this case. The reason: the traffic distribution across accounts is extremely right-skewed. Large enterprise accounts send orders of magnitude more traffic than small accounts, which means they appear as statistical outliers under any simple threshold method and get removed — even though they’re exactly the accounts you most want to segment correctly.
The solution was to use k-Nearest Neighbours with anomaly detection to identify outliers relative to their own cluster neighbourhood, rather than relative to the global distribution.
The approach:
- Fit a k-NN model on the feature space
- For each point, compute its distance to its k nearest neighbours
- Flag points whose distance to neighbours exceeds a threshold as anomalies
This lets a large enterprise account be a legitimate member of a “large enterprise” cluster, while still flagging accounts that are genuinely anomalous — e.g. accounts with corrupted data, test accounts with synthetic traffic patterns, or accounts that don’t fit any natural grouping.
In practice, BigQuery ML’s ML.DETECT_ANOMALIES on a k-NN model handles this cleanly in SQL, making it easy to run as part of a data pipeline without a separate Python environment.
Bucketing into Cohort Dimensions #
With outliers removed, each feature is bucketed into discrete bins. The bucketing strategy differs by feature:
Tercile division (3 buckets) for continuous volume features where the full distribution matters:
- Total Requests (traffic volume) → buckets 1, 2, 3
- Traffic Relative Standard Deviation (variability) → buckets 1, 2, 3
Median split (2 buckets) for percentage features where the distribution was bimodal or showed near-identical tercile edges:
- Threats % → 0 (below median) / 1 (above median)
- API % → 0 / 1
- Media % → 0 / 1
- Web % → 0 / 1
The decision between terciles and median split was empirical: for some features, attempting a tercile split produced bins whose boundaries were nearly identical (e.g. the 33rd and 66th percentile were both 0%), making the middle bucket meaningless. A median split was more informative in those cases.
The final cohort identifier is a JSON string joining all bucket assignments:
{"traffic": "3", "variability": "1", "threats": "1", "api": "1", "media": "2", "web": "1"}This gives each account a human-readable, interpretable identity. traffic: 3, api: 1, media: 2 tells you immediately: high-volume account, below-median API traffic, above-median media serving. You don’t need to look up what cluster 47 means.
The Pipeline in Full #
Monthly time series (12 months, per account)
↓
Calculate percentage features (API, media, web, threats as % of total)
↓
Compute 75th percentile per account → one row per account
↓
Compute Relative Standard Deviation per account
↓
k-NN anomaly detection → remove outliers
↓
Tercile bucketing (traffic volume, variability)
Median split (API %, media %, web %, threats %)
↓
Join buckets → cohort string
↓
Output: one cohort identifier per accountProduct Recommendations from Cohort Attach Rates #
Once accounts are grouped into cohorts, a natural downstream application is product recommendations. Within each cohort, you can calculate the attach rate of each product — what fraction of accounts in this cohort use each product — and use that to recommend products to accounts in the same cohort that don’t yet have them.
# Attach rate per product per cohort
attach_rates = (
cohort_products
.groupby(["cohort_id", "product"])["account_id"]
.count()
/ cohort_products.groupby("cohort_id")["account_id"].nunique()
).reset_index(name="attach_rate")
# Top N products per cohort
recommendations = (
attach_rates
.sort_values("attach_rate", ascending=False)
.groupby("cohort_id")
.head(15)
)The logic: if 70% of accounts in cohort {traffic:3, api:1, media:2, web:1} use product X, and a given account in that cohort doesn’t, product X is worth surfacing as a recommendation. This is collaborative filtering at the cohort level — no per-account interaction history required.
What Worked and What Didn’t #
What worked well:
- The 75th percentile aggregation was more stable than mean or median for capturing typical account behaviour
- The k-NN outlier approach preserved the high-volume accounts that standard IQR methods would have removed
- The JSON cohort string made the output self-documenting — downstream consumers could understand a cohort without joining to a lookup table
What I’d do differently:
- The fixed 12-month window treats a fast-growing account the same as a stable one with the same average. A growth-rate feature would add a useful dimension
- The tercile/binary bucketing is interpretable but lossy. For downstream ML use cases (rather than human-readable segmentation), keeping the continuous features would be better
- The RSD is a useful variability measure but sensitive to accounts with very few active months. Weighting by number of active months would improve it
Closing Thought #
The appeal of this approach is its interpretability. Every account gets a cohort label that a non-technical stakeholder can read and understand. That matters more than it sounds — in practice, a segmentation model that product teams can interrogate and trust gets used. One that produces opaque cluster IDs gets ignored.
The algorithm is also deliberately simple. No neural networks, no complex dimensionality reduction. Just feature engineering, a sensible aggregation strategy, and a principled bucketing scheme. Simple enough to explain in a meeting, robust enough to run as a production pipeline.