[{"content":"","date":"19 August 2026","externalUrl":null,"permalink":"/","section":"Inês Garcia","summary":"","title":"Inês Garcia","type":"page"},{"content":"","date":"19 August 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/churn/","section":"Tags","summary":"","title":"Churn","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/data-science/","section":"Tags","summary":"","title":"Data Science","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/machine-learning/","section":"Tags","summary":"","title":"Machine Learning","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/statistics/","section":"Tags","summary":"","title":"Statistics","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/survival-analysis/","section":"Tags","summary":"","title":"Survival Analysis","type":"tags"},{"content":"","date":"13 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"A few years ago, I built a churn model for a B2B SaaS product. Logistic regression, binary label, 30-day prediction window. It performed fine. The business used it. I moved on.\nWhat bothered me was a question the model couldn\u0026rsquo;t answer: how long does a customer actually stay?\nA customer who churns at month 2 and one who churns at month 18 are identical in a binary classifier — both labeled \u0026ldquo;churned.\u0026rdquo; But they\u0026rsquo;re completely different business problems. One is an onboarding failure. The other is a pricing issue, a competitive loss, or accumulated dissatisfaction that took a year and a half to complete. The interventions are different. The teams responsible are different. The economics are different.\nI eventually built a survival analysis instead. This post is what I learned.\nWhat survival analysis does differently # Survival analysis treats time-to-event as the outcome. Instead of \u0026ldquo;will this customer churn?\u0026rdquo; it asks \u0026ldquo;when will this customer churn, and what accelerates or delays that?\u0026rdquo;\nThe technical framing: you\u0026rsquo;re modelling a duration variable that, for some customers, you haven\u0026rsquo;t fully observed yet. A customer who joined six months ago and hasn\u0026rsquo;t churned has survived at least six months. You don\u0026rsquo;t know when they\u0026rsquo;ll leave. Standard classification models either drop these customers or treat them as permanent non-churners — both wrong. Survival analysis handles them correctly as censored observations: they contribute the information they have (at least six months of survival) without overclaiming.\nThis matters more than it sounds. In most SaaS datasets, the majority of your customers are still active. A model that ignores them or misrepresents them is working with a fundamentally biased sample.\nTwo tools, two questions # Kaplan-Meier: what does retention actually look like? # The Kaplan-Meier estimator produces a survival curve: at each point in time, what fraction of customers are still subscribed?\nfrom lifelines import KaplanMeierFitter kmf = KaplanMeierFitter() kmf.fit(durations=df[\u0026#39;duration_months\u0026#39;], event_observed=df[\u0026#39;churned\u0026#39;]) kmf.plot_survival_function() The curve steps down at each churn event. The shaded band is the 95% confidence interval — it widens at longer durations where fewer customers remain. Don\u0026rsquo;t over-interpret the tail.\nThe useful thing is you can fit separate curves for segments and compare them directly:\nfor segment in df[\u0026#39;segment\u0026#39;].unique(): mask = df[\u0026#39;segment\u0026#39;] == segment kmf.fit( durations=df.loc[mask, \u0026#39;duration_months\u0026#39;], event_observed=df.loc[mask, \u0026#39;churned\u0026#39;], label=segment ) kmf.plot_survival_function(ax=ax, ci_show=False) Higher curve = better retention. Wider separation between curves = segment is a strong predictor of churn. The log-rank test tells you whether the difference is statistically significant:\nfrom lifelines.statistics import multivariate_logrank_test result = multivariate_logrank_test( df[\u0026#39;duration_months\u0026#39;], df[\u0026#39;segment\u0026#39;], df[\u0026#39;churned\u0026#39;] ) print(f\u0026#34;p = {result.p_value:.2e}\u0026#34;) Cox Proportional Hazards: what\u0026rsquo;s driving it? # Kaplan-Meier shows that segments differ. Cox tells you why — which factors increase or decrease churn risk, controlling for everything else simultaneously.\nThe model:\nh(t) = h₀(t) · exp(β₁x₁ + β₂x₂ + ... + βₚxₚ) The output is a hazard ratio per covariate. HR = 1.4 means 40% higher churn risk. HR = 0.7 means 30% lower churn risk. HR = 1.0 means no effect.\nfrom lifelines import CoxPHFitter cph = CoxPHFitter(penalizer=0.01) cph.fit(cox_df, duration_col=\u0026#39;duration_months\u0026#39;, event_col=\u0026#39;churned\u0026#39;) cph.plot(hazard_ratios=True) # forest plot — the most useful output The concordance index tells you how well the model ranks customers by risk:\nprint(f\u0026#34;C-index: {cph.concordance_index_:.3f}\u0026#34;) # 0.5 = random, 0.7+ = useful, 0.8+ = strong What I found # The dataset was enterprise contract customers for a B2B SaaS product — around 23,000 subscription records across ten product lines, roughly 44% churned.\nRetention curve:\nMilestone Retention 6 months 96.8% 12 months 76.3% 24 months 55.7% 36 months 36.2% Median lifetime: 30 months.\nOne important caveat for this population: these are contract customers with at least annual terms. Near-zero churn before month 6 is structurally expected — they can\u0026rsquo;t easily leave mid-contract. The meaningful signal starts at the first renewal window. The 20-point drop between month 6 and month 12 is where annual contracts come up for renewal. That\u0026rsquo;s not a surprise; it\u0026rsquo;s the shape of the problem.\nThe Cox model (C-index 0.78) surfaces the factors that differentiate which accounts make it through renewal:\nAccount segment dominated the results. Some segments showed 10–16× higher churn risk than the reference segment, controlling for product, region, and revenue. These differences reflect contractual dynamics — competitive renewal processes, procurement cycles, shorter initial terms — not product quality. Region was consistently protective. EMEA and North American accounts churned 30–40% less than the APAC baseline, all else equal. Product line mattered significantly. Some products showed 6× higher churn than others. Worth separating in the model rather than lumping all product lines together. MRR had a near-zero independent effect once segment and product were controlled for. The apparent revenue effect in simpler models was confounding. The assumption check you can\u0026rsquo;t skip # Cox PH has a key assumption: hazard ratios are constant over time. A segment that has 3× higher churn risk at month 12 must also have 3× higher churn risk at month 24. In practice this often fails — and it failed here.\nThe test:\nfrom lifelines.statistics import proportional_hazard_test results = proportional_hazard_test(cph, cox_df, time_transform=\u0026#39;rank\u0026#39;) failing = results.summary[results.summary[\u0026#39;p\u0026#39;] \u0026lt; 0.05].index.tolist() 15 covariates failed. Alarming until you notice the pattern: most were one-hot dummies from the same two or three source variables (account segment, product line, downgrade count). The violation isn\u0026rsquo;t spread randomly — it\u0026rsquo;s concentrated in a few original variables.\nThe fix: stratify on the source variable, not each dummy.\nStratification lets each group have its own baseline hazard curve while still estimating shared hazard ratios for the remaining covariates. You lose the hazard ratio for the stratified variable; you get correct estimates for everything else.\ncph_strat = CoxPHFitter(penalizer=0.01) cph_strat.fit( cox_strat_df, duration_col=\u0026#39;duration_months\u0026#39;, event_col=\u0026#39;churned\u0026#39;, strata=[\u0026#39;account_segment\u0026#39;, \u0026#39;product_line\u0026#39;, \u0026#39;downgrade_count\u0026#39;, \u0026#39;region\u0026#39;] ) What happened when I did this: MRR\u0026rsquo;s hazard ratio collapsed to 1.00 (p = 0.93). Once segment, product, region, and downgrade behaviour were properly controlled, MRR had no independent effect on churn. The apparent MRR signal in the non-stratified model was pure confounding.\nThe C-index dropped to 0.505 — barely above chance. This is correct: I stratified away all the predictive variance. Within a fixed stratum, the only remaining covariate (MRR) doesn\u0026rsquo;t predict churn. The non-stratified model (C = 0.78) is still the right tool for ranking accounts by risk. The stratified model is the right tool for getting assumption-correct estimates of specific effects.\nThis trade-off — model correctness vs. predictive power — is the main practical tension in survival analysis with real business data. The resolution is usually better features. Behavioral data from the first 60–90 days of a subscription (usage volume, feature adoption, login frequency) tends to satisfy the PH assumption more cleanly than firmographic segment labels, and it\u0026rsquo;s far more predictive.\nScoring individual accounts # Beyond population-level curves, Cox PH generates a predicted survival function per customer. This is how the analysis becomes operational.\n# Rank all active accounts by churn risk features = cox_df.drop(columns=[\u0026#39;duration_months\u0026#39;, \u0026#39;churned\u0026#39;]) risk = cph.predict_partial_hazard(features.loc[cox_df[\u0026#39;churned\u0026#39;] == 0]) risk_percentile = risk.rank(pct=True) Sort descending. The top accounts are your highest renewal risk. Hand the list to whoever runs renewal conversations — with enough lead time to act.\nWhat the method gets you that classification doesn\u0026rsquo;t # Time-aware LTV. Survival curves give you the expected retention trajectory per segment. LTV is the area under the curve, weighted by revenue — not an average tenure calculated from churned accounts only. Danger periods. The hazard function shows when in the lifecycle churn risk peaks. For contract customers that\u0026rsquo;s structurally at renewal, but the height of the spike varies by segment. That variance is where the model earns its keep. Evaluation. If you run a retention program, you can compare survival curves before and after with a log-rank test. The same framework that generates the analysis can measure whether an intervention worked. Where it falls short # It only sees what billing sees. A customer who has stopped using the product is \u0026ldquo;active\u0026rdquo; until the contract formally ends. Usage signals — logins, API calls, feature adoption — are the leading indicators. This model doesn\u0026rsquo;t have them.\nNo competing risks. Voluntary cancellation, payment failure, and non-renewal are all treated as the same event. They\u0026rsquo;re not. Each has a different cause and a different response. Separating them is the right next step.\nPoint-in-time features are hard. The look-ahead bias problem is real and underappreciated. A feature like \u0026ldquo;total downgrades\u0026rdquo; uses lifetime data — including future events for active customers. For a descriptive analysis that\u0026rsquo;s fine. For a live risk scorer it\u0026rsquo;s not. Building it correctly requires historical snapshots of customer state, not just current state. Most data warehouses don\u0026rsquo;t have this by default.\nThe short version # Survival analysis is the right tool when the question is about duration, not just outcome. For SaaS churn specifically:\nUse Kaplan-Meier to describe retention by segment and identify when churn concentrates. Use Cox PH to isolate which factors drive churn controlling for each other. Check the proportional hazards assumption — it will fail for some variables, and that failure is informative. Use the partial hazard score to rank active accounts by risk and operationalize the model. The biggest improvement available is better features: behavioral data from early in the subscription, built from point-in-time historical snapshots. lifelines is the right Python library to start with. The documentation is genuinely good.\n","date":"13 April 2026","externalUrl":null,"permalink":"/posts/survival-analysis-for-churn/","section":"Posts","summary":"A few years ago, I built a churn model for a B2B SaaS product. Logistic regression, binary label, 30-day prediction window. It performed fine. The business used it. I moved on.\nWhat bothered me was a question the model couldn’t answer: how long does a customer actually stay?\n","title":"Why I stopped using logistic regression for churn","type":"posts"},{"content":" Hi, I\u0026rsquo;m Inês Garcia # Senior Product \u0026amp; Decision Data Scientist with 7+ years of experience working with large-scale behavioral and traffic data — TB-scale, millions of users. I work at the intersection of statistical modeling, machine learning, and product strategy, mostly in high-growth technology environments.\nI built this blog to write about things I\u0026rsquo;m learning, building, and thinking about. Expect data science, engineering, the occasional opinion, and a few rabbit holes.\nBased in Lisbon, Portugal.\nWhat I\u0026rsquo;m working on # I\u0026rsquo;m currently deep into open-source analytics agents — autonomous, LLM-powered systems that can query, analyze, and reason over data without step-by-step human intervention. The research question I\u0026rsquo;m exploring: can a Senior Data Analyst at a company like Cloudflare build a production-grade analytics agent that runs entirely on Cloudflare\u0026rsquo;s own infrastructure (Workers AI, Durable Objects, Vectorize), with no data leaving the network?\nThe agent accepts natural language questions, decides which tools to call (GraphQL API, SQL, internal metric definitions), executes them, reflects on its own output, and returns a grounded, auditable answer — with the exact query it ran attached to every number it cites.\nI\u0026rsquo;m mapping this against real production systems (Meta\u0026rsquo;s multi-agent data warehouse, Cloudflare\u0026rsquo;s Security Analytics AI Assistant) and open-source patterns (LangGraph, smolagents, DSPy) to figure out where the state of the art actually is, and where the gaps are.\nExperience # Cloudflare Nov 2021 – Present Senior Data Analyst, Data Intelligence \u0026amp; Analytics · Lisbon Working with TB-scale traffic and product usage datasets to identify behavioral patterns, adoption drivers, and revenue opportunities. Built a near-real-time anomaly detection system using statistical baselines (median modeling with dynamic standard deviation thresholds). Developed survival analysis models on subscription datasets. Built statistical attribution models integrating product usage, CRM, and billing data. Led NLP analysis of internal JIRA ticket data to surface operational bottlenecks. aicep — Portuguese Agency for Trade \u0026amp; Investment Jul 2020 – Nov 2021 Data Scientist, Product Department · Lisbon Led data science initiatives for trade and investment analysis. Designed econometric prediction models and built automated Python pipelines that reduced manual data processing by 95%. GroupM — Marketing and Advertising Jan 2020 – Jul 2020 Data Scientist, Business \u0026amp; Science Department · Lisbon Built automated Python pipelines integrating social media and campaign performance data for large-scale marketing analytics. PwC Portugal Jul 2017 – Jun 2019 Data Analyst, Research and Analysis Department · Lisbon Delivered quantitative analysis for executive consulting decisions. Automated data processing workflows (VBA, SQL), reducing manual effort by 80%. Managed and mentored a team of nine analysts. Education # Data Science Post Graduation 2024 Instituto Superior Técnico · Universidade Nova de Lisboa Supervised \u0026 Unsupervised Learning, Deep Learning, Time Series Forecasting Postgraduate Diploma in Intelligence Management and Security 2020 NOVA IMS · Universidade Nova de Lisboa Structured Analytical Techniques, Social Network Intelligence, Economic and Competitive Intelligence Strategy Intelligence MA 2020 Instituto Superior de Ciências Sociais e Políticas International Relations Bachelor Degree 2014 Instituto Superior de Ciências Sociais e Políticas Skills # Programming: Python, SQL, PySpark, Scala, JavaScript\nData Science \u0026amp; Statistics: Statistical Modeling, Anomaly Detection, NLP, Econometrics, Time Series Forecasting, Machine Learning, Behavioral Analysis, Cohort Analysis\nData Platforms: BigQuery, Airflow, Distributed Data Processing\nVisualization: PowerBI, Tableau, Looker, Plotly\nCertifications # Natural Language Processing with Classification and Vector Spaces — DeepLearning.ai / Coursera (2021) Mathematics for Machine Learning — Imperial College London / Coursera (2020) Data Science Bootcamp — Ironhack Portugal (2019, 420h) ","date":"9 April 2026","externalUrl":null,"permalink":"/about/","section":"About","summary":"Hi, I’m Inês Garcia # Senior Product \u0026 Decision Data Scientist with 7+ years of experience working with large-scale behavioral and traffic data — TB-scale, millions of users. I work at the intersection of statistical modeling, machine learning, and product strategy, mostly in high-growth technology environments.\n","title":"About","type":"about"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/agents/","section":"Tags","summary":"","title":"Agents","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"AI","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/analytics/","section":"Tags","summary":"","title":"Analytics","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/career/","section":"Tags","summary":"","title":"Career","type":"tags"},{"content":"I need to say something that makes some data analysts uncomfortable: the job is changing. Not disappearing — changing. And the analysts who understand the change will thrive. The ones who don\u0026rsquo;t will spend the next five years fighting it.\nHere\u0026rsquo;s what I\u0026rsquo;ve learned building analytics agents at Cloudflare.\nWhat\u0026rsquo;s Actually Changing # The fear is usually: \u0026ldquo;AI will write SQL, so SQL skills become worthless.\u0026rdquo;\nThat\u0026rsquo;s wrong. Here\u0026rsquo;s what\u0026rsquo;s actually happening:\nThe volume of SQL is exploding. Ten times more queries will be run by organizations in 2027 than in 2024. Most of those will be generated by agents, not humans. But someone has to:\nDefine what the queries should look like Verify that the answers are correct Encode the domain knowledge that makes answers trustworthy Build the evaluation systems that catch errors Explain what the data means to stakeholders That\u0026rsquo;s the data analyst\u0026rsquo;s job. It just got more important, not less.\nThe New Skills That Matter # 1. Evaluation design\nThe most valuable skill in analytics agent development is building evaluation datasets. This means curating real questions, verifying correct answers, and designing metrics that distinguish good responses from bad ones. It requires deep data knowledge. It\u0026rsquo;s not automatable. And most teams are terrible at it.\nIf you build evaluation datasets well, you become indispensable.\n2. Data contract design\nAgents need to know what data is reliable, what\u0026rsquo;s sampled, what\u0026rsquo;s experimental, and what the business definitions are. Defining and documenting these data contracts — the \u0026ldquo;analytical truth layer\u0026rdquo; — is the domain of the data analyst. Do this well and every agent in your organization depends on your work.\n3. Prompt and tool architecture\nWriting agent system prompts is a new craft that combines:\nDeep domain knowledge (what does your data mean?) Clear technical specification (what can the agent do?) Edge case awareness (what will go wrong?) Data analysts have the domain knowledge. They just need to learn the new syntax.\n4. Stakeholder translation\nAn analytics agent that produces a technically correct answer but can\u0026rsquo;t explain it to a PM is useless. Someone needs to translate between the agent\u0026rsquo;s output and business decisions. That\u0026rsquo;s a data analyst job.\nThe Mindset Shift # The analysts I\u0026rsquo;ve seen struggle with agents share a common mindset: they see the agent as a replacement for their skills. Write SQL → agent writes SQL. Create analysis → agent creates analysis. Therefore: I\u0026rsquo;m obsolete.\nThe analysts who thrive have a different mindset: the agent is a multiplier on my domain knowledge. I know what our traffic data means. The agent is fast and tireless. Together, we can answer 10x more questions than I could alone — and the hard questions still need me.\nYour value is your judgment, your domain knowledge, and your ability to catch the agent when it\u0026rsquo;s wrong. Those things are not automatable. They become more valuable as agents handle routine work.\nThe Concrete Action Plan # This week:\nPick one analytics question you answer repeatedly — manually, weekly, quarterly Write down the exact query you use and why Note every \u0026ldquo;gotcha\u0026rdquo; — the sampling issues, the definitions, the edge cases This month:\nBuild a 10-question mini-evaluation dataset for that question type Try one open-source analytics agent framework (smolagents, LangChain, Cloudflare Agents SDK) Run your 10 questions through it. Measure accuracy. This quarter:\nIdentify your organization\u0026rsquo;s most common analytics questions Document them in a structured format: question, expected answer type, required data, known gotchas Present a proposal: \u0026ldquo;Here\u0026rsquo;s what an analytics agent for our team could answer automatically.\u0026rdquo; This year:\nOwn the evaluation system for your team\u0026rsquo;s analytics agent Become the person who catches agent errors before they reach stakeholders Write the internal cookbook — the documented patterns that work The era of analytics agents is not coming. It\u0026rsquo;s here. The question is whether you\u0026rsquo;ll lead the transition or follow it.\n","date":"8 April 2026","externalUrl":null,"permalink":"/posts/data-analyst-survival-guide-agentic-era/","section":"Posts","summary":"I need to say something that makes some data analysts uncomfortable: the job is changing. Not disappearing — changing. And the analysts who understand the change will thrive. The ones who don’t will spend the next five years fighting it.\n","title":"The Data Analyst's Survival Guide to the Agentic Era","type":"posts"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/airflow/","section":"Tags","summary":"","title":"Airflow","type":"tags"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/bigquery/","section":"Tags","summary":"","title":"Bigquery","type":"tags"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/billing/","section":"Tags","summary":"","title":"Billing","type":"tags"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/data-engineering/","section":"Tags","summary":"","title":"Data Engineering","type":"tags"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/data-pipelines/","section":"Tags","summary":"","title":"Data Pipelines","type":"tags"},{"content":"Earlier this year I shipped a pipeline rewrite I\u0026rsquo;m genuinely proud of. It replaced a 2,200-line SQL monolith — one of those files that everyone\u0026rsquo;s afraid to touch — with a clean layered architecture that handles 14 products, runs daily, and can be extended by adding a handful of config files.\nThis post is about how it\u0026rsquo;s built, why the design works, and the trade-offs I made along the way.\nA note on the execution layer: the pipelines in this post run on Jetflow, a configuration-driven data ingestion framework built at Cloudflare that ingests 141 billion rows per day. Jetflow is going open source — you\u0026rsquo;ll see why it matters to this design as we go.\nThe Problem # The existing pipeline was a single large SQL file. It worked — mostly. But it had accumulated years of product-specific logic, hardcoded dates, and patches on patches. Adding support for a new product meant understanding the whole file before you could safely modify any of it. When something broke, there was no way to rerun just one product; you reran everything or nothing.\nThe pipeline\u0026rsquo;s job is to take raw daily usage events and produce monthly billing-ready numbers: one row per customer, per product, per billing metric, for the current month. This feeds directly into the data that determines what customers owe.\nAt its core, the logic isn\u0026rsquo;t that complex. The complexity had accreted around it.\nThe Design # The rewrite has five conceptual layers:\nLayer 0: Product Metadata Registry ↓ Layer 1: Daily Unpivot Views (wide → long format) ↓ Layer 2: Monthly Aggregation (via a shared table function) ↓ Layer 3: Cross-product Union View ↓ Layer 4: Consumer Views (billing, cap tracking, etc.) Let me walk through each one.\nLayer 0: The Product Metadata Registry # The most important decision in the whole design was centralizing business logic into a single metadata view.\nEvery product has:\nA raw metric name in the source table A human-readable display name A billing product identifier (for cap matching) A unit of measure (TB, 1M requests, GB-month, etc.) Whether billing rolls up at the zone level or the account level A monthly aggregation rule — more on this below All of this lives in one place: a BigQuery view built from an UNNEST of a typed array literal. No separate table. No migration scripts. If you need to change the aggregation rule for one product, you change one row in one file.\nThe aggregation rules are where the real business logic lives:\nRule Meaning Example product SUM Add up all daily values API requests, operations LAST_DAY Take the value from the last day of the month Hostnames, seats (headcount-style metrics) SUM_AVG_30 Sum, then divide by 30 Storage billed by TB-month SUM_AVG_720 Sum, then divide by 720 Object storage billed by byte-hours MAX Highest value seen in the month Seat-count maximums These rules aren\u0026rsquo;t arbitrary — they reflect how different product contracts actually work. Object storage isn\u0026rsquo;t billed by how much you stored today; it\u0026rsquo;s billed by the average you stored over the month. Encoding this in a metadata registry means the aggregation logic is DRY across all products.\nLayer 1: Daily Unpivot Views # The upstream daily usage tables are in wide format — one column per metric. A CDN usage table might have 26 columns: total requests, data transfer, and then both of those broken out across 12 geographic regions.\nThe obvious fix would be to reshape those source tables. We didn\u0026rsquo;t do that — and not by choice. Those tables had too many existing consumers (dashboards, other pipelines, downstream models) that depended on their exact schema. Changing the shape of a wide table in a shared data lake is the kind of thing that breaks things quietly and at a distance. So the source tables were off-limits.\nInstead, we added a layer of read-only views on top of them. Each product gets an UNPIVOT view that melts its wide daily table into a long format with two columns: product_metric (the column name as a string) and usage_value (the numeric value). The source tables are untouched; everything downstream of the views gets a consistent schema.\nSELECT product_name, event_date, account_id, zone_id, CAST(requests AS FLOAT64) AS requests, CAST(bytes AS FLOAT64) AS bytes, -- ... all other metric columns cast to FLOAT64 FROM cdn_usage_daily UNPIVOT (usage_value FOR product_metric IN (requests, bytes, ...)) A few things worth noting:\nAll metric columns are explicitly cast to FLOAT64 before the UNPIVOT. Source columns vary in type. Making this explicit prevents type errors downstream and ensures the union across all products later works cleanly.\nBot Management only has one metric, so it uses a plain SELECT with a hardcoded string literal instead of an UNPIVOT clause — because the UNPIVOT overhead isn\u0026rsquo;t worth it for a single column.\nZone-level vs. account-level products are different. Products like CDN bill per zone (a customer might have 50 zones). Products like object storage or email security bill per account. This distinction is flagged in the metadata registry and shapes the filtering logic.\nLayer 2: A Shared Table Function for Aggregation # This is the most technically interesting piece.\nInstead of writing aggregation SQL for each of 14 products, I wrote a single BigQuery table function — a parameterized, reusable piece of SQL that accepts a table as input and returns an aggregated table as output.\nCREATE OR REPLACE TABLE FUNCTION monthly_usage_aggregate(vb_daily_table TABLE\u0026lt;...\u0026gt;, event_date_var DATE) AS ( WITH mapping_enrichment AS ( SELECT t.*, m.monthly_aggregation_rule, m.billable_level, m.metric_unit, -- ... FROM vb_daily_table t JOIN product_mapping m USING (product_name, product_metric) WHERE DATE_TRUNC(event_date, MONTH) = DATE_TRUNC(event_date_var, MONTH) HAVING billing_product IS NOT NULL -- drop non-billable metrics ), monthly_aggregations AS ( SELECT DATE_TRUNC(event_date, MONTH) AS month_date, -- dimensions ... CASE monthly_aggregation_rule WHEN \u0026#39;SUM\u0026#39; THEN SUM(usage_value) WHEN \u0026#39;LAST_DAY\u0026#39; THEN ARRAY_AGG(usage_value ORDER BY event_date DESC LIMIT 1)[OFFSET(0)] WHEN \u0026#39;SUM_AVG_30\u0026#39; THEN SUM(usage_value) / 30 WHEN \u0026#39;SUM_AVG_720\u0026#39;THEN SUM(usage_value) / 720 WHEN \u0026#39;MAX\u0026#39; THEN MAX(usage_value) END AS usage_value FROM mapping_enrichment GROUP BY ALL ) SELECT m.*, SAFE_DIVIDE(usage_value, multiplier) AS normalized_usage FROM monthly_aggregations m JOIN uom_multipliers USING (metric_unit) ) Each product then calls this function with its own unpivot view:\nSELECT * FROM monthly_usage_aggregate( (SELECT * FROM cdn_usage_daily_unpivot WHERE is_billable_zone), \u0026#39;{event_date}\u0026#39; ) This is a sophisticated use of BigQuery\u0026rsquo;s table-valued function feature. The aggregation logic is defined once and tested once. Adding a new product doesn\u0026rsquo;t require touching it at all.\nOne design detail I\u0026rsquo;m particularly pleased with: the HAVING billing_product IS NOT NULL clause at the end of mapping_enrichment. Upstream daily tables often contain metrics that are tracked for observability but aren\u0026rsquo;t billed — intermediate counts, debug signals, things like that. Rather than maintaining an explicit exclusion list, the HAVING clause silently drops anything that doesn\u0026rsquo;t have a billing product mapping. The metadata registry acts as an allowlist.\nBandwidth deduplication required special handling. CDN pricing is regional — a customer might have a cap specifically for North America traffic. For non-bandwidth metrics, a simple SUM works. For Data Transfer / Bandwidth, there\u0026rsquo;s a COALESCE hierarchy that resolves which cap level applies (regional → global → null geo → raw sum), preventing double-counting across pricing tiers.\nLayer 3: The Union View # All 14 monthly aggregate tables are UNION ALL\u0026rsquo;d into a single view:\nSELECT * FROM cdn_usage_monthly UNION ALL SELECT * FROM workers_usage_monthly UNION ALL -- ... 12 more products WHERE month_date \u0026lt; CURRENT_DATE() The month_date \u0026lt; CURRENT_DATE() filter excludes the in-progress current month, which would be partial data. Downstream consumers query this one view and get coverage across all products.\nLayer 4: Consumer Views # The consumer layer is where the original motivation for this rewrite becomes visible. The view that reports usage-against-cap for the current month — previously 80 lines of Bot Management-only SQL — is now 44 lines that cover all 14 products, because it\u0026rsquo;s querying the union view and the product metadata registry rather than hardcoded product-specific logic.\nThe Orchestration # Jetflow as the execution layer # Each layer (helper views, unpivot views, monthly aggregates) is a Jetflow pipeline: a YAML file that declares a consumer, optional transformers, and one or more loaders. Jetflow handles the streaming execution, Parquet conversion, GCS writes, and BigQuery loads. This is what a monthly aggregate pipeline config looks like:\n# cdn.yaml job_name: cdn_monthly_aggregate schedule: daily pipeline: - stage: bigquery_storage # runs the SQL, streams results as Arrow batch_size: 64000 retries: 3 - stage: parquet_file_transformer - stage: gcs_loader # writes Parquet to object storage, partitioned by month partition: month_date partition_type: MONTH retries: 5 - stage: bq_loader # loads from GCS into BigQuery blocking: true The SQL file it references contains just the call to the shared table function. Jetflow handles everything else: parallelism, retries, Arrow-native streaming, memory management. If you\u0026rsquo;re not familiar with it, the Cloudflare blog post is worth reading — the short version is that it achieves 2–5 million rows per second per database connection by keeping data in columnar Arrow format end-to-end, avoiding the row→column→row conversions that slow down most ELT frameworks.\nThe dry-run flag # One contribution I made to Jetflow itself while building this pipeline: a --dry-run flag. During development, you often want to validate that a query is syntactically correct and will produce the expected schema without actually writing data or burning quota.\nThe implementation wires through three layers:\nCLI flag — --dry-run added to ApplicationFlags, threaded into JobConfiguration BigQuery SDK — sets DryRun: true on the query config before submission, which triggers BigQuery\u0026rsquo;s built-in query validation mode Consumer logic — if dry-run mode is active, logs \u0026ldquo;Query validates successfully.\u0026rdquo; and returns immediately after BigQuery responds, skipping GCS writes and BQ loads entirely Dry-run jobs in BigQuery don\u0026rsquo;t reach Done state, so the consumer branches between job.LastStatus() (immediate, for dry runs) and job.Wait(ctx) (blocking, for real runs). The mock interface was updated to match, so unit tests cover both paths.\nIn practice: make compose-up dry-run=true validates the entire pipeline in seconds without touching production data.\nThree DAGs, one daily sequence # Three Airflow DAGs run daily:\nHelper views (1 AM) — deploys/refreshes the product metadata registry and unit multiplier lookup via Jetflow\u0026rsquo;s bq_view_batch loader Monthly aggregates (2 AM) — first creates or replaces the BigQuery table function via BigQueryInsertJobOperator, then triggers all 14 Jetflow product pipelines Unpivot views (3 AM) — deploys the 14 per-product wide-to-long views via Jetflow Each DAG has a dependency check sensor to verify upstream data freshness before running (commented out in the staging branch for ad-hoc testing flexibility).\nThe monthly aggregates DAG runs all 14 products as a single Jetflow task today. That\u0026rsquo;s a known limitation — a future version should use per-product Airflow task groups for parallel execution and isolated retries.\nThe Makefile Shortcut # One small ergonomic improvement that turned out to be disproportionately useful: a make upload target that pushes a DAG directly to the staging Airflow environment.\nupload: @gcloud auth print-identity-token \u0026gt; /dev/null || (echo \u0026#34;Not authenticated. Run gcloud auth login.\u0026#34; \u0026amp;\u0026amp; exit 1) @gsutil cp $(DAG) gs://$(STAGING_BUCKET)/dags/$(notdir $(DAG:.py=))_$(TICKET).py @echo \u0026#34;View at: $(AIRFLOW_UI)?dag_id=$(notdir $(DAG:.py=))_$(TICKET)\u0026#34; Before this, testing a DAG change meant pushing to a branch and waiting for CI/CD. With this, make upload DAG=dags/monthly_aggregates.py TICKET=my-branch gets you into staging Airflow in seconds, with the DAG namespaced by ticket so it doesn\u0026rsquo;t collide with the production DAG. The target also validates that you have an active gcloud auth session before attempting anything, with a clear error message if not.\nTrade-offs and Known Limitations # I\u0026rsquo;m not going to pretend this design is perfect. From the README I wrote for the team:\nFile proliferation. 14 products × 3 files each = 42 files just for the monthly aggregation layer. That\u0026rsquo;s manageable now; it might not scale to 40 products. A templating approach would reduce this.\nThe table function lives in the DAG. The CREATE OR REPLACE TABLE FUNCTION DDL is embedded as a Python string constant in the Airflow DAG rather than in a standalone SQL file. That\u0026rsquo;s because Jetflow\u0026rsquo;s pipeline stages don\u0026rsquo;t natively support DDL execution — so the DAG falls back to Airflow\u0026rsquo;s BigQueryInsertJobOperator to run it before the Jetflow task. It works, but it\u0026rsquo;s awkward: the function isn\u0026rsquo;t version-controlled as SQL, and you need to read Python to find it. I filed a ticket to get DDL support added to Jetflow.\nDynamic UNPIVOT is possible but not implemented. BigQuery\u0026rsquo;s INFORMATION_SCHEMA.COLUMNS lets you discover metric columns dynamically, which would make the unpivot views auto-updating when new metrics are added upstream. The current approach requires a config change for each new column. I prototyped a dynamic version (a SQL script that generates and executes a CREATE OR REPLACE VIEW via EXECUTE IMMEDIATE) but didn\u0026rsquo;t ship it in this iteration — the static approach is more readable and debuggable.\nNo automated validation. Row count checks and data freshness alerts exist at the DAG level but not at the per-product level. A product with zero rows for the month would not trigger an alert today.\nWhat I\u0026rsquo;d Do Differently # Design the metadata registry first, before writing any pipeline code. I ended up retrofitting some of the aggregation rules into the metadata view partway through, which required adjusting the table function. Starting with a fully-specified schema for the registry would have saved a few iterations.\nShip the dynamic UNPIVOT from day one. The static UNPIVOT views are the most maintenance-heavy part of the design. Every time an upstream team adds a new metric column, someone needs to update a YAML file. The dynamic version doesn\u0026rsquo;t have this problem.\nPer-product parallel task groups from the start. Retrofitting Airflow task groups into an existing DAG is messier than designing for them upfront.\nThe Result # The rewrite went from a 2,200-line SQL monolith to:\n1 product metadata view (the registry) 1 unit-of-measure lookup view 14 unpivot views (one per product) 14 monthly aggregate pipeline configs 1 aggregation table function (~100 lines) 1 cross-product union view 3 Airflow DAGs Total lines of SQL in the critical path: under 200. Everything else is configuration.\nAdding a new product now means: adding a row to the metadata registry, adding an unpivot view config, and adding a pipeline config. No changes to shared code. No risk of breaking other products.\nThat\u0026rsquo;s the goal of any refactor: make the easy thing the right thing.\n","date":"15 March 2026","externalUrl":null,"permalink":"/posts/modular-billing-pipeline-from-monolith-to-product-per-file/","section":"Posts","summary":"Earlier this year I shipped a pipeline rewrite I’m genuinely proud of. It replaced a 2,200-line SQL monolith — one of those files that everyone’s afraid to touch — with a clean layered architecture that handles 14 products, runs daily, and can be extended by adding a handful of config files.\n","title":"From Monolith to Modular: Rebuilding a Billing Data Pipeline From Scratch","type":"posts"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/jetflow/","section":"Tags","summary":"","title":"Jetflow","type":"tags"},{"content":"","date":"15 March 2026","externalUrl":null,"permalink":"/tags/sql/","section":"Tags","summary":"","title":"Sql","type":"tags"},{"content":"","date":"5 March 2026","externalUrl":null,"permalink":"/tags/llms/","section":"Tags","summary":"","title":"LLMs","type":"tags"},{"content":"","date":"5 March 2026","externalUrl":null,"permalink":"/tags/open-source/","section":"Tags","summary":"","title":"Open Source","type":"tags"},{"content":"In January 2024, Hugging Face published a benchmark that most people in the data world missed. They compared open-source LLMs against GPT-3.5 and GPT-4 on agent tasks — using a dataset that requires web search and calculator use, the fundamentals of any analytics agent.\nThe result: Mixtral-8x7B beat GPT-3.5 on agent tasks, out of the box, with no fine-tuning.\nThis is the moment open source won the AI agent war for data teams. Here\u0026rsquo;s why it matters.\nThe Benchmark Details # The HuggingFace team created a dataset combining:\nHotpotQA: multi-hop questions requiring combining information from multiple sources GSM8K: grade-school math requiring precise calculation (not estimation) GAIA: hard general AI assistant tasks requiring multiple steps These map well to analytics tasks: multi-source data questions, precise numeric calculations, and complex multi-step investigations.\nResults:\nGPT-4: ~87% Mixtral-8x7B: ~77% ← beats GPT-3.5 GPT-3.5: ~75% OpenHermes-2.5: ~60% Llama2-70b: ~45% Key finding: Mixtral is within 10 points of GPT-4 on agent tasks, and surpasses GPT-3.5, without any agent-specific fine-tuning. With fine-tuning — which HuggingFace explicitly recommends — the gap narrows further.\nWhy Open Source Matters for Data Teams # For most data teams, the reason to care about open source is not political. It\u0026rsquo;s practical:\nPrivacy. Sending your company\u0026rsquo;s query logs, financial metrics, or user behavior data to OpenAI\u0026rsquo;s API is a meaningful data governance decision. Running Mixtral locally — or on an inference provider like Cloudflare Workers AI — means the data never leaves your infrastructure.\nCost at scale. An analytics agent running 500 queries per day against GPT-4 costs ~$300/month. The same agent on Workers AI with Llama 3.1 70B costs ~$15/month. For production workloads, this is a real constraint.\nCustomization. Open-source models can be fine-tuned on your domain. A Mixtral fine-tuned on your specific metric definitions and query patterns will outperform a generic GPT-4 call on your specific tasks.\nThe Practical Recommendation # For a data team building an analytics agent today:\nStart with Workers AI, Together AI, or Groq for fast, cheap, private inference Use Mixtral-8x7B or Llama 3.1 70B as your base model Fine-tune on 50–100 examples of your specific query patterns — this is what the HuggingFace team says would push Mixtral past GPT-4 Evaluate every change against your test dataset The era of \u0026ldquo;we need GPT-4 or it doesn\u0026rsquo;t work\u0026rdquo; is over for most analytics use cases. Open source is here. The question is whether your team is ready to use it.\n","date":"5 March 2026","externalUrl":null,"permalink":"/posts/open-source-won-the-ai-agent-war/","section":"Posts","summary":"In January 2024, Hugging Face published a benchmark that most people in the data world missed. They compared open-source LLMs against GPT-3.5 and GPT-4 on agent tasks — using a dataset that requires web search and calculator use, the fundamentals of any analytics agent.\n","title":"Open Source Won the AI Agent War — Here's What That Means for Data Teams","type":"posts"},{"content":"In February 2024, the Berkeley AI Research Lab published a paper that quietly explained everything. Not \u0026ldquo;how to build AI\u0026rdquo; — but why the move from single LLM calls to multi-component systems is inevitable. And once you read it, you see analytics differently.\nThe paper is called \u0026ldquo;The Shift from Models to Compound AI Systems.\u0026rdquo; The lead authors are Matei Zaharia (who created Spark and co-founded Databricks) and Omar Khattab (creator of DSPy). These are serious people.\nTheir core claim: state-of-the-art AI results are increasingly coming not from better models, but from cleverly engineered systems that combine models with other components.\nWhy This Is True for Analytics # The Berkeley team identifies four reasons why compound systems beat monolithic models. All four apply directly to analytics work:\nReason 1: Some tasks are easier to improve via system design.\nTheir example: a coding model that gets 30% correct on a benchmark. You can spend years and compute making it 35% correct. Or you can build a system that samples 100 solutions, runs unit tests on each, and returns the one that passes — which gets you to 80% on today\u0026rsquo;s model.\nThe analytics equivalent: an LLM that generates SQL queries that are correct 60% of the time. You can use a better model and get to 65%. Or you can add a query validation step, a reflection loop, and a \u0026ldquo;test on sample data before running on production\u0026rdquo; step — and get to 90% on the same model.\nReason 2: Systems can be dynamic.\nModels are trained on historical data. Your metrics change. New tables get added. APIs change. A system that can retrieve current documentation dynamically will always outperform a model that memorized stale knowledge six months ago.\nReason 3: Improving control and trust is easier with systems.\n\u0026ldquo;LLMs still hallucinate, but a system combining LLMs with retrieval can increase user trust by providing citations or automatically verifying facts.\u0026rdquo;\nThis is the entire value proposition of an analytics agent: not just an answer, but a cited answer with the query visible. Your stakeholders can see exactly what data was used. That\u0026rsquo;s fundamentally different from an LLM that just says a number.\nReason 4: Performance goals vary widely.\nA \u0026ldquo;show me the traffic chart\u0026rdquo; question can run on a cheap fast model. \u0026ldquo;Explain this anomaly in the context of our seasonal patterns and current active campaigns\u0026rdquo; needs a more capable model. A compound system routes intelligently.\nThe Three Challenges They Identify # The Berkeley team is honest about what\u0026rsquo;s hard:\nDesign space is vast. For RAG alone, the combinations of retriever + reranker + LLM + verification are enormous. There\u0026rsquo;s no standard answer yet.\nOptimization is hard. You can\u0026rsquo;t backpropagate through a SQL query. New tools like DSPy use \u0026ldquo;textual backpropagation\u0026rdquo; to optimize prompts end-to-end.\nOperations are harder. How do you monitor a system where one question might generate 12 API calls, 3 LLM invocations, and a code execution step? LLMOps is a new discipline.\nWhat This Means for Data Teams # The BAIR paper ends with a prediction: \u0026ldquo;Compound AI systems will remain the best way to maximize the quality and reliability of AI applications going forward.\u0026rdquo;\nFor data teams, this means:\nAnalytics agents are not a passing fad. They are the production-grade pattern. The future belongs to teams that can engineer systems, not just use models. The bottleneck is no longer compute or model quality — it\u0026rsquo;s system design and evaluation. That\u0026rsquo;s actually good news for data analysts. System design and evaluation? That\u0026rsquo;s what we\u0026rsquo;ve been doing for years. We just called it \u0026ldquo;analytics engineering.\u0026rdquo;\n","date":"22 February 2026","externalUrl":null,"permalink":"/posts/why-analytics-agents-work-berkeley/","section":"Posts","summary":"In February 2024, the Berkeley AI Research Lab published a paper that quietly explained everything. Not “how to build AI” — but why the move from single LLM calls to multi-component systems is inevitable. And once you read it, you see analytics differently.\n","title":"The Berkeley AI Lab Figured Out Why Analytics Agents Work (And It's Not About AI)","type":"posts"},{"content":"In August 2025, Meta published an engineering blog post that changed how I think about analytics agents. It\u0026rsquo;s called \u0026ldquo;Creating AI Agent Solutions for Warehouse Data Access and Security,\u0026rdquo; and it describes a multi-agent system they built for their internal data warehouse.\nMost coverage focused on the security angle. I want to focus on something more fundamental: the cookbook they built for agents to follow.\nThe Problem: Tribal Knowledge at Scale # Meta has thousands of data assets, hundreds of teams, and strict privacy requirements. The challenge is that knowing what data to use requires expertise that lives in people\u0026rsquo;s heads. Which tables are restricted? What are the approved alternatives? What does this data actually measure?\nTraditionally, this knowledge was tribal — you asked your teammate. At Meta\u0026rsquo;s scale, that doesn\u0026rsquo;t work.\nTheir solution: encode tribal knowledge as text resources that agents can read. Every table has a summary. Every team\u0026rsquo;s data practices are documented as SOPs. Every access pattern is a retrievable resource. The LLM becomes a lookup mechanism for knowledge that was previously locked in human memory.\nThe Architecture: Don\u0026rsquo;t Monolith Your Agent # The clearest insight from Meta\u0026rsquo;s design: they did not build one analytics agent. They built a multi-agent system with clear separation of concerns.\nTriage agent: understands the user\u0026rsquo;s intent Alternative-finder agent: knows all the non-restricted alternatives to the table you want Partial-preview agent: safely provides a small sample for exploration Access-request agent: drafts the formal permission request Owner agents: independently handle each team\u0026rsquo;s access approval workflow Each agent is small, specialized, and testable. The complexity is in the coordination, not in any individual agent.\nThis maps directly to analytics: instead of one big \u0026ldquo;answer my analytics question\u0026rdquo; agent, build:\nA routing agent: classifies questions into security, performance, traffic, cost A security analyst agent: specializes in WAF, bot, DDoS data A traffic analyst agent: specializes in request volume, geography, path analysis An orchestrator: combines outputs The Evaluation Flywheel # Meta\u0026rsquo;s most important engineering decision: they built the evaluation first.\nBefore shipping, they curated a dataset of real requests with verified outcomes. They run evaluation daily. Every agent decision is logged. Analysts can review decisions and provide feedback. Feedback updates the evaluation set.\nThis is a flywheel: more usage → more feedback → better evaluations → better agent → more trust → more usage.\nThe implication for your data team: if you skip evaluation, you\u0026rsquo;re not building a production analytics agent. You\u0026rsquo;re building a demo.\nThe Guardrail Architecture # Meta is explicit: LLMs cannot be trusted for risk decisions. They use:\nRule-based risk computation (not LLM-based) as the final gate LLM for suggestion, rules for decision Audit logs for every agent action Budget limits (how much data can one query access per day) For analytics agents, this translates to:\nLLM can suggest a WAF rule; human must approve it LLM can suggest a query; guardrail must validate it (no COUNT(*) on sampled data) Every agent query is logged with the reasoning chain The Practical Takeaway # Meta\u0026rsquo;s recipe, adapted for your data team:\nRepresent your data as text resources — table summaries, field descriptions, usage patterns — and index them in a vector store Build specialized sub-agents rather than one monolithic agent Never trust the LLM for risk decisions — add rule-based guardrails Build evaluation before shipping — curate real questions with verified answers Log everything — every agent decision should be reviewable This is not experimental. Meta shipped it to 70,000+ employees. The patterns work.\n","date":"14 January 2026","externalUrl":null,"permalink":"/posts/what-metas-data-warehouse-ai-taught-me/","section":"Posts","summary":"In August 2025, Meta published an engineering blog post that changed how I think about analytics agents. It’s called “Creating AI Agent Solutions for Warehouse Data Access and Security,” and it describes a multi-agent system they built for their internal data warehouse.\n","title":"What Meta's Data Warehouse AI Taught Me About Building Analytics Agents","type":"posts"},{"content":"","date":"11 September 2025","externalUrl":null,"permalink":"/tags/anomaly-detection/","section":"Tags","summary":"","title":"Anomaly Detection","type":"tags"},{"content":"Monitoring availability metrics at scale creates a familiar problem: you have a time series, you need to know when it drops, and you need to know this automatically — without someone staring at a dashboard.\nThis post walks through a statistical algorithm I built to do exactly that. It detects dips in any continuous metric (availability, reachability, error rate) and returns precise start and end timestamps for each event. No ML required — just a modified z-score, two rolling windows, and a few transition rules.\nThe Problem # A \u0026ldquo;dip\u0026rdquo; in a time series sounds easy to define: the value goes down. But in practice:\nMetrics fluctuate constantly — you don\u0026rsquo;t want to fire on every small wobble Some drops are so brief they\u0026rsquo;re noise (a single anomalous minute) Some recoveries are partial — the metric bounces back briefly before dropping again The absolute threshold that matters varies by day, because the baseline isn\u0026rsquo;t constant A naive threshold (if value \u0026lt; 0.999, it's a dip) breaks quickly. You either miss real events or drown in false alarms.\nThe Assumptions # The algorithm makes two working assumptions:\nThe distribution is stationary — the metric doesn\u0026rsquo;t have predictable time-of-day patterns where lower values are expected. If your metric does have those patterns, you\u0026rsquo;d extend this with a time-varying baseline. Under normal conditions, the metric is roughly normally distributed — a bell curve around a stable central value, with dips appearing as strong negative deviations. Both hold well for high-level availability metrics aggregated over many requests.\nStep 1: Compute a Modified Z-Score # Rather than comparing against a fixed threshold, we normalise each data point against the current distribution of the signal. This makes the algorithm adaptive — the \u0026ldquo;what counts as a dip\u0026rdquo; question is answered relative to the recent behaviour of the metric itself.\nFor each data point $x_i$:\n$$z_i = \\frac{x_i - \\text{reference_val}}{\\sigma}$$\nWhere:\nreference_val is either the median of the dataset (adaptive) or a fixed SLA value (e.g. 0.99999) — configurable σ is the standard deviation of the full time series window (typically 24 hours) A data point is flagged as a potential dip if $z_i \u0026lt; -1$ — i.e. the value is more than one standard deviation below the reference.\nsttdev = dip_timeline_reindex.std() reference_val = ( dip_timeline_reindex.median() if reference == \u0026#34;median\u0026#34; else reference_val ) comp = dip_timeline_reindex.sub(reference_val) / sttdev dip_threshold = (comp \u0026lt; -1) * 1 # binary: 1 = possible dip, 0 = normal Using the median as reference (rather than the mean) makes the baseline robust to outliers — a single deep dip doesn\u0026rsquo;t pull the reference value down and cause the algorithm to miss subsequent events.\nStep 2: Filter Out Noise with a Minimum Duration Window # A single minute below threshold is almost certainly noise. We only want to flag an event as a real dip start if it\u0026rsquo;s sustained — specifically, if the next min_window minutes (default: 5) are also flagged.\nWe track two additional signals:\nshift: the lagged value of is_dip — tells us what the previous minute\u0026rsquo;s state was roll_sum: a forward-looking rolling sum over the next max_window minutes — tells us what\u0026rsquo;s coming dip_finder[\u0026#34;shift\u0026#34;] = dip_finder.shift(1).fillna(0) dip_finder[\u0026#34;roll_sum\u0026#34;] = ( dip_finder.is_dip.shift(-max_window + 1) .rolling(window=max_window) .sum() .fillna(0) ) Step 3: Label Dip Transitions # With these three signals (is_dip, shift, roll_sum), we can precisely label each minute as a dip start, dip end, or neither:\n$$\\text{start_end}_i = \\begin{cases} \\text{\u0026ldquo;dip_start\u0026rdquo;} \u0026amp; \\text{if } \\text{is_dip}_i = 1 \\wedge \\text{shift}_i = 0 \\wedge \\text{roll_sum}_i \\geq \\text{min_window} \\ \\text{\u0026ldquo;dip_end\u0026rdquo;} \u0026amp; \\text{if } \\text{is_dip}_i = 0 \\wedge \\text{shift}_i = 1 \\wedge \\text{roll_sum}_i = 0 \\ \\text{NaN} \u0026amp; \\text{otherwise} \\end{cases}$$\nIn plain English:\nDip start: the current minute is below threshold, the previous minute was not, and the next min_window minutes are also below threshold Dip end: the current minute is above threshold, the previous minute was not, and the next max_window minutes are all above threshold — meaning we\u0026rsquo;ve genuinely recovered, not just bounced The max_window check on dip end (default: 15 minutes) is deliberate. Without it, a brief recovery in the middle of a sustained dip would split it into two separate events, making the duration statistics meaningless.\ndip_finder[\u0026#34;start_end\u0026#34;] = dip_finder.apply( lambda dip: ( \u0026#34;dip_end\u0026#34; if (dip[\u0026#34;is_dip\u0026#34;] == 0) \u0026amp; (dip[\u0026#34;shift\u0026#34;] == 1) \u0026amp; (dip[\u0026#34;roll_sum\u0026#34;] == 0) else ( \u0026#34;dip_start\u0026#34; if (dip[\u0026#34;is_dip\u0026#34;] == 1) \u0026amp; (dip[\u0026#34;shift\u0026#34;] == 0) \u0026amp; (dip[\u0026#34;roll_sum\u0026#34;] \u0026gt;= min_window) else np.nan ) ), axis=1, ) Step 4: Remove Consecutive Dip Starts # In practice, you can get runs of consecutive dip_start labels during a noisy entry into a dip. We only want the first one. This is handled by grouping consecutive identical labels and keeping only the first occurrence:\ndips_only[\u0026#34;consecutive_count\u0026#34;] = ( dips_only[\u0026#34;start_end\u0026#34;] .groupby((dips_only[\u0026#34;start_end\u0026#34;] != dips_only[\u0026#34;start_end\u0026#34;].shift()).cumsum()) .cumcount() + 1 ) dips_only = dips_only.loc[~(dips_only[\u0026#34;consecutive_count\u0026#34;] \u0026gt; 1)].copy() Step 5: Pair Starts and Ends # Finally, dip_start and dip_end events are paired sequentially. Only valid pairs are retained — a start without a following end (e.g. a dip still ongoing at the end of the observation window) is excluded. Duration is computed in both timedelta and minutes.\ndip_start_end_df[\u0026#34;duration\u0026#34;] = dip_start_end_df.dip_end - dip_start_end_df.dip_start dip_start_end_df[\u0026#34;duration_min\u0026#34;] = dip_start_end_df[\u0026#34;duration\u0026#34;].dt.total_seconds() / 60 The Parameters # The algorithm has five configurable parameters:\nParameter Default What it controls min_window 5 min Minimum sustained duration to call something a dip start max_window 15 min Minimum recovery duration to call something a dip end reference \u0026quot;median\u0026quot; Whether to normalise against the dataset median or a fixed SLA value reference_val 0.99999 The SLA value, if reference = \u0026quot;SLA\u0026quot; smooth False Apply a rolling average before detection (trades sensitivity for noise reduction) The smooth parameter deserves a note: it was initially appealing as a way to eliminate false alarms, but in practice it caused the algorithm to miss short but real dips — the smoothing would average away a 3-minute event entirely. For most use cases, leaving it off and relying on min_window to filter noise is the better approach.\nWhy Not ML? # A few reasons this approach was chosen over a machine learning model:\nInterpretability. When an alert fires, you can trace exactly why: the z-score was below -1 for more than 5 consecutive minutes, the reference value was X, the standard deviation was Y. There\u0026rsquo;s no black box.\nNo training data required. The algorithm works on the current 24-hour window. You don\u0026rsquo;t need historical labelled examples of dips to get started.\nRobustness to distribution shift. If the baseline level of a metric drifts over time (e.g. availability naturally improves as infrastructure scales), the median-based reference value adapts automatically.\nLow computational overhead. The entire algorithm is vectorised pandas operations — it runs on minute-level data for a 24-hour window in milliseconds.\nWhat This Enables # The output is a clean dataframe of (dip_start, dip_end, duration, duration_min) tuples. This feeds naturally into downstream analysis: which contributing entities were responsible for the dip during each window, how severe it was relative to the reference, and whether the pattern matches known failure modes.\nThe algorithm is written in Python for prototyping but is straightforward to port to Scala for production pipeline integration — all the logic is standard window functions and groupby operations that map directly to Spark or Flink semantics.\nThe full function signature with all parameters:\ndef find_dip_start_end( dip_timeline: pd.DataFrame, min_window: int = 5, max_window: int = 15, smooth: bool = False, reference: str = \u0026#34;median\u0026#34;, reference_val: float = 0.99999, metric: str = \u0026#34;avg_availability\u0026#34;, ) -\u0026gt; pd.DataFrame: ... ","date":"11 September 2025","externalUrl":null,"permalink":"/posts/automatic-dip-detection-statistical-approach/","section":"Posts","summary":"Monitoring availability metrics at scale creates a familiar problem: you have a time series, you need to know when it drops, and you need to know this automatically — without someone staring at a dashboard.\nThis post walks through a statistical algorithm I built to do exactly that. It detects dips in any continuous metric (availability, reachability, error rate) and returns precise start and end timestamps for each event. No ML required — just a modified z-score, two rolling windows, and a few transition rules.\n","title":"Automatic Dip Detection in Time Series: A Statistical Approach","type":"posts"},{"content":"","date":"11 September 2025","externalUrl":null,"permalink":"/tags/time-series/","section":"Tags","summary":"","title":"Time Series","type":"tags"},{"content":"","date":"18 June 2025","externalUrl":null,"permalink":"/tags/feature-engineering/","section":"Tags","summary":"","title":"Feature Engineering","type":"tags"},{"content":"","date":"18 June 2025","externalUrl":null,"permalink":"/tags/segmentation/","section":"Tags","summary":"","title":"Segmentation","type":"tags"},{"content":"Customer segmentation is one of those problems that sounds straightforward until you actually sit down with the data. In this post I\u0026rsquo;ll walk through an approach I built for segmenting customers based on their HTTP traffic patterns — the kind of traffic data that tells you not just how much a customer uses a service, but how they use it.\nThe goal was a \u0026ldquo;look-alike\u0026rdquo; segmentation: group customers with similar traffic behaviour into cohorts, so you can make data-driven inferences about what one customer in a group is likely to need based on what others in the same group are already doing.\nThe Data # The starting point is monthly HTTP traffic data aggregated at the account level, covering a 12-month rolling window. The raw features are:\nTotal Requests — volume of traffic Threat Requests — traffic flagged as malicious or suspicious API Requests — programmatic/API traffic Media Requests — image and video serving traffic Web Requests — standard browser/web traffic The challenge with monthly time series data at the account level: some accounts have gaps — months with zero traffic because they were inactive, not because the data is missing. These need to be treated differently from genuinely absent data points.\nFeature Engineering # The raw time series needs to be collapsed into a single row per account that captures the character of their traffic, not just a snapshot. Three transformations do most of the work:\n1. Traffic composition as percentages\nRather than using raw counts for threats, API, media, and web traffic (which would be dominated by account size), each is expressed as a percentage of total requests:\ndf[\u0026#34;api_pct\u0026#34;] = df[\u0026#34;api_requests\u0026#34;] / df[\u0026#34;total_requests\u0026#34;] df[\u0026#34;media_pct\u0026#34;] = df[\u0026#34;media_requests\u0026#34;] / df[\u0026#34;total_requests\u0026#34;] df[\u0026#34;web_pct\u0026#34;] = df[\u0026#34;web_requests\u0026#34;] / df[\u0026#34;total_requests\u0026#34;] df[\u0026#34;threats_pct\u0026#34;] = df[\u0026#34;threat_requests\u0026#34;] / df[\u0026#34;total_requests\u0026#34;] This makes the features size-agnostic — a small account with 80% API traffic looks like a large account with 80% API traffic, which is the right behaviour for a look-alike model.\n2. 75th Percentile aggregation\nInstead of taking the mean across the 12 months, we take the 75th percentile of each feature per account. This was chosen over the mean for robustness: traffic distributions are highly right-skewed, and the 75th percentile gives a stable representation of typical high-usage behaviour without being pulled by outlier months.\nfeatures = ( df.groupby(\u0026#34;account_id\u0026#34;)[[\u0026#34;total_requests\u0026#34;, \u0026#34;api_pct\u0026#34;, \u0026#34;media_pct\u0026#34;, \u0026#34;web_pct\u0026#34;, \u0026#34;threats_pct\u0026#34;]] .quantile(0.75) .reset_index() ) 3. Relative Standard Deviation (Coefficient of Variation)\nTo capture traffic stability — not just level — we compute the Relative Standard Deviation (RSD) of total requests for each account:\n$$\\text{RSD} = \\frac{\\sigma}{\\mu}$$\nAn account with RSD of 10% has very consistent traffic month to month. An account with RSD of 90% has highly variable traffic — potentially seasonal, growing rapidly, or irregularly active. This becomes its own segmentation dimension.\nOutlier Removal: Why Standard Methods Failed # The standard approaches to outlier removal — IQR-based filtering and standard deviation thresholds — both failed in this case. The reason: the traffic distribution across accounts is extremely right-skewed. Large enterprise accounts send orders of magnitude more traffic than small accounts, which means they appear as statistical outliers under any simple threshold method and get removed — even though they\u0026rsquo;re exactly the accounts you most want to segment correctly.\nThe solution was to use k-Nearest Neighbours with anomaly detection to identify outliers relative to their own cluster neighbourhood, rather than relative to the global distribution.\nThe approach:\nFit a k-NN model on the feature space For each point, compute its distance to its k nearest neighbours Flag points whose distance to neighbours exceeds a threshold as anomalies This lets a large enterprise account be a legitimate member of a \u0026ldquo;large enterprise\u0026rdquo; cluster, while still flagging accounts that are genuinely anomalous — e.g. accounts with corrupted data, test accounts with synthetic traffic patterns, or accounts that don\u0026rsquo;t fit any natural grouping.\nIn practice, BigQuery ML\u0026rsquo;s ML.DETECT_ANOMALIES on a k-NN model handles this cleanly in SQL, making it easy to run as part of a data pipeline without a separate Python environment.\nBucketing into Cohort Dimensions # With outliers removed, each feature is bucketed into discrete bins. The bucketing strategy differs by feature:\nTercile division (3 buckets) for continuous volume features where the full distribution matters:\nTotal Requests (traffic volume) → buckets 1, 2, 3 Traffic Relative Standard Deviation (variability) → buckets 1, 2, 3 Median split (2 buckets) for percentage features where the distribution was bimodal or showed near-identical tercile edges:\nThreats % → 0 (below median) / 1 (above median) API % → 0 / 1 Media % → 0 / 1 Web % → 0 / 1 The decision between terciles and median split was empirical: for some features, attempting a tercile split produced bins whose boundaries were nearly identical (e.g. the 33rd and 66th percentile were both 0%), making the middle bucket meaningless. A median split was more informative in those cases.\nThe final cohort identifier is a JSON string joining all bucket assignments:\n{\u0026#34;traffic\u0026#34;: \u0026#34;3\u0026#34;, \u0026#34;variability\u0026#34;: \u0026#34;1\u0026#34;, \u0026#34;threats\u0026#34;: \u0026#34;1\u0026#34;, \u0026#34;api\u0026#34;: \u0026#34;1\u0026#34;, \u0026#34;media\u0026#34;: \u0026#34;2\u0026#34;, \u0026#34;web\u0026#34;: \u0026#34;1\u0026#34;} This gives each account a human-readable, interpretable identity. traffic: 3, api: 1, media: 2 tells you immediately: high-volume account, below-median API traffic, above-median media serving. You don\u0026rsquo;t need to look up what cluster 47 means.\nThe Pipeline in Full # Monthly time series (12 months, per account) ↓ Calculate percentage features (API, media, web, threats as % of total) ↓ Compute 75th percentile per account → one row per account ↓ Compute Relative Standard Deviation per account ↓ k-NN anomaly detection → remove outliers ↓ Tercile bucketing (traffic volume, variability) Median split (API %, media %, web %, threats %) ↓ Join buckets → cohort string ↓ Output: one cohort identifier per account Product Recommendations from Cohort Attach Rates # Once accounts are grouped into cohorts, a natural downstream application is product recommendations. Within each cohort, you can calculate the attach rate of each product — what fraction of accounts in this cohort use each product — and use that to recommend products to accounts in the same cohort that don\u0026rsquo;t yet have them.\n# Attach rate per product per cohort attach_rates = ( cohort_products .groupby([\u0026#34;cohort_id\u0026#34;, \u0026#34;product\u0026#34;])[\u0026#34;account_id\u0026#34;] .count() / cohort_products.groupby(\u0026#34;cohort_id\u0026#34;)[\u0026#34;account_id\u0026#34;].nunique() ).reset_index(name=\u0026#34;attach_rate\u0026#34;) # Top N products per cohort recommendations = ( attach_rates .sort_values(\u0026#34;attach_rate\u0026#34;, ascending=False) .groupby(\u0026#34;cohort_id\u0026#34;) .head(15) ) The logic: if 70% of accounts in cohort {traffic:3, api:1, media:2, web:1} use product X, and a given account in that cohort doesn\u0026rsquo;t, product X is worth surfacing as a recommendation. This is collaborative filtering at the cohort level — no per-account interaction history required.\nWhat Worked and What Didn\u0026rsquo;t # What worked well:\nThe 75th percentile aggregation was more stable than mean or median for capturing typical account behaviour The k-NN outlier approach preserved the high-volume accounts that standard IQR methods would have removed The JSON cohort string made the output self-documenting — downstream consumers could understand a cohort without joining to a lookup table What I\u0026rsquo;d do differently:\nThe fixed 12-month window treats a fast-growing account the same as a stable one with the same average. A growth-rate feature would add a useful dimension The tercile/binary bucketing is interpretable but lossy. For downstream ML use cases (rather than human-readable segmentation), keeping the continuous features would be better The RSD is a useful variability measure but sensitive to accounts with very few active months. Weighting by number of active months would improve it Closing Thought # The appeal of this approach is its interpretability. Every account gets a cohort label that a non-technical stakeholder can read and understand. That matters more than it sounds — in practice, a segmentation model that product teams can interrogate and trust gets used. One that produces opaque cluster IDs gets ignored.\nThe algorithm is also deliberately simple. No neural networks, no complex dimensionality reduction. Just feature engineering, a sensible aggregation strategy, and a principled bucketing scheme. Simple enough to explain in a meeting, robust enough to run as a production pipeline.\n","date":"18 June 2025","externalUrl":null,"permalink":"/posts/traffic-based-customer-segmentation/","section":"Posts","summary":"Customer segmentation is one of those problems that sounds straightforward until you actually sit down with the data. In this post I’ll walk through an approach I built for segmenting customers based on their HTTP traffic patterns — the kind of traffic data that tells you not just how much a customer uses a service, but how they use it.\n","title":"Traffic-Based Customer Segmentation: A Practical Approach with Quantile Bucketing and k-NN Anomaly Detection","type":"posts"},{"content":"","date":"9 April 2025","externalUrl":null,"permalink":"/tags/bi/","section":"Tags","summary":"","title":"BI","type":"tags"},{"content":"","date":"9 April 2025","externalUrl":null,"permalink":"/tags/dashboards/","section":"Tags","summary":"","title":"Dashboards","type":"tags"},{"content":"","date":"9 April 2025","externalUrl":null,"permalink":"/tags/data/","section":"Tags","summary":"","title":"Data","type":"tags"},{"content":"","date":"9 April 2025","externalUrl":null,"permalink":"/tags/data-culture/","section":"Tags","summary":"","title":"Data Culture","type":"tags"},{"content":"There\u0026rsquo;s a hard truth hiding in your analytics platform. Let me show you how to find it.\nOpen your BI tool. Look at the list of dashboards. Find the one that took you — or someone on your team — two weeks to build. The one with the carefully color-coded KPI tiles, the year-over-year comparisons, the trend lines going back 18 months.\nNow look at the last time it was opened.\nIf it was more than three weeks ago, you\u0026rsquo;re not alone. You\u0026rsquo;re in the majority.\nThe Pattern Nobody Talks About # The analytics industry is built on a quiet fiction: that dashboards get used.\nVendors show you demos of executives making snap decisions in front of real-time data walls. Conference talks describe \u0026ldquo;data-driven cultures\u0026rdquo; where everyone from the CEO to the customer support rep checks their metrics every morning. Hiring decks promise that this analyst hire will \u0026ldquo;transform the way we use data.\u0026rdquo;\nAnd then, in the real world, the dashboard you spent two weeks on gets opened once — at the presentation where you launched it — and then sits there, slowly aging, like milk nobody noticed was left on the counter.\nThis isn\u0026rsquo;t a you problem. It\u0026rsquo;s a structural one.\nWhy Dashboards Die # 1. They were built to show what exists, not to answer what matters # Most dashboards are shaped by what data is available, not by what a decision-maker actually needs. You have a users table, a transactions table, an events table — so you build a dashboard that shows everything in those tables. The metrics feel important because they\u0026rsquo;re measurable.\nBut measurability is not the same as relevance.\nBenn Stancil, one of the most incisive writers on analytics culture, puts it directly: most users don\u0026rsquo;t want complicated analysis. They want to know what is happening. A Gong customer success team wants to know what their customers are doing this week — not an AI-generated \u0026ldquo;health score\u0026rdquo; that blends fifteen signals into a composite number nobody can explain.\nThe more abstracted a dashboard is from a specific question, the faster it gets abandoned.\nThe fix: Before you open your BI tool, write one sentence: \u0026ldquo;This dashboard will help [person] decide [thing] by showing them [specific metric] in context of [baseline or target].\u0026rdquo; If you can\u0026rsquo;t write that sentence, don\u0026rsquo;t build the dashboard.\n2. They\u0026rsquo;re used for theater, not decisions # Here\u0026rsquo;s a harder truth: many dashboards were never meant to be used regularly. They were built to exist — to signal that a team is data-driven, to satisfy a stakeholder who asked for \u0026ldquo;a dashboard on this,\u0026rdquo; or to provide political cover for a decision that was already made.\nWhen executives need data to justify a choice they\u0026rsquo;ve already committed to emotionally, they use the dashboard once (at the meeting where they present the decision) and never again. The dashboard served its purpose. That purpose was theater.\nThis is not cynical — it\u0026rsquo;s human. Leaders under pressure need ways to make difficult calls feel objective. Data analysis provides that cover. The problem is that teams spend real time building dashboards for this use case, when what was actually needed was a one-page summary for a single meeting.\nThe fix: When someone asks for a \u0026ldquo;dashboard,\u0026rdquo; ask what decision it will support and when. If the answer is \u0026ldquo;the executive presentation next Thursday,\u0026rdquo; build a focused one-pager, not a permanent dashboard.\n3. They\u0026rsquo;re too noisy to read quickly # The Geckoboard research on dashboard design identifies the deepest UX problem: every piece of non-data visual noise — decorative elements, unnecessary gridlines, redundant labels, too many metrics — degrades the signal.\nThe statistician Edward Tufte called this the data-ink ratio: the fraction of your visualization that actually communicates data versus ink that just exists. Dashboards with bad data-ink ratios force cognitive work. After a few sessions of effort, most people give up.\nThere\u0026rsquo;s also the problem of missing context. A number without a comparison is nearly useless. \u0026ldquo;42 leads today\u0026rdquo; tells you nothing. \u0026ldquo;42 leads today, versus a 7-day average of 38 and a monthly target of 45\u0026rdquo; tells you something. Most dashboards show the number. They skip the context.\nThe fix: Audit your dashboard. Remove every element that doesn\u0026rsquo;t directly communicate data. For every metric, add a comparison (yesterday, last week, target, average). If a number can\u0026rsquo;t justify its context, reconsider whether it belongs.\n4. The builders and the users are different people # Analytics teams build dashboards. Business teams use them. These two groups have fundamentally different mental models of what a dashboard should do.\nAnalysts are trained to value rigor, completeness, and nuance. They build dashboards that reflect this training: comprehensive, carefully labeled, with drill-down capability and filters for every dimension.\nBusiness users want answers in seconds. They want to glance at a screen and know if things are okay or not. They don\u0026rsquo;t want to apply three filters before they can see the number they need.\nKatie Bauer\u0026rsquo;s framing of analysts as explorers is useful here. Most dashboard work is scouting — routine maintenance, answering basic operational questions. But scouts too often design their reports as if they\u0026rsquo;re presenting a grand discovery. The mismatch creates dashboards that feel overwhelming to navigate.\nThe fix: Get a non-analyst to use your dashboard without instructions. Watch where they get stuck. Ask what question they came in with. Redesign around what they actually did, not what you hoped they\u0026rsquo;d do.\nThe Contradiction at the Heart of Analytics # Here\u0026rsquo;s where it gets uncomfortable.\nEvery practitioner who has been in this industry long enough arrives at a version of the same observation: the dashboards don\u0026rsquo;t get used, the insights are rare, and the decisions are mostly emotional anyway. Data provides the post-hoc rationalization, not the input.\nStancil asked his audience: In your entire career, how often did you find a truly meaningful insight in your data? The average answer was once every two years.\nAnd yet — companies keep hiring data teams. The BI industry keeps growing. More dashboards get built.\nThe explanation is uncomfortable but probably correct: the value of a data team isn\u0026rsquo;t always in the outputs they produce. It\u0026rsquo;s in the belief that data-driven decisions are being made. As long as that belief is maintained, the system is funded. The dashboards may never be opened — but they have to exist.\nThis doesn\u0026rsquo;t mean your work doesn\u0026rsquo;t matter. It means you should be clear-eyed about which dashboards you\u0026rsquo;re building for genuine use versus which ones are institutional theater. Build the former well. Build the latter efficiently.\nWhat Actually Works # After synthesizing what practitioners across the industry have learned, here\u0026rsquo;s what produces dashboards that actually get used:\nMonitoring, not investigation, as the design goal. A dashboard should answer \u0026ldquo;is everything okay?\u0026rdquo; in under 10 seconds. Investigation (why is something not okay?) requires a different tool: ad-hoc analysis, a notebook, a conversation.\nOne question, answered well. A dashboard with five metrics you\u0026rsquo;ve chosen carefully and provided context for is worth more than a dashboard with fifty metrics. Resist the pressure to include everything.\nFeedback loops built in. Ask the people you built the dashboard for what they look at, what they never look at, and whether the dashboard has changed how they work. Build this conversation into your process.\nRewards for boring, reliable work. The most valuable thing an analytics team does is maintain accurate, consistent, trusted reporting — the kind where everyone agrees on what the numbers mean and nobody questions whether the pipeline is broken. This work is low-glamour and high-value. Make it visible.\nNatural language as a complement, not a replacement. The new generation of conversational analytics tools (ask-your-data interfaces built on LLMs) reduce the friction of getting an answer from data. They won\u0026rsquo;t replace dashboards — the ambiguity of human language and the complexity of business logic mean that static, trusted views still have a role. But they can handle the \u0026ldquo;I just want to check one thing\u0026rdquo; use case that clutters most dashboards with filters and drill-downs.\nThe Honest Ending # The dashboard you built that nobody opens isn\u0026rsquo;t a failure of craft. It might be a perfectly designed dashboard. The problem is almost certainly upstream: a misalignment between what was built and what was actually needed, a context where the data was needed for theater rather than decisions, or an organization that hasn\u0026rsquo;t yet built the culture of trust that makes dashboards worth opening.\nThe most important thing you can do isn\u0026rsquo;t redesign the dashboard. It\u0026rsquo;s get closer to the decisions.\nFind out what questions people actually have before they go into important meetings. Find out what numbers make them nervous. Find out what they\u0026rsquo;re checking in Excel because they don\u0026rsquo;t trust the BI tool. Build toward those needs.\nThe dashboards that get used every day aren\u0026rsquo;t the impressive ones. They\u0026rsquo;re the ones that answer exactly one question, reliably, in under 10 seconds.\nStart there.\nSources and Further Reading # Geckoboard — Effective Dashboard Design: A Step-by-Step Guide (2023) — geckoboard.com Benn Stancil — The Insight Industrial Complex (Feb 2023) — benn.substack.com Benn Stancil — Disband the Analytics Team (Mar 2024) — benn.substack.com Benn Stancil — Searching for Insight (Nov 2024) — benn.substack.com Benn Stancil — Does Data Make Us Cowards? (Nov 2021) — benn.substack.com Benn Stancil — Go Crazy, Folks, Go Crazy (Feb 2026) — benn.substack.com Katie Bauer — Analysts Are Explorers (Jul 2022) — wrongbutuseful.substack.com Michal Szudejko — Natural Language Visualization and the Future of Data Analysis (Nov 2025) — towardsdatascience.com ","date":"9 April 2025","externalUrl":null,"permalink":"/posts/why-dashboards-are-read-once-and-never-opened-again/","section":"Posts","summary":"There’s a hard truth hiding in your analytics platform. Let me show you how to find it.\nOpen your BI tool. Look at the list of dashboards. Find the one that took you — or someone on your team — two weeks to build. The one with the carefully color-coded KPI tiles, the year-over-year comparisons, the trend lines going back 18 months.\n","title":"The Dashboard You Built That Nobody Opens","type":"posts"},{"content":"","date":"10 February 2025","externalUrl":null,"permalink":"/tags/development-economics/","section":"Tags","summary":"","title":"Development Economics","type":"tags"},{"content":"","date":"10 February 2025","externalUrl":null,"permalink":"/tags/econometrics/","section":"Tags","summary":"","title":"Econometrics","type":"tags"},{"content":"","date":"10 February 2025","externalUrl":null,"permalink":"/tags/input-output-analysis/","section":"Tags","summary":"","title":"Input-Output Analysis","type":"tags"},{"content":"","date":"10 February 2025","externalUrl":null,"permalink":"/tags/trade/","section":"Tags","summary":"","title":"Trade","type":"tags"},{"content":"In 2020, I was handed a PDF — an ILO working paper titled Spotting Export Potential and Implications for Employment in Developing Countries (Cheong, Decreux \u0026amp; Spies, 2018) — and asked to turn it into a working algorithm.\nThe paper describes a methodology developed by the International Trade Centre to identify a country\u0026rsquo;s unrealized export opportunities, and then estimate how many jobs realizing those opportunities would create. Across six developing countries. At the product-market-sector level.\nMy first instinct was: this sounds like a job for machine learning. My second instinct, after actually reading the paper, was: no, it really isn\u0026rsquo;t. And my third insight, after spending several weeks getting the implementation wrong before getting it right, was about something much more fundamental than the ML vs econometrics debate: it was about how thinking in matrices rather than loops is not just a performance concern — it\u0026rsquo;s an epistemological one.\nThis post is about all three.\nThe Paper: What It Actually Does # The methodology has two parts.\nPart one computes the Export Potential Indicator (EPI) — a score for every (exporting country, product, target market) triple that represents how much more a country could export given its current supply capacity, the target market\u0026rsquo;s demand, and how easy it is for those two to trade with each other. The gap between potential and actual exports is \u0026ldquo;untapped potential.\u0026rdquo;\nThe formula structure is:\nEPI(country, product, market) = min(supply, demand) × ease_of_exporting The supply component projects future market share based on current export share and relative GDP growth. The demand component projects future import volume adjusted for tariff advantages and bilateral distance. The ease component is a ratio of actual to hypothetical trade, capturing proximity, language, and commercial history.\nPart two translates that unrealized export potential into employment, using Leontief input-output analysis. If a sector\u0026rsquo;s exports increase by $X, production must increase by at least $X (direct effect), and that production requires inputs from upstream sectors, who need inputs of their own — a multiplier chain formalized as:\ndy = (I - BA)^{-1} dx Where A is the matrix of technical coefficients (input intensities), B is a diagonal matrix of domestic supply shares, and (I - BA)^{-1} is the Leontief inverse — the total production change required throughout the economy per unit of final demand increase. Employment follows proportionally: dl = diag(l/y) · dy.\nThe paper applies this across Benin, Ghana, Guatemala, Morocco, Myanmar, and the Philippines. The output is sector-level employment creation estimates, disaggregated by gender and skill level.\nThe Implementation Mistake I Made First # My initial implementation looked like this:\nfor country in countries: for product in products: for market in markets: epi = compute_epi(country, product, market) results.append((country, product, market, epi)) This is wrong. Not just slow — wrong in the way that obscures what you\u0026rsquo;re actually doing.\nThe EPI is not a scalar function applied to individual triples. It is a computation over tensors. Supply is a matrix indexed by (country, product). Demand is a matrix indexed by (market, product). Ease is a matrix indexed by (country, market). The EPI is their combination — a three-dimensional array.\nWhen you implement it as nested loops, you lose the structure. You can\u0026rsquo;t see that you\u0026rsquo;re taking the element-wise minimum of two matrices projected into the same space. You can\u0026rsquo;t see that ease is being broadcast across all products. You can\u0026rsquo;t vectorize it later because the logic is buried in conditional branches inside the loop body.\nThe matrix formulation, by contrast, forces clarity:\n# supply[country, product], demand[market, product], ease[country, market] supply_projected = supply * (1 + gdp_growth_relative)[:, np.newaxis] demand_projected = demand * (1 + pop_growth + rev_elasticity * gdppc_growth) # EPI: for each (country, product, market) epi = np.minimum( supply_projected[:, :, np.newaxis], # (countries, products, 1) demand_projected[np.newaxis, :, :] # (1, markets, products) — note: transposed ) * ease[:, np.newaxis, :] # (countries, 1, markets) Now the math is transparent. The broadcasting operations correspond exactly to the formula in the paper. You can audit each step against the appendix. And it runs in seconds instead of hours.\nThe Leontief computation is even more explicit:\nA = Z / y # technical coefficients matrix B = np.diag(d / (m + d)) # import share diagonal matrix leontief_inverse = np.linalg.inv(np.eye(n) - B @ A) # employment multiplier dl = np.diag(l / y) @ leontief_inverse @ dx Three lines. Directly from the technical appendix. Zero ambiguity.\nWhy ML Would Have Been the Wrong Choice # By the time I had the correct implementation running, the ML question had answered itself. But it\u0026rsquo;s worth articulating why.\n1. The question is \u0026ldquo;how many jobs\u0026rdquo;, not \u0026ldquo;which sector will grow\u0026rdquo; # Machine learning is excellent at prediction. Given historical data on which sectors in which countries expanded exports, a gradient boosted tree could probably rank future opportunities with decent accuracy. But that\u0026rsquo;s not what the ILO/ITC methodology is trying to answer.\nThe question is: if we help a country realize its export potential in sector X, how many jobs would that create, and where — directly in sector X, and indirectly in the upstream industries that supply it?\nThat question requires a structural model. You need technical coefficients (how much steel goes into making cars). You need labor intensity ratios (how many workers per unit of output). You need the IO matrix to trace the supply chain. A black-box model trained on historical correlations cannot give you a number that a policymaker can use to justify a budget allocation.\nMullainathan \u0026amp; Spiess (2017), in what remains the best paper on this topic, put it precisely: machine learning solves the problem of prediction, while many economic applications revolve around parameter estimation. The policy question here is a parameter estimation problem.\n2. Interpretability isn\u0026rsquo;t a luxury — it\u0026rsquo;s the deliverable # The EPI gives three named, decomposable components: supply, demand, ease. A country looking at its results can say: \u0026ldquo;We have high untapped potential in processed food exports to Europe — supply capacity is there, European demand for this product is growing, but our ease score is low because of non-tariff barriers.\u0026rdquo; That\u0026rsquo;s an actionable diagnosis.\nA machine learning model ranking export opportunities would rank the same sector highly — but could not tell you why. The diagnosis drives the policy response. Without it, you have a priority list with no prescription.\n3. Developing countries don\u0026rsquo;t have ML-scale data # ML models for gravity-based trade prediction typically require bilateral trade data across many country-pairs over many years, plus a rich feature set. That data exists for OECD countries. For Benin or Myanmar, the statistical infrastructure is thinner, the time series shorter, and key variables (input-output tables, detailed employment surveys) may be from outdated vintages.\nThe Leontief approach is robust to this. Technical coefficients are considered relatively stable over medium time horizons — a 2012 IO table is still useful in 2018. The employment calculation needs only a sectoral employment snapshot, not a panel. You can work with what exists.\nThis is a genuine practical advantage, not a consolation prize.\n4. The assumptions are features # The ILO paper is notably transparent about its assumptions: constant returns to scale, stable technical coefficients, no macroeconomic feedback through exchange rates, no skill mismatch. Each assumption is named, its direction of bias is discussed, and readers are told when and why the results might be over- or understated.\nThis is what makes the methodology defensible in a policy setting. A government minister can challenge a specific assumption. An NGO can ask what happens if we relax the constant returns assumption for agriculture. The model is auditable.\nA deep learning model for the same task would have implicit assumptions embedded in architecture choices, training data selection, and regularization hyperparameters. These are not auditable in the same way. In development economics — where the outputs influence resource allocation for millions of people — that opacity is a serious problem.\nWhere ML Does Add Value in This Context # This isn\u0026rsquo;t an argument against ML in economics. It\u0026rsquo;s an argument for knowing which tool solves which problem.\nThere are genuine ML applications adjacent to this work:\nFeature construction for the EPI supply component: The paper uses a modified PRODY index (GDP-per-capita-weighted export intensity) that requires careful handling of re-exports and tariff preferences. A ML model trained on trade data could potentially identify products with genuine comparative advantage more robustly, by learning the complex interaction between tariff preferences, re-export patterns, and true production capacity.\nAnomaly detection in trade data: The methodology requires a \u0026ldquo;reliability check\u0026rdquo; to identify reporters whose trade statistics are inconsistent with their trading partners\u0026rsquo; mirror statistics. This is pattern recognition — a natural ML task.\nDemand elasticity estimation: The demand component uses estimated revenue elasticities of import demand. These come from econometric estimates in the literature. Modern ML approaches (double/debiased ML, causal forests) could potentially improve these estimates, particularly for products with unusual demand curves.\nBut the core structural calculation — the Leontief inverse, the employment multiplier chain — remains an econometric construct. Its value is precisely that it is grounded in an explicit theory of production.\nWhat the Debate Gets Wrong # The econometrics vs ML debate in economics is often framed as a competition, with ML seen as the newer, more powerful approach that economists are reluctantly adopting.\nThis framing misses the point in both directions.\nThe economists who argue ML is \u0026ldquo;just curve fitting\u0026rdquo; are wrong: ML methods like causal forests and double debiased ML can recover structural parameters in high-dimensional settings where classical approaches break down.\nThe ML practitioners who argue econometrics is \u0026ldquo;just old statistics\u0026rdquo; are also wrong: an econometric model forces you to specify what you\u0026rsquo;re trying to estimate, makes causal assumptions explicit, and connects to theory in a way that constrains what conclusions you can draw.\nFor the export potential problem specifically, the right framing is simpler: the ILO/ITC methodology works because it is fit for purpose. It asks a structural question (what is the employment multiplier of export growth in sector X?) and answers it with a structural tool (Leontief IO analysis). Substituting ML would be like using a regression model to invert a matrix — not wrong in principle, but solving the problem badly when the right solution is available.\nThe Technical Lesson That Stayed With Me # Five years on, the thing I think about most from this project is not the econometrics vs ML question. It\u0026rsquo;s the matrix formulation lesson.\nWhen I switched from nested loops to vectorized operations, the code didn\u0026rsquo;t just get faster. It became easier to verify. Each matrix operation corresponded to a named economic concept. The broadcasting rules enforced dimensional consistency that the loop version hid. Bugs that would have taken hours to diagnose appeared immediately as shape mismatches.\nThere\u0026rsquo;s an epistemological point here: how you implement a computation shapes how you understand it. Writing np.linalg.inv(np.eye(n) - B @ A) forces you to know what n is, what B and A represent, and why you\u0026rsquo;re subtracting rather than adding. Writing a nested loop lets you avoid all of that — until something goes wrong.\nThe Technical Appendix of the ILO paper is written entirely in matrix notation for exactly this reason. The notation is the specification. Implementation is translation, and the closer the translation stays to the original language, the less gets lost.\nReferences # Cheong, D., Decreux, Y., \u0026amp; Spies, J. (2018). Spotting Export Potential and Implications for Employment in Developing Countries. ILO STRENGTHEN Working Paper No. 5. Decreux, Y., \u0026amp; Spies, J. (2016). Export Potential Assessments: A Methodology to Identify Export Opportunities for Developing Countries. ITC. Mullainathan, S., \u0026amp; Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2), 87–106. Hausmann, R., Hwang, J., \u0026amp; Rodrik, D. (2007). What You Export Matters. Journal of Economic Growth, 12(1), 1–25. Leontief, W. (1941). The Structure of the American Economy. Oxford University Press. O\u0026rsquo;Hagan, J., \u0026amp; Mooney, D. (1983). Input-Output Multipliers in a Small Open Economy. Economic and Social Review, 14(4), 273–280. ","date":"10 February 2025","externalUrl":null,"permalink":"/posts/export-potential-econometrics-vs-ml/","section":"Posts","summary":"In 2020, I was handed a PDF — an ILO working paper titled Spotting Export Potential and Implications for Employment in Developing Countries (Cheong, Decreux \u0026 Spies, 2018) — and asked to turn it into a working algorithm.\nThe paper describes a methodology developed by the International Trade Centre to identify a country’s unrealized export opportunities, and then estimate how many jobs realizing those opportunities would create. Across six developing countries. At the product-market-sector level.\n","title":"Why I Used Econometrics Instead of ML to Estimate Export Potential — And What I Learned Implementing It","type":"posts"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]