Topics

Overview of class topics. Click on the titles for details.

Session 1: Course Overview & Scale

Class Dates: 2026-08-26
Introduce the three problem domains: retail recommender systems, graph-based segmentation, and demand forecasting. Discuss how these tasks look on ‘small data’ (single-machine notebooks) versus ’large-scale’ settings with millions of users, items, or time series. Define basic ML concepts (features, targets, train/validation/test splits, empirical risk) and highlight what changes when data no longer fits in memory (data locality, communication cost, approximate vs exact algorithms).

Session 2: Problem Formulation & Planning

Class Dates: 2026-09-02
Review supervised learning fundamentals (regression/classification, loss functions, overfitting) with emphasis on evaluation at scale: how full cross-validation becomes expensive and when single holdout or time-based splits are more appropriate. Contrast common metrics (MSE, MAE, MAPE, precision/recall, ranking metrics) and discuss computational cost of computing metrics on large datasets. Frame each project type formally: recommendation as ranking/utility estimation, segmentation as graph partitioning, forecasting as multi-series prediction. Use short quizzes around appropriate metric choice and evaluation strategies under compute constraints.

Session 3: Recommenders & Frequent Patterns

Class Dates: 2026-09-09
Introduce recommender systems as large, sparse matrix problems. Cover non-personalized methods (global popularity, item popularity within segments) and simple item item similarity from co-occurrence. Explain frequent itemset mining for market-basket analysis: definition of support, confidence, and association rules. Present FP-growth: how FP-trees compress transaction databases and avoid candidate explosion, and why it scales better than Apriori. Discuss scaling issues: building FP-trees per partition, memory limits, pruning by minimum support, and how these patterns can be used to derive item item edges or candidate sets in large retail systems. Include quick exercises on computing supports and small FP-trees.

Session 4: Graph Basics & Connectivity

Class Dates: 2026-09-16
Define graphs formally (nodes, edges, weighted/unweighted, directed/undirected) with examples from retail: product co-purchase graphs, user item bipartite graphs. Introduce degree, paths, and neighborhoods (ego-nets) as a way to analyze local structure. Cover connected components (weak/strong) and their interpretation in co-purchase or social graphs (e.g., isolated subgraphs vs a giant component). Discuss the computational challenges of processing large graphs: storing adjacency information across partitions, iteratively propagating labels (for components), and the difference between small networkx-style analyses and cluster-scale graph processing. Use small toy graphs for cut/degree/component quizzes.

Session 5: Time Series & Many-Series Forecasting

Class Dates: 2026-09-23
Cover time-series fundamentals: trend, seasonality, residuals, stationarity. Discuss classical baseline models (na√Øve last-value, moving averages, simple exponential smoothing) and appropriate error metrics (RMSE, MAE, MAPE). Emphasize the “many-series” setting in retail (thousands of product √ó store series) and what breaks from single-series methods: memory/compute constraints, the need for shared models or hierarchical structures, and proper time-based evaluation (rolling or expanding windows). Highlight how naive random splits cause leakage in time-series and why backtesting protocols are needed at scale. Include quick exercises computing simple baselines and errors on short synthetic series.

Session 6: Factorization & Sequential Patterns

Class Dates: 2026-09-30
Present low-rank matrix factorization for recommenders: representing user item interaction matrices as products of user and item latent factor matrices. Explain ALS-style objectives for implicit and explicit feedback, regularization, and how alternating optimization lends itself to distributed implementations (solving for user factors given item factors and vice versa). Discuss trade-offs between ALS and SGD at scale (synchronization, convergence, fault tolerance). Introduce sequential pattern mining via PrefixSpan: frequent subsequences, prefix-projected databases, and why PrefixSpan reduces database scans compared to older methods. Emphasize scaling issues: pattern explosion and the need for minimum support, maximum pattern length, and parallel prefix projections. Show how sequential patterns can inform next-item recommendation and session-level behavior modeling.

Session 7: Communities, Triangles & Structure

Class Dates: 2026-10-07
Dive deeper into graph algorithms for segmentation. Introduce community detection objectives (modularity, conductance) and explain heuristic algorithms like Louvain/Leiden that optimize modularity in a multi-level fashion. Provide an intuitive overview of spectral clustering and the graph Laplacian (no heavy proofs, but clear conceptual links to partitioning). Introduce triangle counting and clustering coefficients as measures of local transitivity and “tight-knit” neighborhoods. Discuss how these metrics help characterize communities in product or customer graphs, and the challenges of counting triangles in large, skewed graphs (approximate methods, wedge sampling, special handling of high-degree nodes).

Session 8: Parallel ETL & Distributed Patterns

Class Dates: 2026-10-14
Explore algorithmic patterns behind ETL and large-scale analytics: map, shuffle, reduce, join, and dataflow graphs. Discuss complexity at the distributed level: computation cost vs communication cost, with shuffles as a key source of overhead. Analyze how different partitioning strategies (hash, range, custom keys) affect load balance and skew, especially in retail logs with heavy-tailed popularity distributions. Show how naive implementations of aggregations and joins become bottlenecks, and how to reason about more efficient algorithms. Include exercises where students choose partition keys and sketch map reduce decompositions for concrete ETL tasks.

Session 9: Model Selection & Large-Scale Experiments

Class Dates: 2026-10-21
Revisit bias variance and model capacity, now constrained by computational budgets. Discuss hyperparameter tuning strategies (grid search, random search, simple adaptive methods) and why exhaustive hyperparameter sweeps may be impossible at scale. Cover the risk of “metric hacking” when repeatedly probing the same validation set across many runs, even with large data. Emphasize strategies for responsible model selection: limited sweeps, early stopping, and pre-defined evaluation protocols. Use examples from recommenders, segmentation, and forecasting where small hyperparameter tweaks may or may not justify heavy compute.

Session 10: Optimization & Iterative Algorithms

Class Dates: 2026-10-28
Provide a conceptual overview of gradient-based optimization: gradient descent, stochastic gradient descent, and mini-batching. Explain why full-batch methods become impractical with very large datasets, and how stochastic/mini-batch approaches trade variance for speed. Introduce data parallelism vs model parallelism and the idea of parameter servers/asynchronous updates (conceptually, not implementation-level). Connect these ideas back to specific algorithms used in the course (e.g., ALS vs SGD for factorization, iterative graph algorithms like PageRank). Emphasize that in practice, “good enough, fast” often wins over theoretically optimal but slow solutions.

Session 11: Robustness, Drift & Monitoring

Class Dates: 2026-11-04
Discuss robustness and uncertainty in large-scale systems: impact of noisy labels, missing data, and outliers when logs are huge but not perfectly clean. Introduce covariate shift and concept drift with examples from retail (seasonality, changing customer behavior, product additions). Cover monitoring concepts from a theoretical perspective: what to track (distributional changes, performance on holdout slices), how to construct ‘cheap checks’ (schema checks, basic distribution checks) versus more detailed periodic evaluations. Emphasize that even with big data, computational limits constrain how often and how deeply models can be monitored.

Session 12: Algorithm-System Trade-offs & Cases

Class Dates: 2026-11-11
Synthesize algorithmic trade-offs across the three domains: for each project type, compare simple baselines vs more complex models under constraints of latency, memory, cost, and interpretability. Introduce the idea of Pareto frontiers (no formal math required) to reason about accuracy vs efficiency trade-offs. Present short case-study vignettes from industry-scale systems (e.g., simpler models in high-traffic settings, approximate algorithms for triangle counting or PageRank). Encourage students to map their project choices onto this framework: where does their baseline sit, where does their improved model sit, and is there a ‘sweet spot’ in between?

Session 13: Synthesis & Presentation Prep

Class Dates: 2026-11-18
Unify the course’s theoretical themes: learning from large-scale structured data (interactions, graphs, sequences) under constraints of computation, memory, and reliability. Revisit the central concepts: sparse matrices and factorization, graph communities and link analysis, many-series forecasting, frequent pattern and sequential mining, and distributed algorithm patterns for ETL. Highlight open research and practice questions in scalable recommenders, graph ML, and retail forecasting. Facilitate discussion where students explicitly articulate how their project deviated from a naive small-data solution and what approximations or design choices scale forced on them.