MSA 8050: Scalable Data Analysis

Listen to the introduction
The course integrates algorithmic theory, scalable computing systems, and project-based practice for modern data science on large datasets. It covers core algorithms for machine learning, recommender systems, graph mining, frequent pattern mining, and forecasting, while learning how these methods change when deployed on parallel and distributed systems such as Apache Spark. The course combines theoretical discussions with hands-on technical work in Spark, ETL and ELT pipeline design, workflow orchestration, experiment tracking, and scalable model development. Through one of three retail analytics projects, students build and evaluate end-to-end solutions that improve both technical performance and analytical quality over a baseline implementation.

Schedule

Class Schedule
SessionDateTopicIn ClassProject Milestone
12026-08-26Course Overview & ScaleACT01Optional background & project preference survey (ungraded)
22026-09-02Problem Formulation & PlanningACT02M01: Team roster, project choice, and 1-page execution plan
32026-09-09Recommenders & Frequent PatternsACT03
42026-09-16Graph Basics & ConnectivityACT04M02: Data understanding & initial ETL report (schema, data quality, first ETL run)
52026-09-23Time Series & Many-Series ForecastingACT05
62026-09-30Factorization & Sequential PatternsACT06M03: Baseline system implemented and evaluated (ETL+features, metrics, runtime notes)
72026-10-07Communities, Triangles & StructureACT07
82026-10-14Parallel ETL & Distributed PatternsACT08
92026-10-21Model Selection & Large-Scale ExperimentsACT09M04: Improved system implemented, with comparative technical and quality metrics and initial trade-off analysis
102026-10-28Optimization & Iterative AlgorithmsACT10
112026-11-04Robustness, Drift & MonitoringACT11
122026-11-11Algorithm-System Trade-offs & CasesACT12M05: Draft final report and slides with full baseline vs improved metrics and narrative
132026-11-18Synthesis & Presentation PrepACT13
142026-12-02Last day of class: Presentations
2026-12-09-- no class --M06: Final reports and video

Documents

Topics

Project

The course projects comprise a curated set of retail analytics problems designed with consistent scope, data scale, and technical rigor, forming the primary analytical work of the course. Each project follows a structured sequence of milestones that guide the development of an end‑to‑end scalable data analytics system, beginning with data understanding and ETL/ELT pipeline construction and progressing through baseline and improved modeling, systematic evaluation of analytical quality and system performance, and comprehensive reporting. Projects are developed and versioned using the internal GitLab repository and executed on the ARC to ensure reproducibility in a realistic computing environment. The milestone sequence culminates in a final report and presentation, producing a polished technical artifact suitable for inclusion in a professional portfolio.

Project Milestones

Project Milestones
MilestoneDue
M01: Team roster, project choice, and 1-page execution plan2026-09-02
M02: Data understanding & initial ETL report (schema, data quality, first ETL run)2026-09-16
M03: Baseline system implemented and evaluated (ETL+features, metrics, runtime notes)2026-09-30
M04: Improved system implemented, with comparative technical and quality metrics and initial trade-off analysis2026-10-21
M05: Draft final report and slides with full baseline vs improved metrics and narrative2026-11-11
M06: Final reports and video2026-12-09