MSA 8050: Scalable Data Analysis

Listen to the introduction

Transcript

Speaker	Text
Alex	This is the brief on MSA eighty-fifty scalable data analysis. You’ve got the analytics basics down, but what happens when you’re suddenly staring at an absolute mountain of data? Think of this course as taking your existing skills and hooking them up to an industrial sized engine to tackle scalable analytics, which is honestly crucial for modern retail and massive industries. First, we dive into the fusion of algorithmic theory and massive scale. We’ll cover core concepts like machine learning and forecasting, but we really push you to ask, does a standard algorithm still work when the data is literally too big for one computer? No way. That’s exactly where distributed systems like Apache Spark step in to save the day. Second, get ready for hands-on, end to end retail analytics. You’re gonna build a complete solution from scratch, designing data extraction pipelines, running a baseline, and building a far superior model. And why retail? Because it generates an absurd amount of messy, complex data, making it the perfect proving ground for testing scalable systems in the wild. Finally, you’ll turn these skills into a polished professional portfolio piece. You’ll use real industry tools like GitLab for version control and the university’s ARC Supercomputing cluster. Think of ARC and GitLab as a flight simulator for data scientists. They give you undeniable proof that you can handle real-world pressures before even hitting the job market. Ultimately, MSA 8050 takes your foundational analytics knowledge and scales it up to solve massive industry problems, taking you from raw data pipelines to a professional grade presentation.

Speaker

Text

Alex

This is the brief on MSA eighty-fifty scalable data analysis. You’ve got the analytics basics down, but what happens when you’re suddenly staring at an absolute mountain of data? Think of this course as taking your existing skills and hooking them up to an industrial sized engine to tackle scalable analytics, which is honestly crucial for modern retail and massive industries. First, we dive into the fusion of algorithmic theory and massive scale. We’ll cover core concepts like machine learning and forecasting, but we really push you to ask, does a standard algorithm still work when the data is literally too big for one computer? No way. That’s exactly where distributed systems like Apache Spark step in to save the day. Second, get ready for hands-on, end to end retail analytics. You’re gonna build a complete solution from scratch, designing data extraction pipelines, running a baseline, and building a far superior model. And why retail? Because it generates an absurd amount of messy, complex data, making it the perfect proving ground for testing scalable systems in the wild. Finally, you’ll turn these skills into a polished professional portfolio piece. You’ll use real industry tools like GitLab for version control and the university’s ARC Supercomputing cluster. Think of ARC and GitLab as a flight simulator for data scientists. They give you undeniable proof that you can handle real-world pressures before even hitting the job market. Ultimately, MSA 8050 takes your foundational analytics knowledge and scales it up to solve massive industry problems, taking you from raw data pipelines to a professional grade presentation.

The course integrates algorithmic theory, scalable computing systems, and project-based practice for modern data science on large datasets. It covers core algorithms for machine learning, recommender systems, graph mining, frequent pattern mining, and forecasting, while learning how these methods change when deployed on parallel and distributed systems such as Apache Spark. The course combines theoretical discussions with hands-on technical work in Spark, ETL and ELT pipeline design, workflow orchestration, experiment tracking, and scalable model development. Through one of three retail analytics projects, students build and evaluate end-to-end solutions that improve both technical performance and analytical quality over a baseline implementation.

Schedule

Class Schedule
Session	Date	Topic	In Class	Project Milestone
1	2026-08-26	Course Overview & Scale	ACT01	Optional background & project preference survey (ungraded)
2	2026-09-02	Problem Formulation & Planning	ACT02	M01: Team roster, project choice, and 1-page execution plan
3	2026-09-09	Recommenders & Frequent Patterns	ACT03
4	2026-09-16	Graph Basics & Connectivity	ACT04	M02: Data understanding & initial ETL report (schema, data quality, first ETL run)
5	2026-09-23	Time Series & Many-Series Forecasting	ACT05
6	2026-09-30	Factorization & Sequential Patterns	ACT06	M03: Baseline system implemented and evaluated (ETL+features, metrics, runtime notes)
7	2026-10-07	Communities, Triangles & Structure	ACT07
8	2026-10-14	Parallel ETL & Distributed Patterns	ACT08
9	2026-10-21	Model Selection & Large-Scale Experiments	ACT09	M04: Improved system implemented, with comparative technical and quality metrics and initial trade-off analysis
10	2026-10-28	Optimization & Iterative Algorithms	ACT10
11	2026-11-04	Robustness, Drift & Monitoring	ACT11
12	2026-11-11	Algorithm-System Trade-offs & Cases	ACT12	M05: Draft final report and slides with full baseline vs improved metrics and narrative
13	2026-11-18	Synthesis & Presentation Prep	ACT13
14	2026-12-02	Last day of class: Presentations
	2026-12-09	-- no class --		M06: Final reports and video

Documents

Course Syllabus (PDF)

Topics

Project

The course projects comprise a curated set of retail analytics problems designed with consistent scope, data scale, and technical rigor, forming the primary analytical work of the course. Each project follows a structured sequence of milestones that guide the development of an end‑to‑end scalable data analytics system, beginning with data understanding and ETL/ELT pipeline construction and progressing through baseline and improved modeling, systematic evaluation of analytical quality and system performance, and comprehensive reporting. Projects are developed and versioned using the internal GitLab repository and executed on the ARC to ensure reproducibility in a realistic computing environment. The milestone sequence culminates in a final report and presentation, producing a polished technical artifact suitable for inclusion in a professional portfolio.

Project Milestones

Project Milestones
Milestone	Due
M01: Team roster, project choice, and 1-page execution plan	2026-09-02
M02: Data understanding & initial ETL report (schema, data quality, first ETL run)	2026-09-16
M03: Baseline system implemented and evaluated (ETL+features, metrics, runtime notes)	2026-09-30
M04: Improved system implemented, with comparative technical and quality metrics and initial trade-off analysis	2026-10-21
M05: Draft final report and slides with full baseline vs improved metrics and narrative	2026-11-11
M06: Final reports and video	2026-12-09