This is the brief on MSA eighty-fifty scalable data analysis. You’ve got the analytics basics down, but what happens when you’re suddenly staring at an absolute mountain of data? Think of this course as taking your existing skills and hooking them up to an industrial sized engine to tackle scalable analytics, which is honestly crucial for modern retail and massive industries. First, we dive into the fusion of algorithmic theory and massive scale. We’ll cover core concepts like machine learning and forecasting, but we really push you to ask, does a standard algorithm still work when the data is literally too big for one computer? No way. That’s exactly where distributed systems like Apache Spark step in to save the day. Second, get ready for hands-on, end to end retail analytics. You’re gonna build a complete solution from scratch, designing data extraction pipelines, running a baseline, and building a far superior model. And why retail? Because it generates an absurd amount of messy, complex data, making it the perfect proving ground for testing scalable systems in the wild. Finally, you’ll turn these skills into a polished professional portfolio piece. You’ll use real industry tools like GitLab for version control and the university’s ARC Supercomputing cluster. Think of ARC and GitLab as a flight simulator for data scientists. They give you undeniable proof that you can handle real-world pressures before even hitting the job market. Ultimately, MSA 8050 takes your foundational analytics knowledge and scales it up to solve massive industry problems, taking you from raw data pipelines to a professional grade presentation.
The course integrates algorithmic theory, scalable computing systems, and project-based practice for modern data science on large datasets. It covers core algorithms for machine learning, recommender systems, graph mining, frequent pattern mining, and forecasting, while learning how these methods change when deployed on parallel and distributed systems such as Apache Spark. The course combines theoretical discussions with hands-on technical work in Spark, ETL and ELT pipeline design, workflow orchestration, experiment tracking, and scalable model development. Through one of three retail analytics projects, students build and evaluate end-to-end solutions that improve both technical performance and analytical quality over a baseline implementation.
The course projects comprise a curated set of retail analytics problems designed with consistent scope, data scale, and technical rigor, forming the primary analytical work of the course. Each project follows a structured sequence of milestones that guide the development of an end‑to‑end scalable data analytics system, beginning with data understanding and ETL/ELT pipeline construction and progressing through baseline and improved modeling, systematic evaluation of analytical quality and system performance, and comprehensive reporting. Projects are developed and versioned using the internal GitLab repository and executed on the ARC to ensure reproducibility in a realistic computing environment. The milestone sequence culminates in a final report and presentation, producing a polished technical artifact suitable for inclusion in a professional portfolio.
Project Milestones
Project Milestones
Milestone
Due
M01: Team roster, project choice, and 1-page execution plan
2026-09-02
M02: Data understanding & initial ETL report (schema, data quality, first ETL run)
2026-09-16
M03: Baseline system implemented and evaluated (ETL+features, metrics, runtime notes)
2026-09-30
M04: Improved system implemented, with comparative technical and quality metrics and initial trade-off analysis
2026-10-21
M05: Draft final report and slides with full baseline vs improved metrics and narrative