CS738

CS738 (Winter 2026)

Data Engineering for Data Science

 

Calendar Description

Introduction to data engineering issues in data science. Data management technology objectives. Structured data management: Relational database technology, database workloads (OLTP vs OLAP). Big data issues: dealing with volume (geo-distributed, cluster parallel, and cloud-native data management), dealing with variety (data type-native systems, NoSQL database systems), dealing with velocity (streaming data management), and big data processing platforms (MapReduce, Spark). Data preparation pipeline: data acquisition, data integration (data warehouses, data lakes, lake houses), dataset selection, data quality and cleaning, data provenance management. Introduction to several current topics in database research, such as Large Language Models, vector databases.

Open to Master of Data Science and Artificial Intelligence students and others without an undergraduate course on database systems (instructor approval required).

Course Logistics

Use of Generative AI Tools

Syllabus & Schedule (Subject to adjustments)

Marking Scheme (Tentative)

University Policies & Statements