CS738

CS738 (Winter 2026)

Data Engineering for Data Science

Calendar Description

Introduction to data engineering issues in data science. Data management technology objectives. Structured data management: Relational database technology, database workloads (OLTP vs OLAP). Big data issues: dealing with volume (geo-distributed, cluster parallel, and cloud-native data management), dealing with variety (data type-native systems, NoSQL database systems), dealing with velocity (streaming data management), and big data processing platforms (MapReduce, Spark). Data preparation pipeline: data acquisition, data integration (data warehouses, data lakes, lake houses), dataset selection, data quality and cleaning, data provenance management. Introduction to several current topics in database research, such as Large Language Models, vector databases.

Open to Master of Data Science and Artificial Intelligence students and others without an undergraduate course on database systems (instructor approval required).

Course Logistics

TBD

Syllabus & Schedule (Subject to adjustments)

Week Lecture Topic Speaker
1 (Jan 5) 1 Course introduction; Structured Data Management: Introduction to Database Systems Tamer Özsu
  2 Relational model of data, relational calculus & algebra Tamer Özsu
2 (Jan 12) 1 Relational algebra, SQL Tamer Özsu
  2 Database Workloads (OLTP & OLAP: HTAP systems)
3 (Jan 19) 1 Big data: Dealing with volume Tamer Özsu
  2 Big data: Dealing with volume Tamer Özsu
4 (Jan 26) 1 Big data: Dealing with variety Tamer Özsu
  2 Big data: Dealing with variety Tamer Özsu
5 (Feb 2) 1 Big data: Dealing with velocity Tamer Özsu
  2 Big data: Dealing with velocity Tamer Özsu
6 (Feb 9) 1 Cloud computing & cloud-native data management
  2 Introduction to data preparation pipeline Tamer Özsu
7 (Feb 16)   Reading week - no classes  
8 (Feb 23) 1 Midterm exam  
  2 Data acquisition Tamer Özsu
9 (Mar 2) 1 Data integration: Data warehouses Tamer Özsu
  2 Data integration: Data lakes
10 (Mar 9) 1 Data integration: Data lakehouses  
  2 Data profiling
11 (Mar 16) 1 Data quality & data cleaning
  2 Data quality & data cleaning
12 (Mar 23) 1 Data provenance
  2 LLMs and Data Management  
13 (Mar 30) 1 LLMs and Data Management  
  2 Vector databases
       

Marking Scheme (Tentative)