CS738

CS738 (Winter 2025)

Data Engineering for Data Science

Calendar Description

Introduction to data engineering issues in data science. Data management technology objectives. Relational database technology, relational algebra, SQL, transactions, data modelling methodology, entity-relationship models. NoSQL databases including key-value stores, document databases, wide-column stores, graph databases. Overview of big data processing platforms. Data integration including data warehousing, data lakes, ETL and ELT approaches. Data preparation for analysis, data quality, data cleaning. Introduction to several current topics in database research, such as data mining, managing data streams, distributed/parallel databases, HTAP architectures.

Open to Master of Data Science and Artificial Intelligence students and others without an undergraduate course on database systems (instructor approval required).

Course Logistics

TBD

Syllabus & Schedule (Subject to adjustments)

Week Lecture Topic Speaker
1 (Jan 6) 1 Course introduction; Introduction to Database Systems Tamer Özsu
  2 Relational model of data, relational calculus & algebra Tamer Özsu
2 (Jan 13) 1 Relational algebra, SQL Tamer Özsu
  2 SQL M.T. Özsu
3 (Jan 20) 1 Data modeling Tamer Özsu
  2 Catch-up day or in-class discussion  
4 (Jan 27) 1 Big data: Dealing with volume Tamer Özsu
  2 Big data: Dealing with volume Tamer Özsu
5 (Feb 3) 1 Big data: Dealing with variety Tamer Özsu
  2 Big data: Dealing with variety Tamer Özsu
6 (Feb 10) 1 Big data: Dealing with velocity Tamer Özsu
  2 Big data: Dealing with velocity Tamer Özsu
7 (Feb 17)   Reading week - no classes  
8 (Feb 24) 1 Midterm exam  
  2 Catch-up day or in-class discussion  
9 (Mar 3) 1 Data integration: Data warehouses Tamer Özsu
  2 Data integration: Data lakes Renée Miller
10 (Mar 10) 1 Cloud computing & cloud-native data management  
  2 OLAP & OLTP: HTAP systems Anil Goel
11 (Mar 17) 1 Data preparation: the pipeline  
  2 Data quality & data cleaning  
12 (Mar 24) 1 Data quality & data cleaning  
  2 Data provenance  
13 (Mar 31) 1 Vector databases Jianguo Wang
  2 Data management issues in LLMs Theo Rekatsinas
       

Marking Scheme (Tentative)