CS738

CS738 (Winter 2025)

Data Engineering for Data Science

Calendar Description

Introduction to data engineering issues in data science. Data management technology objectives. Relational database technology, relational algebra, SQL, transactions, data modelling methodology, entity-relationship models. NoSQL databases including key-value stores, document databases, wide-column stores, graph databases. Overview of big data processing platforms. Data integration including data warehousing, data lakes, ETL and ELT approaches. Data preparation for analysis, data quality, data cleaning. Introduction to several current topics in database research, such as data mining, managing data streams, distributed/parallel databases, HTAP architectures.

Open to Master of Data Science and Artificial Intelligence students and others without an undergraduate course on database systems (instructor approval required).

Course Logistics

TBD

Syllabus & Schedule (Subject to adjustments)

Week Lecture Topic Slides
1 1 Course introduction and scoping  
  2 Relational model of data  
2 1 SQL  
  2 Advanced SQL, Relational algebra & relational calculus  
3 1 Data modeling  
  2 Relational DBMS internals (query processing)  
4 1 Relational DBMS internals (transaction processing)  
  2 Big data and NoSQL  
5 1 Big data and text processing  
  2 Big data and data streams  
6 1 Big data and graph processing  
  2 Big data and scaling: Classical relational distributed DBMS  
7   Reading week - no classes  
8 1 Big data processing platforms: MapReduce  
  2 Big data processing platforms: MapReduce/Spark  
9 1 Cloud computing & cloud-native data management  
  2 Privacy in big data  
10 1 Data integration: Data warehouses  
  2 Data integration: Data lakes  
11 1 OLAP & OLTP: HTAP systems  
  2 Data mining  
12 1 Data preparation: the pipeline  
  2 Data quality & data cleaning  
13 1 Data quality & data cleaning  
  2 Data provenance  

Marking Scheme (Tentative)