For many years, the preferred parallel data management architecture was shared-nothing where multiple compute units, each of which consist of CPU+Memory+Storage, are connected by a network. The shared-nothing architecture performs parallel computation, usually, through explicit messaging. The shared-nothing architecture has been reflected in cloud data centres in the form of a rack-and-blade design, where each blade contains some compute elements, some memory, usually some storage, and perhaps some accelerators tightly connected and multiple blades are mounted on a rack. This converged data centre architecture is now starting to experience difficulties in adequately serving emerging big data applications with highly varying application require-ments. An important expectation of cloud computing is to provide elasticity to accommodate these extreme application needs, but only when they arise. However, converged data centre architecture is stressed in providing this elasticity and usually over-provisions (i.e., allocates at the beginning of the computation task the maximum resources that may be needed at some point) resulting in low resource utilization. Most data centres have moved to shared-storage solutions by the deployment of SANs that are connected to compute units through high-speed networks. This is called disaggregated storage. More recent work has focused on also disaggregating memory, where a bank of memory units are accessible by the compute units over very high speed and low lantency networks. This is called disaggregated memory. The combination of both to achieve a fully disaggregated platform architecutre has not yet been studied extensively.
At the same time, there is significant interest in using hardware accelerators for data management. Most of the work has focused on GPUs, but there is interest in the use of FPGAs as well. Many are single machine solutions, but parallel environments and CPU-GPU combinations have also been investigated. There is some work on using FPGAs in data management systems and some on CPU-FPGA combinations. There is no work that considers a heterogeneous CPU-GPU-FPGA platform. With the increasing role of data management in AI and ML workloads, hardware accelerators will be increasingly important. Current platforms directly connect these accelerators to the compute units, but they can be disaggregated in the same way storage and memory are.
The purpose of this seminar is to study and debate the design issues of a disaggregated and heterogeneous computing platform (DHCP) and its impact on data management. The design space is quite large, so when more focus is needed by fixing specific workloads, the concentration will be on graph processing.
The course is based on weekly paper readings and student presentations, discussions, and a term project.
Weeks 1 and 2: I will be presenting some background material to set the stage. There are background reading materials for this section and I expect you to read them and be prepared to discuss (see below for details).
Week 3: Prof. Samer Al-Kiswany and his student will be presenting their work on RDMA. We will then have a discussion on CXL.
Weeks 4-12: These weeks, we will spend the class time in paper presentations and discussions. Each week will be two paper presentation and discussions.
Week 13: Brief project presentations (see below)
Discussions will be over Piazza. Please go here and add yourself to the class.
See the schedule. Please note that we might adjust it as needed depending on class size.
Please review the materials concerning academic integrity and academic honesty.