February 26th, Montreal, Co-located with HPCA 2023
Welcome to the First Workshop for Heterogeneous and Composable Memory (HCM 2023)!
The memory systems are becoming heterogeneous and composable.
Increasing memory heterogeneity and composability is beneficial to increase memory capacity
and improve memory utilization in a cost-effective way, and reduce total cost of ownership.
The heterogenous and composable memory (HCM) provides a feasible solution to support terabyte-
or petabye-scale big memory systems, meeting the performance and efficiency requirements of
emerging big-data applications.
However, building and using heterogeneous and composable memories (HCM)
is challenging. We must answer a series of questions, such as how to interconnect memory
components based on different memory technologies (e.g., using compute express link or CXL),
how to organize those memory components in HCM for high performance, how to evolve (or even
revolutionize) existing system software that traditionally handles homogeneous memory systems
with small capacity, and how to build memory abstract and program constructs for HCM management.
In general, HCM brings many unique opportunities and challenges, and we lack knowledge on how to
build and use HCM.
The workshop on HCM aims to deepen our knowledge on HCM and bring together
researchers in academia and industry to share early discoveries, successful examples, and opinions
on opportunities and challenges regarding HCM.
Program
Opening Remarks 8:00 am - 8:10 am ET
Section 1 8:10 am - 9:50 am ET
Abstract:
Deep learning-based recommendation systems are resource-intensive and require large amounts of memory space to achieve high accuracy. To meet these demands, hyperscalers have scaled up their recommendation models to consume tens of terabytes of memory space. Additionally, these models must be fault-tolerant and trained for long periods without accuracy degradation.
In this talk, we present TrainingCXL, an innovative solution that leverages CXL 3.0 to efficiently process large-scale RMs in disaggregated memory while ensuring training is failure-tolerant with low overhead. By integrating persistent memory (PMEM) and GPU as Type-2 devices in a cache-coherent domain, we enable direct access to PMEM without software intervention. TrainingCXL employs computing and checkpointing logic near the CXL controller to manage persistency actively and efficiently. To ensure fault tolerance, we use the unique characteristics of RMs to take checkpointing off the critical path of their training. We also employ an advanced checkpointing technique that relaxes the updating sequence of embeddings across training batches. The evaluation shows that TrainingCXL achieves significant performance improvements, including a 5.2x speedup and 72.6% energy savings compared to modern PMEM-based recommendation systems.
Speaker Bio:
Junhyeok Jang is a highly accomplished Ph.D. candidate under the supervision of Prof. Myoungsoo Jung at CAMELab of KAIST. His research expertise is focused on the cutting-edge field of hardware and software co-design for large-scale machine learning applications, with a particular emphasis on recommendation systems and graph neural networks (GNNs). Mr. Jang's extensive work in this field has led to numerous breakthroughs, including his pioneering research in developing TrainingCXL, a highly efficient and failure-tolerant system for processing large-scale recommendation models in disaggregated memory pools.
Abstract: As memory systems are becoming increasingly heterogeneous with different latencies, bandwidths, and device characteristics, we are met with the question: where to place data and when to move data?
In this talk, I will discuss the problems with using traditional caching for data tiering, and I will show how this technique is inappropriate for heterogeneous memory systems. Further, I will discuss some recent and ongoing work from the UC Davis
Computer Architecture Research Group (DArchR) that uses software hints to enable both transparent and application-specific data movement showing large benefits over traditional hardware caching.
Speaker Bio:
Jason Lowe-Power is an Assistant Professor at University of California, Davis where he leads the Davis Computer Architecture Research Lab (DArchR). His research interests include optimizing data movement in heterogeneous systems,
hardware support for security, and simulation infrastructure. Professor Lowe-Power is also the Chair of the Project Management Committee for the gem5 open-source simulation infrastructure. He received his PhD in 2017 from the University of Wisconsin, Madison, and received
an NSF CAREER Award and a Google Research Scholar Award.
Abstract: Cloud providers seek to deploy CXL-based memory pools to reduce fragmentation as well as the footprint of their embedded carbon. However, the design space of CXL-based memory systems is large. Key questions center around the size, reach, topology, and cost of the memory pool.
Pooling also requires navigating complex design constraints around performance, virtualization, and management. This talk discusses why cloud providers are working to deploy CXL memory pools, key design constraints, and observations in designing towards practical deployment.
Speaker Bio:
Daniel S. Berger is a Senior Researcher at Microsoft Azure Systems Research and an Affiliate Assistant Professor at the University of Washington. His research focuses on improving memory efficiency, sustainability, and robustness in public clouds. He is the recipient of the 2018-2019 Mark Stehlik Postdoctoral Fellowship at Carnegie Mellon University and the 2021 ACM SOSP Best Paper Award.
Abstract:
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory,
and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high performance penalty of accessing data from a remote memory module over the network. Addressing this challenge is difficult as
disaggregated systems have high runtime variability in network latencies/bandwidth, and page migration can significantly delay critical path cache line accesses in other pages. In this talk, we present a characterization analysis on different data movement strategies in fully disaggregated
systems and discuss their performance overheads in a variety of workloads. Then, we describe our new adaptive and software-transparent mechanism that can significantly alleviate data movement overheads in fully disaggregated memory systems. We demonstrate how our proposed hardware mechanism achieves
high system performance and robustness in data movement across a wide variety of emerging workloads at different network and architecture configurations. We conclude by providing future research directions on designing intelligent architectures and adaptive approaches for modern computing systems.
Speaker Bio:
Christina Giannoula is a Postdoctoral Researcher at the University of Toronto working with Prof. Gennady Pekhimenko and the EcoSystem research group. She is also working with the SAFARI research group, which is led by Prof. Onur Mutlu. She received her Ph.D. in October 2022 from School of Electrical and Computer Engineering (ECE) at the National Technical University of Athens (NTUA) advised by Prof. Georgios Goumas,
Prof. Nectarios Koziris and Prof. Onur Mutlu. Her research interests lie in the intersection of computer architecture, computer systems and high-performance computing. Specifically, her research focuses on the hardware/software co-design of emerging applications, including graph processing, pointer-chasing data structures, machine learning workloads, and sparse linear algebra, with modern computing paradigms, such as large-scale multicore systems, disaggregated memory systems and near-data processing architectures.
She has several publications and awards for her research on these topics.
Abstract:
CXL is a dynamic multi-protocol interconnect technology designed to support accelerators and memory devices. CXL provides a rich set of protocols that include I/O semantics similar to PCIe (i.e., CXL.io), caching protocol semantics (i.e., CXL.cache),
and memory access semantics (i.e., CXL.mem) over PCIe PHY. CXL 2.0 specification enabled additional usage models beyond CXL 1.1, while being fully backwards compatible with CXL 1.1 (and CXL 1.0). CXL 2.0 enables dynamic resource allocation including memory and accelerator dis-aggregation across multiple domains. It enables switching, managed hot-plug, security enhancements, persistence memory support, memory error reporting, and telemetry. CXL 3.0 adds new fabric capabilities to build large scale-out systems while doubling the bandwidth with full backwards compatibility to CXL 1.0 and CXL 2.0.
The availability of commercial IP blocks, Verification IPs, and industry standard internal interfaces enables CXL to be widely deployed across the industry. These along with a well-defined compliance program will ensure smooth interoperability across CXL devices in the industry.
Speaker Bio:
Dr. Debendra Das Sharma is an Intel Senior Fellow in the Data Platforms and Artificial Intelligence Group and chief architect of the I/O Technology and Standards Group at Intel Corporation. He drives PCI Express, Compute Express Link (CXL), Intel’s Coherency interconnect, and multichip package interconnect. He is a member of the Board of Directors of PCI-SIG and a lead contributor to PCIe specifications since its inception. He is a co-inventor and founding member of the CXL consortium and co-leads the CXL Technical Task Force. He is also the co-inventor of Universal Chiplet Interconnect Express (UCIe) and chairs the 100+ member UCIe consortium.
Dr. Das Sharma holds 160+ US patents and is a frequent keynote speaker, distinguished lecturer, invited speaker, and panelist at the Hot Interconnects, PCI-SIG Developers Conference, CXL consortium events, Open Server Summit, Open Fabrics Alliance, Flash Memory Summit, SNIA SDC, and Intel Developer Forum. He has a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur and a Ph.D. in Computer Engineering from the University of Massachusetts, Amherst. He has been awarded the Distinguished Alumnus Award from Indian Institute of Technology, Kharagpur in 2019, the IEEE Region 6 Outstanding Engineer Award in 2021, the first PCI-SIG Lifetime Contribution Award in 2022, and the IEEE Circuits and Systems Industrial Pioneer Award in 2022.
Abstract: CXL-based memory expansion decouples CPU and memory within a single server and enables flexible server design with different generations and types of memory technologies.
It can balance the fleet-wide resource utilization and address the memory bandwidth and capacity scaling challenges in hyperscale datacenters. Without efficient memory management, however, such systems can significantly degrade application-level performance.
We propose a novel OS-level application-transparent page placement mechanism (TPP) for efficient CXL-memory management. TPP employs lightweight mechanisms to identify and place hot and cold pages to appropriate memory tiers. It enables page allocation to work independently from page reclamation logic which is tightly-coupled in today's Linux kernel.
At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow memory tiers to the fast tier node. Both promotion and demotion mechanisms work transparently without prior knowledge of an application's memory access behavior. TPP improves Linux's performance by up to 18% and outperforms state-of-the-art solutions for tiered memory by 10–17%.
TPP has been actively being used in Meta datacenter for over a year and parts of it have been merged to the Linux kernel since v5.18.
Coffee Break 10:05 am - 10:20 am ET
Section 2 10:20 am - 12:20 pm ET
Abstract:
Memory is one of the most expensive, yet over-provisioned and underutilized resources in current data centers. Systems for remote memory aim to improve memory utilization allowing memory pooling across a rack of hosts. In this talk, I will provide a high-level overview of the research in this space, and discuss a taxonomy of remote and disaggregated memory systems architected at different levels of the computing stack. I will discuss the benefits that these systems can provide for emerging ML applications and cloud operators, as well as tradeoffs between the various approaches.
Speaker Bio:
Irina Calciu is a co-founder at Graft, a cloud-native startup that makes the AI of the 1% accessible to the 99%. Irina is broadly interested in machine learning systems, as well as parallel and distributed systems, with a focus in algorithms and systems for rack-scale computing. Before Graft, Irina was a Sr. Researcher at VMware Research, working on novel software-hardware co-design solutions for memory disaggregation. Irina completed her PhD at Brown University, working with Maurice Herlihy and Justin Gottschlich (Intel Labs) on algorithms for non-uniform memory access (NUMA) architectures and hybrid transactional memory. Irina has co-authored papers at top conferences, obtaining Best Paper awards at ASPLOS and TRANSACT, and holds more than 15 issued patents. She served as a program co-chair for ATC 2021 and on numerous program committees for top systems conferences, including OSDI, ASPLOS and ATC.
Panelists Bio:
Manoj Wadekar is a Hardware Systems Technologist driving storage and memory technology and roadmaps at Meta. Manoj has been designing and building servers, storage, and network solutions for over 30 years. He is leading the Composable Memory Systems group in OCP. Manoj has evangelized Memory and Storage Disaggregation, NVMe over Fabric, Lossless Ethernet (DCB/CEE) in the industry conferences. Before joining Meta, he held engineering positions at eBay, QLogic and Intel.
Michele Gazzetti is a Research Software Engineer part of IBM Research Europe. His research interests include control plane software management for composable systems and performance evaluation of workloads leveraging composable resources. Michele is also an active member in the OpenFabrics Management Framework Working Group part of the OpenFabrics Alliance.
Jason Lowe-Power is an Assistant Professor at University of California, Davis where he leads the Davis Computer Architecture Research Lab (DArchR). His research interests include optimizing data movement in heterogeneous systems,
hardware support for security, and simulation infrastructure. Professor Lowe-Power is also the Chair of the Project Management Committee for the gem5 open-source simulation infrastructure. He received his PhD in 2017 from the University of Wisconsin, Madison, and received
an NSF CAREER Award and a Google Research Scholar Award.
Daniel S. Berger is a Senior Researcher at Microsoft Azure Systems Research and an Affiliate Assistant Professor at the University of Washington. His research focuses on improving memory efficiency, sustainability, and robustness in public clouds. He is the recipient of the 2018-2019 Mark Stehlik Postdoctoral Fellowship at Carnegie Mellon University and the 2021 ACM SOSP Best Paper Award.
Attending
Venue:
Hotel Bonaventure Montreal
900 Rue De La Gauchetière O, H5A 1E4, Quebec, Montreal, Canada
Date:
Feb 26 (Sunday) morning
Call for Papers
The workshop papers will be related with but not limited to the following topics:
HCM architectures, such as interconnect technologies (such as CXL), memory pooling, and memory disaggregation;
Operating system designs to support HCM, such as memory profiling methods, page migration and allocation, and huge pages;
Characterization of HCM from the perspectives of performance, energy consumption, and reliability;
Use cases for HCM, such as deep-learning training and scientific applications;
Tools (such as simulators or platforms) for HCM research and engineering;
New programming models and program constructs to enable easy programming of HCMs;
HCM in virtualization environments;
New algorithms and performance models to manage and use HCM;
Runtime systems to manage HCM.
Submission
All submissions should be made electronically through the Easychair website. Submissions must be double blind,
i.e., authors should remove their names, institutions or hints found in references to earlier work. When discussing
past work, they need to refer to themselves in the third person, as if they were discussing another researcher’s work.
Furthermore, authors must identify any conflict of interest with the PC chair or PC members. Each paper is a
2-page abstract, using IEEE conference format. The page limit includes figures, tables, and your appendices,
but does not include references, for which there is no page limit. Papers should be submitted in PDF format.
We kindly refer authors to the necessary template.
We encourage researchers from all institutions to submit their work for review.
Preliminary results of interesting ideas and work-in-progress are welcome. A paper accepted to HCM'23
would not preclude its future publication in a major conference. Submissions that are likely to generate
vigorous discussion will be favored!