添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Session 1A: Synthesis for Architectures
(Location: Grande A/B)
12:00 PDT – 12:15 PDT

Explainable Port Mapping Inference with Sparse Performance Counters for AMD’s Zen Architectures
Fabian Ritter and Sebastian Hack (Saarland University)

Abstract: Performance models are instrumental for optimizing performance-sensitive code. When modeling the use of functional units of out-of-order x86-64 CPUs, data availability varies by the manufacturer: Instruction-to-port mappings for Intel’s processors are available, whereas information for AMD’s designs are lacking. The reason for this disparity is that standard techniques to infer exact port mappings require hardware performance counters that AMD does not provide. In this work, we modify the port mapping inference algorithm of the widely used uops.info project to not rely on Intel’s performance counters. The modifications are based on a formal port mapping model with a counter-example-guided algorithm powered by an SMT solver. We investigate in how far AMD’s processors comply with this model and where unexpected performance characteristics prevent an accurate port mapping. Our results provide valuable insights for creators of CPU performance models as well as for software developers who want to achieve peak performance on recent AMD CPUs.
12:15 PDT – 12:30 PDT

Longnail: High-Level Synthesis of Portable Custom Instruction Set Extensions for RISC-V Processors from Descriptions in the Open-Source CoreDSL Language
Julian Oppermann, Brindusa Mihaela Damian-Kosterhon, Florian Meisel, and Tammo Mürmann (Technical University of Darmstadt) ;Eyck Jentzsch (MINRES Technologies GmbH) ;Andreas Koch (Technical University of Darmstadt)

Abstract: In the RISC-V ecosystem, custom instruction set architecture extensions (ISAX) are an energy-efficient and cost-effective way to accelerate modern embedded workloads. However, exploring different combinations of base cores and ISAXes for a specific application requires automation and a level of portability across microarchitectures that is not provided by existing approaches. To that end, we present an end-to-end flow for ISAX specification, generation, and integration into a number of host cores having a range of different microarchitectures. For ISAX specification, we propose CoreDSL, a novel behavioral architecture description language that is concise, easy to learn, and open source. Hardware generation is handled by Longnail, a domain-specific high-level synthesis tool that compiles CoreDSL specifications into hardware modules compatible with the recently introduced SCAIE-V extension interface, which we rely on for automatic integration into the host cores. We demonstrate our tooling by generating ISAXes using a mix of features, including complex multi-cycle computations, memory accesses, branch instructions, custom registers, and decoupled execution across four embedded cores and evaluate the quality of results on a 22nm ASIC process.
12:30 PDT – 12:45 PDT

SEER: Super-Optimization Explorer for High-Level Synthesis using E-graph Rewriting
Jianyi Cheng (University of Cambridge and Intel) ;Samuel Coward (Imperial College London and Intel) ;Lorenzo Chelini, Rafael Barbalho, and Theo Drane (Intel)

Abstract: High-level synthesis (HLS) is a process that automatically translates a software program in a high-level language into a low-level hardware description. However, the hardware designs produced by HLS tools still suffer from a significant performance gap compared to manual implementations. This is because the input HLS programs must still be written using hardware design principles. Existing techniques either leave the program source unchanged or perform a fixed sequence of source transformation passes, potentially missing opportunities to find the optimal design. We propose a super-optimization approach for HLS that automatically rewrites an arbitrary software program into efficient HLS code that can be used to generate an optimized hardware design. We developed a toolflow named SEER, based on the e-graph data structure, to efficiently explore equivalent implementations of a program at scale. SEER provides an extensible framework, orchestrating existing software compiler passes and hardware optimizers. Our work is the first attempt to exploit e-graph rewriting for large software compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we show that SEER achieves up to 38× the performance within 1.4× the area of the original program. Via an Intel-provided case study, SEER demonstrates the potential to outperform manually optimized designs produced by hardware experts.
12:45 PDT – 13:00 PDT

HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis
Hanchen Ye, Hyegang Jun, and Deming Chen (University of Illinois Urbana-Champaign)

Abstract: Dataflow architectures are growing in popularity due to their potential to mitigate the challenges posed by the memory wall inherent to the Von Neumann architecture. At the same time, high-level synthesis (HLS) has demonstrated its efficacy as a design methodology for generating efficient dataflow architectures within a short development cycle. However, existing HLS tools rely on developers to explore the vast dataflow design space, ultimately leading to suboptimal designs. This phenomenon is especially concerning as the size of the HLS design grows. To tackle these challenges, we introduce HIDA1, a new scalable and hierarchical HLS framework that can systematically convert an algorithmic description into a dataflow implementation on hardware. We first propose a collection of efficient and versatile dataflow representations for modeling the hierarchical dataflow structure. Capitalizing on these representations, we develop an automated optimizer that decomposes the dataflow optimization problem into multiple levels based on the inherent dataflow hierarchy. Using FPGAs as an evaluation platform, working with a set of neural networks modeled in PyTorch, HIDA achieves up to 8.54× higher throughput compared to the state-of-the-art (SOTA) HLS optimization tool. Furthermore, despite being fully automated and able to handle various applications, HIDA achieves 1.29× higher throughput over the SOTA RTL-based neural network accelerators on an FPGA.
Session 1B: Optimizing ML Communication
(Location: Grande C)
12:00 PDT – 12:15 PDT

TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
Heehoon Kim, Junyeol Ryu, and Jaejin Lee (Seoul National University)

Abstract: Exploiting parallelism to train deep learning models requires GPUs to cooperate through collective communication primitives. While systems like DGX, equipped with proprietary interconnects, have been extensively studied, the systems where GPUs mainly communicate through PCIe have received limited attention. This paper introduces TCCL, a collective communication library designed explicitly for such systems. TCCL has three components: a profiler for multi-transfer performance measurement, a pathfinder to discover optimal communication paths, and a modified runtime of NCCL to utilize the identified paths. The focus is on ring-based collective communication algorithms that apply to popular communication operations in deep learning, such as AllReduce and AllGather. The evaluation results of TCCL on three different PCIe-dependent GPU clusters show that TCCL outperforms (up to ×2.07) the state-of-the-art communication libraries, NCCL and MSCCL. We also evaluate TCCL with DL training workloads with various combinations of parallelism types.
12:15 PDT – 12:30 PDT

Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
Chang Chen, Xiuhong Li, and Qianchao Zhu (Peking University) ;Jiangfei Duan (Chinese University of Hong Kong) ;Peng Sun and Xingcheng Zhang (Shanghai AI Lab) ;Chao Yang (Peking University)

Abstract: Efficiently training large language models (LLMs) necessitates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed partitioned graphs. Overcoming communication bottlenecks is crucial and is often achieved through communication and computation overlaps. However, existing overlap methodologies tend to lean towards either fine-grained kernel fusion or limited operation scheduling, constraining performance optimization in heterogeneous training environments. This paper introduces Centauri, an innovative framework that encompasses comprehensive communication partitioning and hierarchical scheduling schemes for optimized overlap. We propose a partition space comprising three inherent abstraction dimensions: primitive substitution, topology-aware group partitioning, and workload partitioning. These dimensions collectively create a comprehensive optimization space for efficient overlap. To determine the efficient overlap of communication and computation operators, we decompose the scheduling tasks in hybrid parallel training into three hierarchical tiers: operation, layer, and model. Through these techniques, Centauri effectively overlaps communication latency and enhances hardware utilization. Evaluation results demonstrate that Centauri achieves up to 1.49× speedup over prevalent methods across various parallel training configurations.
12:30 PDT – 12:45 PDT

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Suchita Pati (University of Wisconsin-Madison and AMD) ;Shaizeen Aga, Mahzabeen Islam, and Nuwan Jayasena (AMD) ;Matthew D. Sinclair (University of Wisconsin-Madison and AMD)

Abstract: Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a finegrained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer’s output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer’s compute, and communication. It further uses compute-enhanced memories for communication’s attendant compute. As a result, T3 reduces resource contention, GPU GEMM-1 GEMM-2 Track local/remote stores …. …. DMA Communication-1 Communication-2 Time GPU GEMM-1 Communication-1 (a) Baseline (b) Proposal: T3 collectives Track + trigger Idle Figure 1. T3 overview. and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3’s benefits persist as models scale: geomean 29% for sublayers in ∼500-billion parameter models, PALM and MT-NLG.
12:45 PDT – 13:00 PDT

Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM
Charles Block, Gerasimos Gerogiannis, and Charith Mendis (University of Illinois at Urbana-Champaign) ;Ariful Azad (Indiana University) ;Josep Torrellas (University of Illinois at Urbana-Champaign)

Abstract: Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging from scientific computing to graph neural networks. Typically, when SpMM is executed in a distributed platform, communication costs dominate. Such costs depend on how communication is scheduled. If it is scheduled in a sparsity-unaware manner, such as with collectives, execution is often inefficient due to unnecessary data transfers. On the other hand, if communication is scheduled in a fine-grained sparsity-aware manner, communicating only the necessary data, execution can also be inefficient due to high software overhead. We observe that individual sparse matrices often contain regions that are denser and regions that are sparser. Based on this observation, we develop a model that partitions communication into sparsity-unaware and sparsity-aware components. Leveraging the partition, we develop a new algorithm that performs collective communication for the denser regions, and fine-grained, one-sided communication for the sparser regions. We call the algorithm Two-Face. We show that Two-Face attains an average speedup of 2.11x over prior work when evaluated on a 4096-core supercomputer. Additionally, Two-Face scales well with the machine size.

Session 1C: Case studies and experience
(Location: Grande D/E)
12:00 PDT – 12:15 PDT

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors
Reese Kuper (University of Illinois at Urbana Champaign) ;Ipoom Jeong (University of Illinois Urbana-Champaign) ;Yifan Yuan, Ren Wang, Narayan Ranganathan, and Nikhil Rao (Intel Labs) ;Jiayu Hu (Tencent) ;Sanjay Kumar and Philip Lantz (Intel Labs) ;Nam Sung Kim (University of Illinois Urbana-Champaign)

Abstract: As semiconductor power density is no longer constant with the technology process scaling down, we need different solutions if we are to continue scaling application performance. To this end, modern CPUs are integrating capable data accelerators on the chip, aiming to improve performance and efficiency for a wide range of applications and usages. One such accelerator is the Intel® Data Streaming Accelerator (DSA) introduced since Intel® 4th Generation Xeon® Scalable CPUs (Sapphire Rapids). DSA targets data movement operations in memory that are common sources of overhead in datacenter workloads and infrastructure. In addition, it supports a wider range of operations on streaming data, such as CRC32 calculations, computation of deltas between data buffers, and data integrity field (DIF) operations. This paper aims to introduce the latest features supported by DSA, dive deep into its versatility, and analyze its throughput benefits through a comprehensive evaluation with both microbenchmarks and real use cases. Along with the analysis of its characteristics and the rich software ecosystem of DSA, we summarize several insights and guidelines for the programmer to make the most out of DSA, and use an in-depth case study of DPDK Vhost to demonstrate how these guidelines benefit a real application.
12:15 PDT – 12:30 PDT

Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
Phitchaya Mangpo Phothilimthana (Google DeepMind) ;Saurabh Kadekodi (Google) ;Soroush Ghodrati (University of California San Diego) ;Selene Moon (Google) ;Martin Maas (Google DeepMind)

Abstract: Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual “what-if” analyses that assess the impact of anticipated or hypothetical new storage policies or hardware prior to deployment. We propose Thesios, a methodology to accurately synthesize such hypothetical full-resolution I/O traces by carefully combining down-sampled I/O traces collected from multiple disks attached to multiple storage servers. Applying this approach to real-world traces that a real ready routinely sampled at Google, we show that our synthesized traces achieve 95–99.5% accuracy in read/write request numbers, 90–97% accuracy in utilization, and 80–99.8% accuracy in read latency compared to metrics collected from actual disks. We demonstrate how The-sios enables diverse counterfactual I/O trace synthesis and analyses of hypothetical policy, hardware, and server changes through four case studies: (1) studying the effects of changing disk’s utilization, fullness, and capacity, (2) evaluating new data placement policy, (3) analyzing the impact on power and performance of deploying disks with reduced rotations-per-minute (RPM), and (4) understanding the impact of increased buffer cache size on a storage server. Without Thesios, such counterfactual analyses would require costly and potentially risky A/B experiments in production.
12:30 PDT – 12:45 PDT

A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs
Michael Davies, Ian McDougall, Selvaraj Anandaraj, Deep Machchhar, Rithik Jain, and Karthikeyan Sankaralingam (University of Wisconsin-Madison)

Abstract: We are in age of AI, with rapidly changing algorithms and a somewhat synergistic change in hardware. MLPerf is a recent benchmark suite that serves as a way to compare and evaluate hardware. However it has several drawbacks – it is dominated by CNNs and does a poor job of capturing the diversity of AI use cases, and only represents a sliver of production AI use cases. This paper performs a longitudinal study of state-of-art AI applications spanning vision, physical simulation, vision synthesis, language and speech processing, and tabular data processing, across three generations of hardware to understand how the AI revolution has panned out. We call this collection of applications and execution scaffolding the CaSiO suite. The paper reports on data gathered at the framework level, device API level, and hardware and microarchitecture level. The paper provides insights on the hardware-software revolution with pointers to future trends.
12:45 PDT – 13:00 PDT

Expanding Datacenter Capacity with DVFS Boosting: A safe and scalable deployment experience
Leonardo Piga, Iyswarya Narayanan, Aditya Sundarrajan, Matt Skach, and Qingyuan Deng (Meta) ;Biswadip Maity (University of California Irvine) ;Manoj Chakkaravarthy, Alison Huang, Abhishek Dhanotia, and Parth Malani (Meta)

Abstract: COVID-19 pandemic created unexpected demand for our physical infrastructure. We increased our computing supply by growing our infrastructure footprint as well as expanded existing capacity by using various techniques among those DVFS boosting. This paper describes our experience in deploying DVFS boosting to expand capacity. There are several challenges in deploying DVFS boosting at scale. First, frequency scaling incurs additional power demand, which can exacerbate power over-subscription and incur unexpected capacity loss for the services due to power capping. Second, heterogeneity is commonplace in any large scale infrastructure. We need to deal with the service and hardware heterogeneity to determine the optimal setting for each service and hardware type. Third, there exists a long tail of services with scarce resources and support for performance evaluation. Finally and most importantly, we need to ensure that large scale changes to CPU frequency do not risk the reliability of the services and the infrastructure. We present our solution that has overcome the above challenges and has been running in production for over 3 years. It created 12 MW of supply which is equivalent to building and populating half a datacenter in our fleet. In addition to the real world performance of our solution, we also share our key takeaways to improve fleetwide efficiency via DVFS boosting in a safe manner.


Session 1D: Attacks and Mitigations
(Location: Scripps I/II)
12:00 PDT – 12:15 PDT

Rubix: Reducing the Overhead of Secure Rowhammer Mitigations via Randomized Line-to-Row Mapping
Anish Saxena, Saurav Mathur, and Moinuddin Qureshi (Georgia Tech)

Abstract: Modern systems mitigate Rowhammer using victim refresh, which refreshes neighbours of an aggressor row when it encounters a specified number of activations. Unfortunately, complex attack patterns like Half-Double break victim-refresh, rendering current systems vulnerable. Instead, recently proposed secure Rowhammer mitigations perform mitigative action on the aggressor rather than the victims. Such schemes employ mitigative actions such as row-migration or access-control and include AQUA, SRS, and Blockhammer. While these schemes incur only modest slowdowns at Rowhammer thresholds of few thousand, they incur prohibitive slowdowns (15%-600%) for lower thresholds that are likely in the near future. The goal of our paper is to make secure Rowhammer mitigations practical at such low thresholds. Our paper provides the key insights that benign application encounter thousands of hot rows (receiving more activations than the threshold) due to the memory mapping, which places spatially proximate lines in the same row to maximize row-buffer hitrate. Unfortunately, this causes row to receive activations for many frequently used lines. We propose Rubix, which breaks the spatial correlation in the line-to-row mapping by using an encrypted address to access the memory, reducing the likelihood of hot rows by 2 to 3 orders of magnitude. To aid row-buffer hits, Rubix randomizes a group of 1-4 lines. We also propose Rubix-D, which dynamically changes the line-to-row mapping. Rubix-D minimizes hot-rows and makes it much harder for an adversary to learn the spatial neighbourhood of a row. Rubix reduces the slowdown of AQUA (from 15% to 1%), SRS (from 60% to 2%), and Blockhammer (from 600% to 3%) while incurring a storage of less than 1 Kilobyte.
12:15 PDT – 12:30 PDT

TAROT: A CXL SmartNIC-Based Defense Against Multi-bit Errors by Row-Hammer Attacks
Chihun Song (University of Illinois Urbana-Champaign) ;Michael Jaemin Kim (Seoul National University) ;Tianchen Wang, Houxiang Ji, Jinghan Huang, and Ipoom Jeong (University of Illinois Urbana-Champaign) ;Jaehyun Park, Hwayong Nam, Minbok Wi, and Jung Ho Ahn (Seoul National University) ;Nam Sung Kim (University of Illinois Urbana-Champaign)

Abstract: Row Hammer (RH) has been demonstrated as a security vulnerability in modern systems. Although commodity CPUs can handle RH-induced single-bit errors in DRAM through ECC, RH can still give rise to multi-bit uncorrectable errors (UEs) and crash the systems. Meanwhile, recent work has indicated that the DRAM cells vulnerable to RH are determined by manufacturing imperfections and resulting defects. Taking one step further from the recent work, we first conduct RH experiments on contemporary DRAM modules for 3 weeks. This demonstrates that RH-induced UEs occur only at specific DRAM addresses (RH-UE-vulnerable addresses) and the percentage of such addresses is small in these DRAM modules. Second, to protect the systems from RH-induced UEs, we propose two RH defense solutions: Hand S-TAROT (TArgeted ROw-Hammer Therapy). H-TAROT is a software-based solution running on the host CPU. It obtains RH-UE-vulnerable addresses during the system boot and then periodically accesses such addresses before UEs may occur. Since it accesses only a small percentage of addresses, it does not incur a notable performance penalty for throughput applications (e.g., a 1.5% increase in execution time of the SPECrate 2017 benchmark suite running on a system even with 128GB of DRAM). Yet, it imposes a significant performance penalty on latency-sensitive applications (e.g., a 28.2% increase in tail latency of Redis). To minimize the performance penalty, for a system with a SmartNIC (SNIC), S-TAROT offloads H-TAROT from the host CPU to the SNIC CPU. Our experiment shows that S-TAROT increases the execution time and tail latency of the SPECrate 2017 benchmark suite and Redis by only 0.1% and 1.0%, respectively.
12:30 PDT – 12:45 PDT

Pythia: Compiler-Guided Defense Against Non-Control Data Attacks
Sharjeel Khan, Bodhisatwa Chatterjee, and Santosh Pande (Georgia Tech)

Abstract: Modern C/C++ applications are susceptible to Non-Control Data Attacks, where an adversary attempts to exploit memory corruption vulnerabilities for security breaches such as privilege escalation, control-flow manipulation, etc. One such popular class of non-control data attacks is Controlflow Bending, where the attacker manipulates the program data to flip branch outcomes, and divert the program control flow into alternative paths to gain privileges. Unfortunately, despite tremendous advancements in software security, state-of-art defense mechanisms such as Control-flow Integrity (CFI), are ineffective against control-flow bending attacks especially those involving flipping of branch predicates. In this work, we propose a performance-aware defensive scheme, Pythia, which utilizes ARM’s pointer authentication (PA) hardware support for dynamic integrity checking of forward slices of vulnerable program variables that can be affected by input channels, and backward slices of branch variables (including pointers) that may be misused by the attacker. We first show that a naive scheme of protecting all vulnerabilities can suffer from an average runtime overhead of 47.88%. We then demonstrate how overheads can be minimized by selectively adding ARM PA-based canaries for statically-allocated program variables and creating secure sections for dynamically-allocated variables to avoid overflows by input channels. Our experiments show that employing this hybrid approach results in an average overhead to 13.07%. We empirically show that Pythia offers better security than state-of-the-art data flow integrity (DFI) technique, especially in pointer-intensive code and C++ applications with 92% branches secured on an average and 100 % secured in case of 3 applications.
12:45 PDT – 13:00 PDT

Everywhere All at Once: Co-Location Attacks on Public Cloud FaaS
Zirui Neil Zhao (University of Illinois Urbana-Champaign) ;Adam Morrison (Tel Aviv University) ;Christopher W. Fletcher and Josep Torrellas (University of Illinois Urbana-Champaign)

Abstract: Microarchitectural side-channel attacks exploit shared hardware resources, posing significant threats to modern systems. A pivotal step in these attacks is achieving physical host colocation between attacker and victim. This step is especially challenging in public cloud environments due to the widespread adoption of the virtual private cloud (VPC) and the ever-growing size of the data centers. Furthermore, the shift towards Function-as-a-Service (FaaS) environments, characterized by dynamic function instance placements and limited control for attackers, compounds this challenge. In this paper, we present the first comprehensive study on risks of and techniques for co-location attacks in public cloud FaaS environments. We develop two physical host fingerprinting techniques and propose a new, inexpensive methodology for large-scale instance co-location verification. Using these techniques, we analyze how Google Cloud Run places function instances on physical hosts and identify exploitable placement behaviors. Leveraging our findings, we devise an effective strategy for instance launching that achieves 100% probability of co-locating the attacker with at least one victim instance. Moreover, the attacker co-locates with 61%–100% of victim instances in three major Cloud Run data centers.


Session 2A: Binary Analysis
(Location: Grande A/B )
14:30 PDT – 14:45 PDT

Plankton: Reconciling Binary Code and Debug Information
Anshunkang Zhou and Chengfeng Ye (The Hong Kong University of Science and Technology) ;Heqing Huang (City University of Hong Kong) ;Yuandao Cai and Charles Zhang (The Hong Kong University of Science and Technology)

Abstract: Static analysis has been widely used in large-scale software defect detection. Despite recent advances, it is still not practical enough because it requires compilation interference to obtain analyzable code. Directly translating the binary code using a binary lifter mitigates this practicality problem by being non-intrusive to the building system. However, existing binary lifters cannot produce precise enough code for rigorous static analysis even in the presence of the debug information. In this paper, we propose a new binary lifter Plankton together with two new algorithms that can fill the gaps between the low- and high-level code to produce high-quality LLVM intermediate representations (IRs) from binaries with debug information, enabling full-fledged static analysis with minor precision loss. Plankton shows comparable static analysis results with traditional compilation interference solutions, producing only 17.2% differences while being much more practical, outperforming existing lifters by 76.9% on average.
14:45 PDT – 15:00 PDT

What You Trace is What You Get: Dynamic Stack-Layout Recovery for Binary Recompilation
Fabian Parzefall, Chinmay Deshpande, Felicitas Hetzelt, and Michael Franz (University of California Irvine)

Abstract: Users of proprietary and/or legacy programs without vendor support are denied the significant advances in compiler technologies of the past decades. Adapting these technologies to operate directly on binaries without source code is often infeasible. Binary recompilers attempt to bridge this gap by “lifting” binary executables to compiler-level intermediate representations (IR) and “lowering” them back down to executable form, enabling application of the full range of analyses and transformations available in modern compiler infrastructures. Past approaches could not recover local variables in lifted programs with sufficient precision, which is a necessary prerequisite for many compiler-related applications, including performance optimization. They have relied on heuristics failing on certain input programs, or on conservative over-approximations yielding imprecise results. In this paper, we present a novel approach, WYTIWYG, to recover function-local variables within lifted binaries. Our approach is fully automated and preserves functionality for user-provided inputs. This is accomplished by decomposing the recovery of local variables into a series of instrumentation-based dynamic binary analyses. We conduct an extensive set of careful evaluations on the SPECint 2006 benchmark suite, including direct comparisons with two previously published state-of-the-art binary recompilers. Our approach recompiles fully optimized commercial off-the-shelf binaries compiled with the latest compilers. Using performance of recompiled binaries as an indicator of IR-quality, our approach significantly outperforms similar recompilers by 1.18? on average. Furthermore, WYTIWYG accelerates legacy binaries optimized by older compilers by an astounding 1.22?.
15:00 PDT – 15:15 PDT

Accurate Disassembly of Complex Binaries Without Use of Compiler Metadata
Soumyakant Priyadarshan, Huan Nguyen, and R. Sekar (Stony Brook University)

Abstract: Accurate disassembly of stripped binaries is the first and foremost step in binary analysis, instrumentation and reverse engineering. Complex instruction sets such as the x86 pose major challenges in this context because it is very difficult to distinguish between code and embedded data. To make progress, many recent approaches have either made optimistic assumptions (e.g., absence of embedded data) or relied on additional compiler-generated metadata (e.g., relocation info and/or exception handling metadata). Unfortunately, many complex binaries do contain embedded data, while lacking the additional metadata needed by these techniques. We therefore present a novel approach for accurate disassembly that uses statistical properties of data to detect code, and behavioral properties of code to flag data. We present new static analysis and data-driven probabilistic techniques that are then combined using a prioritized error correction algorithm to achieve results that are 3× to 4× more accurate than the best previous results.
15:15 PDT – 15:30 PDT

FITS: Inferring Intermediate Taint Sources for Effective Vulnerability Analysis of IoT Device Firmware
Puzhuo Liu (Chinese Academy of Sciences) ;Yaowen Zheng (Nanyang Technological University) ;Chengnian Sun (University of Waterloo) ;Chuan Qin, Dongliang Fang, Mingdong Liu, and Limin Sun (Chinese Academy of Sciences)

Abstract: Finding vulnerabilities in firmware is vital as any firmware vulnerability may lead to cyberattacks to the physical IoT devices. Taint analysis is one promising technique for finding firmware vulnerabilities thanks to its high coverage and scalability. However, sizable closed-source firmware makes it extremely difficult to analyze the complete data-flow paths from taint sources (i.e., interface library functions such as recv) to sinks. We observe that certain custom functions in binaries can be used as intermediate taint sources (ITSs). Compared to interface library functions, using custom functions as taint sources can significantly shorten the data-flow paths for analysis. However, inferring ITSs is challenging due to the complexity and customization of firmware. Moreover, the debugging information and symbol table of binaries in firmware are stripped; therefore, prior techniques of inferring taint sources are not applicable except laborious manual analysis. To this end, this paper proposes FITS to automatically infer ITSs. Specifically, FITS represents each function with a novel behavioral feature representation that captures the static and dynamic properties of the function, and ranks custom functions as taint sources through behavioral clustering and similarity scoring. We evaluated FITS on 59 large, real-world firmware samples. The inference results of FITS are accurate: at least one of top-3 ranked custom functions can be used as an ITS with 89% precision. ITSs helped Karonte find 15 more bugs and helped the static taint engine find 339 more bugs. More importantly, 21 bugs have been awarded CVE IDs and rated high severity with media coverage.

Session 2B: Side Channels
(Location: Grande C)
14:30 PDT – 14:45 PDT

Avoiding Instruction-Centric Microarchitectural Timing Channels Via Binary-Code Transformations
Michael Flanders, Reshabh K Sharma, Alexandra E. Michael, Dan Grossman, and David Kohlbrenner (University of Washington)

Abstract: With the end of Moore’s Law-based scaling, novel microarchitectural optimizations are being patented, researched, and implemented at an increasing rate. Previous research has examined recently published patents and papers and demonstrated ways these upcoming optimizations present new security risks via novel side channels. As these side channels are introduced by microarchitectural optimization, they are not generically solvable in source code. In this paper, we build program analysis and transformation tools for automatically mitigating the security risks introduced by future instruction-centric microarchitectural optimizations. We focus on two classes of optimizations that are not yet deployed: silent stores and computation simplification. Silent stores are known to leak secret data being written to memory by dropping in-flight stores that will have no effect. Computation simplification is known to leak operands to arithmetic instructions by shortcutting trivial computations at execution time. This presents problems that classical constant-time techniques cannot handle: register spills, address calculations, and the micro-ops of complex instructions are all potentially leaky. To address these problems, we design, implement, and evaluate a process and tool, cio, for detecting and mitigating these types of side channels in cryptographic code. cio is a backstop, providing verified mitigation for novel microarchitectural side-channels when more specialized and efficient hardware or software tools, such as microcode patches, are not yet available.
14:45 PDT – 15:00 PDT

Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud
Zirui Neil Zhao (University of Illinois Urbana-Champaign) ;Adam Morrison (Tel Aviv University) ;Christopher W. Fletcher and Josep Torrellas (University of Illinois Urbana-Champaign)

Abstract: Last-level cache side-channel attacks have been mostly demonstrated in highly-controlled, quiescent local environments. Hence, it is unclear whether such attacks are feasible in a production cloud environment. In the cloud, side channels are flooded with noise from activities of other tenants and, in Function-as-a-Service (FaaS) workloads, the attacker has a very limited time window to mount the attack. In this paper, we show that such attacks are feasible in practice, although they require new techniques. We present an end-to-end, cross-tenant attack on a vulnerable ECDSA implementation in the public FaaS Google Cloud Run environment. We introduce several new techniques to improve every step of the attack. First, to speed-up the generation of eviction sets, we introduce L2-driven candidate address filtering and a Binary Search-based algorithm for address pruning. Second, to monitor victim memory accesses with high time resolution, we introduce Parallel Probing. Finally, we leverage power spectral density from signal processing to easily identify the victim’s target cache set in the frequency domain. Overall, using these mechanisms, we extract a median value of 81% of the secret ECDSA nonce bits from a victim container in 19 seconds on average.
15:00 PDT – 15:15 PDT

Pathfinder: High-Resolution Control-Flow Attacks Exploiting the Conditional Branch Predictor
Hosein Yavarzadeh and Archit Agarwal (University of California San Diego) ;Max Christman (University of North Carolina at Chapel Hill) ;Christina Garman (Purdue University) ;Daniel Genkin (Georgia Tech) ;Andrew Kwong (University of North Carolina at Chapel Hill) ;Daniel Moghimi (Google) ;Deian Stefan (University of California San Diego) ;Kazem Taram (Purdue University) ;Dean Tullsen (University of California San Diego)

Abstract: This paper introduces novel attack primitives that enable adversaries to leak (read) and manipulate (write) the path history register (PHR) and the prediction history tables (PHTs) of the conditional branch predictor in high-performance CPUs. These primitives enable two new classes of attacks: first, it can recover the entire control flow history of a victim program by exploiting read primitives, as demonstrated by a practical secret-image recovery based on capturing the entire control flow of libjpeg routines. Second, it can launch extremely high-resolution transient attacks by exploiting write primitives. We demonstrate this with a key recovery attack against AES based on extracting intermediate values.
15:15 PDT – 15:30 PDT

Pentimento: Data Remanence in Cloud FPGAs
Colin Drewes (Stanford University) ;Olivia Weng and Andres Meza (University of California San Diego) ;Alric Althoff (ARM) ;David Kohlbrenner (University of Washington) ;Ryan Kastner (University of California San Diego) ;Dustin Richmond (University of California Santa Cruz)

Abstract: Remote attackers can recover “FPGA pentimento” – long-removed data belonging to a prior user or proprietary design image on a cloud FPGA. Just as a pentimento of a painting can be exposed by infrared imaging, FPGA pentimentos can be exposed by signal timing sensors. The data constituting an FPGA pentimento is imprinted on the device through bias temperature instability effects on the underlying transistors. Measuring this degradation using a time-to-digital converter allows an attacker to (1) extract proprietary details or keys from an encrypted FPGA design image available on the AWS marketplace and (2) recover information from a previous user of a cloud-FPGA. These threat models are validated on AWS F1, with successful AES key recovery under one model.
Session 2C: Memory Optimizations
(Location: Grande D/E)
14:30 PDT – 14:45 PDT

Kimbap: A Node-Property Map System for Distributed Graph Analytics
Hochan Lee (University of Texas at Austin) ;Roshan Dathathri (Microsoft Research) ;Keshav Pingali (University of Texas at Austin)

Abstract: Most distributed graph analytics systems such as Gemini, Gluon, and SympleGraph support a computational model in which node properties are updated iteratively using properties of adjacent neighbors of those nodes. However, there are many algorithms that cannot be expressed in this model, such as the Louvain algorithm for community detection and the Shiloach-Vishkin algorithm for connected components. These algorithms may be more efficient or may produce better quality output than simpler algorithms that can be expressed using updates only from adjacent vertices. This paper describes Kimbap, a distributed graph analytics programming framework, and its high-performance implementation that addresses this problem. Kimbap supports general vertex-centric algorithms by permitting the computation at a node to read and write properties of any node in the graph, not just its adjacent neighbors. The programming model allows programmers to specify iterative graph analytics applications, while the Kimbap compiler automatically generates the required communication code, and the Kimbap runtime organizes and synchronizes node-property pairs across the distributed-memory machines. The underlying system uses a distributed node-property map that is optimized for highly concurrent sparse reductions by using a graph-partition-aware sparse representation and by avoiding thread conflicts, thereby eliminating a major bottleneck that throttles performance in systems like Pregel that also support general vertex programs. Our experiments on CPU clusters with up to 256 machines (roughly 12000 threads total) show that (1) Louvain clustering algorithm in Kimbap is on average 4× faster than the state-of-the-art hand-optimized implementation for the same algorithm and (2) Kimbap matches or outperforms the state-of-the-art distributed graph analytics system for algorithms that can be expressed in both systems.
14:45 PDT – 15:00 PDT

TrackFM: Far-out Compiler Support for a Far Memory World
Brian R. Tauro (Illinois Institute of Technology) ;Brian Suchy, Simone Campanoni, and Peter Dinda (Northwestern University) ;Kyle C. Hale (Illinois Institute of Technology)

Abstract: Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.
15:00 PDT – 15:15 PDT

Scaling Up Memory Disaggregated Applications with SMART
Feng Ren, Mingxing Zhang, and Kang Chen (Tsinghua University) ;Huaxia Xia (Meituan) ;Zuoning Chen (Chinese Academy of Engineering) ;Yongwei Wu (Tsinghua University)

Abstract: Recent developments in RDMA networks are leading to the trend of memory disaggregation. However, the performance of each compute node is still limited by the network, especially when it needs to perform a large number of concurrent fine-grained remote accesses. According to our evaluations, existing IOPS-bound disaggregated applications do not scale well beyond 32 cores, and therefore do not take full advantage of today’s many-core machines. After an in-depth analysis of the internal architecture of RNIC, we found three major scale-up bottlenecks that limit the throughput of today’s disaggregated applications: (1) implicit contention of doorbell registers, (2) cache trashing caused by excessive outstanding work requests, and (3) wasted IOPS from unsuccessful CAS retries. However, the solutions to these problems involve many low-level details that are not familiar to application developers. To ease the burden on developers, we propose Smart, an RDMA programming framework that hides the above details by providing an interface similar to one-sided RDMA verbs. We take 44 and 16 lines of code to refactor the state-of-the-art disaggregated hash table (RACE) and persistent transaction processing system (FORD) with Smart, improving their throughput by up to 132.4× and 5.2×, respectively. We have also refactored Sherman (a recent disaggregated B+ Tree) with Smart and an additional speculative lookup optimization (48 lines of code changed), which changes its memory access pattern from bandwidth-bound to IOPS-bound and leads to a speedup of 2.0×. Smart is publicly available at https://github.com/madsys-dev/smart.
15:15 PDT – 15:30 PDT

CC-NIC: a Cache-Coherent Interface to the NIC
Henry N. Schuh and Arvind Krishnamurthy (Google and University of Washington) ;David Culler (Google) ;Henry M. Levy (Google and University of Washington) ;Luigi Rizzo (Google) ;Samira Khan (Google and University of Virginia) ;Brent E. Stephens (Google and University of Utah)

Abstract: Emerging interconnects make peripherals, such as the network interface controller (NIC), accessible through the processor’s cache hierarchy, allowing these devices to participate in the CPU cache coherence protocol. This is a fundamental change from the separate I/O data paths and read-write transaction primitives of today’s PCIe NICs. Our experiments show that the I/O data path characteristics cause NICs to prioritize CPU efficiency at the expense of inflated latency, an issue that can be mitigated by the emerging low-latency coherent interconnects. But, the coherence abstraction is not suited to current host-NIC access patterns. Applying existing signaling mechanisms and data structure layouts in a cache-coherent setting results in extraneous communication and cache retention, limiting performance. Redesigning the interface is necessary to minimize overheads and benefit from the new interactions coherence enables. This work contributes CC-NIC, a host-NIC interface design for coherent interconnects. We model CC-NIC using Intel’s Ice Lake and Sapphire Rapids UPI interconnects, demonstrating the potential of optimizing for coherence. Our results show a maximum packet rate of 1.5Gpps and 980Gbps packet throughput. CC-NIC has 77% lower minimum latency, and 88% lower at 80% load, than today’s PCIe NICs. We also demonstrate application-level core savings. Finally, we show that CC-NIC’s benefits hold across a range of interconnect performance characteristics.
Session 2D: ML Inference Systems
(Location: Scripps I/II)
14:30 PDT – 14:45 PDT

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, and Zeyu Wang (Carnegie Mellon University) ;Zhengxin Zhang (Tsinghua University) ;Rae Ying Yee Wong (Stanford University) ;Alan Zhu and Lijie Yang (Carnegie Mellon University) ;Xiaoxiang Shi (Shanghai Jiao Tong University) ;Chunan Shi (Peking University) ;Zhuoming Chen and Daiyaan Arfeen (Carnegie Mellon University) ;Reyna Abhyankar (University of California San Diego) ;Zhihao Jia (Carnegie Mellon University)

Abstract: This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8× for distributed LLM inference and by 2.6-3.5× for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/
14:45 PDT – 15:00 PDT

ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, and Junyeol Lee (Hanyang University) ;Du-seong Chang (KT Corporation) ;Jiwon Seo (Hanyang University)

Abstract: This paper presents ExeGPT, a distributed system designed for constraint-aware LLM inference. ExeGPT finds and runs with an optimal execution schedule to maximize inference throughput while satisfying a given latency constraint. By leveraging the distribution of input and output sequences, it effectively allocates resources and determines optimal execution configurations, including batch sizes and partial tensor parallelism. We also introduce two scheduling strategies based on Round-Robin Allocation and Workload-Aware Allocation policies, suitable for different NLP workloads. We evaluate ExeGPT on six LLM instances of T5, OPT, and GPT-3 and five NLP tasks, each with four distinct latency constraints. Compared to FasterTransformer, ExeGPT achieves up to 15.2× improvements in throughput and 6× improvements in latency. Overall, ExeGPT achieves an average throughput gain of 2.9× across twenty evaluation scenarios. Moreover, when adapting to changing sequence distributions, the cost of adjusting the schedule in ExeGPT is reasonably modest. ExeGPT proves to be an effective solution for optimizing and executing LLM inference for diverse NLP workload and serving conditions.
15:00 PDT – 15:15 PDT

Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
Sohaib Ahmad and Hui Guan (University of Massachusetts Amherst) ;Brian D. Friedman and Thomas Williams (Nokia Bell Labs) ;Ramesh K. Sitaraman (University of Massachusetts Amherst) ;Thomas Woo (Nokia Bell Labs)

Abstract: Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inference-serving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements.
15:15 PDT – 15:30 PDT

SpotServe: Serving Generative Large Language Models on Preemptible Instances
Xupeng Miao (Carnegie Mellon University) ;Chunan Shi (Peking University) ;Jiangfei Duan (The Chinese University of Hong Kong) ;Xiaoli Xi (Carnegie Mellon University) ;Dahua Lin (Chinese University of Hong Kong and Sensetime Research) ;Bin Cui (Peking University) ;Zhihao Jia (Carnegie Mellon University)

Abstract: The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPU resources at a much cheaper price than regular instances but may be preempted by the cloud provider at any time. Serving LLMs on preemptible instances requires addressing challenges induced by frequent instance preemptions and the necessity of migrating instances to handle the preemptions. This paper presents SpotServe, the first distributed LLM serving system on preemptible instances. Several key techniques of SpotServe realize fast and reliable serving of generative LLMs on cheap preemptible instances. First, SpotServe dynamically adapts the LLM parallelization configuration for dynamic instance availability and fluctuating workload, while balancing the trade-off among the overall throughput, inference latency and monetary costs. Second, to minimize the cost of migrating instances for dynamic reparallelization, the task of migrating instances is formulated as a bipartite graph matching problem in SpotServe, which uses the Kuhn-Munkres algorithm to identify an optimal migration plan that minimizes communication cost. Finally, to take advantage of the grace period offered by modern cloud platforms, we introduce stateful inference recovery, a new inference mechanism that commits inference progress at a much finer granularity and allows SpotServe to cheaply resume inference upon preemption. We evaluate SpotServe on real spot instance preemption traces and various popular LLMs and show that SpotServe can reduce the P99 tail latency by 2.4 – 9.1× compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances. The code is publicly available at: https://github.com/Hsword/SpotServe.
Session 3A: Dynamic Analysis and Instrumentation
(Location: Grande A/B)
16:00 PDT – 16:15 PDT

Flexible Non-intrusive Dynamic Instrumentation for WebAssembly
Ben L. Titzer, Elizabeth Gilbert, Bradley Wei Jie Teo, Yash Anand, Kazuyuki Takayama, and Heather Miller (Carnegie Mellon University)

Abstract: A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level bytecode, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm engines have not offered much insight into executing programs, supporting at best bytecode-level stepping and basic source maps, but no instrumentation capabilities. In this paper, we show the first non-intrusive dynamic instrumentation system for WebAssembly in the open-source Wizard Research Engine. Our innovative design offers a flexible, complete hierarchy of instrumentation primitives that support building highlevel, complex analyses in terms of low-level, programmable probes. In contrast to emulation or machine code instrumentation, injecting probes at the bytecode level increases expressiveness and vastly simplifies the implementation by reusing the engine’s JIT compiler, interpreter, and deoptimization mechanism rather than building new ones. Wizard supports both dynamic instrumentation insertion and removal while providing consistency guarantees, which is key to composing multiple analyses without interference. We detail a fully-featured implementation in a high-performance https://github.com/titzer/wizard-engine multi-tier Wasm engine, show novel optimizations specifically designed to minimize instrumentation overhead, and evaluate performance characteristics under load from various analyses. This design is well-suited for production engine adoption as probes can be implemented to have no impact on production performance when not in use.
16:15 PDT – 16:30 PDT

ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices
Ye Li, Jian Tan, Bin Wu, Xiao He, and Feifei Li (Alibaba)

Abstract: Years of experience in operating large-scale microservice systems strengthens our belief that their individual components, with inevitable anomalies, still demand a quantification of the influences on the end-to-end performance indicators. On a causal graph that represents the complex dependencies between the system components, the scatteredly detected anomalies, even when they look similar, could have different implications with contrastive remedial actions. To this end, we design ShapleyIQ, an online monitoring and diagnosis service that can effectively improve the system stability. It is guided by rigorous analysis on Shapley values for a causal graph. Notably, a new property on splitting invariance addresses the challenging exponential computation complexity problem of generic Shapley values by decomposition. This service has been deployed on a core infrastructure system on Alibaba Cloud, for over a year with more than 15,000 operations for 86 services across 2,546 machines. Since then, it has drastically improved the DevOps efficiency, and the system failures have been significantly reduced by 83.3%. We also conduct an offline evaluation on an open source microservice system TrainTicket, which is to pinpoint the root causes of the performance issues in hindsight. Extensive experiments and test cases show that our system can achieve 97.3% accuracy in identifying the top-1 root causes for these datasets, which significantly outperforms baseline algorithms by at least 28.7% in absolute difference.
16:30 PDT – 16:45 PDT

Loupe: Driving the Development of OS Compatibility Layers
Hugo Lefeuvre (University of Manchester) ;Gaulthier Gain (University of Liege) ;Vlad-Andrei Bădoiu and Daniel Dinca (University Politehnica of Bucharest) ;Vlad-Radu Schiller (University of Manchester) ;Costin Raiciu (University Politehnica of Bucharest) ;Felipe Huici (Unikraft.io) ;Pierre Olivier (University of Manchester)

Abstract: Supporting mainstream applications is fundamental for a new OS to have impact. It is generally achieved by developing a layer of compatibility allowing applications developed for a mainstream OS like Linux to run unmodified on the new OS. Building such a layer, as we show, results in large engineering inefficiencies due to the lack of efficient methods to precisely measure the OS features required by a set of applications. We propose Loupe, a novel method based on dynamic analysis that determines the OS features that need to be implemented in a prototype OS to bring support for a target set of applications and workloads. Loupe guides and boosts OS developers as they build compatibility layers, prioritizing which features to implement in order to quickly support many applications as early as possible. We apply our methodology to 100+ applications and several OSes currently under development, demonstrating high engineering effort savings vs. existing approaches: for example, for the 62 applications supported by the OSv kernel, we show that using Loupe, would have required implementing only 37 system calls vs. 92 for the non-systematic process followed by OSv developers. We study our measurements and extract novel key insights. Overall, we show that the burden of building compatibility layers is significantly less than what previous works suggest: in some cases, only as few as 20% of system calls reported by static analysis, and 50% of those reported by naive dynamic analysis need an implementation for an application to successfully run standard benchmarks.
16:45 PDT – 17:00 PDT

Amanda: Unified Instrumentation Framework for Deep Neural Networks
Yue Guan, Yuxian Qiu, and Jingwen Leng (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Fan Yang (Microsoft Research) ;Shuo Yu (Shanghai Jiao Tong University) ;Yunxin Liu (Tsinghua University) ;Yu Feng and Yuhao Zhu (University of Rochester) ;Lidong Zhou (Microsoft Research) ;Yun Liang (Peking University) ;Chen Zhang, Chao Li, and Minyi Guo (Shanghai Jiao Tong University)

Abstract: The success of deep neural networks (DNNs) has sparked efforts to analyze (e.g., tracing) and optimize (e.g., pruning) them. These tasks have specific requirements and ad-hoc implementations in current execution backends like TensorFlow/PyTorch, which require developers to manage fragmented interfaces and adapt their codes to diverse models. In this study, we propose a new framework called Amanda to streamline the development of these tasks. We formalize the implementation of these tasks as neural network instrumentation, which involves introducing instrumentation into the operator level of DNNs. This allows us to abstract DNN analysis and optimization tasks as instrumentation tools on various DNN models. We build Amanda with two levels of APIs to achieve a unified, extensible, and efficient instrumentation design. The user-level API provides a unified operator-grained instrumentation API for different backends. Meanwhile, internally, we design a set of callback-centric APIs for managing and optimizing the execution of original and instrumentation codes in different backends. Through these design principles, the Amanda framework can accommodate a broad spectrum of use cases, such as tracing, profiling, pruning, and quantization, across different backends (e.g., TensorFlow/PyTorch) and execution modes (graph/eager mode). Moreover, our efficient execution management ensures that the performance overhead is typically kept within 5%.
Session 3B: Security
(Location: Grande C)
16:00 PDT – 16:15 PDT

GIANTSAN: Efficient Memory Sanitization with Segment Folding
Hao Ling (The Hong Kong University of Science and Technology) ;Heqing Huang (City University of Hong Kong) ;Chengpeng Wang, Yuandao Cai, and Charles Zhang (The Hong Kong University of Science and Technology)

Abstract: Memory safety sanitizers, the sharp weapon for detecting invalid memory operations during execution, employ runtime metadata to model the memory and help find memory errors hidden in the programs. However, location-based methods, the most widely deployed memory sanitization methods thanks to their high compatibility, face the low protection density issue: the number of bytes safeguarded by one metadata is limited. As a result, numerous memory accesses require loading excessive metadata, leading to a high runtime overhead. To address this issue, we propose a new shadow encoding with segment folding to increase the protection density. Specifically, we characterize neighboring bytes with identical metadata by building novel summaries, called folded segments, on those bytes to reduce unnecessary metadata loadings. The new encoding uses less metadata to safeguard large memory regions, speeding up memory sanitization. We implement our designed technique as GiantSan. Our evaluation using the SPEC CPU 2017 benchmark shows that GiantSan outperforms the state-of-the-art methods with 59.10% and 38.52% less runtime overhead than ASan and ASan–, respectively. Moreover, under the same redzone setting, GiantSan detects 463 fewer false negative cases than ASan and ASan– in testing the real-world project PHP.
16:15 PDT – 16:30 PDT

Enforcing C/C++ Type and Scope at Runtime for Control-Flow and Data-Flow Integrity
Mohannad Ismail and Christopher Jelesnianski (Virginia Tech) ;Yeongjin Jang (Samsung Research America) ;Changwoo Min (Igalia) ;Wenjie Xiong (Virginia Tech)

Abstract: Control-flow hijacking and data-oriented attacks are becoming more sophisticated. These attacks, especially data-oriented attacks, can result in critical security threats, such as leaking an SSL key. Data-oriented attacks are hard to defend against with acceptable performance due to the sheer amount of data pointers present. The root cause of such attacks is using pointers in unintended ways; fundamentally, these attacks rely on abusing pointers to violate the original scope they were used in or the original types that they were declared as. This paper proposes Scope Type Integrity (STI), a new defense policy that enforces all pointers (both code and data pointers) to conform to the original programmer’s intent, as well as Runtime Scope Type Integrity (RSTI) mechanisms to enforce STI at runtime leveraging ARM Pointer Authentication. STI gathers information about the scope, type, and permissions of pointers. This information is then leveraged by RSTI to ensure pointers are legitimately utilized at runtime. We implemented three defense mechanisms of RSTI, with varying levels of security and performance tradeoffs to showcase the versatility of RSTI. We employ these three variants on a variety of benchmarks and real-world applications for a full security and performance evaluation of these mechanisms. Our results show that they have overheads of 5.29%, 2.97%, and 11.12%, respectively.
16:30 PDT – 16:45 PDT

Lightweight Fault Isolation: Practical, Efficient, and Secure Software Sandboxing
Zachary Yedidia (Stanford University)

Abstract: Software-based fault isolation (SFI) is a longstanding technique that allows isolation of one or more processes from each other with minimal or no use of hardware protection mechanisms. The demand for SFI systems has been increasing due to the advent of cloud and serverless computing, which require systems to run untrusted code with low latency and low context switch times. SFI systems must optimize for a combination of performance, trusted code base (TCB) size, scalability, and implementation complexity. With the rise of ARM64 in both cloud and personal computers, we revisit classic SFI in the context of ARM64 and present a new multi-sandbox SFI scheme that is practical to implement, efficient, and maintains a small TCB. Our technique, called Lightweight Fault Isolation (LFI), supports tens of thousands of 4GiB sandboxes in a single address space and does full software isolation of loads, stores, and jumps with a runtime overhead of 7% on the compatible subset of the SPEC 2017 benchmark suite. In addition to providing low runtime and code size overheads compared to existing multi-sandbox systems, LFI is implemented independently of existing compiler toolchains, has a small static verifier to reduce TCB size, is hardened against basic Spectre attacks, and has broad software support, including for language mechanisms like exceptions and ISA features such as SIMD.
16:45 PDT – 17:00 PDT

FreePart: Hardening Data Processing Software via Framework-based Partitioning and Isolation
Ali Ahad (University of Virginia) ;Gang Wang (University of Illinois at Urbana-Champaign) ;Chung Hwan Kim (University of Texas at Dallas) ;Suman Jana (Columbia University) ;Zhiqiang Lin (Ohio State University) ;Yonghwi Kwon (University of Virginia)

Abstract: Data processing oriented software, especially machine learning applications, are heavily dependent on standard frameworks/libraries such as TensorFlow and OpenCV. As those frameworks have gained significant popularity, the exploitation of vulnerabilities in the frameworks has become a critical security concern. While software isolation can minimize the impact of exploitation, existing approaches suffer from difficulty analyzing complex program dependencies or excessive overhead, making them ineffective in practice. We propose FreePart, a framework-focused software partitioning technique specialized for data processing applications. It is based on an observation that the execution of a data processing application, including data flows and usage of critical data, is closely related to the invocations of framework APIs. Hence, we conduct a temporal partitioning of the host application’s execution based on the invocations of framework APIs and the data objects used by the APIs. By focusing on data accesses at runtime instead of static program code, it provides effective and practical isolation from the perspective of data. Our evaluation on 23 applications using popular frameworks (e.g., OpenCV, Caffe, PyTorch, and TensorFlow) shows that FreePart is effective against all attacks composed of 18 real-world vulnerabilities with a low overhead (3.68%).
Session 3C: ML Cluster Scheduling
(Location: Grande D/E)
16:00 PDT – 16:15 PDT

DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads
Seah Kim (University of California Berkeley) ;Hyoukjun Kwon (University of California Irvine) ;Jinook Song, Jihyuck Jo, Yu-Hsin Chen, Liangzhen Lai, and Vikas Chandra (Meta)

Abstract: Emerging real-time multi-model ML (RTMM) workloads such as AR/VR and drone control involve dynamic behaviors in various granularity; task, model, and layers within a model. Such dynamic behaviors introduce new challenges to the system software in an ML system since the overall system load is not completely predictable, unlike traditional ML workloads. In addition, RTMM workloads require real-time processing, involve highly heterogeneous models, and target resource-constrained devices. Under such circumstances, developing an effective scheduler gains more importance to better utilize underlying hardware considering the unique characteristics of RTMM workloads. Therefore, we propose a new scheduler, DREAM, which effectively handles various dynamicity in RTMM workloads targeting multi-accelerator systems. DREAM quantifies the unique requirements for RTMM workloads and utilizes the quantified scores to drive scheduling decisions, considering the current system load and other inference jobs on different models and input frames. DREAM utilizes tunable parameters that provide fast and effective adaptivity to dynamic workload changes. In our evaluation of five scenarios of RTMM workload, DREAM reduces the overall UXCost, which is an equivalent metric of the energy-delay product (EDP) for RTMM defined in the paper, by 32.2% and 50.0% in the geometric mean (up to 80.8% and 97.6%) compared to state-of-the-art baselines, which shows the efficacy of our scheduling methodology.
16:15 PDT – 16:30 PDT

SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers
Daliang Xu (Peking University) ;Mengwei Xu (State Key Laboratory of Networking and Switching Technology) ;Chiheng Lou (Peking University) ;Li Zhang (State Key Laboratory of Networking and Switching Technology) ;Gang Huang, Xin Jin, and Xuanzhe Liu (Peking University)

Abstract: SoC-Cluster, a novel server architecture composed of massive mobile system-on-chips (SoCs), is gaining popularity in industrial edge computing due to its energy efficiency and compatibility with existing mobile applications. However, we observe that the deployed SoC-Cluster servers are not fully utilized, because the hosted workloads are mostly user-triggered and have significant tidal phenomena. To harvest the free cycles, we propose to co-locate deep learning tasks on them. We present SoCFlow, the first framework that can efficiently train deep learning models on SoC-Cluster. To deal with the intrinsic inadequacy of commercial SoC-Cluster servers, SoCFlow incorporates two novel techniques: (1) the group-wise parallelism with delayed aggregation that can train deep learning models fast and scalably without being influenced by the network bottleneck; (2) the data-parallel mixed-precision training algorithm that can fully unleash the heterogeneous processors’ capability of mobile SoCs. We have fully implemented SoCFlow and demonstrated its effectiveness through extensive experiments. The experiments show that SoCFlow significantly and consistently outperforms all baselines regarding the training speed while preserving the convergence accuracy, e.g., 1.6×–740× convergence speedup with 32 SoCs. Compared to commodity GPU (NVIDIA V100) under the same power budget, SoCFlow achieves comparable training speed but reduces energy consumption by 2.31×–10.23× with the same convergence accuracy.
16:30 PDT – 16:45 PDT

Training Job Placement in Clusters with Statistical In-Network Aggregation
Bohan Zhao and Wei Xu (Tsinghua University) ;Shuo Liu, Yang Tian, and Qiaoling Wang (Huawei) ;Wenfei Wu (Peking University)

Abstract: In-Network Aggregation (INA) offloads the gradient aggregation in distributed training (DT) onto programmable switches, where the switch memory could be allocated to jobs in either synchronous or statistical multiplexing mode. Statistical INA has advantages in switch memory utilization, control-plane simplicity, and management safety, but it faces the problem of cross-layer resource efficiency in job placement. This paper presents a job placement system NetPack for clusters with statistical INA, which aims to maximize the utilization of both computation and network resources. NetPack periodically batches and places jobs into the cluster. When placing a job, NetPack runs a steady state estimation algorithm to acquire the available resources in the cluster, heuristically values each server according to its available resources (GPU and bandwidth), and runs a dynamic programming algorithm to efficiently search for servers with the highest value for the job. Our prototype of NetPack and the experiments demonstrate that NetPack outperforms prior job placement methods by 45% in terms of average job completion time on production traces.
16:45 PDT – 17:00 PDT

RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing
Zheng Wang (University of California San Diego) ;Yuke Wang and Jiaqi Deng (University of California Santa Barbara) ;Da Zheng (Amazon) ;Ang Li (Pacific Northwest National Laboratory) ;Yufei Ding (University of California San Diego)

Abstract: Ensuring high-quality recommendations for newly onboarded users requires the continuous retraining of Deep Learning Recommendation Models (DLRMs) with freshly generated data. To serve the online DLRM retraining, existing solutions use hundreds of CPU computing nodes designated for input preprocessing, causing significant power consumption that surpasses even the power usage of GPU trainers. To this end, we propose RAP, an end-to-end DLRM training framework that supports Resource-aware Automated GPU sharing for DLRM input Preprocessing and Training. The core idea of RAP is to accurately capture the remaining GPU computing resources during DLRM training for input preprocessing, achieving superior training efficiency without requiring additional resources. Specifically, RAP utilizes a co-running cost model to efficiently assess the costs of various input preprocessing operations, and it implements a resource-aware horizontal fusion technique that adaptively merges smaller kernels according to GPU availability, circumventing any interference with DLRM training. In addition, RAP leverages a heuristic searching algorithm that jointly optimizes both the input preprocessing graph mapping and the co-running schedule to maximize the end-to-end DLRM training throughput. The comprehensive evaluation shows that RAP achieves 1.99× speedup on average over the sequential GPU-based DLRM input preprocessing baseline. In addition, the end-to-end training throughput of RAP is only 3.24% lower than the ideal case, which has no input preprocessing overhead.
Session 3D: ML Quantization and Memory Optimizations
(Location: Scripps I/II)
16:00 PDT – 16:15 PDT

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN
Renze Chen (Peking University) ;Zijian Ding (University of California Los Angeles) ;Size Zheng and Chengrui Zhang (Peking University) ;Jingwen Leng (Shanghai Jiao Tong University) ;Xuanzhe Liu and Yun Liang (Peking University)

Abstract: Recently, memory consumption of Deep Neural Network (DNN) rapidly increases, mainly due to long lifetimes and large shapes of tensors. Graph scheduling has emerged as an effective memory optimization technique, which determines the optimal execution, re-computation, swap-out, and swap-in timings for each operator/tensor. However, it often hurts performance significantly and can only manipulate tensors’ lifetimes but not shapes, limiting the optimization space. We find that graph transformation, which can change the tensor shapes and graph structure, creates a new tradeoff space between memory and performance. Nevertheless, graph transformation are applied separately so far, with primary focus on optimizing performance and not memory. In this paper, we propose MAGIS, a DNN memory optimization framework that coordinates graph transformation with graph scheduling. MAGIS uses a hierarchical tree to represent Fission Transformation (F-Trans), a type of transformation which can effectively reduce tensor shapes in a sub-graph. To keep the complexity low, we build a light-weight search space based on graph structure analysis. MAGIS decomposes graph scheduling into graph transformation and re-ordering and designs an incremental scheduling algorithm to alleviate the scheduling overhead after each graph transformation step to efficiently coordinate them. Experimental results show that compared to state-of-the-art works, MAGIS only uses 15%∼85% of their peak memory usage with the same latency1 constraint and obtains a better Pareto boundary in dual-objective optimization of memory and performance. Our code is now available at https://github.com/pku-liang/MAGIS.
16:15 PDT – 16:30 PDT

8-bit Transformer Inference and Fine-tuning for Edge Accelerators
Jeffrey Yu, Kartik Prabhu, Yonatan Urman, Robert M. Radway, Eric Han, and Priyanka Raina (Stanford University)

Abstract: Transformer models achieve state-of-the-art accuracy on natural language processing (NLP) and vision tasks, but demand significant computation and memory resources, which makes it difficult to perform inference and training (fine-tuning) on edge accelerators. Quantization to lower precision data types is a promising way to reduce computation and memory resources. Prior work has employed 8-bit integer (int8) quantization for Transformer inference, but int8 lacks the precision and range required for training. 8-bit floating-point (FP8) quantization has been used for Transformer training, but prior work only quantizes the inputs to matrix multiplications and leaves the rest of the operations in high precision. This work conducts an in-depth analysis of Transformer inference and fine-tuning at the edge using two 8-bit floating-point data types: FP8 and 8-bit posit (Posit8). Unlike FP8, posit has variable length exponent and fraction fields, leading to higher precision for values around 1, making it well suited for storing Transformer weights and activations. As opposed to prior work, we evaluate the impact of quantizing all operations in both the forward and backward passes, going beyond just matrix multiplications. Specifically, our work makes the following contributions: (1) We perform Transformer inference in FP8 and Posit8, achieving less than 1% accuracy loss compared to BFloat16 through operation fusion, without the need for scaling factors. (2) We perform Transformer fine-tuning in 8 bits by adapting low-rank adaptation (LoRA) to Posit8 and FP8, enabling 8-bit GEMM operations with increased multiply-accumulate efficiency and reduced memory accesses. (3) We design an area- and power-efficient posit softmax, which employs bitwise operations to approximate the exponential and reciprocal functions. The resulting vector unit in the Posit8 accelerator, that performs both softmax computation and other element-wise operations in Transformers, is on average 33% smaller and consumes 35% less power than the vector unit in the FP8 accelerator, while maintaining the same level of accuracy. Our work demonstrates that both Posit8 and FP8 can achieve inference and fine-tuning accuracy comparable to BFloat16, while reducing accelerator’s area by 30% and 34%, and power consumption by 26% and 32%, respectively.
16:30 PDT – 16:45 PDT

Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization
Zhanhong Tan, Zijian Zhu, and Kaisheng Ma (Tsinghua University)

Abstract: Memory is a critical design consideration in current data-intensive DNN accelerators, as it profoundly determines energy consumption, bandwidth requirements, and area costs. As DNN structures become more complex, a larger on-chip memory capacity is required to reduce data movement overhead, but at the expense of silicon costs. Some previous works have proposed memory-oriented optimizations, such as different data reuse and layer fusion schemes. However, these methods are not general and potent enough to cope with various graph structures. In this paper, we explore the intrinsic connection between network structures and memory features to optimize both hardware and mapping. First, we introduce a graph-level execution scheme with a corresponding dataflow and memory management method. This scheme enables the execution of arbitrary graph patterns with high data reuse and low hardware overhead. Subsequently, we propose Cocco, a hardware-mapping co-exploration framework leveraging graph-level features of networks. It aims to minimize communication overhead, such as energy consumption and bandwidth requirements, with a smaller memory capacity. We formulate the graph-partition scheduling and memory configuration search as an optimization problem and employ a genetic-based method to achieve efficient co-exploration for large and irregular networks. Experiments demonstrate that Cocco obtains lower external memory access, lower bandwidth requirements, and more stable optimization for graph partition compared to the greedy algorithm and dynamic programming introduced in prior works. Cocco also reduces the costs by 1.89% to 50.33% using co-exploration compared to other typical methods.
16:45 PDT – 17:00 PDT

Atalanta: A Bit is Worth a “Thousand” Tensor Values
Alberto Delmas Lascorz and Mostafa Mahmoud (University of Toronto) ;Ali Hadi Zadeh (University of Toronto and 1QBit) ;Milos Nikolic, Kareem Ibrahim, and Christina Giannoula (University of Toronto) ;Ameer Abdelhadi (McMaster University) ;Andreas Moshovos (University of Toronto and Vector Institute)

Abstract: Atalanta is a lossless, hardware/software co-designed compression technique for the tensors of fixed-point quantized deep neural networks. Atalanta increases effective memory capacity, reduces off-die traffic, and/or helps to achieve the desired performance/energy targets while using smaller offdie memories during inference. Atalanta is architected to deliver nearly identical coding efficiency compared to Arithmetic Coding while avoiding its complexity, overhead, and bandwidth limitations. Indicatively, the Atalanta decoder and encoder units each use less than 50B of internal storage. In hardware, Atalanta is implemented as an assist over any machine learning accelerator transparently compressing/decompressing tensors just before the off-die memory controller. This work shows the performance and energy efficiency of Atalanta when implemented in a 65nm technology node. Atalanta reduces data footprint of weights and activations to 60% and 48% respectively on average over a wide set of 8-bit quantized models and complements a wide range of quantization methods. Integrated with a Tensorcore-based accelerator, Atalanta boosts the speedup and energy efficiency to 1.44× and 1.37×, respectively. Atalanta is effective at compressing the stashed activations during training for fixed-point inference.
Session 4A: Accelerators
(Location: Grande C)
10:00 PDT – 10:15 PDT

Harp: Leveraging Quasi-Sequential Characteristics to Accelerate Sequence-to-Graph Mapping of Long Reads
Yichi Zhang, Dibei Chen, Gang Zeng, and Jianfeng Zhu (Tsinghua University) ;Zhaoshi Li (MetaX Integrated Circuits) ;Longlong Chen, Shaojun Wei, and Leibo Liu (Tsinghua University)

Abstract: Read mapping is a crucial task in computational genomics. Recently, there has been a significant paradigm shift from sequence-to-sequence mapping (S2S) to sequence-to-graph mapping (S2G). The S2G mapping incurs high graph processing overheads and leads to an unnoticed shift of performance hotspots. This presents a substantial challenge to current software implementations and hardware accelerators. This paper introduces Harp, a novel S2G mapping acceleration system. Harp leverages the structural similarities between genome graphs and sequences, significantly reducing graph processing overhead by exploiting their quasisequential characteristic. The Harp accelerator is co-designed with two main algorithmic components: (1) HarpTree, a compact data structure that explicitly reveals the quasi-sequential characteristic, enabling simplifications of graph processing algorithms in S2G mapping, and (2) HarpExt, a multi-stage seed-extension algorithm that mitigates graph operation overhead while maintaining the sight of S2G mapping. Harp achieves an average 140× speedup over the latest S2G mapping software and outperforms the state-of-the-art S2G accelerator by 23.6× while reducing the chip area by 72% in S2G mapping of long reads.
10:15 PDT – 10:30 PDT

GSCore: Efficient Radiance Field Rendering via Architectural Support for 3D Gaussian Splatting
Junseo Lee, Seokwon Lee, Jungi Lee, Junyong Park, and Jaewoong Sim (Seoul National University)

Abstract: This paper presents GSCore, a hardware acceleration unit that efficiently executes the rendering pipeline of 3D Gaussian Splatting with algorithmic optimizations. GSCore builds on the observations from an in-depth analysis of Gaussian-based radiance field rendering to enhance computational efficiency and bring the technique to wide adoption. In doing so, we present several optimization techniques, Gaussian shape-aware intersection test, hierarchical sorting, and subtile skipping, all of which are synergistically integrated with GSCore. We implement the hardware design of GSCore, synthesize it using a commercial 28nm technology, and evaluate the performance across a range of synthetic and real-world scenes with varying image resolutions. Our evaluation results show that GSCore achieves a 15.86× speedup on average over the mobile consumer GPU with a substantially smaller area and lower energy consumption.
10:30 PDT – 10:45 PDT

BeeZip: Towards An Organized and Scalable Architecture for Data Compression
Ruihao Gao, Zhichun Li, Guangming Tan, and Xueqi Li (Chinese Academy of Sciences)

Abstract: Data compression plays a critical role in operating systems and large-scale computing workloads. Its primary objective is to reduce network bandwidth consumption and memory/storage capacity utilization. Given the need to manipulate hash tables, and execute matching operations on extensive data volumes, data compression software has transformed into a resource-intensive CPU task. To tackle this challenge, numerous prior studies have introduced hardware acceleration methods. For example, they have utilized Content-Addressable Memory (CAM) for string matches, incorporated redundant historical copies for each matching component, and so on. While these methods amplify the compression throughput, they often compromise an essential aspect of compression performance: the compression ratio (C.R.). Moreover, hardware accelerators face significant resource costs, especially in memory, when dealing with new large sliding window algorithms. We introduce BeeZip, the first hardware acceleration system designed explicitly for compression with a large sliding window. BeeZip tackles the hardware-level challenge of optimizing both compression ratio and throughput. BeeZip offers architectural support for compression algorithms with the following distinctive attributes: 1) A two-stage compression algorithm adapted for accelerator parallelism, decoupling hash parallelism and match execution dependencies; 2) An organized hash hardware accelerator named BeeHash engine enhanced with dynamic scheduling, which orchestrates hash processes with structured parallelism; 3) A hardware accelerator named HiveMatch engine for the match process, which employs a new scalable parallelism approach and a heterogeneous scale processing unit design to reduce memory resource overhead. Experimental results show that on the Silesia dataset, BeeZip achieves an optimal throughput of 10.42GB/s (C.R. 2.96) and the best C.R. of 3.14 (throughput of 5.95GB/s). Under similar compression ratios, compared to single-threaded/36-threaded software implementations, BeeZip offers accelerator speedups of 23.2×/2.45×, respectively. Against all accelerators we know, BeeZip consistently demonstrates a superior compression ratio, improving by at least 9%.
10:45 PDT – 11:00 PDT

ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

Xiaoyang Lu (Illinois Institute of Technology) ;Boyu Long, Xiaoming Chen, and Yinhe Han (Chinese Academy of Sciences) ;Xian-He Sun (Illinois Institute of Technology)

Abstract: Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer architectures. Recently dedicated SpMM accelerators have been proposed to enhance SpMM performance. However, current SpMM accelerators still face challenges in adapting to varied sparse patterns, fully exploiting inherent parallelism, and optimizing cache performance. To address these issues, we introduce ACES, a novel SpMM accelerator in this study. First, ACES features an adaptive execution flow that dynamically adjusts to diverse sparse patterns. The adaptive execution flow balances parallel computing efficiency and data reuse. Second, ACES incorporates locality-concurrency co-optimizations within the global cache. ACES utilizes a concurrency-aware cache management policy, which considers data locality and concurrency for optimal replacement decisions. Additionally, the integration of a non-blocking buffer with the global cache enhances concurrency and reduces computational stalls. Third, the hardware architecture of ACES is designed to integrate all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1× speedup and marking a substantial advancement in SpMM acceleration.
11:00 PDT – 11:15 PDT

Explainable-DSE: An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck Analysis
Shail Dave (Arizona State University) ;Tony Nowatzki (University of California Los Angeles) ;Aviral Shrivastava (Arizona State University)

Abstract: Effective design space exploration (DSE) is paramount for hardware/software codesigns of deep learning accelerators that must meet strict execution constraints. For their vast search space, existing DSE techniques can require excessive trials to obtain a valid and efficient solution because they rely on black-box explorations that do not reason about design inefficiencies. In this paper, we propose Explainable-DSE – a framework for the DSE of accelerator codesigns using bottleneck analysis. By leveraging information about execution costs from bottleneck models, our DSE is able to identify bottlenecks and reason about design inefficiencies, thereby making bottleneck-mitigating acquisitions in further explorations. We describe the construction of bottleneck models for DNN accelerators. We also propose an API for expressing domain-specific bottleneck models and interfacing them with the DSE framework. Acquisitions of our DSE systematically cater to multiple bottlenecks that arise in executions of multi-functional workloads or multiple workloads with diverse execution characteristics. Evaluations for recent computer vision and language models show that Explainable-DSE mostly explores effectual candidates, achieving codesigns of 6× lower latency in 47× fewer iterations vs. non-explainable DSEs using evolutionary or ML-based optimizations. By taking minutes or tens of iterations, it enables opportunities for runtime DSEs.
Session 4B: Serverless Computing 1
(Location: Grande D/E)
10:00 PDT – 10:15 PDT

λFS:: A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions
Benjamin Carver (George Mason University) ;Runzhou Han (Iowa State University) ;Jingyuan Zhang (George Mason University) ;Mai Zheng (Iowa State University) ;Yue Cheng (University of Virginia)

Abstract: The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to the overall performance of a large-scale DFS. Common “serverful” MDS architectures, such as a single server or cluster of servers, have a significant shortcoming: either they are not scalable, or they make it difficult to achieve an optimal balance of performance, resource utilization, and cost. A modern MDS requires a novel architecture that addresses this shortcoming. To this end, we design and implement ?FS, an elastic, high-performance metadata service for large-scale DFSes. ?FS scales a DFS metadata cache elastically on a FaaS (Function-as-a-Service) platform and synthesizes a series of techniques to overcome the obstacles that are encountered when building large, stateful, and performance-sensitive applications on FaaS platforms. ?FS takes full advantage of the unique benefits offered by FaaS—elastic scaling and massive parallelism—to realize a highly-optimized metadata service capable of sustaining up to 4.13× higher throughput, 90.40% lower latency, 85.99% lower cost, 3.33× better performance-per-cost, and better resource utilization and efficiency than a state-of-the-art DFS for an industrial workload.
10:15 PDT – 10:30 PDT

FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture
Chuhao Xu, Yiyu Liu, Zijun Li, Quan Chen, and Han Zhao (Shanghai Jiao Tong University) ;Deze Zeng (China University of Geosciences) ;Qian Peng, Xueqi Wu, Haifeng Zhao, and Senbo Fu (Huawei) ;Minyi Guo (Shanghai Jiao Tong University)

Abstract: In serverless computing, an idle container is not recycled directly, in order to mitigate time-consuming cold container startup. These idle containers still occupy the memory, exasperating the memory shortage of today’s data centers. By offloading their cold memory to remote memory pool could potentially resolve this problem. However, existing offloading policies either hurt the Quality of Service (QoS) or are too coarse-grained in serverless computing scenarios. We therefore propose FaaSMem, a dedicated memory offloading mechanism tailored for serverless computing with memory poor architecture. It is proposed based on our finding that the memory of a serverless container allocated in different stages has different usage patterns. Specifically, FaaSMem proposes Page Bucket (Pucket) to segregate the memory pages in different segments, and applies segment-wise offloading policies for them. FaaSMem also proposes a semi-warm period during keep-alive stage, to seek a sweet spot between the offloading effort and the remote access penalty. Experimental results show that FaaSMem reduces the average local memory footprint by 9.9% – 79.8% and improves the container deployment density to 108% – 218%, with negligible 95%-ile latency increase.
10:30 PDT – 10:45 PDT

CodeCrunch: Improving Serverless Performance via Function Compression and Cost-Aware Warmup Location Optimization
Rohan Basu Roy (Northeastern University) ;Tirthak Patel (Rice University) ;Rohan Garg (Nutanix) ;Devesh Tiwari (Northeastern University)

Abstract: Serverless computing has a critical problem of function cold starts. To minimize cold starts, state-of-the-art techniques predict function invocation times to warm them up. Warmed-up functions occupy space in memory and incur a keep-alive cost, which can become exceedingly prohibitive under bursty load. To address this issue, we design CodeCrunch, which introduces the concept of serverless function compression and exploits server heterogeneity to make serverless computing more efficient, especially under high memory pressure.
10:45 PDT – 11:00 PDT

RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing

Hanfei Yu (Louisiana State University) ;Rohan Basu Roy (Northeastern University) ;Christian Fontenot (Louisiana State University) ;Devesh Tiwari (Northeastern University) ;Jian Li (Stony Brook University) ;Hong Zhang (University of Waterloo) ;Hao Wang (Louisiana State University) ;Seung-Jong Park (Missouri University of Science and Technology)

Abstract: Serverless computing has grown rapidly as a new cloud computing paradigm that promises ease-of-management, cost-efficiency, and auto-scaling by shipping functions via self-contained virtualized containers. Unfortunately, serverless computing suffers from severe cold-start problems—starting containers incurs non-trivial latency. Full container caching is widely applied to mitigate cold-starts, yet has recently been outperformed by two lines of research: partial container caching and container sharing. However, either partial container caching or container sharing techniques exhibit their drawbacks. Partial container caching effectively deals with burstiness while leaving cold-start mitigation halfway; container sharing reduces cold-starts by enabling containers to serve multiple functions while suffering from excessive memory waste due to over-packed containers. This paper proposes RainbowCake, a layer-wise container pre-warming and keep-alive technique that effectively mitigates cold-starts with sharing awareness at minimal waste of memory. With structured container layers and sharing-aware modeling, RainbowCake is robust and tolerant to invocation bursts. We seize the opportunity of container sharing behind the startup process of standard container techniques. RainbowCake breaks the container startup process of a container into three stages and manages different container layers individually. We develop a sharing-aware algorithm that makes event-driven layer-wise caching decisions in real-time. Experiments on OpenWhisk clusters with real-world workloads show that RainbowCake reduces 68% function startup latency and 77% memory waste compared to state-of-the-art solutions.
11:00 PDT – 11:15 PDT

Flame: A Centralized Cache Controller for Serverless Computing
Yanan Yang, Laiping Zhao, Yiming Li, and Shihao Wu (Tianjin University) ;Yuechan Hao and Yuchi Ma (Huawei) ;Keqiu Li (Tianjin University)

Abstract: Caching function is a promising way to mitigate coldstart overhead in serverless computing. However, as caching also increases the resource cost significantly, how to make caching decisions is still challenging. We find that the prior “local cache control” designs are insufficient to achieve high cache efficiency due to the workload skewness across servers. In this paper, inspired by the idea of software defined network management, we propose Flame, an efficient cache system to manage cached functions with a “centralized cache control” design. By decoupling the cache control plane from local servers and setting up a separate centralized controller, Flame is able to make caching decisions considering a global view of cluster status, enabling the optimized cache-hit ratio and resource efficiency. We evaluate Flame with real-world workloads and the evaluation results show that it can reduce the cache resource usage by 36% on average while improving the coldstart ratio by nearly 7× than the state-of-the-art method.
Session 4C: Power and Energy
(Location: Scripps I)
10:00 PDT – 10:15 PDT

Characterizing Power Management Opportunities for LLMs in the Cloud
Pratyush Patel (Microsoft Azure and University of Washington) ;Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini (Microsoft Azure)

Abstract: Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework. We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss.
10:15 PDT – 10:30 PDT

Going Green for Less Green: Optimizing the Cost of Reducing Cloud Carbon Emissions
Walid A. Hanafy and Qianlin Liang (University of Massachusetts Amherst) ;Noman Bashir (MIT) ;Abel Souza, David Irwin, and Prashant Shenoy (University of Massachusetts Amherst)

Abstract: The continued exponential growth of cloud datacenter capacity has increased awareness of the carbon emissions when executing large compute-intensive workloads. To reduce carbon emissions, cloud users often temporally shift their batch workloads to periods with low carbon intensity. While such time shifting can increase job completion times due to their delayed execution, the cost savings from cloud purchase options, such as reserved instances, also decrease when users operate in a carbon-aware manner. This happens because carbon-aware adjustments change the demand pattern by periodically leaving resources idle, which creates a trade-off between carbon emissions and cost. In this paper, we present GAIA, a carbon-aware scheduler that enables users to address the three-way trade-off between carbon, performance, and cost in cloud-based batch schedulers. Our results quantify the carbon-performance-cost trade-off in cloud platforms and show that compared to existing carbon-aware scheduling policies, our proposed policies can double the amount of carbon savings per percentage increase in cost, while decreasing the performance overhead by 26%.
10:30 PDT – 10:45 PDT

FOCAL: A First-Order Carbon Model to Assess Processor Sustainability
Lieven Eeckhout (Ghent University)

Abstract: Sustainability in general and global warming in particular are grand societal challenges. Computer systems demand substantial materials and energy resources throughout their entire lifetime. A key question is how computer engineers and scientists can reduce the environmental impact of computing. To overcome the inherent data uncertainty, this paper proposes FOCAL, a parameterized first-order carbon model to assess processor sustainability using first principles. FOCAL’s normalized carbon footprint (NCF) metric guides computer architects to holistically optimize chip area, energy and power consumption to reduce a processor’s environmental footprint. We use FOCAL to analyze and categorize a broad set of archetypal processor mechanisms into strongly, weakly or less sustainable design choices, providing insight and intuition into how to reduce a processor’s environmental footprint with implications to both hardware and software. A case study illustrates a pathway for designing strongly sustainable multicore processors delivering high performance while at the same time reducing their environmental footprint.
10:45 PDT – 11:00 PDT

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs

Srikant Bharadwaj (Microsoft) ;Shomit Das (Qualcomm) ;Kaushik Mazumdar and Bradford M. Beckmann (AMD) ;Stephen Kosonocky (Uhnder)

Abstract: With the continuous improvement of on-chip integrated voltage regulators (IVRs) and fast, adaptive frequency control, dynamic voltage-frequency scaling (DVFS) transition times have shrunk from the microsecond to the nanosecond regime, providing immense opportunity to improve energy efficiency. The key to unlocking the continued improvement in V/f circuit technology is the creation of new, smarter DVFS mechanisms that better adapt to rapid fluctuations in workload demand. It is particularly important to optimize fine-grain DVFS mechanisms for graphics processing units (GPUs) as the chips become ever more important workhorses in the datacenter. However, GPU’s massive amount of thread-level parallelism makes it uniquely difficult to determine the optimal V/f state at run-time. Existing solutions—mostly designed for single-threaded CPUs and longer time scales—fail to consider the seemingly chaotic, highly varying nature of GPU workloads at short time scales. This paper proposes a novel prediction mechanism, PCSTALL, that is tailored for emerging DVFS capabilities in GPUs and achieves near-optimal energy efficiency. Using the insights from our fine-grained workload analysis, we propose a wavefront-level program counter (PC) based DVFS mechanism that improves program behavior prediction accuracy by 32% on average as compared to the best performing prior predictor for a wide set of GPU applications at 1µs DVFS time epochs. Compared to the current state-of-art, our PC-based technique achieves 19% average improvement when optimized for Energy-Delay2 Product (ED2 P) at 50µs time epochs, reaching 32% when operated with 1µs DVFS technologies.
11:00 PDT – 11:15 PDT

SUIT: Secure Undervolting with Instruction Traps
Jonas Juffinger (Graz University of Technology) ;Stepan Kalinin (North Carolina State University) ;Daniel Gruss (Graz University of Technology) ;Frank Mueller (North Carolina State University)

Abstract: Modern CPUs dynamically scale voltage and frequency for efficiency. However, too low voltages can result in security-critical errors. Hence, vendors use a generous safety margin to avoid errors at the cost of higher energy overheads. In this work, we present SUIT, a novel hardware-software co-design to reduce the safety margin substantially without compromising reliability or security. We observe that not all instructions are equally affected by undervolting faults and that most faultable instructions are infrequent in practice. Hence, SUIT addresses infrequent faultable instructions via two separate DVFS curves, a conservative and an efficient one. For frequent faultable instructions, SUIT statically relaxes the critical path in hardware. Consequently, the instruction is not faultable anymore on the efficient DVFS curve at the cost of performance overheads for this specific instruction. For infrequent faultable instructions, SUIT introduces a trap mechanism preventing execution on the efficient curve. With this trap mechanism, SUIT temporarily switches to the conservative DVFS curve and switches back if no faultable instruction was executed within a certain time frame. We evaluate all building blocks of SUIT, using both measurements on real hardware and simulations, showing a performance overhead of 3.79 %, and a CPU efficiency gain of 20.8 % on average on SPEC CPU2017.
Session 4D: Static Analysis and Verification
(Location: Scripps II)
10:00 PDT – 10:15 PDT

Kaleidoscope: Precise Invariant-Guided Pointer Analysis
Tapti Palit and Pedro Fonseca (Purdue University)

Abstract: Pointer analysis techniques are crucial for many software security mitigation approaches. However, these techniques suffer from imprecision; hence, the reported points-to sets are a superset of the actual points-to sets that can possibly form during program execution. To improve the precision of pointer analysis techniques, we propose Kaleidoscope. By using an invariant-guided optimistic (IGO) pointer analysis approach, Kaleidoscope makes optimistic assumptions during the pointer analysis that it later validates at runtime. If these optimistic assumptions do not hold true at runtime, Kaleidoscope falls back to an imprecise baseline analysis, thus preserving soundness. We show that Kaleidoscope reduces the average points-to set size by 13.15× across a set of 9 applications over the current state-of-the-art pointer analysis framework. Furthermore, we demonstrate how Kaleidoscope can implement control flow integrity (CFI) to increase the security of traditional CFI policies.
10:15 PDT – 10:30 PDT

Lifting Micro-Update Models from RTL for Formal Security Analysis
Adwait Godbole and Kevin Cheang (University of California Berkeley) ;Yatin A. Manerkar (University of Michigan) ;Sanjit A. Seshia (University of California Berkeley)

Abstract: Hardware execution attacks exploit subtle microarchitectural interactions to leak secret data. While checking programs for the existence of such attacks is essential, verification of software against the full hardware implementation does not scale. Verification using abstract formal models of the hardware can help provide strong security guarantees while leveraging abstraction to achieve scalability. However, handwriting accurate abstract models is tedious and error-prone. Hence, we need techniques to generate models that enable sound yet scalable security analysis automatically. In this work, we propose micro-update models as a modelling framework that enables sound and abstract modelling of microarchitectural features. We also develop algorithms to generate micro-update models from RTL semi-automatically. We implement our modelling and generation framework in a prototype tool called PAUL. We evaluate our approach by synthesizing micro-update models for the Sodor5Stage processor and components from the cva6 (Ariane) processor. We demonstrate how these models can be generated hierarchically, thus increasing scalability to larger designs. We observe up to 8× improvement in run time when performing analysis with the generated models as compared to the source RTL.
10:30 PDT – 10:45 PDT

Formal Mechanised Semantics of CHERI C: Capabilities, Undefined Behaviour, and Provenance
Vadim Zaliva and Kayvan Memarian (University of Cambridge) ;Ricardo Almeida (University of Edinburgh) ;Jessica Clarke (University of Cambridge) ;Brooks Davis (SRI International) ;Alexander Richardson (University of Cambridge) ;David Chisnall (Microsoft) ;Brian Campbell and Ian Stark (University of Edinburgh) ;Robert N. M. Watson and Peter Sewell (University of Cambridge)

Abstract: Memory safety issues are a persistent source of security vulnerabilities, with conventional architectures and the C codebase chronically prone to exploitable errors. The CHERI research project has shown how one can provide radically improved security for that existing codebase with minimal modification, using unforgeable hardware capabilities in place of machine-word pointers in CHERI dialects of C, implemented as adaptions of Clang/LLVM and GCC. CHERI was first prototyped as extensions of MIPS and RISC-V; it is currently being evaluated by Arm and others with the Arm Morello experimental architecture, processor, and platform, to explore its potential for mass-market adoption, and by Microsoft in their CHERIoT design for embedded cores. There is thus considerable practical experience with CHERI C implementation and use, but exactly what CHERI C’s semantics is (or should be) remains an open question. In this paper, we present the first attempt to rigorously and comprehensively define CHERI C semantics, discuss key semantics design questions relating to capabilities, provenance, and undefined behaviour, and clarify them with semantics in multiple complementary forms: in prose, as an executable semantics adapting the Cerberus C semantics, and mechanised in Coq. This establishes a solid foundation for CHERI C, for those porting code to it, for compiler implementers, and for future semantics and verification.
10:45 PDT – 11:00 PDT

Lightweight, Modular Verification for WebAssembly-to-Native Instruction Selection
Alexa VanHattum (Wellesley College) ;Monica Pardeshi (Carnegie Mellon University) ;Chris Fallin (Fastly) ;Adrian Sampson (Cornell University) ;Fraser Brown (Carnegie Mellon University)

Abstract: Language-level guarantees—like module runtime isolation for WebAssembly (Wasm)—are only as strong as the compiler that produces a final, native-machine-specific executable. The process of lowering language-level constructions to ISA-specific instructions can introduce subtle bugs that violate security guarantees. In this paper, we present Crocus, a system for lightweight, modular verification of instruction-lowering rules within Cranelift, a production retargetable Wasm native code generator. We use Crocus to verify lowering rules that cover WebAssembly 1.0 support for integer operations in the ARM aarch64 backend. We show that Crocus can reproduce 3 known bugs (including a 9.9/10 severity CVE), identify 2 previously-unknown bugs and an underspecified compiler invariant, and help analyze the root causes of a new bug.
11:00 PDT – 11:15 PDT

Verifying Rust Implementation of Page Tables in a Software Enclave Hypervisor
Zhenyang Dai (Tsinghua University and Ant Group) ;Shuang Liu (Ant Group) ;Vilhelm Sjoberg (CertiK) ;Xupeng Li (Columbia University) ;Yu Chen (Tsinghua University) ;Wenhao Wang (Chinese Academy of Sciences) ;Yuekai Jia (Tsinghua University and Ant Group) ;Sean Noble Anderson (Portland State University) ;Laila Elbeheiry (Max Planck Institute for Software Systems) ;Shubham Sondhi (CertiK) ;Yu Zhang (Yale University) ;Zhaozhong Ni (CertiK) ;Shoumeng Yan (Ant Group) ;Ronghui Gu (Columbia University) ;Zhengyu He (Ant Group)

Abstract: As trusted execution environments (TEE) have become the corner stone for secure cloud computing, it is critical that they are reliable and enforce proper isolation, of which a key ingredient is spatial isolation. Many TEEs are implemented in software such as hypervisors for flexibility, and in a memory-safe language, namely Rust to alleviate potential memory bugs. Still, even if memory bugs are absent from the TEE, it may contain semantic errors such as mis-configurations in its memory subsystem which breaks spatial isolation. In this paper, we present the verification of the memory subsystem of a software TEE in Rust, namely HyperEnclave. We prove spatial isolation for the secure enclave though correct configuration of page tables for an early prototype of HyperEnclave. To formally model Rust code, we introduce a lightweight formal semantics for the Mid-level intermediate representation (MIR) of Rust. To make verification scalable for such a complex system, we incorporate the MIR semantics with a layered proof framework.
Session 5A: Compiler and Optimization Techniques
(Location: Grande C)
11:45 PDT – 12:00 PDT

C4CAM: A Compiler for CAM-based In-memory Accelerators
Hamid Farzaneh and Joao Paulo Cardoso De Lima (TU Dresden) ;Mengyuan Li (University of Notre Dame) ;Asif Ali Khan (TU Dresden) ;Xiaobo Sharon Hu (University of Notre Dame) ;Jeronimo Castrillon (TU Dresden)

Abstract: Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.
12:00 PDT – 12:15 PDT

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
Erik Orm Hellsten (Lund University) ;Artur Souza (Federal University of Minas Gerais) ;Johannes Lenfers (University of Münster) ;Rubens Lacouture and Olivia Hsu (Stanford University) ;Adel Ejjeh (University of Illinois at Urbana-Champaign) ;Fredrik Kjolstad (Stanford University) ;Michel Steuwer (University of Edinburgh) ;Kunle Olukotun (Stanford University) ;Luigi Nardi (Lund University and Stanford University)

Abstract: We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO’s effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art auto-tuners by delivering on average 1.36×–1.56× faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9×–3.9× faster.
12:15 PDT – 12:30 PDT

Merlin: Multi-tier Optimization of eBPF Code for Performance and Compactness
Jinsong Mao (University of Massachusetts Amherst) ;Hailun Ding (Rutgers University) ;Juan Zhai and Shiqing Ma (University of Massachusetts Amherst)

Abstract: eBPF (extended Berkeley Packet Filter) significantly enhances observability, performance, and security within the Linux kernel, playing a pivotal role in various real-world applications. Implemented as a register-based kernel virtual machine, eBPF features a customized Instruction Set Architecture (ISA) with stringent kernel safety requirements, e.g., a limited number of instructions. This constraint necessitates substantial optimization efforts for eBPF programs to meet performance objectives. Despite the availability of compilers supporting eBPF program compilation, existing tools often overlook key optimization opportunities, resulting in suboptimal performance. In response, this paper introduces Merlin, an optimization framework leveraging customized LLVM passes and bytecode rewriting for Instruction Representation (IR) transformation and bytecode refinement. Merlin employs two primary optimization strategies, i.e., instruction merging and strength reduction. These optimizations are deployed before eBPF verification. We evaluate Merlin across 19 XDP programs (drawn from the Linux kernel, Meta, hXDP, and Cilium) and three eBPF-based systems (Sysdig, Tetragon, and Tracee, each comprising several hundred eBPF programs). The results show that all optimized programs pass the kernel verification. Meanwhile, Merlin can reduce number of instructions by 73% and runtime overhead by 60% compared with the original programs. Merlin can also improve the throughput by 0.59% and reduce the latency by 5.31%, compared to state-of-the-art technique K2 , while being 106 times faster and more scalable to larger and more complex programs without additional manual efforts.
12:30 PDT – 12:45 PDT

Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors
Samuel Thomas and James Bornholt (University of Texas at Austin)

Abstract: Embedded applications extract the best power–performance trade-off from digital signal processors (DSPs) by making extensive use of vectorized execution. Rather than handwriting the many customized kernels these applications use, DSP engineers rely on auto-vectorizing compilers to quickly produce effective code. Building these compilers is a large and error-prone investment, and each new DSP architecture or application-specific ISA customization must repeat this effort to derive a new high-performance compiler. We present Isaria, a framework for automatically generating vectorizing compilers for DSP architectures. Isaria uses equality saturation to search for vectorized DSP code using a system of rewrite rules. Rather than hand-crafting these rules, Isaria automatically synthesizes sound rewrite rules from an ISA specification, discovers phase structure within these rules that improve compilation performance, and schedules their application at compile time while pruning intermediate states of the search. We use Isaria to generate a compiler for an industrial DSP architecture, and show that the resulting kernels outperform existing DSP libraries by up to 6.9× and are competitive with those generated by expert-built compilers. We also demonstrate how Isaria can speed up exploration of new ISA customizations by automatically generating a high-quality vectorizing compiler.
12:45 PDT – 13:00 PDT

Fast Instruction Selection for Fast Digital Signal Processing
Alexander J Root (Stanford University) ;Maaz Bin Safeer Ahmad (Adobe) ;Dillon Sharlet (Independent Researcher) ;Andrew Adams and Shoaib Kamil (Adobe) ;Jonathan Ragan-Kelley (MIT CSAIL)

Abstract: Modern vector processors support a wide variety of instructions for fixed-point digital signal processing. These instructions support a proliferation of rounding, saturating, and type conversion modes, and are often fused combinations of more primitive operations. While these are common idioms in fixed-point signal processing, it is difficult to use these operations in portable code. It is challenging for programmers to write down portable integer arithmetic in a C-like language that corresponds exactly to one of these instructions, and even more challenging for compilers to recognize when these instructions can be used. Our system, Pitchfork, defines a portable fixed-point intermediate representation, FPIR, that captures common idioms in fixed-point code. FPIR can be used directly by programmers experienced with fixed-point, or Pitchfork can automatically lift from integer operations into FPIR using a term-rewriting system (TRS) composed of verified manual and automatically-synthesized rules. Pitchfork then lowers from FPIR into target-specific fixed-point instructions using a set of target-specific TRSs. We show that this approach improves runtime performance of portably-written fixed-point signal processing code in Halide, across a range of benchmarks, by geomean 1.31× on x86 with AVX2, 1.82× on ARM Neon, and 2.44× on Hexagon HVX compared to a standard LLVM-based compiler flow, while maintaining or improving existing compile times.

Session 5B: Emerging and Non-Traditional Technologies
(Location: Grande D/E)
11:45 PDT – 12:00 PDT

An Encoding Scheme to Enlarge Practical DNA Storage Capacity by Reducing Primer-Payload Collisions
Yixun Wei (University of Minnesota Twin Cities) ;Bingzhe Li (University of Texas at Dallas) ;David H.C. Du (University of Minnesota Twin Cities)

Abstract: Deoxyribonucleic Acid (DNA), with its ultra-high storage density and long durability, is a promising long-term archival storage medium and is attracting much attention today. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. Many encoding schemes have been proposed to enlarge DNA storage capacity by increasing DNA encoding density. However, only increasing encoding density is insufficient because enhancing DNA storage capacity is a multifaceted problem. This paper assumes that random accesses are necessary for practical DNA archival storage. We identify all factors affecting DNA storage capacity under current technologies and systematically investigate the practical DNA storage capacity with several popular encoding schemes. The investigation result shows the collision between primers and DNA payload sequences is a major factor limiting DNA storage capacity. Based on this discovery, we designed a new encoding scheme called Collision Aware Code (CAC) to trade some encoding density for the reduction of primer-payload collisions. Compared with the best result among the five existing encoding schemes, CAC can extricate 120% more primers from collisions and increase the DNA tube capacity from 211.96 GB to 295.11 GB. Besides, we also evaluate CAC’s recoverability from DNA storage errors. The result shows CAC is comparable to those of existing encoding schemes.
12:00 PDT – 12:15 PDT

Design of Novel Analog Compute Paradigms with Ark
Yu-Neng Wang (Stanford University) ;Glenn Cowan (Concordia University) ;Ulrich Rührmair (TU Berlin and University of Connecticut) ;Sara Achour (Stanford University)

Abstract: Previous efforts on reconfigurable analog circuits mostly focused on specialized analog circuits, produced through careful co-design, or on highly reconfigurable, but relatively resource inefficient, accelerators that implement analog compute paradigms. This work deals with an intermediate point in the design space: specialized reconfigurable circuits for analog compute paradigms. This class of circuits requires new methodologies for performing co-design, as prior techniques are typically highly specialized to conventional circuit classes (e.g., filters, ADCs). In this context, we present Ark, a programming language for describing analog compute paradigms. Ark enables progressive incorporation of analog behaviors into computations, and deploys a validator and dynamical system compiler for verifying and simulating computations. We use Ark to codify the design space for three different exemplary circuit design problems, and demonstrate that Ark helps exploring design trade-offs and evaluating the impact of non-idealities to the computation.
12:15 PDT – 12:30 PDT

LightRidge: An End-to-end Agile Design Framework for Diffractive Optical Neural Networks
Yingjie Li (University of Utah and University of Maryland) ;Ruiyang Chen, Minhan Lou, Berardi Sensale-Rodriguez, and Weilu Gao (University of Utah) ;Cunxi Yu (University of Utah and University of Maryland)

Abstract: To lower the barrier to diffractive optical neural networks (DONNs) design, exploration, and deployment, we propose LightRidge, the first end-to-end optical ML compilation framework, which consists of (1) precise and differentiable optical physics kernels that enable complete explorations of DONNs architectures, (2) optical physics computation kernel acceleration that significantly reduces the runtime cost in training, emulation, and deployment of DONNs, and (3) versatile and flexible optical system modeling and user-friendly domain-specific-language (DSL). As a result, LightRidge framework enables efficient end-to-end design and deployment of DONNs, and significantly reduces the efforts for programming, hardware-software codesign, and chip integration. Our results are experimentally conducted with physical optical systems, where we demonstrate: (1) the optical physics kernels precisely correlated to low-level physics and systems, (2) significant speedups in runtime with physics-aware emulation workloads compared to the state-of-the-art commercial system, (3) effective architectural design space exploration verified by the hardware prototype and on-chip integration case study, and (4) novel DONN design principles including successful demonstrations of advanced image classification and image segmentation task using DONNs architecture and topology.
12:30 PDT – 12:45 PDT

EagleEye: Nanosatellite constellation design for high-coverage, high-resolution sensing
Zhuo Cheng, Bradley Denby, Kyle McCleary, and Brandon Lucia (Carnegie Mellon University)

Abstract: Advances in nanosatellite technology and low launch costs have led to more Earth-observation satellites in low-Earth orbit. Prior work shows that satellite images are useful for geospatial analysis applications (e.g., ship detection, lake monitoring, and oil tank volume estimation). To maximize its value, a satellite constellation should achieve high coverage and provide high-resolution images of the targets. Existing homogeneous constellation designs cannot meet both requirements: a constellation with low-resolution cameras provides high coverage but only delivers low-resolution images; a constellation with high-resolution cameras images smaller geographic areas. We develop EagleEye, a novel mixed-resolution, leader-follower constellation design. The leader satellite has a low-resolution, high-coverage camera to detect targets with onboard image processing. The follower satellite(s), equipped with a high-resolution camera, receive commands from the leader and take high-resolution images of the targets. The leader must consider actuation time constraints when scheduling follower target acquisitions. Additionally, the leader must complete both target detection and follower scheduling in a limited time. We propose an ILP-based algorithm to schedule follower satellite target acquisition, based on positions identified by a leader satellite. We evaluate on four datasets and show that EagleEye achieves 11–194% more coverage compared to existing solutions.
12:45 PDT – 13:00 PDT

Energy Efficient Convolutions with Temporal Arithmetic
Rhys Gretsch and Peiyang Song (University of California Santa Barbara) ;Advait Madhavan (University of Maryland College Park) ;Jeremy Lau and Timothy Sherwood (University of California Santa Barbara)

Abstract: Convolution is an important operation at the heart of many applications, including image processing, object detection, and neural networks. While data movement and coordination operations continue to be important areas for optimization in general-purpose architectures, for computation fused with sensor operation, the underlying multiply-accumulate (MAC) operations dominate power consumption. Non-traditional data encoding has been shown to reduce the energy consumption of this arithmetic, with options including everything from reduced-precision floating point to fully stochastic operation, but all of these approaches start with the assumption that a complete analog-to-digital conversion (ADC) has already been done for each pixel. While analog-to-time converters have been shown to use less energy, arithmetically manipulating temporally encoded signals beyond simple min, max, and delay operations has not previously been possible, meaning operations such as convolution have been out of reach. In this paper we show that arithmetic manipulation of temporally encoded signals is possible, practical to implement, and extremely energy efficient. The core of this new approach is a negative log transformation of the traditional numeric space into a ‘delay space’ where scaling (multiplication) becomes delay (addition in time). The challenge lies in dealing with addition and subtraction. We show these operations can also be done directly in this negative log delay space, that the associative and commutative properties still apply to the transformed operations, and that accurate approximations can be built efficiently in hardware using delay elements and basic CMOS logic elements. Furthermore, we show that these operations can be chained together in space or operated recurrently in time. This approach fits naturally into the staged ADC readout inherent to most modern cameras. To evaluate our approach, we develop a software system that automatically transforms traditional convolutions into delay space architectures. The resulting system is used to analyze and balance error from both a new temporal equivalent of quantization and delay element noise, resulting in designs that improve the energy per pixel of each convolution frame by more than 2× compared to a state-of-the-art while improving the energy delay product by four orders of magnitude.

Session 5C: Memory: Allocation and Management
(Location: Scripps I)
11:45 PDT – 12:00 PDT

Getting a Handle on Unmanaged Memory
Nick Wanninger, Tommy McMichen, Simone Campanoni, and Peter Dinda (Northwestern University)

Abstract: The inability to relocate objects in unmanaged languages brings with it a menagerie of problems. Perhaps the most impactful is memory fragmentation, which has long plagued applications such as databases and web servers. These issues either fester or require Herculean programmer effort to address on a per-application basis because, in general, heap objects cannot be moved in unmanaged languages. In contrast, managed languages like C# cleanly address fragmentation through the use of compacting garbage collection techniques built upon heap object movement. In this work, we bridge this gap between unmanaged and managed languages through the use of handles, a level of indirection allowing heap object movement. Handles open the door to seamlessly employing runtime features from managed languages in existing, unmodified code written in unmanaged languages. We describe a new compiler and runtime system, ALASKA, that acts as a drop-in replacement for malloc. Without any programmer effort, the ALASKA compiler transforms pointer-based code to utilize handles, with optimizations to minimize performance impact. A codesigned runtime system manages this new level of indirection and exploits heap object movement via an extensible service interface. We investigate the overheads of ALASKA on large benchmarks and applications spanning multiple domains. To show the power and extensibility of handles, we use ALASKA to eliminate fragmentation on the heap through defragmentation, reducing memory usage by up to 40% in Redis.
12:00 PDT – 12:15 PDT

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety
Nathaniel Wesley Filardo (University of Cambridge and Microsoft) ;Brett F. Gutstein, Jonathan Woodruff, Jessica Clarke, and Peter Rugg (University of Cambridge) ;Brooks Davis (SRI International) ;Mark Johnston (University of Cambridge) ;Robert Norton (Microsoft) ;David Chisnall (SCI Semiconductor) ;Simon W. Moore (University of Cambridge) ;Peter G. Neumann (SRI International) ;Robert N. M. Watson (University of Cambridge)

Abstract: Violations of temporal memory safety (“use after free”, “UAF”) continue to pose a significant threat to software security. The CHERI capability architecture has shown promise as a technology for C and C++ language reference integrity and spatial memory safety. Building atop CHERI, prior works – CHERIvoke and Cornucopia – have explored adding heap temporal safety. The most pressing limitation of Cornucopia was its impractical “stop-the-world” pause times. We present Cornucopia Reloaded, a re-designed drop-in replacement implementation of CHERI temporal safety, using a novel architectural feature – a per-page capability load barrier, added in Arm’s Morello prototype CPU and CHERI-RISC-V – to nearly eliminate application pauses. We analyze the performance of Reloaded as well as Cornucopia and CHERIvoke on Morello, using the CHERI-compatible SPEC CPU2006 INT workloads to assess its impact on batch workloads and using pgbench and gRPC QPS as surrogate interactive workloads. Under Reloaded, applications no longer experience significant revocation-induced stop-the-world periods, without additional wall-or CPU-time cost over Cornucopia and with median 87% of Cornucopia’s DRAM traffic overheads across SPEC CPU2006 and < 50% for pgbench.
12:15 PDT – 12:30 PDT

Characterizing a Memory Allocator at Warehouse Scale
Zhuangzhuang Zhou (Cornell University) ;Vaibhav Gogte, Nilay Vaish, Chris Kennelly, Patrick Xia, Svilen Kanev, and Tipp Moseley (Google) ;Christina Delimitrou (MIT) ;Parthasarathy Ranganathan (Google)

Abstract: Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings. We present the first comprehensive characterization study of TCMalloc, a memory allocator used by warehouse-scale applications in Google’s production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we optimize TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. For the applications with the highest memory allocation usage, we observe up to 8.1% and 6.3% improvement in throughput and memory usage respectively. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.
12:30 PDT – 12:45 PDT

More Apps, Faster Hot-Launch on Mobile Devices via Fore/Background-aware GC-Swap Co-design
Jiacheng Huang (City University of Hong Kong and Wuhan University) ;Yunmo Zhang and Junqiao Qiu (City University of Hong Kong) ;Yu Liang (ETH Zürich) ;Rachata Ausavarungnirun (King Mongkut’s University of Technology North Bangkok) ;Qingan Li (Wuhan University) ;Chun Jason Xue (Mohamed bin Zayed University of Artificial Intelligence)

Abstract: Faster app launching is crucial for the user experience on mobile devices. Apps launched from a background cached state, called hot-launching, have much better performance than apps launched from scratch. To increase the number of hot-launches, leading mobile vendors now cache more apps in the background by enabling swap. Recent work also proposed reducing the Java heap to increase the number of cached apps. However, this paper found that existing methods deteriorate app hot-launch performance while increasing the number of cached apps. To simultaneously improve the number of cached apps and hot-launch performance, this paper proposes Fleet, a foreground/background-aware GC-swap co-design framework. To enhance app-caching capacity, Fleet limits the tracing range of GC to background objects only, avoiding touching long-lifetime foreground objects. To improve hot-launch performance, Fleet identifies objects that will be accessed during the next hot-launch and uses runtime information to guide the swap scheme in the OS. In addition, Fleet aggregates small objects with similar access patterns into the same pages to improve swap efficiency. We implemented Fleet in AOSP and evaluated its performance with different types of apps. Experimental results show that Fleet achieves a 1.59× faster hot-launch time and caches 1.21× more apps than Android.
12:45 PDT – 13:00 PDT

MiniMalloc: A Lightweight Memory Allocator for Hardware-Accelerated Machine Learning
Michael D. Moffitt (Google)

Abstract: We present a new approach to static memory allocation, a key problem that arises in the compilation of machine learning models onto the resources of a specialized hardware accelerator. Our methodology involves a recursive depth-first search that limits exploration to a special class of canonical solutions, dramatically reducing the size of the search space. We also develop a spatial inference technique that exploits this special structure by pruning unpromising partial assignments and backtracking more effectively than otherwise possible. Finally, we introduce a new mechanism capable of detecting and eliminating dominated solutions from consideration. Empirical results demonstrate orders of magnitude improvement in performance as compared to the previous state-of-the-art on many benchmarks, as well as a substantial reduction in library size.

Session 5D: Quantum Architecture
(Location: Scripps II)
11:45 PDT – 12:00 PDT

Codesign of quantum error-correcting codes and modular chiplets in the presence of defects
Sophia Fuhui Lin and Joshua Viszlai (University of Chicago) ;Kaitlin N. Smith (Infleqtion) ;Gokul Subramanian Ravi (University of Chicago) ;Charles Yuan (MIT CSAIL) ;Frederic T. Chong (University of Chicago) ;Benjamin J. Brown (IBM T. J. Watson Research Center)

Abstract: Fabrication errors pose a significant challenge in scaling up solid-state quantum devices to the sizes required for fault-tolerant (FT) quantum applications. To mitigate the resource overhead caused by fabrication errors, we combine two approaches: (1) leveraging the flexibility of a modular architecture, (2) adapting the procedure of quantum error correction (QEC) to account for fabrication defects. We simulate the surface code adapted to defective qubit arrays to find metrics that characterize how defects affect fidelity. We then use our simulations to determine the impact of defects on the resource overhead of realizing a fault-tolerant quantum computer on a chiplet-based modular architecture. Our QEC simulation adapts the syndrome readout circuit for the surface code to account for an arbitrary distribution of defects. Our simulations show that our strategy for dealing with fabrication defects demonstrates an exponential suppression of logical failure, where error rates of non-defective physical qubits are ∼ 0.1% for a circuit-based noise model. This is a typical regime on which we imagine running the defect-free surface code. We use our numerical results to establish post-selection criteria for assembling a device with defective chiplets. Using our criteria, we then evaluate the resource overhead in terms of the average number of physical qubits fabricated for a logical qubit to obtain a target logical error rate. We find that an optimal choice of chiplet size, based on the defect rate and target performance, is essential to limiting any additional error correction overhead due to defects. When the optimal chiplet size is chosen, at a defect rate of 1% the resource overhead can be reduced to below 3X and 6X respectively for the two defect models we use, for a wide range of target performance. Without tolerance to defects, the overhead grows exponentially as we increase the number of physical qubits in each logical qubit to achieve better performance, and also grows faster with an increase in the defect rate. When the target code distance is 27, the resource overhead of the defect-intolerant, modular approach is 45X and more than 105X higher than the super-stabilizer approach, respectively, at a defect rate of 0.1% and 0.3%. We also determine cutoff fidelity values that help identify whether a qubit should be disabled or kept as part of the QEC code.
12:00 PDT – 12:15 PDT

MECH: Multi-Entry Communication Highway for Superconducting Quantum Chiplets
Hezi Zhang and Keyi Yin (University of California San Diego) ;Anbang Wu (University of California Santa Barbara) ;Hassan Shapourian and Alireza Shabani (Cisco Quantum Lab) ;Yufei Ding (University of California San Diego)

Abstract: Chiplet architecture is an emerging architecture for quantum computing that could significantly increase qubit resources with its great scalability and modularity. However, as the computing scale increases, communication between qubits would become a more severe bottleneck due to the long routing distances. In this paper, we propose a multi-entry communication highway (MECH) mechanism to trade ancillary qubits for program concurrency, and build a compilation framework to efficiently manage and utilize the highway resources. Our evaluation shows that this framework significantly outperforms the baseline approach in both the circuit depth and the number of operations on typical quantum benchmarks. This implies a more efficient and less error-prone compilation of quantum programs.
12:15 PDT – 12:30 PDT

QuFEM: Fast and Accurate Quantum Readout Calibration Using the Finite Element Method
Siwei Tan, Liqiang Lu, Hanyu Zhang, Jia Yu, Congliang Lang, Yongheng Shang, Xinkui Zhao, and Mingshuai Chen (Zhejiang University) ;Yun Liang (Peking University) ;Jianwei Yin (Zhejiang University)

Abstract: Quantum readout noise turns out to be the most significant source of error, which greatly affects the measurement fidelity. Matrix-based calibration has been demonstrated to be effective in various quantum platforms. However, existing methodologies are fundamentally limited in either scalability or accuracy. Inspired by the classical finite element method (FEM), a formal method to model the complex interaction between elements, we present our calibration framework named QuFEM. First, we apply a divide-and-conquer strategy that formulates the calibration as a series of tensor products with noise matrices. This matrices are iteratively characterized together with the calibrated probability distribution, aiming to capture the inherent locality of qubit interactions. Then, to accelerate the end-to-end calibration, we propose a sparse tensor-product engine to exploit the sparsity in the intermediate values. Our experiments show that QuFEM achieves 2.5×103× speedup in the 136-qubit calibration compared to the state-of-the-art matrix-based calibration technique [50], and provides 1.2× and 1.4× fidelity improvement on the 18-qubit and 36-qubit real-world quantum devices.
12:30 PDT – 12:45 PDT

A Fault-Tolerant Million Qubit-Scale Distributed Quantum Computer

Junpyo Kim, Dongmoon Min, Jungmin Cho, Hyeonseong Jeong, Ilkwon Byun, Junhyuk Choi, Juwon Hong, and Jangwoo Kim (Seoul National University)

Abstract: A million qubit-scale quantum computer is essential to realize the quantum supremacy. Modern large-scale quantum computers integrate multiple quantum computers located in dilution refrigerators (DR) to overcome each DR’s unscaling cooling budget. However, a large-scale multi-DR quantum computer introduces its unique challenges (i.e., slow and erroneous inter-DR entanglement, increased qubit scale), and they make the baseline error handling mechanism ineffective by increasing the number of gate operations and the inter-DR communication latency to decode and correct errors. Without resolving these challenges, it is impossible to realize a fault-tolerant large-scale multi-DR quantum computer. In this paper, we propose a million qubit-scale distributed quantum computer which uses a novel error handling mechanism enabling fault-tolerant multi-DR quantum computing. First, we apply a low-overhead multi-DR error syndrome measurement (ESM) sequence to reduce both the number of gate operations and the error rate. Second, we apply a scalable multi-DR error decoding unit (EDU) architecture to maximize both the decoding speed and accuracy. Our multi-DR error handling SW-HW co-design improves the ESM latency, ESM errors, EDU latency, and EDU accuracy by 3.7 times, 2.4 times, 685 times, and 6.1 · 1010 times, respectively. With our scheme applied to assumed voltage-scaled CMOS and mature ERSFQ technologies, we successfully build a fault-tolerant million qubit-scale quantum computer.
12:45 PDT – 13:00 PDT

Promatch: Extending the Reach of Real-Time Quantum Error Correction with Adaptive Predecoding
Narges Alavisamani, Suhas Vittal, and Ramin Ayanzadeh (Georgia Tech) ;Poulami Das (University of Texas at Austin) ;Moinuddin Qureshi (Georgia Tech)

Abstract: Fault-tolerant quantum computing relies on Quantum Error Correction (QEC), which encodes logical qubits into data and parity qubits. Error decoding is the process of translating the measured parity bits into types and locations of errors. To prevent a backlog of errors, error decoding must be performed in real-time (i.e., within 1?? on superconducting machines). Minimum Weight Perfect Matching (MWPM) is an accurate decoding algorithm for surface code, and recent research has demonstrated real-time implementations of MWPM (RT-MWPM) for a distance of up to 9. Unfortunately, beyond d=9, the number of flipped parity bits in the syndrome, referred to as the Hamming weight of the syndrome, exceeds the capabilities of existing RT-MWPM decoders. In this work, our goal is to enable larger distance RT-MWPM decoders by using adaptive predecoding that converts high Hamming weight syndromes into low Hamming weight syndromes, which are accurately decoded by the RT-MWPM decoder. An effective predecoder must balance both accuracy (as any erroneous decoding by the predecoder contributes to the overall Logical Error Rate, termed as LER) and coverage (as the predecoder must ensure that the hamming weight of the syndrome is within the capability of the final decoder). In this paper, we propose Promatch, a real-time adaptive predecoder that predecodes both simple and complex patterns using a locality-aware, greedy approach. Our approach ensures two crucial factors: 1) high accuracy in prematching flipped bits, ensuring that the decoding accuracy is not hampered by the predecoder, and 2) enough coverage adjusted based on the main decoder’s capability given the time constraints. Promatch represents the first real-time decoding framework capable of decoding surface codes of distances 11 and 13, achieving an LER of 2.6 × 10−14 for distance 13. Moreover, we demonstrate that running Promatch concurrently with the recently proposed Astrea-G achieves LER equivalent to MWPM LER, 3.4 × 10−15, for distance 13, representing the first real-time accurate decoder for up-to a distance of 13.

Session 6A: Bug Finding and Testing
(Location: Grande C)
14:30 PDT – 14:45 PDT

Greybox Fuzzing for Concurrency Testing
Dylan Wolff, Shi Zheng, Gregory J. Duck, Umang Mathur, and Abhik Roychoudhury (National University of Singapore)

Abstract: Uncovering bugs in concurrent programs is a challenging problem owing to the exponentially large search space of thread interleavings. Past approaches towards concurrency testing are either optimistic — relying on random sampling of these interleavings — or pessimistic — relying on systematic exploration of a reduced (bounded) search space. In this work, we suggest a fresh, pragmatic solution neither focused only on formal, systematic testing, nor solely on unguided sampling or stress-testing approaches. We employ a biased random search which guides exploration towards neighborhoods which will likely expose new behavior. As such it is thematically similar to greybox fuzz testing, which has proven to be an effective technique for finding bugs in sequential programs. To identify new behaviors in the domain of interleavings, we prune and navigate the search space using the “reads-from” relation. Our approach is significantly more efficient at finding bugs per schedule exercised than other state-of-the art concurrency testing tools and approaches. Experiments on widely used concurrency datasets also show that our greybox fuzzing inspired approach gives a strict improvement over a randomized baseline scheduling algorithm in practice via a more uniform exploration of the schedule space. We make our concurrency testing infrastructure “Reads-From Fuzzer” (RFF) available for experimentation and usage by the wider community to aid future research.
14:45 PDT – 15:00 PDT

Multi-Dimensional and Message-Guided Fuzzing for Robotic Programs in Robot Operating System
Jia-Ju Bai (Beihang University) ;Hao-Xuan Song and Shi-Min Hu (Tsinghua University)

Abstract: An increasing number of robotic programs are implemented based on Robot Operating System (ROS), which provides many practical tools and libraries for robot development. To improve robot reliability and security, several recent approaches apply fuzzing to ROS programs for bug detection. However, these approaches still have some main limitations, including inefficient test case generation, ineffective program feedback and weak generality/automation. In this paper, we design a new fuzzing framework named ROFER, to effectively test robotic programs in ROS for bug detection. Compared to existing ROS fuzzing approaches, ROFER has two novel techniques: (1) a dimension-level mutation method that considers the contribution of each input dimension to testing coverage, to generate efficient test cases from multiple dimensions; (2) a message-guided fuzzing approach that uses a new coverage metric named message feature, to reflect the robot’s possible state transitions affected by multiple ROS nodes. We evaluate ROFER on 13 common robotic programs in ROS2, and it finds 88 real bugs, 46 of which have been confirmed by ROS developers. We compare ROFER to four state-of-the-art ROS fuzzing approaches, and it finds more bugs with higher testing coverage.
15:00 PDT – 15:15 PDT

CSSTs: A Dynamic Data Structure for Partial Orders in Concurrent Execution Analysis
Hünkar Can Tunç (Aarhus University) ;Ameya Prashant Deshmukh (Indian Institute of Technology Bombay) ;Berk Cirisci (Amazon Web Services) ;Constantin Enea (Ecole Polytechnique) ;Andreas Pavlogiannis (Aarhus University)

Abstract: Dynamic analyses are a standard approach to analyzing and testing concurrent programs. Such techniques observe program traces ? and analyze them to infer the presence or absence of bugs. At its core, each analysis maintains a partial order ? that represents order dependencies between the events of ? . Naturally, the scalability of the analysis largely depends on maintaining ? efficiently. The standard data structure for this task has thus far been Vector Clocks. These, however, are slow for analyses that follow a non-streaming style, costing ?(?) time for inserting (and propagating) each new ordering in ?, where ? is the size of ?, while they cannot handle the deletion of existing orderings. In this paper we develop Collective Sparse Segment Trees (CSSTs), a simple but elegant data structure for maintaining a partial order ?. CSSTs thrive when the width ? of ? is much smaller than the size ? of its domain, allowing inserting, deleting, and querying for orderings in ? to run in ?(log?) time. For a concurrent trace, ? normally equals the number of its threads, and is orders of magnitude smaller than its size ?, making CSSTs fitting for this setting. Our experiments confirm that CSSTs are the best data structure currently to handle a range of dynamic analyses from existing literature.
15:15 PDT – 15:30 PDT

UBFuzz: Finding Bugs in Sanitizer Implementations
Shaohua Li and Zhendong Su (ETH Zurich)

Abstract: In this paper, we propose a testing framework for validating sanitizer implementations in compilers. Our core components are (1) a program generator specifically designed for producing programs containing undefined behavior (UB), and (2) a novel test oracle for sanitizer testing. The program generator employs Shadow Statement Insertion, a general and effective approach for introducing UB into a valid seed program. The generated UB programs are subsequently utilized for differential testing of multiple sanitizer implementations. Nevertheless, discrepant sanitizer reports may stem from either compiler optimization or sanitizer bugs. To accurately determine if a discrepancy is caused by sanitizer bugs, we introduce a new test oracle called crash-site mapping. We have incorporated our techniques into UBfuzz, a practical tool for testing sanitizers. Over a five-month testing period, UBfuzz successfully found 31 bugs in both GCC and LLVM sanitizers. These bugs reveal the serious false negative problems in sanitizers, where certain UBs in programs went unreported. This research paves the way for further investigation in this crucial area of study.

Session 6B: Processing-In-Memory (PIM) for ML
(Location: Grande D/E)
14:30 PDT – 14:45 PDT

AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
Jaehyun Park, Jaewan Choi, Kwanhee Kyung, Michael Jaemin Kim, and Yongsuk Kwon (Seoul National University) ;Nam Sung Kim (University of Illinois Urbana-Champaign) ;Jung Ho Ahn (Seoul National University)

Abstract: The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and memory resources. Especially, the Gen stages, consisting of the attention and fully-connected (FC) layers, dominate the overall execution time. Meanwhile, we reveal that the conventional system with GPUs used for TbGM inference cannot efficiently execute the attention layer, even with batching, due to various constraints. To address this inefficiency, we first propose AttAcc, a processing-in-memory (PIM) architecture for efficient execution of the attention layer. Subsequently, for the end-to-end acceleration of TbGM inference, we propose a novel heterogeneous system architecture and optimizations that strategically use xPU and PIM together. It leverages the high memory bandwidth of AttAcc for the attention layer and the powerful compute capability of the conventional system for the FC layer. Lastly, we demonstrate that our GPU-PIM system outperforms the conventional system with the same memory capacity, improving performance and energy efficiency of running a 175B TbGM by up to 2.81× and 2.67×, respectively.
14:45 PDT – 15:00 PDT

SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration
Cong Li, Zhe Zhou, Size Zheng, Jiaxi Zhang, Yun Liang, and Guangyu Sun (Peking University)

Abstract: Generative large language models’ (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference’s acceleration potential in current systems. To tackle this problem, we propose SpecPIM to accelerate speculative inference on the PIM-enabled system. SpecPIM aims to boost the performance of speculative inference by extensively exploring the heterogeneity brought by both the algorithm and the architecture. To this end, we construct the architecture design space to satisfy each model’s disparate resource demands and dedicate the dataflow design space to fully utilize the system’s hardware resources. Based on the co-design space, we propose a design space exploration (DSE) framework to provide the optimal design under different target scenarios. Compared with speculative inference on GPUs and existing PIM-based LLM accelerators, SpecPIM achieves 1.52×/2.02× geomean speedup and 6.67×/2.68× geomean higher energy efficiency.
15:00 PDT – 15:15 PDT

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization
Cong Li and Zhe Zhou (Peking University) ;Yang Wang (Microsoft Research Asia) ;Fan Yang (Nankai University) ;Ting Cao and Mao Yang (Microsoft Research) ;Yun Liang and Guangyu Sun (Peking University)

Abstract: DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in recent years. However, their integration for deep learning acceleration poses inherent challenges. Existing DRAM-PIMs are limited in computational capabilities, primarily applicable for element-wise and GEMV operators. Unfortunately, these operators contribute only a small portion of the execution time in most DNN workloads. Current systems still necessitate powerful hosts to handle a significant portion of compute-heavy operators. To expand the applicability of commodity DRAM-PIMs in accelerating deep learning, we introduce a novel PIM-DL framework. The philosophy behind PIM-DL is to replace the compute-heavy GEMM operations in linear layers with Lookup-Tables (LUTs). Such LUT-based neural networks (LUT-NNs) substantially reduce multiplications in DNN inference, rendering them suitable for efficient execution on DRAM-PIMs. To accurately convert DNNs into LUT-NNs and achieve optimal inference serving performance, we first introduce an enhanced LUT-NN (eLUT-NN) algorithm for model calibration, then we propose an Auto-Tuner capable of optimizing the mapping parameters on diverse DRAM-PIM platforms. We evaluate PIM-DL on off-the-shelf UPMEM PIM-DIMM products and simulated HBM-PIM/AiM platforms across multiple contemporary DNN workloads. Compared with GEMM-based inference on DRAM-PIMs, PIM-DL achieves 22.6×~37.1× speedup. Compared with CPU/GPU-based inference, PIM-DL achieves up to 3.54×/1.20× speedup.
15:15 PDT – 15:30 PDT

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing
Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, and Sanghyeon Lee (KAIST) ;Hyungkyu Ham and Gwangsun Kim (POSTECH) ;Divya Mahajan (Georgia Tech) ;Jongse Park (KAIST)

Abstract: Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively. Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a “blocked” mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naïve NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3×, 2.4× and 1.6× throughput improvement, respectively.

Session 6C: Optimization of Tensor Programs
(Location: Scripps I)
14:30 PDT – 14:45 PDT

Felix: Optimizing Tensor Programs with Gradient Descent
Yifan Zhao, Hashim Sharif, Vikram Adve, and Sasa Misailovic (University of Illinois Urbana-Champaign)

Abstract: Obtaining high-performance implementations of tensor programs such as deep neural networks on a wide range of hardware remains a challenging task. Search-based tensor program optimizers can automatically find high-performance programs on a given hardware platform, but the search process in existing tools suffer from low efficiency, requiring hours or days of time to discover good programs due to the size of the search space. We present Felix, a novel gradient-based compiler optimization framework for tensor-based programs. Felix creates a differentiable space of tensor programs that is amenable to search by gradient descent. Felix applies continuous relaxation on the space of programs and creates differentiable estimator of program latency, allowing efficient search of program candidates using gradient descent, in contrast to conventional approaches that search over a non-differentiable objective function over a discrete search space. We perform an extensive evaluation on six deep neural networks for vision and natural language processing tasks on three GPU-based platforms. Our experiments show that Felix surpasses the performance of off-the-shelf inference frameworks – PyTorch, Tensorflow, and TensorRT – within 7 minutes of search time on average. Felix also finds optimized programs significantly faster than TVM Ansor, a state-of-the-art search-based optimizer for tensor programs.
14:45 PDT – 15:00 PDT

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions
Chunwei Xia, Jiacheng Zhao, and Qianqi Sun (Chinese Academy of Sciences) ;Zheng Wang (University of Leeds) ;Yuan Wen (University of Aberdeen) ;Teng Yu (Thewake Systems) ;Xiaobing Feng and Huimin Cui (Chinese Academy of Sciences)

Abstract: Optimizing deep neural network (DNN) execution is important but becomes increasingly difficult as DNN complexity grows. Existing DNN compilers cannot effectively exploit optimization opportunities across operator boundaries, leaving room for improvement. To address this challenge, we present Souffle, an open-source compiler that optimizes DNN inference across operator boundaries. Souffle creates a global tensor dependency graph using tensor expressions, traces data flow and tensor information, and partitions the computation graph into subprograms based on dataflow analysis and resource constraints. Within a subprogram, Souffle performs local optimization via semantic-preserving transformations, finds an optimized program schedule, and improves instruction-level parallelism and data reuse. We evaluated Souffle using six representative DNN models on an NVIDIA A100 GPU. Experimental results show that Souffle consistently outperforms six state-of-the-art DNN optimizers by delivering a geometric mean speedup of up to 3.7× over TensorRT and 7.8× over Tensorflow XLA.
15:00 PDT – 15:15 PDT

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
Jason Ansel, Edward Yang, and Horace He (Meta) ;Natalia Gimelshein (OpenAI) ;Animesh Jain, Michael Voznesensky, Bin Bao, David Berard, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, and Will Feng (Meta) ;Jiong Gong (Intel) ;Michael Gschwind, Brian Hirsh, Sherlock Huang, Laurent Kirsch, Michael Lazos, Yanbo Liang, Jason Liang, Yinghai Lu, CK Luk, and Bert Maher (Meta) ;Yunjie Pan (University of Michigan) ;Christian Puhrsch, Matthias Reso, Mark Saroufim, Helen Suk, and Michael Suo (Meta) ;Phil Tillet (OpenAI) ;Eikan Wang (Intel) ;Xiaodong Wang, William Wen, Shunting Zhang, and Xu Zhao (Meta) ;Keren Zhou (OpenAI) ;Richard Zou, Ajit Mathews, Gregory Chanan, Peng Wu, and Soumith Chintala (Meta)

Abstract: This paper introduces two extensions to the popular PyTorch machine learning framework, TorchDynamo and TorchInductor, which implement the torch.compile feature released in PyTorch 2. TorchDynamo is a Python-level just-in-time (JIT) compiler that enables graph compilation in PyTorch programs without sacrificing the flexibility of Python. It achieves this by dynamically modifying Python bytecode before execution and extracting sequences of PyTorch operations into an FX graph, which is then JIT compiled using one of many extensible backends. TorchInductor is the default compiler backend for TorchDynamo, which translates PyTorch programs into OpenAI’s Triton for GPUs and C++ for CPUs. Results show that TorchDynamo is able to capture graphs more robustly than prior approaches while adding minimal overhead, and TorchInductor is able to provide a 2.27× inference and 1.41× training geometric mean speedup on an NVIDIA A100 GPU across 180+ real-world models, which outperforms six other compilers. These extensions provide a new way to apply optimizations through compilers in eager mode frameworks like PyTorch.
15:15 PDT – 15:30 PDT

Optimal Kernel Orchestration for Tensor Programs with Korch
Muyan Hu (University of Illinois at Urbana-Champaign) ;Ashwin Venkatram (AMD) ;Shreyashri Biswas (Carnegie Mellon University) ;Balamurugan Marimuthu (Sambanova Systems) ;Bohan Hou and Gabriele Oliaro (Carnegie Mellon University) ;Haojie Wang and Liyan Zheng (Tsinghua University) ;Xupeng Miao (Carnegie Mellon University) ;Jidong Zhai (Tsinghua University) ;Zhihao Jia (Carnegie Mellon University)

Abstract: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7× on V100 GPUs and up to 1.6× on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

Session 6D: Variational Quantum Computing
(Location: Scripps II)
14:30 PDT – 14:45 PDT

Elivagar: Efficient Quantum Circuit Search for Classification
Sashwat Anagolum (Pennsylvania State University) ;Narges Alavisamani (Georgia Tech) ;Poulami Das (The University of Texas at Austin) ;Moinuddin Qureshi (Georgia Tech) ;Yunong Shi (Amazon Quantum Technologies)

Abstract: Designing performant and noise-robust circuits for Quantum Machine Learning (QML) is challenging — the design space scales exponentially with circuit size, and there are few well-supported guiding principles for QML circuit design. Although recent Quantum Circuit Search (QCS) methods attempt to search for such circuits, they directly adopt designs from classical Neural Architecture Search (NAS) that are misaligned with the unique constraints of quantum hardware, resulting in high search overheads and severe performance bottlenecks. We present Élivágar, a novel resource-efficient, noise-guided QCS framework. Élivágar innovates in all three major aspects of QCS — search space, search algorithm and candidate evaluation strategy — to address the design flaws in current classically-inspired QCS methods. Élivágar achieves hardware-efficiency and avoids an expensive circuit-mapping co-search via noise- and device topology-aware candidate generation. By introducing two cheap-to-compute predictors, Clifford noise resilience and representational capacity, Élivágar decouples the evaluation of noise robustness and performance, enabling early rejection of low-fidelity circuits and reducing circuit evaluation costs. Due to its resource-efficiency, Élivágar can further search for data embeddings, significantly improving performance. Based on a comprehensive evaluation of Élivágar on 12 real quantum devices and 9 QML applications, Élivágar achieves 5.3% higher accuracy and a 271× speedup compared to state-of-the-art QCS methods.
14:45 PDT – 15:00 PDT

Red-QAOA: Efficient Variational Optimization through Circuit Reduction
Meng Wang (University of British Columbia and Pacific Northwest National Laboratory) ;Bo Fang and Ang Li (Pacific Northwest National Laboratory) ;Prashant J. Nair (University of British Columbia)

Abstract: The Quantum Approximate Optimization Algorithm (QAOA) addresses combinatorial optimization challenges by converting inputs to graphs. However, the optimal parameter searching process of QAOA is greatly affected by noise. Larger problems yield bigger graphs, requiring more qubits and making their outcomes highly noise-sensitive. This paper introduces Red-QAOA, leveraging energy landscape concentration via a simulated annealing-based graph reduction. Red-QAOA creates a smaller (distilled) graph with nearly identical parameters to the original graph. The distilled graph produces a smaller quantum circuit and thus reduces noise impact. At the end of the optimization, Red-QAOA employs the parameters from the distilled graph on the original graph and continues the parameter search on the original graph. Red-QAOA outperforms state-of-the-art Graph Neural Network (GNN)-based pooling techniques on 3200 real-world problems. Red-QAOA reduced node and edge counts by 28% and 37%, respectively, with a mean square error of only 2%.
15:00 PDT – 15:15 PDT

VarSaw: Application-tailored Measurement Error Mitigation for Variational Quantum Algorithms
Siddharth Dangwal (University of Chicago) ;Gokul Subramanian Ravi (University of Michigan) ;Poulami Das (University of Texas at Austin) ;Kaitlin N. Smith (Infleqtion) ;Jonathan Mark Baker (University of Texas at Austin) ;Frederic T. Chong (University of Chicago)

Abstract: For potential quantum advantage, Variational Quantum Algorithms (VQAs) need high accuracy beyond the capability of today’s NISQ devices, and thus will benefit from error mitigation. In this work we are interested in mitigating measurement errors which occur during qubit measurements after circuit execution and tend to be the most error-prone operations, especially detrimental to VQAs. Prior work, JigSaw, has shown that measuring only small subsets of circuit qubits at a time and collecting results across all such ‘subset’ circuits can reduce measurement errors. Then, running the entire (‘global’) original circuit and extracting the qubit-qubit measurement correlations can be used in conjunction with the subsets to construct a high-fidelity output distribution of the original circuit. Unfortunately, the execution cost of JigSaw scales polynomially in the number of qubits in the circuit, and when compounded by the number of circuits and iterations in VQAs, the resulting execution cost quickly turns insurmountable. To combat this, we propose VarSaw, which improves JigSaw in an application-tailored manner, by identifying considerable redundancy in the JigSaw approach for VQAs: spatial redundancy across subsets from different VQA circuits and temporal redundancy across globals from different VQA iterations. VarSaw then eliminates these forms of redundancy by commuting the subset circuits and selectively executing the global circuits, reducing computational cost (in terms of the number of circuits executed) over naive JigSaw for VQA by 25x on average and up to 1000x, for the same VQA accuracy. Further, it can recover, on average, 45% of the infidelity from measurement errors in the noisy VQA baseline. Finally, it improves fidelity by 55%, on average, over JigSaw for a fixed computational budget. VarSaw can be accessed here: https://github.com/siddharthdangwal/VarSaw
15:15 PDT – 15:30 PDT

ProxiML: Building Machine Learning Classifiers for Photonic Quantum Computing
Aditya Ranjan (Northeastern University) ;Tirthak Patel (Rice University) ;Daniel Silver, Harshitta Gandhi, and Devesh Tiwari (Northeastern University)

Abstract: Quantum machine learning has shown early promise and potential for productivity improvements for machine learning classification tasks, but has not been systematically explored on photonics quantum computing platforms. Therefore, this paper presents the design and implementation of ProxiML – a novel quantum machine learning classifier for photonic quantum computing devices with multiple noise-aware design elements for effective model training and inference. Our extensive evaluation on a photonic device (Xanadu’s X8 machine) demonstrates the effectiveness of ProxiML machine learning classifier (over 90% accuracy on a real machine for challenging four-class classification tasks), and competitive classification accuracy compared to prior reported machine learning classifier accuracy on other quantum platforms – revealing the previously unexplored potential of Xanadu’s X8 machine.

Session 7A: Architecture Support for ML
(Location: Grande C)
16:00 PDT – 16:15 PDT

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning
Kai Zhong and Zhenhua Zhu (Tsinghua University) ;Guohao Dai (Shanghai Jiao Tong University and Infinigence-AI) ;Hongyi Wang, Xinhao Yang, and Haoyu Zhang (Tsinghua University) ;Jin Si (Beijing University of Posts and Telecommunications) ;Qiuli Mao (Tsinghua University) ;Shulin Zeng (Tsinghua University and Infinigence-AI) ;Ke Hong (Tsinghua University) ;Genghan Zhang (Stanford University) ;Huazhong Yang and Yu Wang (Tsinghua University)

Abstract: Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware resources. Sparse kernel accelerators are optimized for specific tasks. However, their dedicated processing units and data paths cannot effectively support other SpTA tasks with different dataflow and various sparsity, resulting in performance degradation. This paper proposes FEASTA, a Flexible and Efficient Accelerator for Sparse Tensor Algebra. To process general SpTA tasks with various sparsity e – ciently, we design FEASTA meticulously from three levels. At the dataflow abstraction level, we apply the Einstein Summation on the sparse fiber tree data structure to model the uni ed execution flow of general SpTA as joining and merging the fiber tree. At the instruction set architecture (ISA) level, a general SpTA ISA is proposed based on the execution flow. It includes different types of instructions for dense and sparse data, achieving flexibility and efficiency at the instruction level. At the architecture level, an instruction-driven architecture consisting of configurable and high-performance function units is designed, supporting the flexible and ef- efficient ISA. Evaluations show that FEASTA has 5.40× geomean energy efficiency improvements compared to GPU among various workloads. FEASTA delivers 1.47× and 3.19× higher performance on sparse matrix multiplication kernels compared to state-of-the-art sparse matrix accelerator and CPU extension. Across diverse kernels, FEASTA achieves 1.69-12.70× energy efficiency over existing architectures.
16:15 PDT – 16:30 PDT

CMC: Video Transformer Acceleration via CODEC Assisted Matrix Condensing
Zhuoran Song, Chunyu Qi, Fangxin Liu, Naifeng Jing, and Xiaoyao Liang (Shanghai Jiao Tong University)

Abstract: Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance bottleneck, impeding their deployment to resource-constrained platforms. While accelerators meticulously designed for Vision Transformers (ViTs) have emerged, they may not be the optimal solution for VidTs, primarily due to two reasons. These accelerators tend to overlook the inherent temporal redundancy that characterizes VidTs, limiting their chance for further performance enhancement. Moreover, incorporating a sparse attention prediction module within these accelerators incurs a considerable overhead. To this end, we move our attention to the video CODEC, which is essential for video preprocessing and can be utilized to detect the temporal and spatial similarity in raw video frames, showcasing the potential of exploring the temporal and spatial redundancy in VidTs while avoiding significant costs on prediction. This paper proposes CMC, the first CODEC assisted algorithm-accelerator co-design framework (CMC) for VidT acceleration. Specifically, from the algorithm aspects, we offer CODEC-friendly inter- and intra-matrix prediction algorithms to identify the informative data on-the-fly. We then design a recovery algorithm so that we can safely skip the computation on non-informative data in the temporal and spatial domains and recover their results by copying the informative data’s features to reserve accuracy. From the hardware aspects, we propose to augment the video CODEC to make it efficiently implement inter- and intra-matrix prediction algorithms with negligible costs. Additionally, we propose a specialized CMC architecture that includes a recovery engine with fine-grained buffer management to translate the computational saving in the algorithm to real speedup. Experiments show that CMC can achieve 2.1×, 8.8×, and 42.4× speedup over state-of-the-art ViT accelerator HeatViT, A100 GPU, and Xeon CPU with negligible accuracy loss.
16:30 PDT – 16:45 PDT

Tandem Processor: Grappling with Emerging Operators in Neural Networks
Soroush Ghodrati, Sean Kinzer, Hanyang Xu, and Rohan Mahapatra (University of California San Diego) ;Yoonsung Kim (KAIST) ;Byung Hoon Ahn (University of California San Diego) ;Dong Kai Wang (University of Illinois Urbana-Champaign) ;Lavanya Karthikeyan (University of California San Diego) ;Amir Yazdanbakhsh (Google DeepMind) ;Jongse Park (KAIST) ;Nam Sung Kim (University of Illinois Urbana-Champaign) ;Hadi Esmaeilzadeh (University of California San Diego)

Abstract: With the ever increasing prevalence of neural networks and the upheaval from the language models, it is time to rethink neural acceleration. Up to this point, the broader research community, including ourselves, has disproportionately focused on GEneral Matrix Multiplication (GEMM) operations. The supporting argument was that the large majority of the neural operations are GEMM. This argument guided the research in Neural Processing Units (NPUs) for the last decade. However, scant attention was paid to non-GEMM operations and they are rather overlooked. As deep learning evolved and progressed, these operations have grown in diversity and also large variety of structural patterns have emerged that interweave them with the GEMM operations. However, conventional NPU designs have taken rather simplistic approaches by supporting these operations through either a number of dedicated blocks or fall back to general-purpose processors. This work sets out to challenge the conventional wisdom in neural accelerator design and explore the architecture of an on-chip companion, dubbed Tandem Processor, that complements the rather optimized GEMM unit in neural accelerators. This processor needs to be specialized to keep up with the GEMM unit; and yet needs to be programmable to address the (1) structural and (2) operational variations. To strike a balance between specialization and programmability, on the one hand, we specialize its memory access logic with a novel ISA/microarchitecture that alleviates the register file and its associated load/store operations. On the other hand, the calculations of the non-GEMM layers are only supported through primitive arithmetic/logic vector operations. Therefore, programmability is offered at the mathematical level. The enhancements due to the specialization of the memory access logic in the Tandem Processor and its tight integration with the GEMM unit sustain the throughput and the utilization of the neural accelerator. Comprehensive evaluations of the proposed design based on the end-to-end execution of seven diverse DNNs including emerging language models show significant performance improvements and energy reduction enabled by leveraging the Tandem Processor. We provide the RTL code that is synthesizable both for FPGA and ASIC implementations in addition to the associated compiler as part of the open-source GeneSys project (https://actlab-genesys.github.io/). We also present the chip floorplan and post-layout analysis. This work is the result of 10 years of effort in building real NPUs that support end-to-end neural network execution.
16:45 PDT – 17:00 PDT

Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMs
Zhewen Pan and Joshua San Miguel (University of Wisconsin-Madison) ;Di Wu (University of Central Florida)

Abstract: In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and efficiency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in this work, we observe that there is growing untapped potential for value reuse. We propose a novel computing paradigm, value-level parallelism, whereby unique products are computed only once, and different inputs subscribe to (select) their products via temporal coding. Our architecture, Carat, employs value-level parallelism and transforms multiplication into accumulation, performing GEMMs with eefficient multiplier-free hardware. Experiments show that, on average, Carat improves iso-area throughput and energy efficiency by 1.02× and 1.06× over a systolic array and 3.2× and 4.3× when scaled up to multiple nodes.
Session 7B: Program and Configuration Synthesis
(Location: Grande D/E)
16:00 PDT – 16:15 PDT

NetRen: Service Migration-Driven Network Renascence with Synthesizing Updated Configuration
Rongxin Han, Jingyu Wang, Qi Qi, Haifeng Sun, Chaowei Xu, Zhaoyang Wan, and Zirui Zhuang (Beijing University of Posts and Telecommunications) ;Yichuan Yu (Huawei) ;Jianxin Liao (Beijing University of Posts and Telecommunications)

Abstract: Changes in enterprise networks require updated configurations. However, manual configurations with slow update efficiency, poor performance, and handling limitations, lead to the unavailability of updated networks. Therefore, we propose an efficient network renascence framework, NetRen, which synthesizes OSPF/BGP configurations driven by service and traffic migration. We follow the workflow of sketch extraction, configuration synthesis, and repair. Initially, comprehensive graphs are constructed to represent configuration sketches. We propose a GraphTrans synthesizer with Transformer’s benefits of long-range focus and parallel reasoning. Training samples with the optimization relationship enable the synthesizer to achieve a mapping that optimizes performance based on configurations. To overcome the satisfiability barrier, configurations from the synthesizer are input to the stepwise configuration repairer as well-initialized solutions, achieving rapid configuration repair. Experiments demonstrate that the consistency of network configurations output by the GraphTrans synthesizer averages 98%. NetRen achieves a 312.4× increase in synthesis efficiency and a 5.83% improvement in network performance.
16:15 PDT – 16:30 PDT

Hydride: A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures
Akash Kothari, Abdul Rafae Noor, Muchen Xu, Hassam Uddin, Dhruv Baronia, Stefanos Baziotis, Vikram Adve, and Charith Mendis (University of Illinois at Urbana-Champaign) ;Sudipta Sengupta (Amazon Web Services)

Abstract: As modern hardware architectures evolve to support increasingly diverse, complex instruction sets for meeting the performance demands of modern workloads in image processing, deep learning, etc., it has become ever more crucial for compilers to provide robust support for evolution of their internal abstractions and retargetable code generation support to keep pace with emerging instruction sets. We propose Hydride, a novel approach to compiling for complex, emerging hardware architectures. Hydride uses vendor-defined pseudocode specifications of multiple hardware ISAs to automatically design retargetable instructions for AutoLLVM IR, an extensible compiler IR which consists of (formally defined) language-independent and target-independent LLVM IR instructions to compile to those ISAs, and automatically generated instruction selection passes to lower AutoLLVM IR to each of the specified hardware ISAs. Hydride also includes a code synthesizer that automatically generates code generation support for schedule-based languages, such as Halide, to optimally generate AutoLLVM IR. Our results show that Hydride is able to represent 3,557 instructions combined in x86, Hexagon, ARM architectures using only 397 AutoLLVM IR instructions, including (Intel) SSE2, SSE4, AVX, AVX2, AVX512, (Qualcomm) Hexagon HVX, and (ARM) NEON vector ISAs. We created a new Halide compiler with Hydride using only a formal semantics of Halide IR, leveraging the auto-generated AutoLLVM IR and back-ends for the three hardware architectures. Across kernels from deep learning and image processing, this compiler is able to perform just as well as the mature, production Halide compiler on Hexagon, and outperform on x86 by 8% and ARM by 3%. Hydride also outperforms the production Halide’s LLVM back end by 12% on x86, 100% on HVX, and 26% on ARM across the same kernels.
16:30 PDT – 16:45 PDT

RTL-Repair: Fast Symbolic Repair of Hardware Design Code
Kevin Laeufer, Brandon Fajardo, Abhik Ahuja, Vighnesh Iyer, Borivoje Nikolić, and Koushik Sen (University of California Berkeley)

Abstract: We present RTL-Repair, a semantics-based repair tool for register transfer level circuit descriptions. Compared to the previous state-of-the-art tool, RTL-Repair generates more correct repairs within seconds instead of minutes or even hours. We imagine that RTL-Repair could thus be integrated into an IDE to give developers repair suggestions promptly. Our new SMT-based one-step fault localization and repair algorithm for digital hardware designs uses optimization to generate minimal changes that a user can easily understand. A novel adaptive windowing approach allows us to avoid scalability issues by focusing the repair search on the parts of the test that matter the most. RTL-Repair provides repairs that pass their testbench for 9 out of 12 real bugs collected from open-source hardware projects. Two repairs fully match the ground truth, one partially, four more repairs change the correct expression but in a way that overfits the testbench, and only three repairs differ strongly from the ground truth.
16:45 PDT – 17:00 PDT

SIRO: Empowering Version Compatibility in Intermediate Representations via Program Synthesis
Bowen Zhang and Wei Chen (The Hong Kong University of Science and Technology) ;Peisen Yao (Zhejiang University) ;Chengpeng Wang, Wensheng Tang, and Charles Zhang (The Hong Kong University of Science and Technology)

Abstract: This paper presents Siro, a new program transformation framework that translates between different versions of Intermediate Representations (IR), aiming to better address the issue of IR version incompatibility on IR-based software, such as static analyzers. We introduce a generic algorithm skeleton for Siro based on the divide-and-conquer principle. To minimize labor-intensive tasks of the implementation process, we further employ program synthesis to automatically generate translators for IR instructions within vast search spaces. Siro is instantiated on LLVM IR and has effectively helped to produce ten well-functioning IR translators for different version pairs, each taking less than three hours. From a practical perspective, we utilize these translators to assist static analyzers and fuzzers in reporting bugs and achieving accuracy of 91% and 95%, respectively. Remarkably, Siro has already been deployed in real-world scenarios and makes existing static analyzers available to safeguard the Linux kernel by uncovering 80 new vulnerabilities.
Session 7C: Storage Optimizations in Software
(Location: Scripps I)
16:00 PDT – 16:15 PDT

MemSnap μCheckpoints: A Data Single Level Store for Fearless Persistence
Emil Tsalapatis, Ryan Hancock, Rakeeb Hossain, and Ali José Mashtizadeh (University of Waterloo)

Abstract: Single level stores (SLSes) have recently resurfaced as a system for persisting application data. SLSes like EROS, Aurora, and TreeSLS use application checkpointing to replace file-based APIs. These systems checkpoint at a coarse granularity and must be combined with file persistence mechanisms like WALs, undermining the benefits of their SLS design. We present MemSnap, a new system that completes the single level store vision by eliminating the need for the WAL API. MemSnap introduces µCheckpoints that persist updates to memory for individual write transactions concurrently with other threads. We introduce a novel per-thread dirty set tracking mechanism in the kernel and use it to transparently persist application data. We use virtual memory techniques to prevent modifications to in-flight µCheckpoints, without blocking the application. MemSnap-based persistence has 4.5×–30× lower latency than file-based random IO and is within 2×of direct disk IO latency. We integrate MemSnap with the SQLite, RocksDB, and PostgreSQL databases, gaining performance while retaining ACID. MemSnap increases the throughput of SQLite by 5× over file APIs and achieves a 4× throughput improvement for RocksDB compared to Aurora.
16:15 PDT – 16:30 PDT

Grafu: Unleashing the Full Potential of Future Value Computation for Out-of-core Synchronous Graph Processing
Tsun-Yu Yang (The Chinese University of Hong Kong) ;Cale England (Oklahoma State University) ;Yi Li and Bingzhe Li (University of Texas at Dallas) ;Ming-Chang Yang (The Chinese University of Hong Kong)

Abstract: As graphs exponentially grow recently, out-of-core graph systems have been invented to process large-scale graphs by keeping massive data in storage. Among them, many systems process the graphs iteration-by-iteration and provide synchronous semantics that allows easy programmability by forcing the computation dependency of vertex values between iterations. On the other hand, although future value computation is an effective IO optimization for out-of-core graph systems by computing vertex values of future iterations in advance, it is challenging to take full advantage of future value computation while guaranteeing iteration-based dependency. In fact, based on our investigation, even state-of-the-art work along this direction has a wide gap from optimality in IO reduction and further requires substantial overhead in computation as well as extra memory consumption. This paper presents Grafu, an out-of-core graph system unleashing the full potential of future value computation while providing synchronous semantics. For this goal, three main designs are proposed to optimize future value computation from different perspectives. First, we propose a new elevator execution order to significantly increase the number of future-computed vertices for considerable IO reduction. Second, unlike existing work that uses high-cost barriers to ensure dependency under future value computation, we present conditional barrier to alleviate this computational overhead by adaptively removing the barriers while guaranteeing dependency. Third, we introduce a new graph reordering technique, greedy coloring, to reduce the extra memory consumption required for future value computation. Our evaluation of various billion-scale graphs reveals that Grafu significantly outperforms other state-of-the-art systems.
16:30 PDT – 16:45 PDT

CrossPrefetch: Accelerating I/O Prefetching for Modern Storage
Shaleen Garg and Jian Zhang (Rutgers University) ;Rekha Pitchumani (Samsung) ;Manish Parashar (University of Utah) ;Bing Xie (Microsoft) ;Sudarsun Kannan (Rutgers University)

Abstract: We introduce CrossPrefetch, a novel cross-layered I/O prefetching mechanism that operates across the OS and a user-level runtime to achieve optimal performance. Existing OS prefetching mechanisms suffer from rigid interfaces that do not provide information to applications on the prefetch effectiveness, suffer from high concurrency bottlenecks, and are inefficient in utilizing available system memory. CrossPrefetch addresses these limitations by dividing responsibilities between the OS and runtime, minimizing overhead, and achieving low cache misses, lock contentions, and higher I/O performance. CrossPrefetch tackles the limitations of rigid OS prefetching interfaces by maintaining and exporting cache state and prefetch effectiveness to user-level runtimes. It also addresses scalability and concurrency bottlenecks by distinguishing between regular I/O and prefetch operations paths and introduces fine-grained prefetch indexing for shared files. Finally, CrossPrefetch designs low-interference access pattern prediction combined with support for adaptive and aggressive techniques to exploit memory capacity and storage bandwidth. Our evaluation of CrossPrefetch, encompassing microbenchmarks, macrobenchmarks, and real-world workloads, illustrates performance gains of up to 1.22x-3.7x in I/O throughput. We also evaluate CrossPrefetch across different file systems and local and remote storage configurations.
16:45 PDT – 17:00 PDT

Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta Compression
Hongming Huang (City University of Hong Kong and Huawei) ;Peng Wang (Huawei) ;Qiang Su (City University of Hong Kong) ;Hong Xu (The Chinese University of Hong Kong) ;Chun Jason Xue (City University of Hong Kong and Mohamed bin Zayed University of Artificial Intelligence) ;André Brinkmann (Johannes Gutenberg University Mainz)

Abstract: Deduplication compresses backup data by identifying and removing duplicate blocks. However, deduplication cannot detect when two blocks are very similar, which opens up opportunities for further data reduction using delta compression. Most existing works find similar blocks by characterizing each block by a set of features and matching similar blocks using coarse-grained super-features. If two blocks share a super-feature, delta compression only needs to store their delta for the new block. Existing delta compression techniques constrain their super-features to find only matching blocks that are likely to be very similar. Palantir introduces hierarchical super-features with different sensitivities to block similarities to find more candidates of similar blocks and increase overall backup compression including deduplication by 7.3% over N-Transform and Odess and 26.5% over Finesse. The overhead of Palantir for storing additional super-features is overcome by exploiting temporal localities of backup streams, and the throughput penalty is within 7.7%. Palantir also introduces a false positive filter to discard matching blocks that are counterproductive for the overall data reduction.
Session 7D: Graph Neural Networks
(Location: Scripps II)
16:00 PDT – 16:15 PDT

TGLite: A Lightweight Programming Framework for Continuous-Time Temporal Graph Neural Networks
Yufeng Wang and Charith Mendis (University of Illinois at Urbana-Champaign)

Abstract: In recent years, Temporal Graph Neural Networks (TGNNs) have achieved great success in learning tasks for graphs that change over time. These dynamic/temporal graphs represent topology changes as either discrete static graph snapshots (called DTDGs), or a continuous stream of timestamped edges (called CTDGs). Because continuous-time graphs have richer time information, it will be crucial to have abstractions for programming CTDG-based models so that practitioners can easily explore new designs and optimizations in this space. A few recent frameworks have been proposed for programming and accelerating TGNN models, but these either do not support continuous-time graphs, lack easy composability, and/or do not facilitate CTDG-specific optimizations. In this paper, we propose a lightweight framework called TGLite to fill this apparent gap in the status quo. It provides abstractions that serve as composable building blocks for implementing TGNN models for CTDGs. It introduces a novel TBlock representation for capturing message-flow dependencies between nodes, with explicit support for temporal-related attributes, which is well-suited for common TGNN computation patterns. TBlocks serve as a central representation on which many different operators can be defined, such as temporal neighborhood sampling, scatter/segmented computations, as well as optimizations tailored to CTDGs. We use TGLite to implement four existing TGNN models. Compared to the TGL framework, TGLite is able to accelerate runtime performance of training (1.06 − 3.43×) and inference (1.09 − 4.65×) of these models on V100 and A100 GPUs across different experimental settings. Notably, when scaling to larger datasets, TGL runs out-of-memory in some cases on the V100 while TGLite is able to run successfully.
16:15 PDT – 16:30 PDT

MaxK-GNN: Extremely Fast GPU Kernel Design for Accelerating Graph Neural Networks Training
Hongwu Peng and Xi Xie (University of Connecticut) ;Kaustubh Shivdikar (Northeastern University) ;Md Amit Hasan, Jiahui Zhao, Shaoyi Huang, and Omer Khan (University of Connecticut) ;David Kaeli (Northeastern University) ;Caiwen Ding (University of Connecticut)

Abstract: In the acceleration of deep neural network training, the graphics processing unit (GPU) has become the mainstream platform. GPUs face substantial challenges on Graph Neural Networks (GNNs), such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges. However, the memory traffic involved with Sparse-Dense Matrix Matrix Multiplication (SpMM) is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an “after-thought” (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based Sparse Matrix-Matrix Multiplication (SpGEMM) Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and Sampled Sparse Matrix Dense Matrix Multiplication (SSpMM) Kernel. We conduct extensive evaluations of MaxK-GNN and report the system training time. Experiments show that MaxK-GNN system could approach the speedup limit according to Amdahl’s law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22×/4.24× speedup (vs. 5.52×/7.27×) on Reddit compared to DGL and GNNAdvisor implementations. Our implementation can be found on GitHub1.
16:30 PDT – 16:45 PDT

Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures
Kun Wu (University of Illinois at Urbana-Champaign) ;Mert Hidayetoğlu (Stanford University) ;Xiang Song (AWS AI) ;Sitao Huang (University of California Irvine) ;Da Zheng and Israt Nisa (AWS AI) ;Wen-Mei Hwu (Nvidia and University of Illinois at Urbana-Champaign)

Abstract: Relational graph neural networks (RGNNs) are graph neural networks with dedicated structures for modeling the different types of nodes and edges in heterogeneous graphs. While RGNNs have been increasingly adopted in many real-world applications due to their versatility and accuracy, they pose performance and system design challenges: inherent memory-intensive computation patterns, the gap between the programming interface and kernel APIs, and heavy programming effort required to optimize kernels caused by their coupling with data layout and heterogeneity. To systematically address these challenges, we propose Hector, a novel two-level intermediate representation and its code generator framework that (a) captures the key properties of RGNN models, and opportunities to reduce memory accesses in inter-operator scheduling and materialization, (b) generates code with flexible data access schemes to eliminate redundant data copies, and (c) decouples model semantics, data layout, and operators-specific optimizations from each other to reduce programming effort. By building on one general matrix multiply (GEMM) template and a node/edge traversal template, Hector achieves up to 9.9× speed-up in inference and 43.7× speed-up in training compared with the state-of-the-art public systems on select models, RGCN, RGAT and HGT, when running heterogeneous graphs provided by Deep Graph Library (DGL) and Open Graph Benchmark (OGB). In addition, Hector does not trigger any out-of-memory (OOM) exception in these tests. We also propose linear operator reordering and compact materialization to further accelerate the system by up to 3.8×. As an indicator of the reduction of programming effort, Hector takes in 51 lines of code expressing the three models and generates a total of 8K lines of CUDA and C++ code. Through profiling, we found that higher memory efficiency allows Hector to accommodate larger input and therefore attain higher throughput in forward propagation, while backward propagation is bound by latency introduced by atomic updates and outer products.
16:45 PDT – 17:00 PDT

Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks
Yu Gan, Guiyang Liu, Xin Zhang, Qi Zhou, Jiesheng Wu, and Jiangwei Jiang (Alibaba)

Abstract: Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs. To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using unsupervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.
Session 8A: Caching and Prefetching
(Location: Grande C)
10:00 PDT – 10:15 PDT

PDIP: Priority Directed Instruction Prefetching
Bhargav Reddy Godala (Princeton University) ;Sankara Prasad Ramesh (University of California San Diego) ;Gilles A. Pokam, Jared Stark, and Andre Seznec (Intel) ;Dean Tullsen (University of California San Diego) ;David I. August (Princeton University)

Abstract: Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them. In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles – along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.
10:15 PDT – 10:30 PDT

Limoncello: Prefetchers for Scale
Akanksha Jain (Google) ;Hannah Lin (Google and University of Washington) ;Carlos Villavieja (Google) ;Baris Kasikci (Google and University of Washington) ;Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan (Google)

Abstract: This paper presents Limoncello, a novel software system that dynamically configures data prefetchers for high-utilization systems. We demonstrate that in resource-constrained environments, such as large data centers, traditional methods of hardware prefetching can increase memory latency and decrease available memory bandwidth. To address this issue, Limoncello disables hardware prefetchers when memory bandwidth utilization is high, and it leverages targeted software prefetching to reduce cache misses when hardware prefetchers are disabled. Limoncello is software-centric and does not require any modifications to hardware. Our evaluation of the deployment on Google’s fleet reveals that Limoncello unlocks significant performance gains for high-utilization systems: It improves application throughput by 10%, due to a 15% reduction in memory latency, while maintaining minimal change in cache miss rate for targeted library functions.
10:30 PDT – 10:45 PDT

PATHFINDER: Practical Real-Time Learning for Data Prefetching
Lin Jia, James Patrick Mcmahon, Sumanth Gudaparthi, Shreyas Singh, and Rajeev Balasubramonian (University of Utah)

Abstract: Data prefetching is vital in high-performance processors and a large body of research has introduced a number of different approaches for accurate prefetching: stride detection, address correlating prefetchers, delta pattern detection, irregular pattern detection, etc. Most recently, a few works have leveraged advances in machine learning and deep neural networks to design prefetchers. These neural-inspired prefetchers observe data access patterns and develop a trained model that can then make accurate predictions for future accesses. A significant impediment to the success of these prefetchers is their high implementation cost, for both inference and training. These models cannot be trained in real-time, i.e., they have to be trained beforehand with a large benchmark suite. This results in a large model (that increases the overhead for inference), and the model can only successfully predict patterns that are similar to patterns in the training set. In this work, we explore the potential of using spiking neural networks to learn and predict data access patterns, and specifically address deltas. While prior work has leaned on the recent success of trained artificial neural networks, we hypothesize that spiking neural networks are a better fit for real-time data prefetching. Spiking neural networks rely on the STDP algorithm to learn while performing inference – it is a low-cost and local learning algorithm that can quickly observe and react to the current stream of accesses. It is therefore possible to achieve high accuracy on previously unseen access patterns with a relatively small spiking neural network. This paper makes the case that spiking neurons and STDP offer a complexity-effective approach to leverage machine learning for data prefetching. We show that the proposed Pathfinder prefetcher is competitive with other state-of-the-art prefetchers on a range of benchmarks. Pathfinder can be implemented at 0.5 W and an area footprint of only 0.23 ??2 (12 nm technology). This work shows that neural-inspired prefetching can be both practical and capable of high performance, but more innovations in SNN/STDP prefetch algorithms are required to fully maximize their potential.
10:45 PDT – 11:00 PDT

Skip It: Take Control of Your Cache!
Shashank Anand and Michal Friedman (ETH Zurich) ;Michael Giardino (Huawei) ;Gustavo Alonso (ETH Zurich)

Abstract: Mechanisms to explicitly manage the presence of data in caches are fundamental for the correctness and performance of modern systems. These operations, while critical, often incur significant performance penalties even when carefully used. Moreover, these mechanisms are implemented in proprietary and often undocumented hardware, so research into optimizations and novel designs is mostly limited to slow, simplified software simulations. In this paper, we design microarchitectural extensions to support two types of user-controlled cache writebacks to main memory. Furthermore, we propose Skip It, a mechanism built on top of our extensions that substantially reduces redundant writebacks. We implemented these designs on the open-source BOOM out-of-order RISC-V CPU. The performance in hardware is ≈100 cycles which favorably compares to similar operations in commercially available server-class platforms. In addition, Skip It performs as well as or better than state-of-the-art software techniques for avoiding unnecessary writebacks.
11:00 PDT – 11:15 PDT

RPG^2: Robust Profile-Guided Runtime Prefetch Generation
Yuxuan Zhang, Nathan Sobotka, and Soyoon Park (University of Pennsylvania) ;Saba Jamilan (University of California Santa Cruz) ;Tanvir Ahmed Khan (Columbia University) ;Baris Kasikci (University of Washington and Google) ;Gilles A Pokam (Intel) ;Heiner Litz (University of California Santa Cruz) ;Joseph Devietti (University of Pennsylvania)

Abstract: Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program’s regular “demand” accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well. To try to meet these challenges, we have designed the RPG2 system for online prefetch injection and tuning. RPG2 is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG2 can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code – something that static compilers cannot. RPG2 improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns.
Session 8B: Memory: Address Translation and Tiering
(Location: Grande D/E)
10:00 PDT – 10:15 PDT

METAL: Caching Multi-level Indexes in Domain-Specific Architectures
Anagha Molakalmur Anil Kumar and Aditya Prasanna (Simon Fraser University) ;Jonathan Balkind (University of California Santa Barbara) ;Arrvindh Shriraman (Simon Fraser University)

Abstract: State-of-the-art domain specific architectures (DSAs) work with sparse data, and need hardware support for index data-structures [31, 43, 57, 61]. Indexes are more space-efficient for sparse-data, and reduce DRAM bandwidth, if data reuse can be managed. However, indexes exhibit dynamic accesses, chase pointers, and need to walk-and-search. This inflates the working set and thrashes the cache. We observe that the cache organization itself is responsible for this behavior. We develop METAL, a portable caching idiom that enables DSAs to employ index data-structures. METAL decouples reuse of the index metadata from data reuse, and optimizes it independently. We propose two ideas: i) IX-Cache: A cache that leverages range tags to short-circuits index walks, and reduces the working set. IX-cache helps capture the trade-off between wider index nodes that maximize reach vs those that are closer to leaf and minimize walk latency. ii) Reuse Patterns: An interface to explicitly manage the cache. Patterns orchestrate cache insertions and bypass as we dynamically traverse different index regions. METAL improves performance vs. streaming DSAs by 7.8×, address-caches by 4.1×, and state-of-the-art DSA-cache [50] by 2.4×. We reduce DRAM energy by 1.6× vs. prior state-of-the-art.
10:15 PDT – 10:30 PDT

GMT: GPU Orchestrated Memory Tiering for the Big Data Era
Chia-Hao Chang, Jihoon Han, and Anand Sivasubramaniam (Pennsylvania State University) ;Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu (NVIDIA Research)

Abstract: As the demand for processing larger datasets increases, GPUs need to reach deeper into their (memory) hierarchy to directly access capacities that only storage systems (SSDs) can hold. However, the state-of-the-art mechanisms to reach storage either employ software stacks running on the host CPUs as intermediaries (e.g. Dragon, HMM), which has been noted to perform poorly and not able to meet the throughput needs of GPU cores, or directly access SSDs through NVMe queues (BaM) which does not benefit from lower latencies that may be possible by having the host memory as an intermediate tier. This paper presents the design and implementation of GPU Memory Tiering (GMT) by implementing a GPU-orchestrated 3-tier hierarchy comprising GPU memory, host memory and SSDs, where the GPU orchestrates most of the transfers that are bandwidth/latency sensitive. Additionally, it is important to not blindly transfer pages from the GPU memory to host memory upon an eviction, and GMT employs a reuse-prediction based practical insertion policy to perform discretionary page placement/bypass. An implementation and evaluation on an actual platform demonstrates that GMT performs 50% better than the state-of-the-art 2-tier strategy (BaM) and over 350% better than the state-of-the-art 3-tier strategy that is orchestrated by host CPUs (HMM), over a number of GPU applications with diverse memory access characteristics.
10:30 PDT – 10:45 PDT

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
Cong Guo (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Rui Zhang (Ant Group) ;Jiale Xu, Jingwen Leng, Zihan Liu, Ziyu Huang, and Minyi Guo (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Hao Wu, Shouren Zhao, Junping Zhao, and Ke Zhang (Ant Group)

Abstract: Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10×) of GPUs’ native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator’s efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake.
10:45 PDT – 11:00 PDT

Direct Memory Translation for Virtualized Clouds
Jiyuan Zhang (University of Illinois Urbana-Champaign) ;Weiwei Jia (University of Rhode Island) ;Siyuan Chai, Peizhe Liu, Jongyul Kim, and Tianyin Xu (University of Illinois Urbana-Champaign)

Abstract: Virtual memory translation has become a key performance bottleneck of memory-intensive workloads in virtualized cloud environments. On the x86 architecture, a nested translation needs to sequentially fetch up to 24 page table entries (PTEs). This paper presents Direct Memory Translation (DMT), a hardware-software extension for x86-based virtual memory that minimizes translation overhead while maintaining backward compatibility with x86. In DMT, the OS manages last-level PTEs in a contiguous physical memory region, termed Translation Entry Areas (TEAs). DMT establishes a direct mapping from each virtual page in a Virtual Memory Area (VMA) to the corresponding PTE in a TEA. Since processes manage memory with a handful of major VMAs, the mapping can be maintained per VMA and effectively stored in a few dedicated registers. DMT further optimizes virtualized memory translation via guest-host cooperation by directly allocating guest TEAs in physical memory, bypassing intermediate virtualization layers. DMT is inherently scalable—it takes one, two, and three memory references in native, virtualized, and nested virtualized setups. Its scalability enables hardware-assisted translation for nested virtualization. Our evaluation shows that DMT significantly speeds up page walks by an average of 1.58x (1.65x with THP) in a virtualized setup, resulting in 1.20x (1.14x with THP) speedup of application execution on average.
11:00 PDT – 11:15 PDT

WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers
Hongliang Qu and Zhibin Yu (Chinese Academy of Sciences)

Abstract: Recently, page-table self-replication (PTSR) has been proposed to reduce the page-table caused NUMA effect for large-memory workloads on NUMA servers. However, PTSR may improve or hurt performance of an application, depending on its characteristics and the co-located applications. This is hard for users to know, but current PTSR can only be manually enabled/disabled by users. To address this issue, we propose WASP (Workload-Aware Self-Replication) to automatically enable/disable PTSR to reduce the page-table caused NUMA effect. WASP innovates two techniques. First, it identifies a set of indicators, which are generally available on most processor architectures, to indicate if PTSR should be enabled/disabled. Second, WASP devises a hierarchical as well as gradual mechanism using these indicators to enable/disable PTSR automatically for a specific workload to reduce the page-table caused NUMA effect during its execution. We implement WASP in Linux and evaluate it on x86 and ARM NUMA servers. Experimental results show that WASP can automatically enable/disable PTSR successfully according to workload characteristics, achieving at least the same performance improvement as that obtained by manually enabling PTSR on both x86 and ARM NUMA servers.

Session 8C: High Performance Systems
(Location: Scripps I/II)
10:00 PDT – 10:15 PDT

Supporting Descendants in SIMD-Accelerated JSONPath
Mateusz Gienieczko and Filip Murlak (University of Warsaw) ;Charles Paperman (CRIStAL, Université de Lille, INRIA)

Abstract: Harnessing the power of SIMD can bring tremendous performance gains in data processing. In querying streamed JSON data, the state of the art leverages SIMD to fast forward significant portions of the document. However, it does not provide support for descendant, which excludes many real-life queries and makes formulating many others hard. In this work, we aim to change this: we consider the fragment of JSONPath that supports child, descendant, wildcard, and labels. We propose a modular approach based on novel depth-stack automata that process a stream of events produced by a state-driven classifier, allowing fast forwarding parts of the input document irrelevant at the current stage of the computation. We implement our solution in Rust and compare it with the state of the art, confirming that our approach allows supporting descendants without sacrificing performance, and that reformulating natural queries using descendants brings impressive performance gains in many cases. KEYWORDS json, jsonpath, simd, query language, data management
10:15 PDT – 10:30 PDT

Boost Linear Algebra Computation Performance via Efficient VNNI Utilization
Hao Zhou and Qiukun Han (Enflame Tech) ;Heng Shi (Enflame Tech and Shanghai Jiao Tong University) ;Yalin Zhang (Enflame Tech Inc.) ;Jianguo Yao (Enflame Tech and Shanghai Jiao Tong University)

Abstract: Intel’s Vector Neural Network Instruction (VNNI) provides higher efficiency on calculating dense linear algebra (DLA) computations than conventional SIMD instructions. However, existing auto-vectorizers frequently deliver suboptimal utilization of VNNI by either failing to recognize VNNI’s unique computation pattern at the innermost loops/basic blocks, or producing inferior code through constrained and rudimentary peephole optimizations/pattern matching techniques. Auto-tuning frameworks might generate proficient code but are hampered by the necessity for sophisticated pattern templates and extensive search processes. This paper introduces a novel compilation methodology that generates high-performance VNNI-enabled code. By leveraging DLA’s salient characteristics to identify VNNI’s utilization opportunities, it proceeds to pinpoint the most effective strategies for feeding VNNI’s inputs and hinding VNNI’s execution latency via efficient memory access and register/cache reuse and exploited instruction-level parallelism. A tailored static cost analysis model guides this exploration, determining critical parameters for effective code transformation and generation. The evaluation on DLA and DNN workloads show that our framework outperforms state-of-the-art industrial compilers and research works.
10:30 PDT – 10:45 PDT

A shared compilation stack for distributed-memory parallelism in stencil DSLs
George Bisbas (Imperial College London) ;Anton Lydike, Emilien Bauer, Nick Brown, and Mathieu Fehr (University of Edinburgh) ;Lawrence Mitchell (Unaffiliated) ;Gabriel Rodriguez-Canal and Maurice Jamieson (University of Edinburgh) ;Paul H J Kelly (Imperial College London) ;Michel Steuwer (Technische Universität Berlin) ;Tobias Grosser (University of Cambridge)

Abstract: Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem.
10:45 PDT – 11:00 PDT

SlimSLAM: An Adaptive Runtime for Visual-Inertial Simultaneous Localization and Mapping
Armand Behroozi and Yuxiang Chen (University of Michigan) ;Vlad Fruchter, Lavanya Subramanian, and Sriseshan Srikanth (Meta) ;Scott Mahlke (University of Michigan)

Abstract: Simultaneous localization and mapping (SLAM) algorithms track an agent’s movements through an unknown environment. SLAM must be fast and accurate to avoid adverse effects such as motion sickness in AR/VR headsets and navigation errors in autonomous robots and drones. However, accurate SLAM is computationally expensive and target platforms are often highly constrained. Therefore, to maintain real-time functionality, designers must either pay a large up-front cost to design specialized accelerators or reduce the algorithm’s functionality, resulting in poor pose estimation. We find that most SLAM algorithms are statically configured for the worst-case input and mismanage their computation budget. Thus, we present SlimSLAM, a domain-specific runtime scheduler which adapts SLAM algorithmic parameters based on input needs, minimizing computation while maintaining accuracy. SlimSLAM exploits information from a SLAM algorithm’s state to detect and adjust over-provisioned parameters in real-time. We demonstrate SlimSLAM on the state-of-the-art real-time visual-inertial SLAM algorithm: HybVIO. We show that SlimSLAM is able to provide speedups up to 5.4× on the EuRoC, TUM-VI Room, and Monado VR datasets and outperforms other adaptive approaches on average by 2.3×-1.3× with iso-accuracy. This enables more accurate SLAM on constrained computation platforms such as the Raspberry Pi 4 (RPi4) where SlimSLAM is faster and more accurate than both HybVIO’s static RPi4 configuration as well as other SLAM algorithms.
11:00 PDT – 11:15 PDT

Compiling Loop-Based Nested Parallelism for Irregular Workloads
Yian Su (Northwestern University) ;Mike Rainey (Carnegie Mellon University) ;Nick Wanninger, Nadharm Dhiantravan, and Jasper Liang (Northwestern University) ;Umut A. Acar (Carnegie Mellon University) ;Peter Dinda and Simone Campanoni (Northwestern University)

Abstract: Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

Session 8D: IoT and Embedded
(Location: Fairway I/IV)
10:00 PDT – 10:15 PDT

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays
Massimo Giordano, Rohan Doshi, and Qianyun Lu (Stanford University) ;Boris Murmann (University of Hawaii)

Abstract: The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML’s characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 ??2/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.
10:15 PDT – 10:30 PDT

MulBERRY: Enabling Bit-Error Robustness for Energy-Efficient Multi-Agent Autonomous Systems
Zishen Wan (Georgia Tech) ;Nandhini Chandramoorthy, Karthik Swaminathan, and Pin-Yu Chen (IBM Research) ;Kshitij Bhardwaj (Lawrence Livermore National Lab) ;Vijay Janapa Reddi (Harvard University) ;Arijit Raychowdhury (Georgia Tech)

Abstract: The adoption of autonomous swarms, consisting of a multitude of unmanned aerial vehicles (UAVs), operating in a collaborative manner, has become prevalent in mainstream application domains for both military and civilian purposes. These swarms are expected to collaboratively carry out navigation tasks and employ complex reinforcement learning (RL) models within the stringent onboard size, weight, and power constraints. While techniques such as reducing onboard operating voltage can improve the energy efficiency of both computation and flight missions, they can lead to on-chip bit failures that are detrimental to mission safety and performance. To this end, we propose MulBERRY, a multi-agent robust learning framework to enhance bit error robustness and energy efficiency for resource-constrained autonomous UAV swarms. MulBERRY supports multi-agent robust learning, both offline and on-device, with adaptive and collaborative agent-server optimizations. For the first time, MulBERRY demonstrates the practicality of robust low-voltage operation on multi-UAV systems leading to energy savings in both compute and mission quality-of-flight. We conduct extensive system-level experiments on autonomous multi-UAV navigation by leveraging algorithm-level robust learning techniques, and hardware-level bit error, thermal, and power characterizations. Through evaluations, we demonstrate that MulBERRY achieves robustness-performance-efficiency co-optimizations for autonomous UAV swarms. We also show that MulBERRY generalizes well across fault patterns, environments, UAV types, UAV agent numbers, and RL policies, with up to 18.97% reduction in flight energy, 22.07% increase in the number of successful missions, and 4.16× processing energy reduction.
10:30 PDT – 10:45 PDT

Exploiting Human Color Discrimination for Memory- and Energy-Efficient Image Encoding in Virtual Reality
Nisarg Ujjainkar and Ethan Shahan (University of Rochester) ;Kenneth Chen, Budmonde Duinkharjav, and Qi Sun (New York University) ;Yuhao Zhu (University of Rochester)

Abstract: Virtual Reality (VR) has the potential of becoming the next ubiquitous computing platform. Continued progress in the burgeoning field of VR depends critically on an efficient computing substrate. In particular, DRAM access energy is known to contribute to a significant portion of system energy. Today’s framebuffer compression system alleviates the DRAM traffic by using a numerically lossless compression algorithm. Being numerically lossless, however, is unnecessary to preserve perceptual quality for humans. This paper proposes a perceptually lossless, but numerically lossy, system to compress DRAM traffic. Our idea builds on top of long-established psychophysical studies that show that humans cannot discriminate colors that are close to each other. The discrimination ability becomes even weaker (i.e., more colors are perceptually indistinguishable) in our peripheral vision. Leveraging the color discrimination (in)ability, we propose an algorithm that adjusts pixel colors to minimize the bit encoding cost without introducing visible artifacts. The algorithm is coupled with lightweight architectural support that, in real-time, reduces the DRAM traffic by 66.9% and outperforms existing framebuffer compression mechanisms by up to 20.4%. Psychophysical studies on human participants show that our system introduce little to no perceptual fidelity degradation.
10:45 PDT – 11:00 PDT

MicroVSA: An Ultra-Lightweight Vector Symbolic Architecture-based Classifier Library for Always-On Inference on Tiny Microcontrollers
Nuntipat Narkthong and Shijin Duan (Northeastern University) ;Shaolei Ren (University of California Riverside) ;Xiaolin Xu (Northeastern University)

Abstract: Artificial intelligence (AI) on tiny edge devices has become feasible thanks to the emergence of high-performance microcontrollers (MCUs) and lightweight machine learning (ML) models. Nevertheless, the cost and power consumption of these MCUs and the computation requirements of these ML algorithms still present barriers that prevent the widespread inclusion of AI functionality on smaller, cheaper, and lower-power devices. Thus, there is an urgent need for a more efficient ML algorithm and implementation strategy suitable for lower-end MCUs. This paper presents MicroVSA, a library of optimized implementations of a low-dimensional computing (LDC) classifier, a recently proposed variant of vector symbolic architecture (VSA), for 8, 16, and 32-bit MCUs. MicroVSA achieves up to 21.86x speedup and 8x less flash utilization compared to the vanilla LDC. Evaluation results on the three most common always-on inference tasks – myocardial infarction detection, human activity recognition, and hot word detection – demonstrate that MicroVSA outperforms traditional classifiers and achieves comparable accuracy to tiny deep learning models, while requiring only a few ten bytes of RAM and can easily fit in tiny 8-bit MCUs. For instance, our model for recognizing human activity from inertial sensor data only needs 2.46 KiB of flash and 0.02 KiB of RAM and can complete one inference in 0.85 ms on a 32-bit ARM Cortex-M4 MCU or 11.82 ms on a tiny 8-bit AVR MCU, whereas the RNN model running on a higher-end ARM Cortex-M3 requires 62.0 ms. Our study suggests that ubiquitous ML deployment on low-cost tiny MCUs is possible, and more study on VSA model training, model compression, and implementation techniques is needed to further lower the cost and power of ML on edge devices.
11:00 PDT – 11:15 PDT

Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems
Harrison Williams and Matthew Hicks (Virginia Tech)

Abstract: Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental source until enough energy is stored to guarantee a burst of operation despite changes in power input. Current platforms use a fixed-size buffer chosen at design time to meet constraints on charge time or application longevity, but static energy buffers are a poor fit for the highly volatile power sources found in real-world deployments: fixed buffers waste energy both as heat when they reach capacity during a power surplus and as leakage when they fail to charge the system during a power deficit. To maximize batteryless system performance in the face of highly dynamic input power, we propose REACT: a responsive buffering circuit which varies total capacitance according to net input power. REACT uses a variable capacitor bank to expand capacitance to capture incoming energy during a power surplus and reconfigures internal capacitors to reclaim additional energy from each capacitor as power input falls. Compared to fixed-capacity systems, REACT captures more energy, maximizes usable energy, and efficiently decouples system voltage from stored charge—enabling low-power and high-performance designs previously limited by ambient power. Our evaluation on real-world platforms shows that REACT eliminates the tradeoff between responsiveness, efficiency, and longevity, increasing the energy available for useful work by an average 25.6% over static buffers optimized for reactivity and capacity, improving event responsiveness by an average 7.7? without sacrificing capacity, and enabling programmer directed longevity guarantees.

Session 9A: Accelerated Applications
(Location: Grande C)
11:45 PDT – 12:00 PDT

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Mapping
Zihan Liu (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Wentao Ni (Shanghai Jiao Tong University) ;Jingwen Leng (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Yu Feng (University of Rochester) ;Cong Guo (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Quan Chen (Shanghai Jiao Tong University) ;Chao Li and Minyi Guo (Shanghai Jiao Tong University and Shanghai Qi Zhi Institute) ;Yuhao Zhu (University of Rochester)

Abstract: Approximate nearest neighbor (ANN) search is a widely applied technique in modern intelligent applications, such as recommendation systems and vector databases. Therefore, efficient and high-throughput execution of ANN search has become increasingly important. In this paper, we first characterize the state-of-the-art product quantization-based method of ANN search and identify a significant source of inefficiency in the form of unnecessary pairwise distance calculations and accumulations. To improve efficiency, we propose Juno, an end-to-end ANN search system that adopts a carefully designed sparsity- and locality-aware search algorithm. We also present an efficient hardware mapping that utilizes ray tracing cores in modern GPUs with pipelined execution on tensor cores to execute our sparsity-aware ANN search algorithm. Our evaluations on four datasets from 1 to 100 million search points demonstrate 2.2×-8.5× improvements in search throughput. Moreover, our algorithmic enhancements alone achieve a maximal 2.6× improvement on the hardware without the acceleration of the RT core.
12:00 PDT – 12:15 PDT

ngAP: Non-blocking Large-scale Automata Processing on GPUs
Tianao Ge (Hong Kong University of Science and Technology Guangzhou) ;Tong Zhang (Samsung) ;Hongyuan Liu (Hong Kong University of Science and Technology Guangzhou)

Abstract: Finite automata serve as compute kernels for various applications that require high throughput. However, despite the increasing compute power of GPUs, their potential in processing automata remains underutilized. In this work, we identify three major challenges that limit GPU throughput. 1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. Our key insight is that processing automata “one-symbol-at-a-time” serializes the execution, and thus needs to be revamped. To address these challenges, we propose Non-blocking Automata Processing, which allows parallel processing of different symbols in the input stream and also enables further optimizations: 1) We prefetch a portion of computations to increase the chances of processing multiple symbols simultaneously, thereby utilizing GPU threads better. 2) To reduce redundant computations, we store repeated computations in a memoization table, enabling us to substitute them with table lookups. 3) We privatize some computations to preserve the mapping between threads and states, thus improving data locality. Experimental results demonstrate that our approach outperforms the state-of-the-art GPU automata processing engine by an average of 7.9× and up to 901× across 20 applications.
12:15 PDT – 12:30 PDT

Marple: Scalable Spike Sorting for Untethered Brain-Machine Interfacing
Eugene Sha, Andy Liu, Kareem Ibrahim, Mostafa Mahmoud, and Christina Giannoula (University of Toronto) ;Ameer Abdelhadi (McMaster University) ;Andreas Moshovos (University of Toronto and Vector Institute)

Abstract: Spike sorting is the process of parsing electrophysiological signals from neurons to identify if, when, and which particular neurons fire. Spike sorting is a particularly difficult task in computational neuroscience due to the growing scale of recording technologies and complexity in traditional spike sorting algorithms. Previous spike sorters can be divided into software-based and hardware-based solutions. Software solutions are highly accurate but operate on recordings after-the-fact, and often require utilization of high-power GPUs to process in a timely fashion, and they cannot be used in portable applications. Hardware solutions suffer in terms of accuracy due to the simplification of mechanisms for implementation’s sake and process only up to 128 inputs. This work answers the question: “How much computation power and memory storage is needed to sort spikes from 1000s of channels to keep up with advances in probe technology?” We analyze the computational and memory requirements for modern software spike sorters to identify their potential bottlenecks – namely in the template memory storage. We architect Marple, a highly optimized hardware pipeline for spike sorting which incorporates a novel mechanism to reduce the template memory storage from 8 – 11x. Marple is scalable, uses a flexible vector-based back-end to perform neuron identification, and a fixed-function front-end to filter the incoming streams into areas of interest. The implementation is projected to use just 79mW in 7nm, when spike sorting 10K channels at peak activity. We further demonstrate, for the first time, a machine learning replacement for the template matching stage.
12:30 PDT – 12:45 PDT

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
Mahyar Emami and Sahand Kashani (EPFL) ;Keisuke Kamahori (University of Tokyo) ;Mohammad Sepehr Pourghannad (Sharif University) ;Ritik Raj (Indian Institute of Technology Roorkee) ;James R. Larus (EPFL)

Abstract: The demise of Moore’s Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1–1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers’ productivity by speeding design iterations and permitting more exhaustive exploration. One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism. This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.
12:45 PDT – 13:00 PDT

ORIANNA: An Accelerator Generation Framework for Optimization-based Robotic Applications
Yuhui Hao (Tianjin University) ;Yiming Gan (Chinese Academy of Sciences) ;Bo Yu (Shenzhen Institute of Artificial Intelligence and Robotics for Society) ;Qiang Liu (Tianjin University) ;Yinhe Han (Chinese Academy of Sciences) ;Zishen Wan (Georgia Tech) ;Shaoshan Liu (Shenzhen Institute of Artificial Intelligence and Robotics for Society)

Abstract: Despite extensive efforts, existing approaches to design accelerators for optimization-based robotic applications have limitations. Some approaches focus on accelerating general matrix operations, but they fail to fully exploit the specific sparse structure commonly found in many robotic algorithms. On the other hand, certain methods require manual design of dedicated accelerators, resulting in inefficiencies and significant non-recurring engineering (NRE) costs. The Orianna framework represents a significant advancement as it overcomes the limitations of existing approaches and provides a solution to generate accelerators for robotic applications that integrate multiple optimization-based algorithms. The framework leverages a common abstraction factor graph to overcome these challenges. Application designers can easily define their objectives and constraints using the Orianna software framework. The Orianna compiler then translates the high-level code into basic matrix operations and executes them in an out-of-order manner on the generated accelerator. We demonstrate its effectiveness by implementing a real FPGA-based prototype. Orianna accelerators achieve a 6.5× performance improvement and a 15.1× energy reduction compared to an advanced Intel CPU. Furthermore, Orianna achieves comparable performance to state-of-the-art dedicated accelerators while utilizing only 36.1% of hardware resources.

Session 9B: SSDs
(Location: Grande D/E)
11:45 PDT – 12:00 PDT

AERO: Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs
Sungjun Cho (POSTECH) ;Beomjun Kim (Kyungpook National University) ;Hyunuk Cho (POSTECH) ;Gyeongseob Seo (Kyungpook National University) ;Onur Mutlu (ETH Zürich) ;Myungsuk Kim (Kyungpook National University) ;Jisung Park (POSTECH)

Abstract: This work investigates a new erase scheme in NAND flash memory to improve the lifetime and performance of modern solid-state drives (SSDs). In NAND flash memory, an erase operation applies a high voltage (e.g., > 20 V) to flash cells for a long time (e.g., > 3.5 ms), which degrades cell endurance and potentially delays user I/O requests. While a large body of prior work has proposed various techniques to mitigate the negative impact of erase operations, no work has yet investigated how erase latency should be set to fully exploit the potential of NAND flash memory; most existing techniques use a fixed latency for every erase operation which is set to cover the worst-case operating conditions. To address this, we propose AERO (Adaptive ERase Operation), a new erase scheme that dynamically adjusts erase latency to be just long enough for reliably erasing target cells, depending on the cells’ current erase characteristics. AERO accurately predicts such near-optimal erase latency based on the number of fail bits during an erase operation. To maximize its benefits, we further optimize AERO in two aspects. First, at the beginning of an erase operation, AERO attempts to erase the cells for a short time (e.g., 1 ms), which enables AERO to always obtain the number of fail bits necessary to accurately predict the near-optimal erase latency. Second, AERO aggressively yet safely reduces erase latency by leveraging a large reliability margin present in modern SSDs. We demonstrate the feasibility and reliability of AERO using 160 real 3D NAND flash chips, showing that it enhances SSD lifetime over the conventional erase scheme by 43% without change to existing NAND flash chips. Our system-level evaluation using eleven real-world workloads shows that an AERO-enabled SSD reduces read tail latency by 34% on average over a state-of-the-art technique.
12:00 PDT – 12:15 PDT

BypassD: Enabling fast userspace access to shared SSDs
Sujay Yadalam (University of Wisconsin-Madison) ;Chloe Alverti (National Technical University of Athens) ;Vasileios Karakostas (University of Athens) ;Jayneel Gandhi (Meta) ;Michael Swift (University of Wisconsin-Madison)

Abstract: Modern storage devices, such as Optane NVMe SSDs, offer ultra-low latency of a few microseconds and high bandwidth of multiple gigabytes per second. At these speeds, the kernel software I/O stack is a substantial source of overhead. Userspace approaches avoid kernel software overheads but face challenges in supporting shared storage without major changes to file systems, the OS or the hardware. We propose a new I/O architecture, BypassD, for fast, userspace access to shared storage devices. BypassD takes inspiration from virtual memory: it uses virtual addresses to access a device and relies on hardware for translation and protection. Like memory-mapping a file, the OS kernel constructs a mapping for file contents in the page table. Userspace I/O requests then use virtual addresses from these mappings to specify which file and file offset to access. BypassD extends the IOMMU hardware to translate file offsets into device Logical Block Addresses. Existing applications require no modifications to use BypassD. Our evaluation shows that BypassD reduces latency for 4KB accesses by 42% compared to standard Linux kernel and performs close to userspace techniques like SPDK that do not support device sharing. By eliminating software overheads, BypassD improves performance of real workloads, such as the WiredTiger storage engine, by ∼20%.
12:15 PDT – 12:30 PDT

Eliminating Storage Management Overhead of Deduplication over SSD Arrays Through a Hardware/Software Co-Design
Yuhong Wen, Xiaogang Zhao, and You Zhou (Huazhong University of Science and Technology) ;Tong Zhang (Rensselaer Polytechnic Institute and ScaleFlux) ;Shangjun Yang, Changsheng Xie, and Fei Wu (Huazhong University of Science and Technology)

Abstract: This paper presents a hardware/software co-design solution to efficiently implement block-layer deduplication over SSD arrays. By introducing complex and varying dependency over the entire storage space, deduplication is infamously subject to high storage management overheads in terms of CPU/memory resource usage and I/O performance degradation. To fundamentally address this problem, one intuitive idea is to offload deduplication storage management from host into SSDs, which is motivated by the redundant dual address mapping in host-side deduplication layer and intra-SSD flash translation layer (FTL). The practical implementation of this idea is nevertheless challenging because of the array-wide deduplication vs. per-SSD FTL management scope mismatch. Aiming to tackle this challenge, this paper presents a solution, called ARM-Dedup, that makes SSD FTL deduplication-oriented and array-aware and accordingly re-architects deduplication software to achieve lightweight and high-performance deduplication over an SSD array. We implemented an ARM-Dedup prototype based on the Linux Dmdedup engine and mdraid software RAID over FEMU SSD emulators. Experimental results show that ARM-Dedup has good scalability and can improve system performance significantly, such as by up to 272% and 127% higher IOPS in synthetic and real-world workloads, respectively.
12:30 PDT – 12:45 PDT

LazyBarrier: Reconstructing Android IO Stack for Barrier-Enabled Flash Storage
Yuanyi Zhang, Heng Zhang, Wenbin Cao, Xing He, Daejun Park, Jinyoung Choi, and SungJun Park (Samsung Electronics)

Abstract: Costly synchronous write—a.k.a., fsync()—is a common approach to preserve the storage order of data. Android smartphones use No-Barrier by default to improve fsync() performance. Recently, Barrier-Enabled IO Stack (BEIOS) provides an efficient order-preserving method for flash storage by barrier write commands (BWCs). However, our evaluation shows that BEIOS suffers from performance degradation in multi-threaded scenarios since BWCs increase linearly with the number of threads. This paper proposes LazyBarrier, a new order-preserving IO stack mainly involving order-preserving write (OPW) Model, the Lazy Issue algorithm, modified F2FS, and Barrier Scheduling. OPW Model is based on Finite Automata and defines state transition in OPW. Lazy Issue is a novel algorithm that minimizes BWCs. F2FS is modified to support OPW, and Barrier Scheduling solves the ordering problem in multi-queue block layer. The experiments on Android smartphones show that, compared with BEIOS, LazyBarrier improves OPW performance by 73% and 29% under the FIO benchmark and in real applications respectively. And LazyBarrier also outperforms BEIOS by 19% in dbench workload in servers.
12:45 PDT – 13:00 PDT

Achieving Near-Zero Read Retry for 3D NAND Flash Memory
Min Ye (City University of Hong Kong) ;Qiao Li (Xiamen University) ;Yina Lv (City University of Hong Kong) ;Jie Zhang (Peking University) ;Tianyu Ren (City University of Hong Kong) ;Daniel Wen (YEESTOR Microelectronics) ;Tei-Wei Kuo (National Taiwan University) ;Chun Jason Xue (City University of Hong Kong and Mohamed bin Zayed University of Artificial Intelligence)

Abstract: As the flash-based storage devices age with program/erase (P/E) cycles, they require an increasing number of read retries for error correction, which in turn deteriorates their read performance. The design of read-retry methods is critical to flash read performance. Current flash chips embed predefined read retry tables (RRT) for retry, but these tables fail to consider the read granularity and error behaviors. We characterize different types of real flash chips, based on which we further develop models for the correlation among the optimal read offsets of read voltages required for reading each page. By leveraging characterization observations and the models, we propose a methodology to generate a tailored RRT for each flash model. We introduce a dynamic read retry procedure to pick up proper read voltages from the table, followed by a proximity-search method for fine-tuning the read offsets. Experiments on real flash chips show that the proposed methodology can achieve near-zero retries. It reduces the average number of read retries to below 0.003 for data with high retention time at 8K P/E cycles, whereas the state-of-the-art approaches incur over 3 read retries on average once the flash is aged to 3K P/E cycles.

Session 9C: ML Systems and Optimizations
(Location: Scripps I/II)
11:45 PDT – 12:00 PDT

SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
Wei Niu, Md Musfiqur Rahman Sanim, and Zhihao Shu (University of Georgia) ;Jiexiong Guan (William & Mary) ;Xipeng Shen (North Carolina State University) ;Miao Yin (University of Texas at Arlington) ;Gagan Agrawal (University of Georgia) ;Bin Ren (William & Mary)

Abstract: This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, we observe that layout transformations between the computational operators cause a significant slowdown in these applications. This paper presents SmartMem, a comprehensive framework for eliminating most layout transformations, with the idea that multiple operators can use the same tensor layout through careful choice of layout and implementation of operations. Our approach is based on classifying the operators into four groups, and considering combinations of producer-consumer edges between the operators. We develop a set of methods for searching such layouts. Another component of our work is developing efficient memory layouts for 2.5 dimensional memory commonly seen in mobile devices. Our experimental results show that SmartMem outperforms 5 state-of-the-art DNN execution frameworks on mobile devices across 18 varied neural networks, including CNNs, Transformers with both local and global attention, as well as LLMs. In particular, compared to DNNFusion, SmartMem achieves an average speedup of 2.8×, and outperforms TVM and MNN with speedups of 6.9× and 7.9×, respectively, on average.
12:00 PDT – 12:15 PDT

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations
Dongning Ma (Villanova University) ;Fred Lin, Alban Desmaison, Joel Coburn, Daniel Moore, and Sriram Sankar (Meta) ;Xun Jiao (Villanova University and Meta)

Abstract: Deep neural networks (DNNs) have been widely-adopted in various safety-critical applications such as computer vision and autonomous driving. However, as technology scales and applications diversify, coupled with the increasing heterogeneity of underlying hardware architectures, silent data corruption (SDC) has been emerging as a pronouncing threat to the reliability of DNNs. Recent reports from industry hyperscalars underscore the difficulty in addressing SDC due to their “stealthy” nature and elusive manifestation. In this paper, we propose Dr. DNA, a novel approach to enhance the reliability of DNN systems by detecting and mitigating SDCs. Specifically, we formulate and extract a set of unique SDC signatures from the Distribution of Neuron Activations (DNA), based on which we propose early-stage detection and mitigation of SDCs during DNN inference. We perform an extensive evaluation across 3 vision tasks, 5 different datasets, and 10 different models, under 4 different error models. Results show that Dr. DNA achieves 100% SDC detection rate for most cases, 95% detection rate on average and >90% detection rate across all cases, representing 20% – 70% improvement over baselines. Dr. DNA can also mitigate the impact of SDCs by effectively recovering DNN model performance with <1% memory overhead and <2.5% latency overhead.
12:15 PDT – 12:30 PDT

GPU-based Private Information Retrieval for On-Device Machine Learning Inference
Maximilian Lam (Harvard University) ;Jeff Johnson (Meta) ;Wenjie Xiong (Virginia Tech) ;Kiwan Maeng (Pennsylvania State University) ;Udit Gupta (Harvard University) ;Yang Li, Liangzhen Lai, and Ilias Leontiadis (Meta) ;Minsoo Rhu (KAIST and Meta) ;Hsien-Hsin S. Lee (Intel) ;Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks (Harvard University) ;Edward Suh (Meta and Cornell University)

Abstract: On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20× over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5× additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100, 000 queries per second—a > 100× throughput improvement over a CPU-based baseline—while maintaining model accuracy.
12:30 PDT – 12:45 PDT

RECom: A Compiler Approach to Accelerate Recommendation Model Inference with Massive Embedding Columns
Zaifeng Pan (Renmin University of China) ;Zhen Zheng (Alibaba) ;Feng Zhang and Ruofan Wu (Renmin University of China) ;Hao Liang (Alibaba) ;Dalin Wang (Renmin University of China) ;Xiafei Qiu, Junjie Bai, and Wei Lin (Alibaba) ;Xiaoyong Du (Renmin University of China)

Abstract: Embedding columns are important for deep recommendation models to achieve high accuracy, but they can be very time-consuming during inference. Machine learning (ML) compilers are used broadly in real businesses to optimize ML models automatically. Unfortunately, no existing work uses compilers to automatically accelerate the heavy embedding column computations during recommendation model inferences. To fill this gap, we propose RECom, the first ML compiler that aims at optimizing the massive embedding columns in recommendation models on the GPU. RECom addresses three major challenges. First, generating an efficient schedule on the GPU for the massive operators within embedding columns is difficult. Existing solutions usually lead to numerous small kernels and also lack inter-subgraph parallelism. We adopt a novel codegen strategy that fuses massive embedding columns into a single kernel and maps each column into a separate thread block on the GPU. Second, the complex shape computations under dynamic shape scenarios impede further graph optimizations. We develop a symbolic expression-based module to reconstruct all shape computations. Third, ML frameworks inevitably introduce redundant computations due to robustness considerations. We develop a subgraph optimization module that performs graph-level simplifications based on the entire embedding column context. Experiments on both in-house and open-source models show that RECom can achieve 6.61× and 1.91× over state-of-the-art baselines in terms of end-to-end inference latency and throughput, respectively. RECom’s source code is publicly available at https://github.com/AlibabaResearch/recom.
12:45 PDT – 13:00 PDT

NDPipe: Exploiting Near-data Processing for Scalable Inference and Continuous Training in Photo Storage
Jungwoo Kim and Seonggyun Oh (DGIST) ;Jaeha Kung (Korea University) ;Yeseong Kim and Sungjin Lee (DGIST)

Abstract: This paper proposes a novel photo storage system called NDPipe, which accelerates the performance of training and inference for image data by leveraging near-data processing in photo storage servers. NDPipe distributes storage servers with inexpensive commodity GPUs in a data center and uses their collective intelligence to perform inference and training near image data. By efficiently partitioning deep neural network (DNN) models and exploiting the data parallelism of many storage servers, NDPipe can achieve high training throughput with low synchronization costs. NDPipe optimizes the near-data processing engine to maximally utilize system components in each storage server. Our results show that, given the same energy budget, NDPipe exhibits 1.39× higher inference throughput and 2.64× faster training speed than typical photo storage systems.

Session 9D: Quantum Software
(Location: Fairway I/IV)
11:45 PDT – 12:00 PDT

Exploiting the Regular Structure of Modern Quantum Architectures for Compiling and Optimizing Programs with Permutable Operators
Yuwei Jin, Fei Hua, Yanhao Chen, and Ari Hayes (Rutgers University) ;Chi Zhang (University of Pittsburgh) ;Eddy Z. Zhang (Rutgers University)

Abstract: A critical feature in today’s quantum circuit is that they have permutable two-qubit operators. The flexibility in ordering the permutable two-qubit gates leads to more compiler optimization opportunities. However, it also imposes significant challenges due to the additional degree of freedom. Our Contributions are two-fold. We first propose a general methodology that can find structured solutions for scalable quantum hardware. It breaks down the complex compilation problem into two sub-problems that can be solved at small scale. Second, we show how such a structured method can be adapted to practical cases that handle sparsity of the input problem graphs and the noise variability in real hardware. Our evaluation evaluates our method on IBM and Google architecture coupling graphs for up to 1,024 qubits and demonstrate better result in both depth and gate count – by up to 72% reduction in depth, and 66% reduction in gate count. Our real experiments on IBM Mumbai show that we can find better expected minimal energy than the state-of-the-art baseline.
12:00 PDT – 12:15 PDT

One Gate Scheme to Rule Them All: Introducing a Complex Yet Reduced Instruction Set for Quantum Computing
Jianxin Chen and Dawei Ding (DAMO Academy) ;Weiyuan Gong (Harvard University) ;Cupjin Huang (DAMO Academy) ;Qi Ye (DAMO Academy and Tsinghua University)

Abstract: The design and architecture of a quantum instruction set are paramount to the performance of a quantum computer. This work introduces a gate scheme for qubits with ?? +?? coupling that directly and efficiently realizes any two-qubit gate up to single-qubit gates. First, this scheme enables high-fidelity execution of quantum operations, especially when decoherence is the primary error source. Second, since the scheme spans the entire SU(4) group of two-qubit gates, we can use it to attain the optimal two-qubit gate count for algorithm implementation. These two advantages in synergy give rise to a quantum Complex yet Reduced Instruction Set Computer (CRISC). Though the gate scheme is compact, it supports a comprehensive array of quantum operations. This may seem paradoxical but is realizable due to the fundamental differences between quantum and classical computer architectures. Using our gate scheme, we observe marked improvements across various applications, including generic ?-qubit gate synthesis, quantum volume, and qubit routing. Furthermore, the proposed scheme also realizes a gate locally equivalent to the commonly used CNOT gate with a gate time of ? 2? , where ? is the two-qubit coupling. The AshN scheme is also completely impervious to ?? error, the main coherent error in transversely coupled systems, as the control parameters implementing the gates can be easily adjusted to take the ?? term into account.
12:15 PDT – 12:30 PDT

MorphQPV: Exploiting Isomorphism in Quantum Programs to Facilitate Confident Verification
Siwei Tan, Debin Xiang, and Liqiang Lu (Zhejiang University) ;Junlin Lu (Peking University) ;Qiuping Jiang (Ningbo University) ;Mingshuai Chen and Jianwei Yin (Zhejiang University)

Abstract: Unlike classical computing, quantum program verification (QPV) is much more challenging due to the non-duplicability of quantum states that collapse after measurement. Prior approaches rely on deductive verification that shows poor scalability. Or they require exhaustive assertions that cannot ensure the program is correct for all inputs. In this paper, we propose MorphQPV, a confident assertion-based verification methodology. Our key insight is to leverage the isomorphism in quantum programs, which implies a structure-preserve relation between the program runtime states. In the assertion statement, we define a tracepoint pragma to label the verified quantum state and an assume-guarantee primitive to specify the expected relation between states. Then, we characterize the ground-truth relation between states using an isomorphism-based approximation, which can effectively obtain the program states under various inputs while avoiding repeated executions. Finally, the verification is formulated as a constraint optimization problem with a confidence estimation model to enable rigorous analysis. Experiments suggest that MorphQPV reduces the number of program executions by 107.9× when verifying the 27-qubit quantum lock algorithm and improves the probability of success by 3.3×-9.9× when debugging five benchmarks.
12:30 PDT – 12:45 PDT

Fermihedral: On the Optimal Compilation for Fermion-to-Qubit Encoding
Yuhao Liu, Shize Che, and Junyu Zhou (University of Pennsylvania) ;Yunong Shi (AWS Quantum Technologies) ;Gushu Li (University of Pennsylvania)

Abstract: This paper introduces Fermihedral, a compiler framework focusing on discovering the optimal Fermion-to-qubit encoding for targeted Fermionic Hamiltonians. Fermion-to-qubit encoding is a crucial step in harnessing quantum computing for efficient simulation of Fermionic quantum systems. Utilizing Pauli algebra, Fermihedral redefines complex constraints and objectives of Fermion-to-qubit encoding into a Boolean Satisfiability problem which can then be solved with high-performance solvers. To accommodate larger-scale scenarios, this paper proposed two new strategies that yield approximate optimal solutions mitigating the overhead from the exponentially large number of clauses. Evaluation across diverse Fermionic systems highlights the superiority of Fermihedral, showcasing substantial reductions in implementation costs, gate counts, and circuit depth in the compiled circuits. Real-system experiments on IonQ’s device affirm its effectiveness, notably enhancing simulation accuracy.
12:45 PDT – 13:00 PDT

OnePerc: A Randomness-aware Compiler for Photonic Quantum Computing
Hezi Zhang and Jixuan Ruan (University of California San Diego) ;Hassan Shapourian and Ramana Rao Kompella (Cisco Systems) ;Yufei Ding (University of California San Diego)

Abstract: The photonic platform holds great promise for quantum computing. Nevertheless, the intrinsic probabilistic characteristic of its native fusion operations introduces substantial randomness into the computing process, posing significant challenges to achieving scalability and efficiency in program execution. In this paper, we introduce a randomness-aware compilation framework designed to concurrently achieve scalability and efficiency. Our approach leverages an innovative combination of offline and online optimization passes, with a novel intermediate representation serving as a crucial bridge between them. Through a comprehensive evaluation, we demonstrate that this framework significantly outperforms the most efficient baseline compiler in a scalable manner, opening up new possibilities for realizing scalable photonic quantum computing.

Session 10A: FPGAs and Reconfigurable Hardware
(Location: Grande C)
16:00 PDT – 16:15 PDT

FPGA Technology Mapping Using Sketch-Guided Program Synthesis
Gus Henry Smith, Benjamin Kushigian, Vishal Canumalla, and Andrew Cheung (University of Washington) ;Steven Lyubomirsky (OctoAI) ;Sorawee Porncharoenwase, René Just, Gilbert Louis Bernstein, and Zachary Tatlock (University of Washington)

Abstract: FPGA technology mapping is the process of implementing a hardware design expressed in high-level HDL (hardware design language) code using the low-level, architecture-specific primitives of the target FPGA. As FPGAs become increasingly heterogeneous, achieving high performance requires hardware synthesis tools that better support mapping to complex, highly configurable primitives like digital signal processors (DSPs). Current tools support DSP mapping via handwritten special-case mapping rules, which are laborious to write, error-prone, and often overlook mapping opportunities. We introduce Lakeroad, a principled approach to technology mapping via sketch-guided program synthesis. Lakeroad leverages two techniques—architecture-independent sketch templates and semantics extraction from HDL—to provide extensible technology mapping with stronger correctness guarantees and higher coverage of mapping opportunities than state-of-the-art tools. Across representative microbenchmarks, Lakeroad produces 2–3.5× the number of optimal mappings compared to proprietary state-of-the-art tools and 6–44× the number of optimal mappings compared to popular open-source tools, while also providing correctness guarantees not given by any other tool. Figure 1. Even given a simple input design (input 1), the state-of-the-art (SOTA) hardware synthesis tool for Xilinx FPGAs frequently fails to efficiently use programmable primitives like DSPs. Lakeroad, on the other hand, can utilize all features of programmable primitives given just a short description of an FPGA architecture (input 2) and the vendor-provided simulation models of the primitive (input 3).
16:15 PDT – 16:30 PDT

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs
Neha Prakriya, Yuze Chi, Suhail Basalama, Linghao Song, and Jason Cong (University of California Los Angeles)

Abstract: Despite the increasing adoption of FPGAs in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to generate accelerators with high frequency and throughput. We propose TAPA-CS, a task-parallel dataflow programming framework which automatically partitions and compiles a large design across a cluster of FPGAs while achieving high frequency and throughput. TAPA-CS has three main contributions. First, it is an open-source framework which allows users to leverage virtually “unlimited” accelerator fabric, high-bandwidth memory (HBM), and on-chip memory. Second, given as input a large design, TAPA-CS automatically partitions the design to map to multiple FPGAs, while ensuring congestion control, resource balancing, and overlapping of communication and computation. Third, TAPA-CS couples coarse-grained floor-planning with interconnect pipelining at the inter- and intra-FPGA levels to ensure high frequency. FPGAs in our multi-FPGA testbed communicate through a high-speed 100Gbps Ethernet infrastructure. We have evaluated the performance of TAPA-CS on designs, including systolic-array based CNNs, graph processing workloads such as page rank, stencil applications, and KNN. On average, the 2-, 3-, and 4-FPGA designs are 2.1×, 3.2×, and 4.4× faster than the single FPGA baselines generated through Vitis HLS. TAPA-CS also achieves a frequency improvement between 11%-116% compared with Vitis HLS.
16:30 PDT – 16:45 PDT

Zoomie: A Software-like Debugging Tool for FPGAs
Tianrui Wei and Kevin Laeufer (University of California Berkeley) ;Katie Lim (University of Washington) ;Jerry Zhao and Koushik Sen (UC Berkeley) ;Jonathan Balkind (University of California Santa Barbara) ;Krste Asanovic (University of California Berkeley)

Abstract: FPGA prototyping has long been an indispensable technique in pre-silicon verification as well as enabling early-stage software development. FPGAs themselves have also gained popularity as hardware accelerators deployed in datacenters. However, FPGA development brings a plethora of problems. These issues constitute a high barrier towards mass adoption of agile development surrounding FPGA-based projects. To address these problems, we have built Zoomie for fast incremental compilation, reusing verification infrastructure, and a software-inspired approach towards open-source emulation. We show that Zoomie achieves 18× speedup over the vendor toolchain in incremental compilation time for million-gate designs. At the same time, Zoomie also provides a software-like debugging experience with breakpoints, stepping the design, and forcing values in a running design.
16:45 PDT – 17:00 PDT

HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description
Kingshuk Majumder and Uday Bondhugula (Indian Institute of Science)

Abstract: The emergence of machine learning, image and audio processing on edge devices has motivated research towards power-efficient custom hardware accelerators. Though FPGAs are an ideal target for custom accelerators, the difficulty of hardware design and the lack of vendor agnostic, standardized hardware compilation infrastructure has hindered their adoption. This paper introduces HIR, an MLIR-based intermediate representation (IR) and a compiler to design hardware accelerators for affine workloads. HIR replaces the traditional datapath + FSM representation of hardware with datapath + schedules. We implement a compiler that automatically synthesizes the finite-state-machine (FSM) from the schedule description. The IR also provides high-level language features, such as loops and multi-dimensional tensors. The combination of explicit schedules and high-level language abstractions allow HIR to express synchronization-free, finegrained parallelism, as well as high-level optimizations such as loop pipelining and overlapped execution of multiple kernels. Built as a dialect in MLIR, it draws from best IR practices learnt from communities like those of LLVM. While offering rich optimization opportunities and a high-level abstraction, the IR enables sharing of optimizations, utilities and passes with software compiler infrastructure. Our evaluation shows that the generated hardware design is comparable in performance and resource usage with Vitis HLS. We believe that such a common hardware compilation pipeline can help accelerate the research in language design for hardware description. KEYWORDS HDL, HLS, MLIR, Verilog, accelerator, FPGA

Session 10B: Serverless Computing 2
(Location: Grande D/E)
16:00 PDT – 16:15 PDT

FaaSGraph: Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless Computing
Yushi Liu, Shixuan Sun, Zijun Li, and Quan Chen (Shanghai Jiao Tong University) ;Sen Gao and Bingsheng He (National University of Singapore) ;Chao Li and Minyi Guo (Shanghai Jiao Tong University)

Abstract: Graph processing is widely used in cloud services; however, current frameworks face challenges in efficiency and cost-effectiveness when deployed under the Infrastructure-as-a-Service model due to its limited elasticity. In this paper, we present FaaSGraph, a serverless-native graph computing scheme that enables efficient and economical graph processing through the co-design of graph processing frameworks and serverless computing systems. Specifically, we design a data-centric serverless execution model to efficiently power heavy computing tasks. Furthermore, we carefully design a graph processing paradigm to seamlessly cooperate with the data-centric model. Our experiments show that FaaSGraph improves end-to-end performance by up to 8.3X and reduces memory usage by up to 52.4% compared to state-of-the-art IaaS-based methods. Moreover, FaaSGraph delivers steady 99%-ile performance in highly fluctuated workloads and reduces monetary cost by 85.7%.
16:15 PDT – 16:30 PDT

In-Storage Domain-Specific Acceleration for Serverless Computing
Rohan Mahapatra, Soroush Ghodrati, Byung Hoon Ahn, Sean Kinzer, Shu-Ting Wang, Hanyang Xu, and Lavanya Karthikeyan (University of California San Diego) ;Hardik Sharma (Google) ;Amir Yazdanbakhsh (Google DeepMind) ;Mohammad Alian (University of Kansas) ;Hadi Esmaeilzadeh (University of California San Diego)

Abstract: While (I) serverless computing is emerging as a popular form of cloud execution, datacenters are going through major changes: (II) storage dissaggregation in the system infrastructure level and (III) integration of domain-specific accelerators in the hardware level. Each of these three trends individually provide significant benefits; however, when combined the benefits diminish. On the convergence of these trends, the paper makes the observation that for serverless functions, the overhead of accessing dissaggregated storage overshadows the gains from accelerators. Therefore, to benefit from all these trends in conjunction, we propose In-Storage Domain-Specific Acceleration for Serverless Computing (dubbed DSCS-Serverless1). The idea contributes a server-less model that utilizes a programmable accelerator embedded within computational storage to unlock the potential of acceleration in disaggregated datacenters. Our results with eight applications show that integrating a comparatively small accelerator within the storage (DSCS-Serverless) that fits within the storage’s power constraints (25 Watts), significantly outperforms a traditional disaggregated system that utilizes NVIDIA RTX 2080 Ti GPU (250 Watts). Further, the work highlights that disaggregation, serverless model, and the limited power budget for computation in storage device require a different design than the conventional practices of integrating microprocessors and FPGAs. This insight is in contrast with current practices of designing computational storage devices that are yet to address the challenges associated with the shifts in datacenters. In comparison with two such conventional designs that use ARM cores or a Xilinx FPGA, DSCS-Serverless provides 3.7× and 1.7× end-to-end application speedup, 4.3× and 1.9× energy reduction, and 3.2× and 2.3× better cost efficiency, respectively.
16:30 PDT – 16:45 PDT

FUYAO: DPU-enabled Direct Data Transfer for Serverless Computing
Guowei Liu, Laiping Zhao, Yiming Li, Zhaolin Duan, Sheng Chen, and Yitao Hu (Tianjin University) ;Zhiyuan Su (Inspur Electronic Information Industry) ;Wenyu Qu (Tianjin University)

Abstract: Serverless computing typically relies on the third-party forwarding method to transmit data between functions. This method couples control flow and data flow together, resulting in significantly slow data transmission speeds. This challenge makes it difficult for the serverless computing paradigm to meet the low-latency requirements of web services. To solve this problem, we propose decoupling the control flow from the data flow, enabling direct data transfer between functions. We introduce Fuyao, the first intermediate data transfer solution capable of reducing data transfer latency to the sub-millisecond level. Fuyao provides four different data transfer methods to cater to diverse data transfer requirements within or between nodes. For function pairs that communicate frequently, Fuyao builds a stateful direct connection between them, enabling rapid inter-function data exchange. We evaluate Fuyao using real-world representative benchmarks. Experimental results show that Fuyao outperforms state-of-the-art systems by up to 57× on latency.
16:45 PDT – 17:00 PDT

DataFlower: Exploiting the Data-flow Paradigm for Serverless Workflow Orchestration
Zijun Li, Chuhao Xu, Quan Chen, Jieru Zhao, Chen Chen, and Minyi Guo (Shanghai Jiao Tong University)

Abstract: Serverless computing that runs functions with auto-scaling is a popular task execution pattern in the cloud-native era. By connecting serverless functions into workflows, tenants can achieve complex functionality. Prior research adopts the control-flow paradigm to orchestrate a serverless workflow. However, the control-flow paradigm inherently results in long response latency, due to the heavy data persistence overhead, sequential resource usage, and late function triggering. Our investigation shows that the data-flow paradigm has the potential to resolve the above problems, with careful design and optimization. We propose DataFlower, a scheme that achieves the data-flow paradigm for serverless workflows. In DataFlower, a container is abstracted to be a function logic unit and a data logic unit. The function logic unit runs the functions, and the data logic unit handles the data transmission asynchronously. Moreover, a host-container collaborative communication mechanism is used to support efficient data transfer. Our experimental results show that compared to state-of-the-art serverless designs, DataFlower reduces the 99%-ile latency of the benchmarks by up to 35.4%, and improves the peak throughput by up to 3.8X.

Session 10C: ML Sparsity and Dynamic Shapes
(Location: Scripps I/II)
16:00 PDT – 16:15 PDT

Fractal: Joint Multi-Level Sparse Pattern Tuning of Accuracy and Performance for DNN Pruning
Yue Guan, Changming Yu, Yangjie Zhou, Jingwen Leng, Chao Li, and Minyi Guo (Shanghai Jiao Tong University and Shanghai Qizhi Institute)

Abstract: Model pruning, which eliminates redundant parameters and reduces computational complexity, emerges as a viable strategy for efficient deep neural network (DNN) deployment. Owing to the irregular memory access and computation patterns in the sparse DNN models after pruning, existing arts have suggested various structured sparse patterns to enhance sparse DNN performance. In this work, we propose a unique perspective of understanding existing sparse pattern design as computation-skipping after tiling the tensor computation into multi-level hierarchies. This unified perspective opens up a new design space of multi-level sparse tiling to maximize the sparsity benefits of DNNs, as opposed to the single-level choice in current practices. We present Fractal, an auto-tuning system for sparse patterns that identifies the optimal multi-level sparse tiling pattern. We introduce PatternIR, a novel high-level intermediate representation (IR), to express a diverse range of multi-level sparse patterns. By leveraging insights from prior dense operator optimizations, we translate PatternIR into low-level compiler IRs, facilitating further operator optimization and code generation. Our evaluations demonstrate that Fractal yields substantial speedups of up to on average 3.16× on CUDA Core, 2.52× on TensorCore of GPUs compared to the state-of-art dense baseline under 75% sparsity while upholding minimal accuracy degradation compared to prior sparse operator libraries.
16:15 PDT – 16:30 PDT

Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel Polymerization
Feng Yu, Guangli Li, Jiacheng Zhao, Huimin Cui, and Xiaobing Feng (Chinese Academy of Sciences) ;Jingling Xue (University of New South Wales)

Abstract: In recent times, dynamic-shape neural networks have gained widespread usage in intelligent applications to address complex tasks, introducing challenges in optimizing tensor programs due to their dynamic nature. As the operators’ shapes are determined at runtime in dynamic scenarios, the compilation process becomes expensive, limiting the practicality of existing static-shape tensor compilers. To address the need for effective and efficient optimization of dynamic-shape neural networks, this paper introduces MikPoly, a novel dynamic-shape tensor compiler based on micro-kernel polymerization. MikPoly employs a two-stage optimization approach, dynamically combining multiple statically generated micro-kernels using a lightweight cost model based on the shape of a tensor operator known at runtime. We evaluate the effectiveness of MikPoly by employing popular dynamic-shape operators and neural networks on two representative accelerators, namely GPU Tensor Cores and Ascend NPUs. Our experimental results demonstrate that MikPoly effectively optimizes dynamic-shape workloads, yielding an average performance improvement of 1.49× over state-of-the-art vendor libraries.
16:30 PDT – 16:45 PDT

DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores
Ruibo Fan, Wei Wang, and Xiaowen Chu (Hong Kong University of Science and Technology)

Abstract: Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific computing and machine learning applications. Recent advancements in hardware, notably Tensor Cores (TCs), have created promising opportunities for accelerating SpMM. However, harnessing these hardware accelerators to speed up general SpMM necessitates considerable effort. In this paper, we undertake a comprehensive analysis of the state-of-the-art techniques for accelerating TC-based SpMM and identify crucial performance gaps. Drawing upon these insights, we propose DTC-SpMM, a novel approach with systematic optimizations tailored for accelerating general SpMM on TCs. DTC-SpMM encapsulates diverse aspects, including efficient compression formats, reordering methods, and runtime pipeline optimizations. Our extensive experiments on modern GPUs with a diverse range of benchmark matrices demonstrate remarkable performance improvements in SpMM acceleration by TCs in conjunction with our proposed optimizations. The case study also shows that DTC-SpMM speeds up end-to-end GNN training by up to 1.91× against popular GNN frameworks.
16:45 PDT – 17:00 PDT

SoD 2 : Statically Optimizing Dynamic Deep Neural Network Execution
Wei Niu and Gagan Agrawal (University of Georgia) ;Bin Ren (William & Mary)

Abstract: Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD2, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD2 runs up to 3.9× faster than these systems while saving up to 88% peak memory consumption.

Session 10D: Trusted Computing
(Location: Fairway I/IV)
16:00 PDT – 16:15 PDT

SEVeriFast: Minimizing the root of trust for fast startup of SEV microVMs
Benjamin Holmes (MIT and Vassar College) ;Jason Waterman (Vassar College) ;Dan Williams (Virginia Tech)

Abstract: Serverless computing platforms rely on fast container initialization to provide low latency and high throughput for requests. While hardware enforced trusted execution environments (TEEs) have gained popularity, confidential computing has yet to be widely adopted by latency-sensitive platforms due to its additional initialization overhead. We investigate the application of AMD’s Secure Encrypted Virtualization (SEV) to microVMs and find that current startup times for confidential VMs are prohibitively slow due to the high cost of establishing a root of trust for each new VM. We present SEVeriFast, a new bootstrap scheme for SEV VMs that reevaluates current microVM techniques for fast boot, such as eliminating bootstrap stages and bypassing guest kernel decompression. Counter-intuitively, we find that introducing an additional bootstrap component and reintroducing kernel compression optimizes the cold boot performance of SEV microVMs by reducing the cost of measurement on the critical boot path and producing a minimal root of trust. To our knowledge, SEVeriFast is the first work to explore the trade-offs associated with booting confidential microVMs and provide a set of guiding principles as a step toward confidential serverless. We show that SEVeriFast improves cold start performance of SEV VMs over current methods by 86-93%.
16:15 PDT – 16:30 PDT

sIOPMP: Scalable and Efficient I/O Protection for TEEs
Erhu Feng (Shanghai Jiao Tong University) ;Dahu Feng (Tsinghua University) ;Dong Du and Yubin Xia (Shanghai Jiao Tong University) ;Wenbin Zheng and Siqi Zhao (Alibaba DAMO Academy) ;Haibo Chen (Shanghai Jiao Tong University)

Abstract: Trusted Execution Environments (TEEs), like Intel SGX/TDX, AMD SEV-SNP, ARM TrustZone/CCA, have been widely adopted in prevailing architectures. However, these TEEs typically do not consider I/O isolation (e.g., defending against malicious DMA requests) as a first-class citizen, which may degrade the I/O performance. Traditional methods like using IOMMU or software I/O can degrade throughput by at least 20% for I/O intensive workloads. The main reason is that the isolation requirements for I/O devices differ from CPU ones. This paper proposes a novel I/O isolation mechanism for TEEs, named sIOPMP (scalable I/O Physical Memory Protection), with three key features. First, we design a Multi-stage-Tree-based checker, supporting more than 1,000 hardware regions. Second, we classify the devices into hot and cold, and support unlimited devices with the mountable entry. Third, we propose a remapping mechanism to switch devices between hot and cold status for dynamic I/O workloads. Evaluation results show that sIOPMP introduces only negligible performance overhead for both benchmarks and real-world workloads, and improves 20% ∼ 38% network throughput compared with IOMMU-based mechanisms or software I/O adopted in TEEs.
16:30 PDT – 16:45 PDT

A Midsummer Night’s Tree: Efficient and High Performance Secure SCM
Samuel Thomas (Brown University) ;Kidus Workneh (University of Colorado Boulder) ;Jac McCarty (Bryn Mawr College) ;Joseph Izraelevitz and Tamara Lehman (University of Colorado Boulder) ;R. Iris Bahar (Colorado School of Mines)

Abstract: Secure memory is a highly desirable property to prevent memory corruption-based attacks. The emergence of nonvolatile, storage class memory (SCM) devices presents new challenges for secure memory. Metadata for integrity verification, organized in a Bonsai Merkle Tree (BMT), is cached on-chip in volatile caches, and may be lost on a power failure. As a consequence, care is required to ensure that metadata updates are always propagated into SCM. To optimize metadata updates, state-of-the-art approaches propose lazy update crash consistent metadata schemes. However, few consider the implications of their optimizations on on-chip area, which leads to inefficient utilization of scarce on-chip space. In this paper, we propose A Midsummer Night’s Tree (AMNT), a novel “tree within a tree” approach to provide crash consistent integrity with low run-time overhead while limiting on-chip area for security metadata. Our approach offloads the potential hardware complexity of our technique to software to keep area overheads low. Our proposed mechanism results in significant improvements (a 41% reduction in execution overhead on average versus the state-of-the-art) for in-memory storage applications while significantly reducing the required on-chip area to implement our protocol.
16:45 PDT – 17:00 PDT

Veil: A Protected Services Framework for Confidential Virtual Machines
Adil Ahmad (Arizona State University) ;Botong Ou and Congyu Liu (Purdue University) ;Xiaokuan Zhang (George Mason University) ;Pedro Fonseca (Purdue University)

Abstract: Confidential virtual machines (CVMs) enabled by AMD SEV provide a protected environment for sensitive computations on an untrusted cloud. Unfortunately, CVMs are typically deployed with huge and vulnerable operating system kernels, exposing the CVMs to attacks that exploit kernel vulnerabilities. Veil is a versatile CVM framework that efficiently protects critical system services like shielding sensitive programs, which cannot be entrusted to the buggy kernel. Veil leverages a new hardware primitive, virtual machine privilege levels (VMPL), to install a privileged security monitor inside the CVM. We overcome several challenges in designing Veil, including (a) creating unlimited secure domains with a limited number of VMPLs, (b) establishing resource-efficient domain switches, and (c) maintaining commodity kernel backwards-compatibility with only minor changes. Our evaluation shows that Veil incurs no discernible performance slowdown during normal CVM execution while incurring a modest overhead (2 − 64%) when running its protected services across real-world use cases.

Session 11A: Cryptography and Privacy
(Location: Grande C; Ends 18:45 PDT)
17:30 PDT – 17:45 PDT

Accelerating Multi-Scalar Multiplication for Efficient Zero Knowledge Proofs with Multi-GPU Systems
Zhuoran Ji and Zhiyuan Zhang (Shandong University) ;Jiming Xu (Ant Group) ;Lei Ju (Shandong University)

Abstract: Zero-knowledge proof is a cryptographic primitive that allows for the validation of statements without disclosing any sensitive information, foundational in applications like verifiable outsourcing and digital currency. However, the extensive proof generation time limits its widespread adoption. Even with GPU acceleration, proof generation can still take minutes, with Multi-Scalar Multiplication (MSM) accounting for about 78.2% of the workload. To address this, we present DistMSM, a novel MSM algorithm tailored for distributed multi-GPU systems. At the algorithmic level, DistMSM adapts Pippenger’s algorithm for multi-GPU setups, effectively identifying and addressing bottlenecks that emerge during scaling. At the GPU kernel level, DistMSM introduces an elliptic curve arithmetic kernel tailored for contemporary GPU architectures. It optimizes register pressure with two innovative techniques and leverages tensor cores for specific big integer multiplications. Compared to state-of-the-art MSM implementations, DistMSM offers an average 6.39× speedup across various elliptic curves and GPU counts. An MSM task that previously took seconds on a single GPU can now be completed in mere tens of milliseconds. It showcases the substantial potential and efficiency of distributed multi-GPU systems in ZKP acceleration.
17:45 PDT – 18:00 PDT

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models
Juntaek Lim, Youngeun Kwon, and Ranggi Hwang (KAIST) ;Kiwan Maeng (Pennsylvania State University) ;Edward Suh (FAIR at Meta and Cornell University) ;Minsoo Rhu (KAIST)

Abstract: Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD’s noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119× training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained.
18:00 PDT – 18:15 PDT

BitPacker: Enabling High Arithmetic Efficiency in Fully Homomorphic Encryption Accelerators
Nikola Samardzic and Daniel Sanchez (MIT)

Abstract: Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data. Though FHE is slow on a CPU, recent hardware accelerators compensate most of FHE’s overheads, enabling real-time performance in complex programs like deep neural networks. However, the state-of-the-art FHE scheme, CKKS, is inefficient on accelerators. CKKS represents encrypted data using integers of widely different sizes (typically 30 to 60 bits). This leaves many bits unused in registers and arithmetic datapaths. This overhead is minor in CPUs, but accelerators are dominated by multiplications, so poor utilization causes large area and energy overheads. We present BitPacker, a new implementation of CKKS that keeps encrypted data packed in fixed-size words, enabling near-full arithmetic efficiency in accelerators. BitPacker is the first redesign of an FHE scheme that targets accelerators. On a state-of-the-art accelerator, BitPacker improves performance by gmean 59% and by up to 3×, and reduces energy by gmean 59%. BitPacker does not reduce precision and can be applied to all prior accelerators without hardware changes.
18:15 PDT – 18:30 PDT

ZENO: A Type-based Optimization Framework for Zero Knowledge Neural Network Inference
Boyuan Feng, Zheng Wang, Yuke Wang, Shu Yang, and Yufei Ding (University of California Santa Barbara)

Abstract: Zero knowledge Neural Networks draw increasing attention for guaranteeing computation integrity and privacy of neural networks (NNs) based on zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK) security scheme. However, the performance of zkSNARK NNs is far from optimal due to the million-scale circuit computation with heavy scalar-level dependency. In this paper, we propose a type-based optimizing framework for efficient zero-knowledge NN inference, namely ZENO (ZEro knowledge Neural network Optimizer). We first introduce ZENO language construct to maintain high-level semantics and the type information (e.g., privacy and tensor) for allowing more aggressive optimizations. We then propose privacy-type driven and tensor-type driven optimizations to further optimize the generated zkSNARK circuit. Finally, we design a set of NN-centric system optimizations to further accelerate zkSNARK NNs. Experimental results show that ZENO achieves up to 8.5× end-to-end speedup than state-of-the-art zkSNARK NNs. We reduce proof time for VGG16 from 6 minutes to 48 seconds, which makes zkSNARK NNs practical.
18:30 PDT – 18:45 PDT

Performance-aware Scale Analysis with Reserve for Homomorphic Encryption
Yongwoo Lee, Seonyoung Cheon, and Dongkwan Kim (Yonsei University) ;Dongyoon Lee (Stony Brook University) ;Hanjun Kim (Yonsei University)

Abstract: Thanks to the computation ability on encrypted data and the efficient fixed-point execution, the RNS-CKKS fully homomorphic encryption (FHE) scheme is a promising solution for privacy-preserving machine learning services. However, writing an efficient RNS-CKKS program is challenging due to its manual scale management requirement. Each ciphertext has a scale value with its maximum scale capacity. Since each RNS-CKKS multiplication increases the scale, programmers should properly rescale a ciphertext by reducing the scale and capacity together. Existing compilers reduce the programming burden by automatically analyzing and managing the scales of ciphertexts, but they either conservatively rescale ciphertexts and thus give up further optimization opportunities, or require time-consuming scale management space exploration. This work proposes a new performance-aware static scale analysis for an RNS-CKKS program, which generates an efficient scale management plan without expensive space exploration. This work analyzes the scale budget, called “reserve”, of each ciphertext in a backward manner from the end of a program and redistributes the budgets to the ciphertexts, thus enabling performance-aware scale management. This work also designs a new type system for the proposed scale analysis and ensures the correctness of the analysis result. This work achieves 41.8% performance improvement over EVA that uses conservative static scale analysis. It also shows similar performance improvement to exploration-based Hecate yet with 15526× faster scale management time.

Session 11B: Scheduling
(Location: Grande D/E)
17:30 PDT – 17:45 PDT

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
Zizhao Mo, Huanle Xu, and Chengzhong Xu (University of Macau)

Abstract: Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such as computation and communication. This heterogeneity poses a significant challenge for the elastic scheduling of deep learning workloads. Unfortunately, existing elastic schedulers often overlook the impact of heterogeneity on scaling efficiency, resulting in considerably prolonged job completion times. In this paper, we present Heet, a new Heterogeneity-aware system explicitly developed for elastic training in DL clusters. Heet addresses two critical issues. First, it utilizes a 3-D collaborative filtering method to accurately measure the scaling efficiency of all elastic configurations on heterogeneous hosts, substantially reducing profiling overhead. Second, Heet introduces a unique price function to effectively balance scaling efficiency and scheduling efficiency. Building upon this function, Heet incorporates a scalable mechanism that employs minimum-weight full bipartite matching and opportunistic resource trading to generate dynamic scheduling decisions. Evaluations conducted on cloud clusters and large-scale simulations demonstrate that Heet can reduce job completion time by up to 2.46× compared to existing solutions.
17:45 PDT – 18:00 PDT

Efficient Microsecond-scale Blind Scheduling with Tiny Quanta
Zhihong Luo, Sam Son, and Dev Bali (University of California Berkeley) ;Emmanuel Amaro (VMware Research) ;Amy Ousterhout (University of California San Diego) ;Sylvia Ratnasamy (University of California Berkeley) ;Scott Shenker (ICSI and University of California Berkeley)

Abstract: A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (?s scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are made about the duration of individual jobs, or even about the distribution of their durations, this requires blind scheduling with frequent and efficient preemption, which is not scalably supported for ?s-level tasks. We present Tiny Quanta (TQ), a system that enables efficient blind scheduling of ?s-level workloads. TQ performs fine-grained preemptive scheduling and does so with high performance via a novel combination of two mechanisms: forced multitasking and two-level scheduling. Evaluations with a wide variety of ?s-level workloads show that TQ achieves low tail latency while sustaining 1.2x to 6.8x the throughput of prior blind scheduling systems.
18:00 PDT – 18:15 PDT

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
Seyedali Jokar Jandaghi and Kaveh Mahdaviani (University of Toronto) ;Amirhossein Mirhosseini (University of Michigan) ;Sameh Elnikety (Microsoft Research) ;Cristiana Amza and Bianca Schroeder (University of Toronto)

Abstract: In an effort to increase the utilization of data center resources cloud providers have introduced a new type of virtual machine (VM) offering, called a burstable VM (BVM). Our work is the first to study the characteristics of burstable VMs (based on traces from production systems at a major cloud provider) and resource allocation approaches for BVM workloads. We propose new approaches for BVM resource allocation and use extensive simulations driven by field data to compare them with two baseline approaches used in practice. We find that traditional approaches based on using a fixed oversubscription ratio or based on the Central Limit Theorem do not work well for BVMs: They lead to either low utilization or high server capacity violation rates. Based on the lessons learned from our workload study, we develop a new approach to BVM scheduling, called Audible, using a non-parametric statistical model, which makes the approach light-weight and workload independent, and obviates the need for training machine learning models and for tuning their parameters. We show that Audible achieves high system utilization while being able to enforce stringent requirements on server capacity violations.
18:15 PDT – 18:30 PDT

CPS: A Cooperative Para-virtualized Scheduling Framework for Manycore Machines
Yuxuan Liu, Tianqiang Xu, Zeyu Mi, Zhichao Hua, Binyu Zang, and Haibo Chen (Shanghai Jiao Tong University)

Abstract: Today’s cloud platforms offer large virtual machine (VM) instances with multiple virtual CPUs (vCPU) on manycore machines. These machines typically have a deep memory hierarchy to enhance communication between cores. Although previous researches have primarily focused on addressing the performance scalability issues caused by the double scheduling problem in virtualized environments, they mainly concentrated on solving the preemption problem of synchronization primitives and the traditional NUMA architecture. This paper specifically targets a new aspect of scalability issues caused by the absence of runtime hypervisor-internal states (RHS). We demonstrate two typical RHS problems, namely the invisible pCPU (physical CPU) load and dynamic cache group mapping. These RHS problems result in a collapse in VM performance and low CPU utilization because the guest VM lacks visibility into the latest runtime internal states maintained by the hypervisor, such as pCPU load and vCPU-pCPU mappings. Consequently, the guest VM makes inefficient scheduling decisions. To address the RHS issue, we argue that the solution lies in exposing the latest scheduling decisions made by both the guest and host schedulers to each other. Hence, we present a cooperative para-virtualized scheduling framework called CPS, which facilitates the proactive exchange of timely scheduling information between the hypervisor and guest VMs. To ensure effective scheduling decisions for VMs, a series of techniques are proposed based on the exchanged information. We have implemented CPS in Linux KVM and have designed corresponding solutions to tackle the two RHS problems. Evaluation results demonstrate that CPS significantly improves the performance of PARSEC by 81.1% and FxMark by 1.01x on average for the two identified problems.

Session 11C: ML Training Optimizations
(Location: Scripps I/II)
17:30 PDT – 17:45 PDT

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yuming Li, and Yinhe Han (Chinese Academy of Sciences)

Abstract: With the rapid up-scaling of transformer-based large language models (LLM), training these models is becoming increasingly demanding on novel parallel training techniques. Tensor partitioning is an extensively researched parallel technique, encompassing data and model parallelism, and has a significant influence on LLM training performance. However, existing state-of-the-art parallel training systems are based on incomplete tensor partitioning space, where the distribution of partitioned sub-operators is limited to the spatial dimension. We discover that introducing the temporal dimension into tensor partitioning of LLM training instance provides extra opportunities to avoid collective communication across devices, saving memory space and also overlapping device-to-device communication with computation. In this paper, we propose a new tensor partition primitive that distributes sub-operators along both the spatial and temporal dimensions to further explore communication and memory overhead reduction over current solutions. This new primitive creates a broader parallelization space and leads to parallel solutions that achieve better training throughput with lower peak memory occupancy compared to state-of-the-art techniques. To efficiently deploy optimized parallel transformer model training to multiple devices, we further present an optimization algorithm that can find optimal parallel solutions from our spatial-temporal tensor partition space with acceptable search time. Our evaluation shows that our optimized tensor partitioning achieves up to 1.68 × training throughput with 69% peak memory occupancy compared to state-of-the-art distributed training systems when training LLMs. Upon scaling to 32 GPUs, the geo-mean speedup across benchmarks is 1.30 ×. When applied in 3D parallelism, up to 1.46 × training throughput can be achieved.
17:45 PDT – 18:00 PDT

AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
Zhenbo Sun, Huanqi Cao, Yuanwei Wang, Guanyu Feng, Shengqi Chen, Haojie Wang, and Wenguang Chen (Tsinghua University)

Abstract: Large language models (LLMs) have demonstrated powerful capabilities, requiring huge memory with their increasing sizes and sequence lengths, thus demanding larger parallel systems. The broadly adopted pipeline parallelism introduces even heavier and unbalanced memory consumption. Recomputation is a widely employed technique to mitigate the problem but introduces extra computation overhead. This paper proposes AdaPipe, which aims to find the optimized recomputation and pipeline stage partitioning strategy. AdaPipe employs adaptive recomputation to maximize memory utilization and reduce the computation cost of each pipeline stage. A flexible stage partitioning algorithm is also adopted to balance the computation between different stages. We evaluate AdaPipe by training two representative models, GPT-3 (175B) and Llama 2 (70B), achieving up to 1.32× and 1.22× speedup on clusters with NVIDIA GPUs and Ascend NPUs respectively.
18:00 PDT – 18:15 PDT

EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree
Zhaodong Chen (University of California Santa Barbara) ;Andrew Kerr, Richard Cai, Jack Kosaian, and Haicheng Wu (NVIDIA) ;Yufei Ding (University of California San Diego) ;Yuan Xie (The Hong Kong University of Science and Technology)

Abstract: As deep learning models become increasingly complex, the deep learning compilers are critical for enhancing the system efficiency and unlocking hidden optimization opportunities. Although excellent speedups have been achieved in inference workloads, existing compilers face significant limitations in training. Firstly, the training computation graph involves intricate operations challenging to fuse, such as normalization, loss functions, and reductions, which limit optimization opportunities like kernel fusion. Secondly, the training graph’s additional edges connecting forward and backward operators pose challenges in finding optimal and feasible partitions for kernel fusion. More importantly, existing compilers cannot either generate kernels with state-of-the-art performance on modern GPUs or accommodate diverse fusion patterns. In this paper, we introduce Epilogue Visitor Tree (EVT), a novel compiler that overcomes these limitations. EVT employs novel graph-level compilation passes to unlock hidden fusion and optimization opportunities. It also incorporates a Work partially done during the internship at NVIDIA The source code to reproduce the results is publically available at https: //github.com/apuaaChen/EVT_AE. The CUDA templates of EVT are actively maintained under the official CUTLASS repository https://github. com/NVIDIA/cutlass. novel integer linear programming-based partitioner that efficiently solves the optimal and feasible partitions in complex joint forward-backward graphs. Moreover, we present the Epilogue Visitor Abstraction and introduce the EVT operator compiler that automatically generates flexible epilogues that can be integrated with high-performance main loop implementations from CUTLASS and other SOTA libraries. EVT is evaluated on diverse training workloads across domains and achieves 1.26∼3.1× speedup.
18:15 PDT – 18:30 PDT

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training
Hongzheng Chen (Cornell University) ;Cody Hao Yu and Shuai Zheng (Boson AI) ;Zhen Zhang (Amazon Web Services) ;Zhiru Zhang (Cornell University) ;Yida Wang (Amazon Web Services)

Abstract: Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, Slapo progressively optimizes the model “as-needed” through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92× on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41× on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.

Session 11D: More Processing-In-Memory
(Location: Fairway I/IV)
17:30 PDT – 17:45 PDT

BVAP: Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded Repetitions
Ziyuan Wen, Lingkun Kong, Alexis Le Glaunec, Konstantinos Mamouras, and Kaiyuan Yang (Rice University)

Abstract: Regular pattern matching is pervasive in applications such as text processing, malware detection, network security, and bioinformatics. Recent studies have demonstrated specialized in-memory automata processors with superior energy and memory efficiencies than existing computing platforms. Yet, they lack efficient support for the construct of bounded repetition that is widely used in regular expressions (regexes). This paper presents BVAP, a software-hardware co-designed in-memory Bit Vector Automata Processor. It is enabled by a novel theoretical model called Action-Homogeneous Nondeterministic Bit Vector Automata (AH-NBVA), its efficient hardware implementation, and a compiler that translates regexes into hardware configurations. BVAP is evaluated with a cycle-accurate simulator in a 28nm CMOS process, achieving 67-95% higher energy efficiency and 42-68% lower area, compared to state-of-the-art automata processors (CA, eAP, and CAMA), across a set of real-world benchmarks.
17:45 PDT – 18:00 PDT

IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System
Minseok Seo and Xuan Truong Nguyen (Seoul National University and Inter-university Semiconductor Research Center) ;Seok Joong Hwang (SAPEON) ;Yongkee Kwon, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, and Chunseok Jeong (SK hynix) ;Sangheon Lee, Yongseok Choi, Wooseok Byun, and Seungcheol Baek (SAPEON) ;Hyuk-Jae Lee (Seoul National University and Inter-university Semiconductor Research Center) ;John Kim (KAIST)

Abstract: Accelerating end-to-end inference of transformer-based large language models (LLMs) is a critical component of AI services in datacenters. However, the diverse compute characteristics of LLMs’ end-to-end inference present challenges as previously proposed accelerators only address certain operations or stages (e.g., self-attention, generation stage, etc.). To address the unique challenges of accelerating end-to-end inference, we propose IANUS – Integrated Accelerator based on NPU-PIM Unified Memory System. IANUS is a domain-specific system architecture that combines a Neural Processing Unit (NPU) with a Processing-in-Memory (PIM) to leverage both the NPU’s high computation throughput and the PIM’s high effective memory bandwidth. In particular, IANUS employs a unified main memory system where the PIM memory is used both for PIM operations and for NPU’s main memory. The unified main memory system ensures that memory capacity is efficiently utilized and the movement of shared data between NPU and PIM is minimized. However, it introduces new challenges since normal memory accesses and PIM computations cannot be performed simultaneously. Thus, we propose novel PIM Access Scheduling that manages not only the scheduling of normal memory accesses and PIM computations but also workload mapping across the PIM and the NPU. Our detailed simulation evaluations show that IANUS improves the performance of GPT-2 by 6.2× and 3.2×, on average, compared to the NVIDIA A100 GPU and the state-of-the-art accelerator. As a proof-of-concept, we develop a prototype of IANUS with a commercial PIM, NPU, and an FPGA-based PIM controller to demonstrate the feasibility of IANUS.
18:00 PDT – 18:15 PDT

PIM-STM: Software Transactional Memory for Processing-In-Memory Systems
André Lopes, Daniel Castro, and Paolo Romano (IST/INESC-ID)

Abstract: Processing-In-Memory (PIM) is a novel approach that augments existing DRAM memory chips with lightweight logic. By allowing to offload computations to the PIM system, this architecture allows for circumventing the data-bottleneck problem that affects many modern workloads. This work tackles the problem of how to build efficient software implementations of the Transactional Memory (TM) abstraction by introducing PIM-STM, a library that provides a range of diverse TM implementations for UPMEM, the first commercial PIM system. Via an extensive study we assess the efficiency of alternative choices in the design space of TM algorithms on this emerging architecture. We further quantify the impact of using different memory tiers of the UPMEM system (having different trade-offs for what concerns latency vs capacity) to store the metadata used by different TM implementations. Finally, we assess the gains achievable in terms of performance and memory efficiency when using PIM-STM to accelerate TM applications originally conceived for conventional CPU-based systems.
18:15 PDT – 18:30 PDT

CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
Songyun Qu and Shixin Zhao (Chinese Academy of Sciences) ;Bing Li (Capital Normal University) ;Yintao He, Xuyi Cai, Lei Zhang, and Ying Wang (Chinese Academy of Sciences)

Abstract: In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size, and crossbar number, it is necessary to develop compilation tools that are fully aware of the CIM architectural details and implementation diversity. However, due to the lack of architectural support in current popular open-source compiling stacks such as TVM, existing CIM designs either manually deploy networks or build their own compilers, which is time-consuming and labor-intensive. Although some works expose the specific CIM device programming interfaces to compilers, they are often bound to a fixed CIM architecture, lacking the flexibility to support the CIM architectures with different computing granularity. On the other hand, existing compilation works usually consider the scheduling of limited operation types (such as crossbar-bound matrix-vector multiplication). Unlike conventional processors, CIM accelerators are featured by their diverse architecture, circuit, and device, which cannot be simply abstracted by a single level if we seek to fully explore the advantages brought by CIM. Therefore, we propose CIM-MLC , a universal multi-level compilation framework for general CIM architectures. In this work, we first establish a general hardware abstraction for CIM architectures and computing modes to represent various CIM accelerators. Based on the proposed abstraction, CIM-MLC can compile tasks onto a wide range of CIM accelerators having different devices, architectures, and programming interfaces. More importantly, compared with existing compilation work, CIM-MLC can explore the mapping and scheduling strategies across multiple architectural tiers in CIM, which form a tractable yet effective design space, to achieve better scheduling and instruction generation results. Experimental results show that CIM-MLC achieves 3.2× inference speedup on average compared to prior CIM-oriented compilation work.