The 9th EMC2 - Energy Efficient Machine Learning and Cognitive Computing
Co-located with the The ACM International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS 2024
description Workshop Objective
With the advent of ChatGPT and other language models, Generative AI and LLMs have captured the imagination of whole world! A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with processtechnology and new design methodologies, has the potential to usher unprecedented disruption in the way modern computing systemsare designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems– ranging from low end mobile devices to large scale data centers. Applications that benefit from efficient machine learning include computer vision and image processing, augmented/mixed reality, language understanding, speech and gesture recognition, malware detection, autonomous driving, and many more. Naturally, these applications have diverse requirements for performance, energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best outcome.
chat Call for Papers
The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:
- Keynotes, invited talks and discussion panels by leading researchers from industry and academia
- Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
- Independent publication of proceedings through IEEE CPS
We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.
format_list_bulleted Topics for the Workshop
- Neural network architectures for resource constrained applications
- Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
- Power and performance efficient memory architectures suited for neural networks
- Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
- Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
- Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
- Optimizations to improve performance of training techniques including on-device and large-scale learning
- Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
- Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:15 - 08:30
Welcome
Welcome and Opening Remarks
Satyam Srivastava, d-Matrix
08:30 - 09:00
Invited Talk
Dense-and-Sparse Quantization Methods for Efficient LLM Serving
Amir Gholami, UC Berkeley link
Slides Video
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck for serving these models is increasingly shifting to memory bandwidth rather than compute. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, I will discuss our on-going work on a new type of quantization scheme called dense-and-sparse quantization which enables lossless compression to ultra-low precisions of up to 2-bit, while achieving state-of-the-art accuracy. Dense-and-sparse quantization allows this by decomposing the parameters and KVCache values into two components: a sparse component that includes outliers and sensitive values in the network, along with a dense component which is amenable to low precision compression. This allows us to achieve lossless compression for model parameters down to 3 bits, as well as down to 2-bits for compressing KV Cache values enabling serving a LLaMA-7B model on a single A100 GPU even with a context length of 1M token length.
09:00 - 09:30
Invited Talk
Hardware-aware Algorithms for Language Modeling
Tri Dao, Princeton University link
Slides Video
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. In the first half, we describe FlashAttention, a fast and memory-efficient exact attention algorithm. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) FlashAttention is 4-8x faster than optimized baselines, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time. In the second half, we focus on subquadratic-time architectures such structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks. The resulting Mamba architecture matches or exceeds the performance of strong modern Transformers on language modeling, validated at 1B and 3B scales on both pretraining and downstream evaluation, while enjoying 5x higher inference throughput and linear scaling in sequence length.
09:30 - 10:00
Paper Session #1
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, Josep Torrellas
Demystifying Neuro-Symbolic AI via Workload Characterization and Cross-Layer Co-Design
Zishen Wan
Is Flash Attention Stable?
Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
10:00 - 10:15
Break
Break/Q&A
10:15 - 10:45
Invited Talk
Embracing Machine Learning for the Next Generation of Domain Specific Accelerators
Amir Yazdanbakhsh, Google Deepmind link
Slides Video
In recent years, computer architecture research has been enriched by the advent of machine learning (ML) techniques. In this talk, we will discuss the interplay between ML and designing domain specific architectures. Then, we will delve into the synergies between these two domains, highlighting their collaborative potential to advance design of computer systems. We will explore the diverse range of opportunities that ML affords for optimizing various aspects across the entire compute stack, from algorithmic to system-level optimization. Furthermore, we will embark on a journey towards Architecture 2.0, envisioning a future where ML-assisted architecture research takes center stage. This discussion will emphasize the importance of nurturing and fostering a community that embraces data-driven solutions in computer architecture design.
10:45 - 11:15
Invited Talk
Put LLMs on device? Challenges and new opportunities
Zechun Liu, Meta Reality Labs link
Slides Video
Large language models (LLMs) are permeating various facets of human life,
influencing not only communication and work but also shaping everyday entertainment
experiences. Due to limitations in memory size and computational cost, there is an
increasing demand to deploy LLM on smartphones and mobile devices. To address
these challenges, a new research direction has emerged that focuses on downsizing
LLMs for on-device inference. This includes techniques such as model compression,
deployment acceleration, small model design, etc. In this talk, we will discuss the
constraints and solutions for optimizing models in on-device use cases as well as
practical methods for LLM quantization.
11:15 - 11:45
Invited Talk
Rapid LLM deployments: with great power comes great responsibility
Esha Choukse, Microsoft Research link
Slides Video
With the ubiquitous use-cases of modern LLMs, the deployment scale of these models is unforeseen. This has led to a large-scale datacenter expansion with GPUs, currently running into an energy wall worldwide. This talk will focus on the properties of generative LLMs that can be used to make the deployment of these models more power-efficient. The talk will also introduce POLCA and Splitwise, two techniques to reduce the power consumption for the LLM serving.
11:45 - 12:05
Invited Talk
12:05 - 13:40
Lunch Break
Poster Session / Lunch Break
13:40 - 14:00
Paper Session #2
Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference
Deepika Bablani, Jeffrey McKinstry, Steven Esser, Rathinakumar Appuswamy, Dharmendra Modha
Accurate Block Quantization in LLMs with Outliers
Ilya Soloveychik, Nikita Trukhanov
14:00 - 15:00
Keynote
Efficient Multi-modal LLM
Song Han, MIT, NVIDIA link
Slides Video
This talk presents efficient multi-modal LLM innovations and system implementations. I’ll first present VILA, a visual language model pre-training recipe beyond visual instruction tuning, enabling multi-image reasoning and in-context learning. Followed by SmoothQuant and AWQ for LLM quantization, and the TinyChat inference library. AWQ and TinyChat enable VILA 2.7B deployable on Jetson Orin Nano, bringing new opportunities for mobile vision applications. Second, I’ll present efficient representation learning, including EfficientViT for high-resolution vision, accelerating SAM by 48x without performance loss; and condition-aware neural networks(CAN), a novel way to add control to diffusion models. Third, I’ll present StreamingLLM, a KV cache optimization technique for long conversation and LongLoRA, using sparse, shifted attention for long-context LLM. Finally, I’ll present PockEngine for efficient LLM fine-tuning. Many of these techniques have been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM.
15:00 - 16:00
Panel
The Path to AGI: Directions and Challenges
Video
Recent advances in AI has profoundly affected our perception of the technology. Not only are there domains where AI has reached superhuman performance levels, it is also starting to influence our interaction with machines (courtesy of LLMs).At the same time, observations of emergent properties has excited many in the community as a potential existence proof of future AGI. How and when we get there is an open question though. In this panel we discuss the various elements across algorithm innovation, computing infrastructure, training and data generation, even ethics and governance that need to come together for us to create truly intelligent systems.
Moderator: Raj Parihar
Hadi S. Esmaeilzadeh, UCSD
Amir Yazdanbakhsh, Google
Tatiana Shpeisman, Modular
Hyoukjun Kwon, UCI
Sree Ganesan, d-Matrix
16:00 - 16:15
Break
Break
16:15 - 16:45
Invited Talk
Neural Networks are not Matrix Multiplications
Hadi S. Esmaeilzadeh, UCSD link
Slides Video
With the ever-increasing prevalence of neural networks and language models, it is time to rethink neural acceleration. Past research has focused disproportionately on General Matrix Multiplication (GEMM) operations, assuming they dominate neural computations. However, non-GEMM operations have grown in diversity and complexity, interweaving with GEMM. Conventional Neural Processing Units (NPUs) have taken simplistic approaches to handling these operations. In this talk, I will discuss our most recent work that challenges the conventional wisdom in neural accelerator design and explores the architecture of a specialized “Tandem Processor” that complements the GEMM unit. The Tandem Processor is specialized for memory access logic with a novel Instruction Set Architecture (ISA) and microarchitecture that alleviates the register file, while offering programmability at the mathematical level for non-GEMM layers. This balances specialization and programmability, sustaining the throughput and utilization of the neural accelerator. We provide the Register Transfer Level (RTL) code that is synthesizable both for Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) implementations in addition to the associated compiler as part of the open-source GeneSys initiative (https://actlab-genesys.github.io/). Tandem and GeneSys are the result of more than a decade of pioneering work in NPU design and open-sourcing accelerators.
16:45 - 17:15
Invited Talk
Efficient AI Programming with Mojo and Max
Tatiana Shpeisman, Modular link
Slides Video
As AI grows in its capabilities and ubiquity, it becomes increasingly important to improve efficiency of AI applications. At the same time, developing and deploying new applications requires the ability to quickly iterate on the design and have flexibility to modify it as required by the deployment scenario. In this talk, we will describe how Mojo and Modular MAX platform help to achieve this goal. We will give a brief overview of Mojo and MAX and illustrate by example how AI programmers can benefit from Max and its extensibility via Mojo.
17:15 - 17:45
Invited Talk
CHAI: Clustered Head Attention for Efficient LLM Inference
Bilge Acun, Meta link
Slides Video
Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute. In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73x without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.
17:45 - 18:15
Invited Talk
Teaching LLMs to Use Tools at Scale
Shishir Patil, UC Berkeley link
Slides Video
In this talk, we will explore our innovative approach to integrating Large Language Models (LLMs) with various tools via APIs. Bridging LLMs with APIs presents a significant challenge, primarily because of the models’ struggles to generate precise input arguments and their propensity to hallucinate API calls. Gorilla LLM, trained with our novel Retriever-Aware-Training (RAT), surpasses the performance of all open-sourced LLMs on writing API calls. Gorilla presents a novel PL-inspired metric to measure hallucination, commonly encountered in LLMs. Gorilla is an open-source project having served hundreds of thousand user requests, with enterprise adoption, and an energetic community supporting it. We’ll also spotlight the Berkeley Function Calling Leaderboard to evaluate an LLM’s ability to call functions (tools) accurately. We’ll conclude with our learnings from our deployment experiences, and present open research questions to enable wider integration of LLMs in applications.