The 10th EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Co-located with the IEEE International Symposium on High-Performance Computer Architecture HPCA 2025

Sunday, March 02, 2025
Las Vegas, NV, USA
Room: Mesquite 3

description Workshop Objective

With the advent of ChatGPT and other language models, Generative AI and LLMs have captured the imagination of whole world! A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with processtechnology and new design methodologies, has the potential to usher unprecedented disruption in the way modern computing systemsare designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems– ranging from low end mobile devices to large scale data centers. Applications that benefit from efficient machine learning include computer vision and image processing, augmented/mixed reality, language understanding, speech and gesture recognition, malware detection, autonomous driving, and many more. Naturally, these applications have diverse requirements for performance, energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best outcome.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:00 - 08:15
Welcome

Welcome and Opening Remarks

Sushant Kondguli, Meta

08:15 - 08:15
Session

Session 1 Start

Session Chair - Sushant Kondguli, Meta

08:15 - 09:00
Keynote

The uncrossable, unbridgeable von Neumann abyss: For inference at low-latency and low-cost, build on the other side

Dharmendra Modha, IBM link

Describing NorthPole, a chip that is the result of nearly two decades of work by scientists at IBM Research and has been an outgrowth of a 14 year partnership with United States Department of Defense (Defense Advanced Research Projects Agency (DARPA), Office of the Under Secretary of Defense for Research and Engineering, and Air Force Research Laboratory). NorthPole, implemented using a 12-nm process on-shore in US, outperforms all of the existing specialized chips for running neural networks on ResNet50 and Yolov4, even those using more advanced technology processes.

09:00 - 09:30
Invited Talk

Pushing the Frontiers of Language Model Efficiency: From Innovative Architecture to Any‐Shape LLMs and Multiplication‐Less Reparameterization

Yingyan(Celine) Lin, Georgia Tech link

As large language models (LLMs) continue to transform applications in natural language understanding, generation, and reasoning, there is an ever‐increasing need for more efficient, flexible, and high‐performing solutions. In this talk, I will introduce our recent work in advancing LLM efficiency through innovative architecture, any‐shape flexibility, and multiplication‐less reparameterization. Collectively, these techniques provide a multi‐layered strategy for building faster, lighter, and more agile LLMs, extending the frontiers of efficient language intelligence across diverse deployment scenarios.

09:30 - 10:00
Invited Talk

Inference at the Edge: Tackling the challenges of deploying large-scale models on-device.

Kimish Patel, Meta link

Advent of generative AI has resulted in increased engagement of AI capabilities ranging from use of diffusion models for creative work to leveraging mult-modal models for AI assistants. Form factors such as smart phones and wearables make privacy, latency, power and cost aware deployment of such capabilities possible, by leveraging on-chip AI accelerators for low-latency, energy efficient inference. However black-box nature of such accelerators makes them harder to leverage without significantly affecting developer efficiency of ML and production engineers. By providing native integration points to such accelerators, ExecuTorch framework enables deployment of PyTorch models on these platforms, while preserving ease of authoring, profiling and debugging that is native to PyTorch. In this talk, we provide an overview of the ExecuTorch stack and how it brings both ML developers and silicon vendors together to enable on-device AI, within the PyTorch native ecosystem.

10:00 - 10:15
Break

Break/Q&A

10:15 - 11:00
Invited Talk

Efficient Large Language Models and Generative AI

Song Han, MIT link

The rapid advancement of generative AI, particularly large language models (LLMs), presents unprecedented computational challenges. The autoregressive nature of LLMs makes inference memory bounded. Generating long sequences further compounds the memory demand. I will present efficiency optimization techniques including quantization (SmoothQuant, AWQ, SVDQuant) and KV cache optimization (StreamingLLM, QUEST, DuoAttention), followed by efficient model architectures—HART, a hybrid autoregressive transformer for efficient visual generation, and SANA, an efficient diffusion model deployable on the edge.

11:00 - 11:30
Invited Talk

ML Workloads in AR/VR and Their Implications to the ML System Design

Hyoukjun Kwon, UC Irvine link

Augmented and virtual reality (AR/VR) combines many machine learning (ML) models to implement complex applications. Unlike traditional ML workloads, those in AR/VR involve (1) multiple concurrent ML pipelines (cascaded ML models with control/data dependencies), (2) highly heterogeneous modality and corresponding model structures, and (3) heavy dynamic behavior based on the user context and inputs. In addition, AR/VR requires a (4) real-time execution of those ML workloads on (5) energy-constrained wearable form-factor devices. All together, it creates significant challenges to the ML system design targeting for AR/VR. In this talk, I will first demystify the ML workloads in AR/VR via a recent open benchmark, XRBench, which was developed with industry collaborators at Meta to reflect real use cases. Using the workloads, I will list the challenges and implications of the AR/VR ML workloads to ML system designs. Based on that, I will present hardware and software system design examples tailored for AR/VR ML workloads. Finally, I will discuss research opportunities in the AR/VR ML system design domain.

11:30 - 12:00
Paper Session #1

LLM Inference Acceleration via Efficient Operation Fusion

Mahsa Salmani, Ilya Soloveychik

ATM-Net: Adaptive Termination and Multi-Precision Neural Networks for Energy-Harvested Edge Intelligence

Neeraj Solanki, Sepehr Tabrizchi, Samin Sohrabi, Jason Schmidt, Arman Roohi

12:00 - 12:59
Lunch Break

Lunch Break

13:00 - 13:00
Session

Session 2 Start

Session Chair - Eric Qin, Meta

13:05 - 13:15
Invited Talk
13:15 - 14:00
Keynote

Memory-Centric Computing: Enabling Fundamentally Efficient Computing Systems

Onur Mutlu, ETH Zurich link

Computing is bottlenecked by data. Large amounts of application data overwhelm storage capability, communication capability, and computation capability of the modern machines we design today. As a result, many key applications’ performance, efficiency, and scalability are bottlenecked by data movement. In this talk, we describe three major shortcomings of modern architectures in terms of 1) dealing with data, 2) taking advantage of the vast amounts of data, and 3) exploiting different semantic properties of application data. We argue that an intelligent architecture should be designed to handle data well. We posit that handling data well requires designing architectures based on three key principles: 1) data-centric, 2) data-driven, 3) data-aware. We give examples for how to exploit these principles to design a much more efficient and higher performance computing system. We especially discuss recent research that aims to fundamentally reduce memory latency and energy, and practically enable computation close to data, with at least two promising directions: 1) processing using memory, which exploits the fundamental operational properties of memory chips to perform massively-parallel computation in memory, with low-cost changes, 2) processing near memory, which integrates sophisticated additional processing capability in memory chips, the logic layer of 3D-stacked technologies, or memory controllers to enable near-memory computation with high memory bandwidth and low memory latency. We show both types of architectures can enable order(s) of magnitude improvements in performance and energy consumption of many important workloads, such as machine learning, graph analytics, database systems, video processing, climate modeling, genome analysis. We discuss how to enable adoption of such fundamentally more intelligent architectures, which we believe are key to efficiency, performance, and sustainability. We conclude with some research opportunities in and guiding principles for future computing architecture and system designs.

14:00 - 14:30
Invited Talk

Efficient Inference and Finetuning on the Edge – Towards A Low Power e2e “near Edge” use case, enabling Agentic Applications

Kingsuk Maitra, Qualcomm link

Finetuning and inference options on edge devices and/or the cloud would be discussed. A novel use case of identifying source of pollution in river water in India would be presented. Data feeds from sensor measurements combined with satellite imagery data would be used to train/fine-tune a graph transformer type network, capable of multi-modal fusion of the geo-spatial and water conductivity data. Subsequent integration into a agentic framework using either on-prem and/or on cloud network in a hybrid setting would be discussed. Augmentation of the agentic workflow with an in-house proprietary perception engine would also be introduced. Emphasis would be made on how the different blocks, starting from geo-spatial satellite imagery data collection engine, to real time sensor measurements, integrated via the perception engine, are fed into the specialized LLM network via our Playground dashboard, available via QCOMM cloud AI, and orchestrated into actionable insights, would also be highlighted.

14:30 - 15:00
Invited Talk

nanoML: Pushing the Limits of Edge AI with Weightless Neural Networks

Lizy Kurian John, UT Austin link

Mainstream artificial neural network models, such as Deep Neural Networks (DNNs) are computation-heavy and energy-hungry. Weightless Neural Networks (WNNs) are natively built with RAM-based neurons and represent an entirely distinct type of neural network computing compared to DNNs. WNNs are extremely low-latency, low-energy, and suitable for efficient, accurate, edge inference. The WNN approach derives an implicit inspiration from the decoding process observed in the dendritic trees of biological neurons, making neurons based on Random Access Memories (RAMs) and/or Lookup Tables (LUTs) ready-to-deploy neuromorphic digital circuits. This talk will describe the state of the art of Weightless Neural Networks, and their applications for edge inferencing. It will also present Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. Techniques such as Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of DWN models for edge inferencing will be discussed.

15:00 - 15:15
Break

Break/Q&A

15:15 - 15:30
Talk

Intro to SCF

Sushant Kondguli, Meta

15:30 - 16:30
Panel

The evolution of scaling laws and what it means for sustainable AI computing

We are witnessing a paradigm shift in the development of AI technology. For last several years, creation of the next frontier language models has largely followed the recipe of AI scaling law – larger model + more data + more compute leads to better capability. Lately though, this recipe has become unsustainable, even for the largest organizations. This has spurred clever innovations across data conditioning, reinforcement learning, knowledge distillation, inference-time computing, and many more. Not only have these innovations helped overcome the wall of traditional scaling, it has lowered the barrier to accessing top tier AI technology and also shifted some computational demand from training to inference. This panel will explore the implications and possibilities of this new landscape of AI computing and how to make it more sustainable.

Moderator: Satyam Srivastava

Onur Mutlu, ETH Zurich

Huichu Liu, Meta

Yingyan (Celine) Yin, Georgia Tech

Song Han, MIT

16:30 - 17:00
Invited Talk

Overcoming Memory Limitations for On-Device AI and LLM in Wearable Augmented Reality Systems

Huichu Liu, Meta link

Wearable augmented reality (AR) devices have the potential to revolutionize user experiences with AI-enabled capabilities. However, the power and size constraints of current memory technologies pose significant challenges to realizing this vision. To overcome these limitations, we must address the memory power and area challenges that hinder the performance of AR/AI workloads in wearable devices. To achieve this goal, we will explore state-of-the-art logic, memory, and advanced packaging technologies, such as 3D SRAM, 3D IPM, and CIM. In this talk, we will introduce three key approaches to overcoming memory limitations and enabling on-device AI and LLM in wearable AR systems: (1) Integrate logic and memory with advanced 3D technologies; (2) Co-optimize memory for workload characteristics; and (3) Design for the end-to-end system. By optimizing memory design and technology solutions for specific workloads and end-to-end systems, we can unlock the full potential of wearable AR devices.

17:00 - 17:15
Close

Closing Remarks

Satyam Srivastava, d-Matrix