The 10th EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Co-located with the IEEE International Symposium on High-Performance Computer Architecture HPCA 2025

Sunday, March 02, 2025
Las Vegas, NV, USA
Room: Mesquite 3

description Workshop Objective

With the advent of ChatGPT and other language models, Generative AI and LLMs have captured the imagination of whole world! A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with processtechnology and new design methodologies, has the potential to usher unprecedented disruption in the way modern computing systemsare designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems– ranging from low end mobile devices to large scale data centers. Applications that benefit from efficient machine learning include computer vision and image processing, augmented/mixed reality, language understanding, speech and gesture recognition, malware detection, autonomous driving, and many more. Naturally, these applications have diverse requirements for performance, energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best outcome.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:00 - 08:15
Welcome

Welcome and Opening Remarks

Satyam Srivastava, d-Matrix

08:15 - 09:00
Keynote

The uncrossable, unbridgeable von Neumann abyss: For inference at low-latency and low-cost, build on the other side

Dharmendra Modha, IBM link

Describing NorthPole, a chip that is the result of nearly two decades of work by scientists at IBM Research and has been an outgrowth of a 14 year partnership with United States Department of Defense (Defense Advanced Research Projects Agency (DARPA), Office of the Under Secretary of Defense for Research and Engineering, and Air Force Research Laboratory). NorthPole, implemented using a 12-nm process on-shore in US, outperforms all of the existing specialized chips for running neural networks on ResNet50 and Yolov4, even those using more advanced technology processes.

09:00 - 09:30
Invited Talk

Craylm v0.5: Unifying LLM Inference and Training for RL Agents

Greg Diamos, Celestial AI link

Craylm is a fully open source, CC-0 Licensed (unrestricted commercial use), integrated LLM inference and training platform. We created Craylm to simplify the development of reinforcement learning agents with advanced reasoning and memory capabilities, similar to those of DeepSeek R1. By integrating inference and training engines into a single platform, Craylm enables the seamless generation and utilization of reasoning trajectories for training updates, streamlining the development process. Craylm builds on top of the vLLM inference engine, the Megatron-LM training framework, and the HuggingFace model hub. It unifies the capabilities of these tools into a single platform, enabling users to easily perform LLM inference and training, and build higher level applications such as LLM-Agents.

09:30 - 10:00
Invited Talk

Inference at the Edge: Tackling the challenges of deploying large-scale models on-device.

Kimish Patel, Meta link

Advent of generative AI has resulted in increased engagement of AI capabilities ranging from use of diffusion models for creative work to leveraging mult-modal models for AI assistants. Form factors such as smart phones and wearables make privacy, latency, power and cost aware deployment of such capabilities possible, by leveraging on-chip AI accelerators for low-latency, energy efficient inference. However black-box nature of such accelerators makes them harder to leverage without significantly affecting developer efficiency of ML and production engineers. By providing native integration points to such accelerators, ExecuTorch framework enables deployment of PyTorch models on these platforms, while preserving ease of authoring, profiling and debugging that is native to PyTorch. In this talk, we provide an overview of the ExecuTorch stack and how it brings both ML developers and silicon vendors together to enable on-device AI, within the PyTorch native ecosystem.

10:00 - 10:15
Break

Break/Q&A

10:15 - 11:00
Invited Talk

Efficient Large Language Models and Generative AI

Song Han, MIT link

The rapid advancement of generative AI, particularly large language models (LLMs), presents unprecedented computational challenges. The autoregressive nature of LLMs makes inference memory bounded. Generating long sequences further compounds the memory demand. I will present efficiency optimization techniques including quantization (SmoothQuant, AWQ, SVDQuant) and KV cache optimization (StreamingLLM, QUEST, DuoAttention), followed by efficient model architectures—HART, a hybrid autoregressive transformer for efficient visual generation, and SANA, an efficient diffusion model deployable on the edge.

11:00 - 11:30
Invited Talk

ML Workloads in AR/VR and Their Implications to the ML System Design

Hyoukjun Kwon, UC Irvine link

Augmented and virtual reality (AR/VR) combines many machine learning (ML) models to implement complex applications. Unlike traditional ML workloads, those in AR/VR involve (1) multiple concurrent ML pipelines (cascaded ML models with control/data dependencies), (2) highly heterogeneous modality and corresponding model structures, and (3) heavy dynamic behavior based on the user context and inputs. In addition, AR/VR requires a (4) real-time execution of those ML workloads on (5) energy-constrained wearable form-factor devices. All together, it creates significant challenges to the ML system design targeting for AR/VR. In this talk, I will first demystify the ML workloads in AR/VR via a recent open benchmark, XRBench, which was developed with industry collaborators at Meta to reflect real use cases. Using the workloads, I will list the challenges and implications of the AR/VR ML workloads to ML system designs. Based on that, I will present hardware and software system design examples tailored for AR/VR ML workloads. Finally, I will discuss research opportunities in the AR/VR ML system design domain.

11:30 - 12:00
Paper Session #1

LLM Inference Acceleration via Efficient Operation Fusion

Mahsa Salmani, Ilya Soloveychik

ATM-Net: Adaptive Termination and Multi-Precision Neural Networks for Energy-Harvested Edge Intelligence

Neeraj Solanki, Sepehr Tabrizchi, Samin Sohrabi, Jason Schmidt, Arman Roohi

12:00 - 13:00
Lunch Break

Lunch Break

13:00 - 13:45
Keynote

Memory-Centric Computing: Enabling Fundamentally Efficient Computing Systems

Onur Mutlu, ETH Zurich link

13:45 - 14:15
Invited Talk
14:15 - 14:45
Invited Talk

nanoML: Pushing the Limits of Edge AI with Weightless Neural Networks

Lizy Kurian John, UT Austin link

Mainstream artificial neural network models, such as Deep Neural Networks (DNNs) are computation-heavy and energy-hungry. Weightless Neural Networks (WNNs) are natively built with RAM-based neurons and represent an entirely distinct type of neural network computing compared to DNNs. WNNs are extremely low-latency, low-energy, and suitable for efficient, accurate, edge inference. The WNN approach derives an implicit inspiration from the decoding process observed in the dendritic trees of biological neurons, making neurons based on Random Access Memories (RAMs) and/or Lookup Tables (LUTs) ready-to-deploy neuromorphic digital circuits. This talk will describe the state of the art of Weightless Neural Networks, and their applications for edge inferencing. It will also present Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. Techniques such as Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of DWN models for edge inferencing will be discussed.

14:45 - 15:00
Break

Break/Q&A

15:00 - 16:00
Panel

Panel Discussion

Moderator: To Be Announced

16:00 - 16:30
Invited Talk

Invited Talk

To Be Announced,

16:30 - 17:00
Invited Talk

Invited Talk

To Be Announced,

17:00 - 17:30
Invited Talk

Sponsor Talk / Fireside Chat

,

17:30 - 18:00
Invited Talk

Invited Talk

To Be Announced,

18:00 - 18:15
Close

Closing Remarks

Sushant Kondguli, Meta