EMC^2: EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Sunday, March 22, 2026 Room: Ft. Pitt

EMC2 Competition: AI Infrastructure Demos, sponsored by Runara.ai — Learn more →

EMC2 Social: Join us after the workshop! 6:15 – 8:30 PM at The Lobby Bar, The Landing Hotel at Rivers Casino. Learn more →

Program
Panel
Accepted Papers
Competition
EMC2 Social
CFP

$500 cash prize + Certificate of Recognition

Winners eligible for Summer Internship opportunities at Runara.ai

Since this is an open-scope, applied competition, proposals will be evaluated based on their real-world application potential instead of rigid benchmark metrics. Judging prioritizes: production readiness and robustness, clarity and usefulness of metrics, real measurable end-to-end impact, clean system design, and practical relevance to modern AI infrastructure.

For any questions please contact raj@runara.ai

16:00 - 17:00 Sunday, March 22, 2026 Ft. Pitt

As the cumulative demand for AI computing continues to grow, new bottlenecks in technology deployment are emerging. These range from compute capacity to memory supply to cluster reliability to simply finding enough power to supply the infrastructure. Since these aren’t easily solved with brute force addition of more chips, potential solutions encompass new model architectures, new computing architectures, high performance cooling designs, robust multi-device software stacks, and even orbital datacenters. In this panel, we will aim to identify the most pressing requirements for the computing stack of coming years. We will discuss both the promising technologies and the hypes to define a blueprint of the computing architecture where energy is the currency of success.

Satyam Srivastava

d-Matrix

    Panelists
  
      Nishil Talati link
    
UIUC

      Esha Choukse link
    
Microsoft

      Josep Torrellas link
    
UIUC

      Zain Asgar link
    
Gimlet AI

      Gennady Pekhimenko link
    
University of Toronto

Paper Session #1 (1:30 PM – 2:15 PM)

piPE-SA: Enabling Deeply Pipelined Processing Elements in Systolic Arrays

Jiayi Wang, Chenyi Wang and Ang Li
CacheFlex: Explicitly Control What You Need in Your Cache

Jingqun Zhang, Weihang Li, Maohua Nie, Yung-Jen Cheng, Jiayi Wang, Shwet Chitnis and Ang Li
IEI: A Composite Infrastructure Efficiency Index for GPU Inference

Raj Parihar

Paper Session #2 (2:45 PM – 3:30 PM)

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang and Murali Annavar
I-Fuse: Profile-Guided Speculative Load Micro-op Fusion for Data Center Applications

Deepanjali Mishra, Tanvir Ahmed Khan, Gilles Pokam, Heiner Litz and Akshitha Sriraman
Store or Recompute? Characterizing the Carbon Tradeoff of KV Cache Retention in LLM Inference

Abnash Bassi, Jaylen Wang, Fiodar Kazhamiaka, Daniel S. Berger and Akshitha Sriraman

08:00 - 08:15

Welcome

Welcome and Opening Remarks

Sushant Kondguli, Meta

08:15 - 09:00

Keynote

Future of Energy-Efficient Cognitive Computing: A Six-Word Story on Sentience, Systems, and Sustainability

Parthasarathy Ranganathan, Google

We are at a pivotal inflection point in the design of computing systems. On one hand, demand for computing is accelerating at phenomenal rates, fueled by the AI revolution and increasingly deep processing on massive data volumes. On the other hand, Moore’s Law is slowing down. This widening supply-demand gap is forcing us to revisit traditional assumptions around systems design. In this talk, we will discuss Google’s experience designing and deploying large-scale AI systems optimized for efficiency, reliability, and velocity. Building on these lessons, we will identify key challenges and opportunities for future innovation, highlighting how the next generation of energy-efficient cognitive computing will be AI-driven, vertically integrated, and uncomfortably exciting!

09:00 - 09:30

Invited Talk

Architecting Responsible AI: Efficient and Carbon-Aware Caching for AI and Its Security Implications

Sihang Liu, University of Waterloo

While generative AI offers transformative potential, its rapid scaling has introduced significant challenges in sustainability and responsible deployment. This talk explores the role of caching in AI systems, including both performance and its broader implications for sustainability and security. First, I will present our recent work on improving the efficiency of video generation systems through caching techniques. I will then discuss our context caching system that explores the trade-off between operational carbon savings and the increased embodied carbon associated with storage. Finally, I will highlight our recent findings on novel security vulnerabilities arising from caching mechanisms used in image generation services.

09:30 - 10:00

Invited Talk

Energy Consumption in AI Datacenters: Can We Address this Challenge?

Josep Torrellas, UIUC

The global electricity demand from data centers, AI, and cryptocurrencies in expected to be around 800TW/h by the end of 2026. Currently, companies like Oracle are building 500MW AI campuses. Training a single model takes 100+ MW over many months, with clusters of about 100K GPUs, and the projected AI cluster size is expected to increase several times in the next few years. Clearly, power will be the primary limiter to the growth of AI. Given this scenario, what can we, researchers, do? In this talk, I will discuss the problem and suggest some techniques that we can use to try to mitigate it. Some of the ideas are being developed in the context of the ACE Center for Evolvable Computing, a center funded by SRC and DARPA for efficient distributed computing.

10:00 - 10:30

Break

10:30 - 11:00

Invited Talk

Memory-Centric Computing: Solving Computing's Memory Problem

Onur Mutlu, ETH Zurich

Computing is bottlenecked by data. Large amounts of application data overwhelm the storage capability, communication capability, and computation capability of the modern machines we design today. As a result, many key applications’ performance, efficiency, and scalability are bottlenecked by data movement. In this talk, we describe three major shortcomings of modern computers in terms of 1) dealing with data, 2) taking advantage of vast amounts of data, and 3) exploiting different semantic properties of application data. We argue that an intelligent computing architecture should be designed to handle data well. We posit that handling data well requires designing architectures based on three key principles: 1) data-centric, 2) data-driven, 3) data-aware. We give examples of how to exploit these principles to design a much more efficient and higher performance computing system. We especially discuss recent research that aims to fundamentally reduce memory latency and energy, and practically enable computation close to data, with at least two promising directions: 1) processing using memory, which exploits the fundamental operational properties of memory chips to perform massively-parallel computation in memory, with low-cost changes, 2) processing near memory, which integrates sophisticated additional processing capability in memory chips, the logic layer of 3D-stacked technologies, or memory controllers to enable near-memory computation with high memory bandwidth and low memory latency. We show both types of architectures can enable order(s) of magnitude improvements in performance and energy consumption of many important workloads, such as artificial intelligence, machine learning, graph analytics, database systems, video processing, climate modeling, genome analysis. We discuss how to enable adoption of such fundamentally more intelligent architectures, which are key to efficiency, performance, and sustainability. We conclude with some research opportunities in and guiding principles for future computing architecture and system designs.

An accompanying overview of modern memory-centric computing ideas & systems can be found at A Modern Primer on Processing in Memory (updated February 2025).

A shorter invited paper from IMW 2025 is at Memory-Centric Computing: Solving Computing’s Memory Problem (May 2025).

11:00 - 11:30

Invited Talk

Efficient and Scalable Agentic AI With Heterogeneous Systems

Zain Asgar, Gimlet AI

AI agents are dynamic, multi-stage systems composed of diverse operations with very different resource characteristics, ranging from compute- and memory-intensive model execution to bandwidth- and IO-bound data processing. Yet most deployments today run on homogeneous infrastructure, despite inference itself being inherently heterogeneous across its phases. In this talk, we present a system for orchestrating agent workloads across a mix of CPUs, GPUs, and accelerators spanning vendors and hardware tiers. The system automatically decomposes agentic workloads into fine-grained execution graphs, compiles them into hardware-specific fragments, and dynamically places and stitches them across distributed infrastructure to meet latency and performance goals. By mapping each stage of execution to the hardware best suited for it, we show how we can achieve step-function performance gains and significant improvements in cost efficiency. We highlight results showing how emerging hardware like SRAM-centric chips, combined with existing GPU infrastructure, can outperform traditional homogeneous deployments.

11:30 - 12:00

Invited Talk

Private Neural Recommendation with Homomorphic Encryption and Orion

Brandon Reagen, NYU

AI is great. It powers many of our favorite services and drives industries. However, today’s solutions pose a tradeoff between utility and privacy where receiving the highest quality service often requires disclosing private information. In this talk I will show how things don’t have to be this way. Fully Homomorphic Encryption (FHE) is a cryptographic method that enables computation directly on encrypted data, never disclosing sensitive inputs while still enabling access to high-quality services. I will then cover the challenges of using this technology, which include both extreme performance overhead and programming difficulty, and how our Orion framework addresses them. Finally, I will highlight our most recent work that demonstrates how FHE can be applied to neural recommendation. Time permitting, a demonstration will be given.

12:00 - 13:15

Lunch Break

13:30 - 14:15

Paper Session #1

piPE-SA: Enabling Deeply Pipelined Processing Elements in Systolic Arrays

Jiayi Wang, Chenyi Wang and Ang Li

CacheFlex: Explicitly Control What You Need in Your Cache

Jingqun Zhang, Weihang Li, Maohua Nie, Yung-Jen Cheng, Jiayi Wang, Shwet Chitnis and Ang Li

IEI: A Composite Infrastructure Efficiency Index for GPU Inference

Raj Parihar

14:15 - 14:45

Invited Talk

Making the Accuracy-Efficiency Trade-Off in Agentic Systems

Esha Choukse, Microsoft

For decades, systems researchers have balanced competing objectives: latency versus throughput, performance versus power, and speed versus cost. Traditionally, accuracy was a fixed requirement—non-negotiable and external to the systems equation. But as modern computing increasingly hosts AI-driven workloads, accuracy itself has become a tunable system variable. In this talk, I’ll argue that the next frontier in systems design lies in treating accuracy as a first-class performance knob, to be traded for efficiency in a principled way. Drawing from our recent work, I will show how software–hardware co-design can explicitly expose, quantify, and manage the accuracy-efficiency tradeoff.

14:45 - 15:30

Paper Session #2

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang and Murali Annavar

I-Fuse: Profile-Guided Speculative Load Micro-op Fusion for Data Center Applications

Deepanjali Mishra, Tanvir Ahmed Khan, Gilles Pokam, Heiner Litz and Akshitha Sriraman

Store or Recompute? Characterizing the Carbon Tradeoff of KV Cache Retention in LLM Inference

Abnash Bassi, Jaylen Wang, Fiodar Kazhamiaka, Daniel S. Berger and Akshitha Sriraman

15:30 - 16:00

Break

16:00 - 17:00

Panel

Architecting an energy-first stack for the AI age

As the cumulative demand for AI computing continues to grow, new bottlenecks in technology deployment are emerging. These range from compute capacity to memory supply to cluster reliability to simply finding enough power to supply the infrastructure. Since these aren’t easily solved with brute force addition of more chips, potential solutions encompass new model architectures, new computing architectures, high performance cooling designs, robust multi-device software stacks, and even orbital datacenters. In this panel, we will aim to identify the most pressing requirements for the computing stack of coming years. We will discuss both the promising technologies and the hypes to define a blueprint of the computing architecture where energy is the currency of success.

Moderator: Satyam Srivastava, d-Matrix

Panelists:

Nishil Talati, UIUC
Esha Choukse, Microsoft
Josep Torrellas, UIUC
Zain Asgar, Gimlet AI
Gennady Pekhimenko, University of Toronto

17:00 - 17:30

Invited Talk

Agile and evolvable software construction in the era of rapidly evolving hardware accelerator designs

Charith Mendis, UIUC

Modern AI workloads have become exceedingly abundant and important in the current computing landscape. As a result, there have been numerous software and hardware innovations aimed at accelerating these workloads. However, we observe a subtle disconnect between the software and hardware communities. Most software innovations target well-established hardware platforms such as CPUs (e.g., x86, ARM) and GPUs (e.g., NVidia GPUs), while hardware innovations produce plenty of other tensor accelerator designs (e.g., Gemmini, Feather, Trainium) each year.

We asked the question, why aren’t the software community using these accelerators or even evaluating on them? The simple yet undeniable reason is the lack of standardized software tooling compared to CPUs and GPUs. For an architecture to be used, properly designed compiler backends, correctness, and performance testing tools should be abundant (e.g., CUDA ecosystem).

In this talk, I will describe how we bridge this gap by automatically generating the necessary software tools for a large class of accelerators through the Accelerator Compiler Toolkit (ACT) ecosystem. Central to ACT is an ISA definition language, TAIDL, that for the first time standardizes the hardware-software interfaces for a large class of accelerators. Departing from the traditional approach of manually constructing test oracles, performance models, or retargetable compiler backends, we instead introduce agile and evolvable methodologies to automatically generate such necessary tooling using both formal methods and machine learning techniques for any TAIDL-defined accelerator interface. I will show how such automation enables rapid software prototyping, making rapidly evolving accelerator designs usable by the software community.

17:30 - 18:00

Invited Talk

Making Long-Context LLM Inference Practical: Leveraging Token Importance for Efficient Inference

Urmish Thakker, Ex-SambaNova Systems

As LLM context windows scale to hundreds of thousands or even millions of tokens, inference efficiency is increasingly limited by KV-cache memory growth and the high cost of prompt prefilling. Prior approaches attempt to mitigate these costs through heuristic cache eviction or prompt compression techniques, but often rely on architecture-specific assumptions or require retraining to recover accuracy. In this talk, I show that LLMs implicitly encode token importance signals that reveal which parts of context matter most for downstream computation. By exploiting these signals, it is possible to selectively retain or process only the most relevant tokens, significantly improving both decoding throughput and prefill throughput while preserving model quality in long-context inference.

Energy Efficient Machine Learning and Cognitive Computing

11th Edition

Co-located with ASPLOS 2026 in Pittsburgh, PA

emoji_events EMC2 Competition: AI Infrastructure Demos

Overview

Competition Tracks

Hardware and Model Flexibility

Submission Requirements

Selection & Presentation

Awards

$500 cash prize + Certificate of Recognition

Evaluation Criteria

forum Architecting an energy-first stack for the AI age

Moderator

Panelists

description Accepted Papers

Paper Session #1 (1:30 PM – 2:15 PM)

piPE-SA: Enabling Deeply Pipelined Processing Elements in Systolic Arrays

CacheFlex: Explicitly Control What You Need in Your Cache

IEI: A Composite Infrastructure Efficiency Index for GPU Inference

Paper Session #2 (2:45 PM – 3:30 PM)

Fast NF4 Dequantization Kernels for Large Language Model Inference

I-Fuse: Profile-Guided Speculative Load Micro-op Fusion for Data Center Applications

Store or Recompute? Characterizing the Carbon Tradeoff of KV Cache Retention in LLM Inference

local_bar EMC2 Social

description Workshop Objective

chat Call for Papers

format_list_bulleted Topics for the Workshop

We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

08:00 - 08:15

Welcome

Welcome and Opening Remarks

Sushant Kondguli, Meta

08:15 - 09:00

Keynote

Future of Energy-Efficient Cognitive Computing: A Six-Word Story on Sentience, Systems, and Sustainability

09:00 - 09:30

Invited Talk

Architecting Responsible AI: Efficient and Carbon-Aware Caching for AI and Its Security Implications

09:30 - 10:00

Invited Talk

Energy Consumption in AI Datacenters: Can We Address this Challenge?

10:00 - 10:30

Break

Break

10:30 - 11:00

Invited Talk

Memory-Centric Computing: Solving Computing's Memory Problem

11:00 - 11:30

Invited Talk

Efficient and Scalable Agentic AI With Heterogeneous Systems

11:30 - 12:00

Invited Talk

Private Neural Recommendation with Homomorphic Encryption and Orion

12:00 - 13:15

Lunch Break

Lunch Break

13:15 - 13:30

Invited Talk

Sponsor Talk

Max Sbabo, d-Matrix

13:30 - 14:15

Paper Session #1

piPE-SA: Enabling Deeply Pipelined Processing Elements in Systolic Arrays

Jiayi Wang, Chenyi Wang and Ang Li

CacheFlex: Explicitly Control What You Need in Your Cache

Jingqun Zhang, Weihang Li, Maohua Nie, Yung-Jen Cheng, Jiayi Wang, Shwet Chitnis and Ang Li

IEI: A Composite Infrastructure Efficiency Index for GPU Inference

Raj Parihar

14:15 - 14:45

Invited Talk

Making the Accuracy-Efficiency Trade-Off in Agentic Systems

14:45 - 15:30

Paper Session #2

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang and Murali Annavar

I-Fuse: Profile-Guided Speculative Load Micro-op Fusion for Data Center Applications

Deepanjali Mishra, Tanvir Ahmed Khan, Gilles Pokam, Heiner Litz and Akshitha Sriraman

Store or Recompute? Characterizing the Carbon Tradeoff of KV Cache Retention in LLM Inference

Abnash Bassi, Jaylen Wang, Fiodar Kazhamiaka, Daniel S. Berger and Akshitha Sriraman

EMC2 Competition: AI Infrastructure Demos

Architecting an energy-first stack for the AI age

Accepted Papers

EMC2 Social

Workshop Objective

Call for Papers

Topics for the Workshop