EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

Overview

The deployment of language models is rapidly shifting from datacenters to edge devices such as laptops, smartphones, and embedded platforms, driven by the demand for interactive, low-latency, and privacy-preserving applications. In this context, Small Language Models (SLMs) have emerged as practical candidates, yet their inference reveals inefficiencies in conventional accelerators. While GPUs and NPUs process the GEMM-heavy prefill stage efficiently, they remain underutilized during the GEMV-dominated decoding phase, resulting in limited throughput and excessive energy consumption at the edge. To overcome this challenge, we present EdgeCIM, a hardware-software co-design framework for accelerating decoder-only SLM inference. EdgeCIM directly targets the memory-bound decoding stage through a tiled hierarchy of SRAM-based digital Compute-in-Memory (CIM) macros coupled with a design space exploration process that co-optimizes latency and energy. By addressing this critical bottleneck, EdgeCIM enables real-time, energy-efficient inference under the strict constraints of edge deployment.

Introduction

The deployment of language models is rapidly shifting from datacenters to edge devices such as laptops, smartphones, and embedded platforms, driven by the demand for interactive, low-latency, and privacy-preserving applications. In this context, Small Language Models (SLMs) have emerged as practical candidates, yet their inference reveals inefficiencies in conventional accelerators. While GPUs and NPUs process the GEMM-heavy prefill stage efficiently, they remain underutilized during the GEMV-dominated decoding phase, resulting in limited throughput and excessive energy consumption at the edge. To overcome this challenge, we present EdgeCIM, a hardware-software co-design framework for accelerating decoder-only SLM inference. EdgeCIM directly targets the memory-bound decoding stage through a tiled hierarchy of SRAM-based digital Compute-in-Memory (CIM) macros coupled with a design space exploration process that co-optimizes latency and energy. By addressing this critical bottleneck, EdgeCIM enables real-time, energy-efficient inference under the strict constraints of edge deployment.

System Model

EdgeCIM consists of two main components: an optimization algorithm and an analytical simulator. Its primary objective is to explore the hardware design space and identify the optimal CIM-based hardware configuration for accelerating the decoding phase of decoder-only SLMs. The framework takes as input the target SLM configuration and a predefined hardware design space. The optimization algorithm samples candidate architectures and evaluates them using an objective function, while the simulator models the CIM-based accelerator and reports key performance metrics such as latency, energy, and area. These metrics are fed back into the optimization engine, which iteratively refines the search toward optimal solutions. The outcome is a hardware configuration that minimizes the objective function, along with its optimized parameters and performance metrics.

At the hardware level, the EdgeCIM accelerator adopts a hierarchical tiled architecture. The chip is organized into clusters, each containing multiple tiles, and every tile integrates an array of SRAM-based digital CIM processing elements (PEs). Each PE performs GEMV operations. By exposing a range of configurable architectural parameters, EdgeCIM defines a rich design space that can be systematically explored to adapt the hardware to different SLM workloads while meeting the strict constraints of edge deployment.

Proposed Technique

To accelerate the GEMV-heavy decoding phase of SLMs, EdgeCIM introduces a set of hardware-software co-design techniques. At the dataflow level, large projection, attention, linear, and feed-forward layers are partitioned into smaller blocks that are streamed sequentially from off-chip memory through the accelerator. This block-based mapping folds computation over time, enabling efficient execution under on-chip storage constraints. In parallel, an active-tile pipelining strategy overlaps computation with memory transfers: while a subset of tiles executes GEMV operations, the remaining tiles preload the next block from DRAM. This reduces bandwidth pressure and hides memory latency without sacrificing throughput.

At the software level, EdgeCIM integrates a genetic algorithm-driven design space exploration framework implemented in Python. The optimizer samples candidate hardware configurations, varying architectural parameters such as the number of clusters, tiles, active tiles, PEs, and bus widths, and evaluates them using an analytical simulator implemented in C++. The simulator models the CIM-based accelerator and reports performance metrics which are fed back into the optimization loop. Through iterative refinement, the GA converges toward Pareto-optimal designs that adapt to diverse SLM workloads and edge deployment requirements. Formally, the optimization problem is expressed as:

minimize L(h)α × E(h)1−α,
hH
0 ≤ α ≤ 1

where L(h) and E(h) denote the latency and energy of generating a specified number of tokens under configuration h, and H represents the hardware design space. The tunable parameter α allows the framework to prioritize either low latency or high energy efficiency, depending on deployment requirements. By combining hardware-aware mapping with a formal optimization framework, EdgeCIM systematically identifies configurations that optimize performance for real-time edge inference.

Results

We evaluate EdgeCIM on a diverse set of decoder-only SLMs, including TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B–3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B–4B). Across all benchmarks, EdgeCIM sustains an average of 336.4 tokens/s and 173 tokens/J while satisfying the area constraints of edge devices. The resulting hardware configurations occupy between 18.4 and 103.6 mm², confirming the feasibility of integrating EdgeCIM into mobile and embedded platforms.

The results further show that EdgeCIM consistently delivers substantial performance gains compared to state-of-the-art commercial edge accelerators. On LLaMA3.2-1B, it achieves 7.3× higher throughput than NVIDIA’s Orin Nano and 2.44× higher than Orin AGX, while improving energy efficiency by 49.6× over Orin Nano. For SmolLM2-1.7B, throughput improves by 6.36× relative to Orin Nano and 4× over Orin Nano Super, underscoring the scalability of EdgeCIM across different model sizes. On larger workloads such as LLaMA3.2-3B, EdgeCIM surpasses Qualcomm SA8255P, Snapdragon X Elite, and Snapdragon 8 Elite Mobile by 9.95×, 7.57×, and 5.93×, respectively. Overall, these results confirm that EdgeCIM sustains real-time, energy-efficient inference across diverse SLMs while operating within the strict constraints of edge deployment.

How to download/use the Dataset

  • You can download the data set from the following link: Coming Soon
  • You can download the code set from the following link: Coming Soon
  • How to use the dataset and Python code: Coming Soon

Acknowledgments

This project was carried out in collaboration with the University of California, Irvine (UCI).