# The Future of Low-bitwidth Reconfigurable and Parallel AI Computing

## 2023/12/12

## Masato Motomura AI Computing Research Unit (ArtIC) Tokyo Institute of Technology (Tokyo Tech)





## Introduction: Myself and Our Group (ArtIC@Tokyo Tech)









## **DRP Research Started Around 2000**

#### An Accelerator IP core in an SoC

#### Filling the gap between CPU and hard-wired logic





### DRP: 1<sup>st</sup> Presented at Microprocessor Forum 21 Years Ago

#### November 25, 2002

> PDF Version

#### New NEC Array Speeds Data

NEC Introduces Its Dynamically Reconfigurable 512-Processor Array

#### By Max Baron



Digital media and communications are in their infancy. Most of their development and deployment roadmaps are still in the future, but they promise to become an indispensable part of everyday life. For computer architects, the new applications represent both challenges and rewards. The workloads are data intensive and require performance levels that are often impractical to implement with generalpurpose processors. Challenge and opportunity are engendering specialized architectures that are competing for a chance to show their might and enjoy a slice of revenues that may rival those of the PC market. Two years ago, in Japan, NEC's research team started looking at an interesting engine that could be used in the new applications.

On October 16, 2002, at the annual Microprocessor Forum, Masa Motomura, an

architect at NEC's System ULSI Development Division, unveiled details of the company's new massively parallel architecture, a dynamically reconfigurable processor (DRP). The new architecture can be used as a network processor or as a DSP engine in applications requiring high performance.

#### DRP Brings Together Three Powerful Concepts

The DRP is not the first-ever massively parallel engine, nor will it be the last, but the innovative features that set it apart really demand a second look. Three notable features stand out from the rest. To begin with, most arrays are designed as network processors or as DSP engines; the DRP can perform both functions. It can also pinch-hit as a semiefficient, but working, general-purpose processor.

Second, NEC's architects have created an architecture that can change its array configuration on a cycle-bycycle basis, making these changes indistinguishable, timing-wise, from instructions. Most other designs have defined longer-reconfiguration delays that work best if the resulting interconnections are kept fixed for the duration of a thread.

Finally, the DRP applies a different solution to the propagation delays that must be taken into account as data moves across the chip. Where most other architectures are synchronizing units via clocked registers and processing elements (PE), the DRP can define multiple propagation paths to become one pipe stage—a small asynchronous engine walled between clocked registers to make it cooperate with other parts of the array.

#### PE Architecture Supports Flow-Through Data

Figure 1 shows the DRP's byte-wide processing element, which consists of a data-management unit (DMU) and an ALU designed to operate on 8-bit and 1-bit data. The DMU can execute 25 instructions that include inversion, shifting, masking, and constant generation, using 8-bit and 1-bit operands. A special command named WIRE is used to cause the DMU to pass the operand unchanged to the PE's outputs. The ALU can execute 23 arithmetic/logic instructions on 8-bit data and can use a carry propagation path to process data that is wider than 8 bits. Like the DMU, the ALU has a WIRE command.



Figure 1. Simplified block diagram of NEC's

processing ele file, which can used as a pipe flow-through d instructions fo units also hold PE.



Masa Motomura of NEC unveiled details about DRP at MPF 2002. Photo by Ross Mehan.

## Guess Who He is…

#### ···· Long Story, Hmm

Aside from the usual multiwindowed programmer interface, NEC's compiler offers graphic views of the scheduled dataflow graph and the scheduled statetransitions diagram. Place and route-determined connections are also displayed to help in analyzing critical-path delays. The programmer can assign a critical-path delay to be used by the high-level synthesis program. The program will divide the implementation into multiple states to fit within the critical-path-delay budget. It is expected that the visual display of information could help in speeding up place-and-route work but will be of limited use in programming and debugging complex code. The DRP has been provided with internal logic to help debug programs.

#### The DRP in Action

System designers must be able to take advantage of the chip's most prominent features: applicability beyond digital signal processing, one-cycle datapath change, and dataflow. NEC's architects have endowed the PE with capabilities that can support general data-intensive processing, but they had to add eight 32-bit multipliers to meet DSP needs such as could be encountered in high-end image processing. NEC's compiler provides a seamless environment for writing code aimed at PEs and multipliers.

## **DRP Features Tiled CGRA Architecture**



## Execution Model (1): Spatial Mapping



## **Execution Model (2):** Temporal Sequencing



## **Putting DRP in Execution Model Landscape**



# Putting DRP in Execution Model Landscape – In 3D



## 11

## **Recent Evolution: DRP-AI for Neural Networks**



Now used in Renesas's MCU/MPU products. Total shipment of DRP chips is still rapidly expanding!

#### DRP-AI Demo & Its New Gen. Exposure at ISSCC 2024



## DRP: Early-Coming/Ever-Evolving in SDH/SDC Movement



DRP's Spatial-then-Temporal Processing Style Lead me to the <u>Structure-Oriented Computing</u> Concept







## **AI's Energy Problem**

#### AI Technology is Now Omnipresent in Our Society









Generative AI for Text/Image Smart Robotics Autonomous Drones Smart Social Infrastructure

- Serious Concern -Its energy consumption and environmental impact

## What Do We Know About It?



(Stanford Report 2023)

released the most emissions and required the most power consumption.

(IEEE Spectrum 2023)

16

## And … What We Can Do About It?



## Hence

We Should Make AI Computing Several Orders of Magnitude More <u>Energy Conscious</u>



turely, our houses our torestals, and our neutral

(Forbs 2023)



## **But, How?**

**Answer: Interplay Among Algorithm-Architecture-Real Chip** 



18

## **Observation: AI Computing Landscape**

#### It is All About How to Handle Large-Volume Inputs and Outputs



## **AI Computing:** Driven by Energy Minimization Principle



### **Architectural Shifts from Sequence to Structure**



## **Analogy: bit Dangerous yet Potentially Useful**



22

## **Real World Example: Tenstorrent**



Mix of Sequence-Structure Strategy Depends on Each Architecture

Finding the best mix – on each side – is the heart of architecture design





## **Lottery Ticket Hypothesis**





## **HNN Utilizes Fixed Random Weights**

- Fixed at initial random numbers
- Weights are no longer variables but are (random) constants
- Binary weights {-1, +1} show better accuracy than multi-bit weights
  - [V. Ramanujan+, CVPR2020]
- Enhance computation efficiency





## **HNN Needs a Supermask**

- A supermask is binary {0, 1} information for selecting connections
- Conventional NNs do not need supermask
- A supermask provides the trade-off between accuracy and sparsity





## **Key Contributions of This Work**

## The first HNN inference chip, *Hiddenite*:

Hidden Network Inference Tensor Engine

- 1. On-chip weight generation
  - eliminates the need for storing and loading weights

### 2. On-chip supermask expansion

reduces the model parameters to load

### 3. A high-density 4D parallel processor •

improves efficiency by maximizing data re-use

### On-chip model construction

Supermask

Weight

Weight



30

## **4D PE Tensor: Dataflow**



## **Generating RNG Seeds by Hashing**



Hashed seeds eliminate the need to store weights without accuracy degradation 32

## **Total External Memory Access Reduction**



Hiddenite drastically reduces power-consuming external memory accesses

## Accuracy vs. Model Size on ImageNet





- \* [V. Ramanujan+, CVPR2020]
- \*\* [J. Faraone+, CVPR 2018]

- Comparable or better accuracies
- Smaller model size than binary model

## **Hiddenite Chip Summary**

## Micrograph



## **Specification Table**

| Technology           | TSMC 40nm CMOS (LP)                                                                   |
|----------------------|---------------------------------------------------------------------------------------|
| Package              | QFN80 (48 Signal Pins)                                                                |
| Chip Size            | 3mm x 3mm                                                                             |
| Core Area            | SRAM: 3.78mm <sup>2</sup><br>Logic: 0.58mm <sup>2</sup><br>Total: 4.36mm <sup>2</sup> |
| Core V <sub>DD</sub> | 0.8-1.1V                                                                              |
| I/O V <sub>DD</sub>  | 3.3V                                                                                  |
| Gate Count           | 746K Gates                                                                            |
| SRAM                 | AMEM: 8Mb<br>SMEM: 256kb<br>ZMEM: 128kb<br>Total : 8.375Mb                            |

## **Measured Results on ImageNet**



Efficiency on ResNet50: 18.2-to-16.0TOPS/W at 0.77V
 Maximum frequencies: 614-to-573MHz at 1.1V

## What Hiddenite Has Achieved ?

- Hiddenite is the first HNN inference chip
- Drastically reduce external memory access by
  - On-chip model construction
    - On-chip weight generation
    - On-chip supermask expansion
  - Slice-based layer-fusion processing
- SOTA accuracy relative to model size by score distillation

**Algorithm** 

**Architecture** 

**Real Chip** 

MCC

ite

Hiddel

SOTA-level computation efficiency

We also presented a new Strong Lottery Ticket training algorithm at <u>ICML 2022</u>.

"Multicoated Supermasks Enhance Hidden Networks"

66666

**SEU** 

WGU

AMEM 2D BS PE Tensor

**CTRL** 

PPU



## **Combinatorial Optimization Appears Everywhere**



**NP Hard: Notoriously Difficult for Present Computers** 



## **Serial and Parallel Annealing Policies**

#### **Traditional Method**

#### **Our Proposal (ISSCC 2020)**

41

N: #Spins



SCA<sup>[6]</sup> fromcan realize O(N) times faster spin update than SA

## **Comparison of Annealing Algorithms**

![](_page_41_Figure_1.jpeg)

## **Motivation for Applying Multi-Annealing Algorithms**

![](_page_42_Figure_1.jpeg)

43

- Optimal policy depends on the Ising model (i.e., Problem to solve)
  - RPA works better for the most cases
  - DA is better for Ising models having many negative couplings

## **Amorphica: Metamorphic Annealing Architecture**

![](_page_43_Figure_1.jpeg)

- Near Memory, Fully Spin-Parallel Architecture
- SA/DA/SCA/RPA algorithms are applied with dynamic reconfigurability
- Very close to what Binary Neural Network (BNN) Inference Chip looks

## **Amorphica Chip Summary**

### Micrograph

### **Specification Table**

| 1100 0 10 100     | 1 11 11 12 12 12 12 12 12 12 12 12 12 12 | 110000000 × |
|-------------------|------------------------------------------|-------------|
| WMEM              | WMEM                                     | WMEM        |
| WMEM              | WMEM                                     | WMEM-       |
| WMEM              |                                          |             |
|                   | WMEM                                     |             |
| WMEM              | WMEM                                     | WMEM        |
| - 10000 8600 81 · | 100 500 00 00 000 000 00                 |             |

| Technology           | TSMC 40nm CMOS (LP)                                     |  |
|----------------------|---------------------------------------------------------|--|
| Package              | QFN80                                                   |  |
| Chip Size            | 3mm x 3mm                                               |  |
| Core Area            | SRAM: 3.55mm <sup>2</sup><br>Logic: 1.48mm <sup>2</sup> |  |
| Core V <sub>DD</sub> | 0.8-1.1V                                                |  |
| I/O V <sub>DD</sub>  | 3.3V                                                    |  |
| Max Freqency         | 336MHz@1.1V<br>134MHz@0.8V                              |  |
| Gate Count           | 1.2M Gates                                              |  |
| SRAM                 | WMEM: 8Mb DMEM: 64Kb<br>IMEM: 64Kb Total: 8.125Mb       |  |

## **Comparison to GPU (Nvidia RTX2080-Ti)**

![](_page_45_Figure_1.jpeg)

Up to 58x speed up can be achieved, with around 1/500 power consumption. That is, 30k times more energy efficient.

## **Key Contributions of This Work**

![](_page_46_Figure_1.jpeg)

## Wrap Up The Two Showcases

![](_page_47_Figure_1.jpeg)

## Vision: SoCs/SiPs for the Smart-X Society

#### **General SoC/SiP View**

![](_page_48_Figure_2.jpeg)

**SoC** (System on Chip), **SiP** (System in Package) for **Smart-X** Systems, e.g.,

- Mobile Devices
- Mobilities
- Wearable Devices
- => Ensemble of Domain-Specific Engines

... on some common **low-bitwidth** reconfigurable and parallel architecture foundation.

This vision explains why we value real chip implementation (as opposed to using FPGAs)

## **Key Takeaways**

Importance of the Interplay Among <u>Algorithm-Architecture-Real Chip</u>

![](_page_49_Figure_2.jpeg)

50

# Many Thanks to Collaborators! Questions ?

![](_page_50_Picture_1.jpeg)

![](_page_50_Picture_2.jpeg)

## **CGRA: Past and Present**

### **c**GRA boom in late 90's to 00's

- Lots of academic projects and startups
  - Pipe-Rench, Chameleon, IP-flex, etc.
- Most of them "Hyped-out"

## Dynamically Reconfigurable Processor (DRP) started by NEC, succeeded by Renesas

![](_page_51_Picture_7.jpeg)

![](_page_51_Figure_8.jpeg)

RF CMOS Radio frequency complementary metal-oxide semiconductor

**DRP** alone survived and continued its growth, and is now glowing beyond the age of 20<sup>th</sup>!

## **Challenges of Full-connection Annealing Processors**

![](_page_52_Figure_1.jpeg)

## **Challenges of Full-connection Annealing Processors**

![](_page_53_Figure_1.jpeg)

54

## **Challenges of Full-connection Annealing Processors**

![](_page_54_Figure_1.jpeg)

55