### NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS

Davide Giri Paolo Mantovani Luca P. Carloni

Columbia University New York, USA ACM/IEEE NOCS 2018 Torino, Italy

# SOC TRENDS

- Heterogeneity
  - Custom accelerators
- $\circ$  NoC
- Shared memory

Challenges

- Scalability
- Programmability



Mobileye EyeQ5, 2020.



Qualcomm Snapdragon 835, 2017.

## LOOSELY-COUPLED ACCELERATORS

Major speedups and energy savings:

- Highly parallel and customized datapath
- Aggressively banked private local memory (PLM)



What should the cache coherence model for accelerators be?

 $\circ$  We identified 3 main models in literature

## **ACCELERATOR MODELS: FULLY COHERENT**

Coherent with entire cache hierarchy

 $\,\circ\,$  Same coherence model as the processor

Programming requirements

Race free accelerator execution

Implementation variants

- Generally bus-based
- Accelerators may own a cache
  - IBM CAPI, [Y. Shao et al., MICRO '16], [M. J. Lyons et al., TACO '12]
  - × ARM ACE-lite



## **ACCELERATOR MODELS: NON COHERENT**

Not coherent with cache hierarchy

- Caches are by-passed
- Programming requirements
  - Race free accelerator execution
  - Flush all caches prior to accelerator execution

Implementation variants

- Generally NoC-based and DMA-based
  - [Y. Chen et al., ICCD '13], [E. Cota et al., DAC '15]
     [Y. Shao et al., MICRO '16]



## ACCELERATOR MODELS: LLC COHERENT

Coherent with LLC only

Processors' private caches are by-passed

#### Programming requirements

- Race free accelerator execution
- Flush processors' private caches prior to accelerator execution

Implementation variants

No implementation in literature

• First proposed by [E. Cota et al., DAC '15]



## CONTRIBUTIONS

#### Protocol.

• Variation of MESI to support 3 coherence models for accelerators (NoC-based)

#### **Coherence Models.**

- Show how each model can outperform the others in some cases
- Show that the best choice of model varies at runtime

Architecture. Design of a multi-core NoC-based architecture that supports:

- Three models of coherence for accelerators
- $\,\circ\,$  Run-time selection of the coherence model for each accelerator
- Coexistence of heterogeneous coherence models for accelerators

## OUR SOC PLATFORM

### Our design is based on an instance of **Embedded Scalable Platforms (ESP)** [L. P. Carloni, DAC '16]

- Socketed tiles
- $\circ$  NoC
- Easy integration and reuse of heterogeneous components

#### We added a cache hierarchy to ESP

 Now it can run multi-processor and multiaccelerator applications on Linux SMP



## ESP: NOC

- $\circ$  2D-mesh
- O 1 cycle hops
- 6 physical planes to prevent deadlock and to provide sufficient bandwidth
- Point-to-point ordering required to prevent deadlock



## **ESP: PROCESSOR TILE**

Main components

- $\,\circ\,$  Single processor core
- L2 private cacheAdded for this work

#### In this work

- $\odot$  Up to 2 processor tiles
- 64KB private caches
- Off-the-shelf processor with L1 write-through caches



## **ESP: MEMORY TILE**

Main components

- Memory controller
- $\,\circ\,$  LLC and directory
  - $\odot$  Added for this work
  - Can be split over multiple tiles
- In this work
- Up to 2 memory tiles
- Up to 2MB aggregate LLC



## **ESP: ACCELERATOR TILE**

#### Main components

- Any accelerator complying with a simple interface
- $\,\circ\,$  A small TLB
- A DMA controller and/or a private cache (added for this work)
- Support for **run-time selection of coherence model** through one I/O write to the configuration registers



## OUR PROTOCOL

We modified a classic MESI directory-based cache-coherence protocol

- to make it work over a NoC (atomic operations)
- to support all coherence models for accelerators (recalls, flush, LLC-coherent requests)

#### **Directory controller**

- Write-back: add a Valid state and dirty bit
- Recalls
- O Flush
- LLC-coherent read/write requests

### Private cache controller

- $\circ$  L1 invalidation
- Recalls
- Flush
- Atomic operations

### **OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT**

| \ Requests<br>State \ | LLC-coherent Read                                          | LLC-coherent Write                                             |
|-----------------------|------------------------------------------------------------|----------------------------------------------------------------|
| Invalid               | Read memory<br>Send data to requestor<br>Go to Valid state | Read memory if misaligned<br>Write to LLC<br>Go to Valid state |
| Valid                 | Send data to requestor                                     | Write to LLC                                                   |
| Shared                | -                                                          | -                                                              |
| Exclusive             | -                                                          | -                                                              |
| Modified              | -                                                          | -                                                              |

## **EXPERIMENTAL SETUP**

#### We designed 4 custom accelerators:

- Sort (merge and bitonic sort combined)
- Sparse Matrix-Vector Multiplication
- FFT-1D and FFT-2D

These accelerators represent a good mix of **memory access pattern** characteristics:

- Varying footprint size (32KB 20MB)
- Streaming vs. irregular pattern
- Temporal and spatial locality

### ESP's GUI:

#### The CAD flow from GUI to bitstream is fully automated.

| Check and Update SoC Configuration |                                 |                            |                                 |  |  |  |
|------------------------------------|---------------------------------|----------------------------|---------------------------------|--|--|--|
| Accelerator   Sort  Cache          | Accelerator 💌<br>spmv 💌 🗌 Cache | Accelerator   fft2d  Cache | Accelerator 💌                   |  |  |  |
| Accelerator   fft1d  Cache         | Memory & Debug 🕶                | Processor 💌                | Accelerator 👻<br>sort 💌 🗌 Cache |  |  |  |
| Accelerator   fft2d  Cache         | Processor 💌                     | Memory 💌                   | Accelerator 👻<br>spmv 💌 🗹 Cache |  |  |  |
| Accelerator   spmv  Cache          | Accelerator   sort  Cache       | Accelerator  fft1d  Cache  | Accelerator 💌                   |  |  |  |

We deployed our SoC on FPGA and we executed applications on Linux SMP.



October 4th, 2018

### **RESULTS: MULTIPLE ACCELERATORS**





### **RESULTS: FULLY-COHERENT ACCELERATORS**

The fully-coherent model can win for workloads whose data structures fit the accelerator's private cache.

No flush needed.



### **RESULTS: SUMMARY**

- The best coherence model varies with the accelerator workload size and with the number of active accelerators in the system.
- LLC-coherent and fully-coherent models can significantly reduce accesses to DRAM.



## CONCLUSIONS

• There is **no absolute winner** among the coherence models.

- Workload size, caches size and number of active accelerators influences the best choice  $\rightarrow$  Hence, the best choice can vary at runtime.
- We proposed a cache-coherence protocol that supports all three coherence models in a NoC-based SoC:
  - Fully-coherent, LLC-coherent, non-coherent.
- We designed a NoC-based SoC architecture enabling
  - **Coexistence** of heterogeneous coherence models operating simultaneously.
  - **Run-time selection** of the coherence model for each accelerator.

### THANK YOU!

Any question?

Davide Giri Paolo Mantovani Luca P. Carloni

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS

# BACKUP

## **ESP: PROGRAMMABILITY**

- The accelerator driver is invoked by an application to offload a task.
- Accelerator tiles handle virtual memory without interrupting the processor cores
- We use locks to enforce race free execution of the accelerators. Additionally:
  - During the execution of non-coherent accelerators, we ensure that there exists only a single copy of the data.
  - For **LLC-coherent** accelerators data can be present both in DRAM and in the LLC.
- The flush phase becomes a negligible overhead for large accelerator workloads



## ESP: CACHES

- Designed in SystemC and implemented through HLS.
- Configurable sets, ways and the number of sharers and owners.
- The device driver can select which caches to flush.

For this work:

- LLC: 2 MB
- Private caches: 64KB



### **OUR PROTOCOL: DIRECTORY CONTROLLER EXCERPT**

|     | REQUESTS                                                                  |                                                                         |                                                           |                                                                        | DMA R                                                      | EQUESTS                         | RESPONSES                        |                    |                                                       |
|-----|---------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------|------------------------------------------------------------------------|------------------------------------------------------------|---------------------------------|----------------------------------|--------------------|-------------------------------------------------------|
|     | GetS                                                                      | GetM                                                                    | PutS                                                      | PutM                                                                   | Evict                                                      | Read                            | Write                            | Inv-Ack            | Data                                                  |
| I   | read mem,<br>Excl. Data to req,<br>owner = req / E                        | read mem,<br>Data to req,<br>owner = req / M                            | Put-Ack to req                                            | Put-Ack to req                                                         |                                                            | read mem,<br>Data to req<br>/ V | [read mem],<br>write LLC,<br>/ V |                    |                                                       |
| V   | Excl. Data to req,<br>owner = req / E                                     | Data to req,<br>owner = req / M                                         | Put-Ack to req                                            | Put-Ack to req                                                         | [write mem]<br>/ I                                         | Data to req                     | write LLC                        |                    |                                                       |
| S   | Data to req,<br>sharers += req                                            | Data to req,<br>Inval. to sharers,<br>owner = req,<br>clear sharers / M | Put-Ack to req,<br>sharers -= req<br>/ V (if last sharer) | Put-Ack to req,<br>sharers -= req<br>/ V (if last sharer)              | [write mem],<br>Inval. to<br>sharers, clear<br>sharers / I |                                 |                                  |                    |                                                       |
| E   | Fwd-GetS to owner,<br>sharers+=req+owner,<br>clear owner / S <sup>D</sup> | Fwd-GetM<br>to owner,<br>owner = req<br>/ M                             | Put-Ack to req,<br>if req is owner:<br>- clear owner / V  | write LLC,<br>Put-Ack to req,<br>if req is owner:<br>- clear owner / V | Fwd-GetM<br>to owner,<br>clear owner<br>/ EI <sup>D</sup>  |                                 |                                  |                    |                                                       |
| М   | Fwd-GetS to owner,<br>sharers+=req+owner<br>clear owner / S <sup>D</sup>  | Fwd-GetM<br>to owner,<br>owner = req                                    | Put-Ack to req                                            | write LLC,<br>Put-Ack to req,<br>if req is owner:<br>- clear owner / V | Fwd-GetM<br>to owner,<br>clear owner<br>/ MI <sup>D</sup>  |                                 |                                  |                    |                                                       |
| SD  | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers -= req                                      | stall                                                      |                                 |                                  |                    | write LLC,<br>/ V (if no sharers),<br>/ S (otherwise) |
| EID | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers - = req                                     |                                                            |                                 |                                  | [write mem]<br>/ I | write mem<br>/ I                                      |
| MID | stall                                                                     | stall                                                                   | Put-Ack to req,<br>sharers -= req                         | Put-Ack to req,<br>sharers -= req                                      |                                                            |                                 |                                  |                    | write mem<br>/ I                                      |