











# Introduction to Reconfigurable Computing













- Configurable Computing (CC) Attempts To Increase Performance And Silicon Utilization Efficiency Through Logic Recycling using FPGA and FPGA-like Devices
- Hardware Algorithms Can Be "Paged" Into/Out Of CC Modules Much As Operating Systems Perform Software Paging
  - Factors Impacting the Performance
    - $\rightarrow$  Logic Speed
    - → Speed Of Reconfiguration
    - → Flexibility Of Configuration













# **Resource Utilization**

## • Standard Microprocessor

- → Specialized Unit For Each Essential Task
- → Unit Functionality Fixed
- → Idle Units Lower Silicon Utilization
- → Basic Algorithms Fixed

### **Reconfigurable Processor**

- $\rightarrow$  Each Unit Specialized To Fit Task
- → Unit Functionality Alterable At Run Time
- → Idle Units Reconfigured For New Tasks
- → Basic Algorithms Can Be Tailored To Application

















- FPGAs can support multiple memory ports
- FPGAs outperform DSPs:
  - $\rightarrow$  Parallelism in the algorithm
  - $\rightarrow$  Simple operations in a fixed sequence
  - → FPGAs provide greater computational density using less power
  - $\rightarrow$  Large data sets, low resolution (8 12 bits)
  - $\rightarrow$  Simple control
- DSPs outperform FPGAs
  - $\rightarrow$  MAC operations
  - $\rightarrow$  Complex arithmetic





# MPRG











Colt Prototype HP 0.5um 3 Metal, **PGA-132** (MOSIS) 16 FUs, XBar, DPs 5.5mm x 6.1mm 50 MHz Full-scale device: Stallion









# 2nd Generation Processor--The Stallion

- Successor of the Colt chip
- Six data ports achieving basic pipelined dataflow control
- Smart crossbar for the purpose of passing programming and data words to and from data-ports and meshes
- Two IFU meshes and 4 multipliers
- Ready for fabrication



# **The Stallion Organization**





# **Example Sub-Mesh Mapping**

Left Port Right \_◄ MPRG High . Lo (val Load 0 if F2=1 else load 🔺 valid data Resul  $t \ge =0$ Output 1 if Y=0 Valid i VIRGINIA TECI Dela Pass )elav Valid if  $\leftarrow$  F2 Valid if Select Y Delay Port Left Overflow Result **Factorial** 

4x4 sub matrix of IFUs
Factorial computation
Demonstrates conditional execution capabilities
Configured in < 30 usec</li>



#### **Features**

- Each slot contains a single port
- Clusters connected using a module to bridge adjacent slots
- Bridging extendible to other system boards
- System is inherently scalable











# Core Computing Component

- XILINX FPGA (currently used in test-bed)
- Problem: Pipeline processing fast but not readily modified with current ASIC design practice
- Solution:
- Colt chip (fabricated and tested)
  - $\rightarrow$  0.8 um HP CMOS process fabricated by MOSIS
  - $\rightarrow$  Run time configurable
  - $\rightarrow$  50 MHz clock
- Stallion chip (designed but not yet fabricated)
  - $\rightarrow$  0.5 um HP CMOS process
  - $\rightarrow$  64 functional units in mesh
  - → Dedicated multiplier
  - $\rightarrow$  Six data ports
  - $\rightarrow$  100 MHz clock

