

# **Al Acceleration**

Salil Raje Executive Vice President Software and IP Products









# Welcome to All Developers!

| Data scientists                       | Frameworks: Python, APIs             | DEEPHi<br>深 鉴 科 技 | Caffe    | mxnet                                 | <b> Ø</b> FFMPEG | <b>†</b> TensorFlow |
|---------------------------------------|--------------------------------------|-------------------|----------|---------------------------------------|------------------|---------------------|
| SaaS developers                       | FaaS Platform                        | aws               | HUAWEI   | <b>Aliyun</b> Alibaba Cloud Computing | XIEMIN           |                     |
| Application developers                | SDX: C++, OpenCL, Libraries          | Linux             | <u> </u> | <b>X</b> en⁴                          |                  |                     |
| Embedded developers                   | Embedded Software: MPSoC             |                   |          |                                       |                  |                     |
| Hardware-aware<br>Software developers | HLS: C++ IP Functions                |                   |          |                                       |                  |                     |
| System integrators                    | IP Integrator: System Integration    |                   |          |                                       |                  |                     |
| Hardware developers                   | Vivado Design Suite: RTL Full Design | 1                 |          |                                       |                  |                     |

# Training vs. Inference



Migrate trained model to inference hardware





# Inference Projected Growth



Barclays Research, Company Reports May 2018





# Inference Challenges



The rate of Al innovation



Performance at low latency



Low power consumption



Whole app acceleration



### The Rate of Al Model Innovation

# Classification Object Detection Segmentation Speech Recognition Engine Anomaly Detection CNN RNN, LSTM MLP

**APPLICATIONS** 

Diverse models over a broad range of applications



#### The Rate of Al Model Innovation: Classification

Classification





Source:

https://arxiv.org/pdf/1605.07678.pdf https://arxiv.org/pdf/1608.06993.pdf https://arxiv.org/pdf/1709.01507.pdf https://arxiv.org/pdf/1611.05431.pdf



# **Network Complexity is Growing**

#### AlexNet



#### GoogLeNet



#### DenseNet







# Inference is Moving to Lower Precision





# Rate of Innovation Outpaces Silicon Cycles

#### AlexNet



#### GoogLeNet



#### DenseNet



#### Silicon lifecycle

# Only Adaptable Hardware Addresses Inference Challenges

Custom data flow Domain Specific Architectures Custom memory hierarchy (DSAs) on Adaptable Platforms **Custom precision** 

# Xilinx Acquires DeePhi







# Example: DeePhi LSTM

Custom data flow LSTM for speech recognition

Custom memory hierarchy

Sparse matrix implementation in memory

Custom precision

12 bit weights, 16 bit activations









# **Example:** xDNN

Custom data flow
Optimized for latest CNN

Custom memory hierarchy

Optimized on-chip memory

Custom precision Int8





#### Low Latency is Critical for Inference









High throughput **OR** low latency

High throughput AND low latency



#### Low Latency: Xilinx's Unique Advantage





#### Al Inference Acceleration

Leveraging AI Engines

Majority of Adaptable & Scalar Engines available for Whole App Acceleration

- (1) Measured on EC2 Xeon Platinum 8124 Skylake, c5.18xlarge AWS instance, Intel Caffe: https://github.com/intel/caffe
- (2) V100 numbers taken from Nvidia Technical Overview, "Deep Learning Platform, Giant Leaps in Performance and Efficiency for Al Services"
- (3) Versal Core Series
- (4) GoogLeNet V1 throughput (Img/sec)



### Low Latency: Xilinx's Unique Advantage



#### Al Inference Acceleration

Leveraging Al Engines

Majority of Adaptable & Scalar Engines available for Whole App Acceleration

- (1) Measured on EC2 Xeon Platinum 8124 Skylake, c5.18xlarge AWS instance, Intel Caffe: https://github.com/intel/caffe
- (2) V100 numbers taken from Nvidia Technical Overview, "Deep Learning Platform, Giant Leaps in Performance and Efficiency for Al Services"
- (3) Versal Core Series
- (4) GoogLeNet V1 throughput (Img/sec)



### Low Latency: Xilinx's Unique Advantage



#### Al Inference Acceleration

Leveraging AI Engines

Majority of Adaptable & Scalar Engines available for Whole App Acceleration

- (1) Measured on EC2 Xeon Platinum 8124 Skylake, c5.18xlarge AWS instance, Intel Caffe: <a href="https://github.com/intel/caffe">https://github.com/intel/caffe</a>
- (2) V100 numbers taken from Nvidia Technical Overview, "Deep Learning Platform, Giant Leaps in Performance and Efficiency for Al Services"
- (3) Versal Core Series
- (4) GoogLeNet V1 throughput (Img/sec)

### Low-Latency CNN Inference Performance



Sources: Alveo - Published (INT8); Versal - Proiected (INT8), 65% PL reserved for whole application; GPU 1 - P4 Published (INT8); GPU 2 - V100 Published (FP16/FP32); GPU 3 - T4 Projected

### Low-Latency CNN Inference Performance



DeePhi Pruning Technology

1.3x - 8x

Performance improvement based on the network

Sources: Alveo - Published (INT8); Versal - Proiected (INT8), 65% PL reserved for whole application; GPU 1 - P4 Published (INT8); GPU 2 - V100 Published (FP16/FP32); GPU 3 - T4 Projected



#### Power Is Critical for Inference Applications



16x
Perf/watt
vs. GPU

# Whole Application Acceleration: Smart City / Security





# **Enabling the Development Community**



#### **IN SUMMARY**



Match the speed of AI innovation

Give the best performance at low latency

Give the best power results

Accelerate the whole application



#### Xilinx

Building the Adaptable, Intelligent World