Analysis - Distributed Processing in FPGA Accelerated Cloud

7. Evaluation

7.4 Analysis

MLP to be further optimized, which in turn improved the total inferences per second even further. For that, a small price in latency was paid since data travels through two shells and do more hops in the switch.

7.4 Analysis

CRUN enables all three acceleration scenarios, local, network and distributed. In the datacenter, the FPGA can be easily managed through the PCIe and users can have access to local or multiple remote FPGAs seamlessly. Since its deployment aims to be distributed through the datacenter instead of in clusters, no extra resources in the FPGA is consumed to turn it into a standalone device.

Even in its preliminary state, the CRUN proves its potential. The machine learning trial presents considerable gain in performance and is the only suitable solution for the requirements among other solutions studied.

It is important to note that such latency values in CRUN trial was only possible due to the use of TRex that leverages DPDK to bypass the Linux kernel stack. Also, all trials were realized in bare-metal, which means that some performance degradation is expected when running the application in a VM. This performance degradation can be mitigated with the usage of SR-IOV, which is supported in most of modern NICs.

The CRUN latency values presented in Table 7.3 are average ones. In fact, it was observable that the maximum latency value could reach around 290µs. This must be taken into account when using the system. Still, it is expected jitter also in the others implementations, but such values were not obtained.

The use of DPDK and SR-IOV should also be leveraged when building the DMA path for local acceleration. With this, it is expected even better results in terms of latency for both bare-metal and virtualized systems, since there is no need for the data to trac through the switches in the network. Still, the local acceleration has somewhat limited scalability.

7.4.1 Hardware

CRUN also is the most complex implementation among all other implementations used for comparison. The simple fact that the application uses HDL already add a

barrier for most developers. On the other hand, the usage of standard interfaces such as AXI4-Stream and the abstraction of the I/O signals facilitate the development.

As an example, only a few modications in the control were needed to adapt the hardware design between SDAccel^TM and CRUN. Still, work is needed in order to study options in how to provide a platform where the AHU can be easily integrated and developed with support of HLS.

When analyzing the network connectivity, it is clear the importance of the P4 com-ponents built from SDNet^TM packet processor. On one hand, they can be easily developed, modied and integrated with the set of tools provided by Xilinx^R. On the other hand, they are also the components that have major eect on area and la-tency of the shell. Currently, it seems that the only other option that could provide similar functionality is the project P4FPGA [60], but it has not been investigated.

Another option would be to develop similar functionality using HDL or HLS. Even though the performance and area of the nal design could be improved by opti-mization, it is a very challenging, time consuming and error prone task that would most probably not result in such complete tool. Thus, SDNet^TM seems to be the only design option that is easy, powerful and exible enough for the functionalities required here. Yet, this is a vendor specic and proprietary tool, limited to Xilinx's FPGAs only.

7.4.2 Software

In any cloud environment, software support is needed to allow management of the whole datacenter's infrastructure in an automated fashion, specially with HWA sup-port. From the related work presented in Chapter4, one could point it out that most of the eorts leverages a modied version of OpenStack [58]. In fact, OpenStack is a strong candidate to ll this position in NFV clouds [87], at least for open source solutions.

HWA support from OpenStack is being developed by the Cyborg project [59], which aims to provide a management framework for various types of accelerator resources, such as FPGA, GPU and ASIC. Ocial OpenStack releases are still not mature in this area, also Cyborg requires that vendors deliver their own Cyborg's driver so the HWA can be deployed. Specically for FPGAs, unfortunately there is no support yet from vendors available. Furthermore, in the rst phase it is probably expected that only PCIe connectivity will be provided.

Thus, the in-house development of a software management called BRO was proposed.

7.4. Analysis 66 The motivation is to avoid the steep learning curve barrier to modify OpenStack and to obtain a simple and quick VIM like functionality for proof of concept purposes only.

The required functionality could be mapped to the VIM component in MANO's architecture, but one can point out that MAPPER is a rather complex component and goes above the VIM responsibilities of the MANO architecture. Again, MANO's architecture division can be fuzzy, but MAPPER would be better compared with NFVO, providing service level management, while DEPLOYER and CONNECTOR would provide the automation of the infrastructure.

In document Distributed Processing in FPGA Accelerated Cloud (sivua 74-77)