From run-time reconfigurable coarse-grain arrays to application-specific accelerator design

(1)

(2)

Fabio Garzia

From Run-Time Reconfigurable Coarse-Grain Arrays to Application-Specific Accelerator Design

Thesis for the degree of Doctor of Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB104, at Tampere University of Technology, on the 4th of December 2009, at 12 noon.

Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2009

(3)

ISBN 978-952-15-2280-2 (printed) ISBN 978-952-15-2286-4 (PDF) ISSN 1459-2045

(4)

adoption of mechanisms to reduce the external overheads give improvements of 10×

in average. The logic to reduce the communication overhead occupies 0.6% of the total area but gives a speed-up of 15×for the transfer speed and 3× reduction of the overall cost of the transfers. The new reconfiguration infrastructure gives an 8%

improvement for the maximum working frequency. The dynamic reconfiguration allows to hide the cost of reconfiguration behind the CREMA processing activity.

The same consideration applies for the control operations.

In addition the author considers the internal causes of performance degradation and proposes a new model that can be easily adapted to a chosen application. For this purpose the author presents a template called CREMA, that can be tailored to the application requirements, but keeps the possibility to share its internal resources using run-time reconfiguration. This new model can be used as a method to realize application-specific accelerators. The new design presents an application-specific accelerator that is 3× −4.5×smaller than the previous general-purpose device and 1.5× −5×faster. Mapping of SDR kernels shows figures that approach the real-time specifications.

(5)

(6)

this Thesis. I wish to thank PhD.Claudio Brunelli, who guided me in my first steps as a young research scientist, and PhD.Tapani Ahonen, who gave me always precious suggestions and friendly support. I acknowledge also PhD.Fabio Campiand PhD.

Claudio Muccifrom Bologna, who granted me useful feedback whenever I asked. I cannot forget that without FabioI would not have ended up in Tampere in the first place.

I would like to acknowledge the Reviewers Prof.Jorma Skytt¨aand Prof.Steve Wilton, who provided valuable and insightful comments to improve this Thesis. I wish also to thank MSc. Stephen Burgessfor the time dedicated to improve the readability of this work.

This Thesis would have not been possible without the contribution of several under- graduate students, who alternated in our research group during last years. In particular, I would like to thank Federico Cinelli, Andrea Ferro, Carmelo Giliberto, Andreas Hubmer, Waqar Hussain, Martin Knecht, Juha Kylli¨ainen, David Markvica, Riccardo Mastria, Markus Moisio, and Davide Rossi. A special acknowledgment goes toRoberto Airoldi, who is currently exploiting this work as part of his PhD.

research.

Finally I cannot forget the unbounded support of Garzia family and Heini, whose confidence always gave me the motivation to push ahead with my work.

The work was financially supported by theGraduate School in Electronics, Telecom- munications and Automation (GETA) and Tampere University of Technology. Re-

(7)

search grants were received from the Tuula ja Yrjö Neuvo Foundation, the HPY:n Tutkimussäätiöand theNokia Foundation, which are gratefully acknowledged.

Tampere, November 2009 Fabio Garzia

(8)

List of Publications . . . ix

List of Figures . . . xi

List of Tables . . . xv

List of Abbreviations. . . xvii

1. Introduction . . . 1

1.1 Objective and Scope of Research . . . 1

1.2 Main Contributions . . . 2

1.2.1 Author’s contribution . . . 3

1.3 Thesis Outline . . . 4

2. Run-time Reconfigurability: a Survey . . . 5

2.1 Run-time Reconfigurable Devices . . . 6

2.2 Fine-grain Architectures . . . 8

2.2.1 GARP . . . 8

2.2.2 Chess Array . . . 10

2.2.3 XiRISC . . . 11

2.2.4 MOLEN . . . 11

2.2.5 Embedded FPGAs . . . 13

(9)

2.3 Coarse-grain Architectures . . . 14

2.3.1 Kress Array . . . 14

2.3.2 DREAM . . . 16

2.3.3 Morphosys . . . 16

2.3.4 PACT XPP . . . 19

2.3.5 ADRES . . . 19

2.3.6 BUTTER . . . 22

2.3.7 Montium . . . 22

2.4 Multi-core Systems . . . 24

2.4.1 REMARC . . . 24

2.4.2 RAW . . . 25

2.4.3 Picoarray . . . 27

2.4.4 RAA . . . 27

2.5 Hybrid Approaches . . . 28

2.6 Dynamic Partial Reconfiguration . . . 29

2.7 Concluding Remarks . . . 32

3. Accelerating Applications: the Example of BUTTER . . . 33

3.1 A Platform for Application Development . . . 33

3.1.1 COFFEE . . . 33

3.1.2 MILK . . . 34

3.1.3 Building the platform . . . 34

3.2 Programming BUTTER . . . 37

3.3 3D Graphics . . . 39

3.3.1 Mapping of the 4x4 matrix-vector multiplication . . . 41

3.3.2 Result evaluation . . . 44

(10)

3.5.1 W-CDMA acceleration: correlation points . . . 52

3.5.2 FFT . . . 53

3.6 Satellite Navigation . . . 56

3.6.1 Tracking channel for a GPS device . . . 57

4. Limits of run-time reconfigurability . . . 61

4.1 Communication Overhead . . . 61

4.1.1 DMA transfers . . . 63

4.1.2 Pipelined switched interconnections . . . 64

4.1.3 Ping-pong memories . . . 67

4.1.4 Results evaluation . . . 70

4.2 Reconfiguration Overhead . . . 70

4.2.1 Fast dynamic reconfiguration . . . 71

4.3 Control Overhead . . . 74

4.3.1 Register chain . . . 75

5. CREMA: Coarse grain REconfigurable array with Mapping Adaptiveness . 77 5.1 CREMA . . . 78

5.1.1 Routing parameters . . . 79

5.1.2 PE core parameters . . . 81

5.1.3 Reconfiguration memory parameters . . . 84

(11)

5.1.4 Design example: the 4x4 matrix-vector multiplication . . . 84

5.2 CREMA Peripherals . . . 88

5.2.1 CREMA control unit . . . 88

5.2.2 Infrastructure for the run-time reconfiguration . . . 91

5.3 Case Study: Realizing a Platform for SDR . . . 91

5.3.1 Mapping kernels for synchronization purposes . . . 91

5.3.2 Calculation of maximum value inside the array . . . 93

5.3.3 Mapping FFT for OFDM-based protocols . . . 97

5.3.4 Putting it all together: a platform for SDR . . . 97

6. Application-Specific Accelerator Design . . . 101

6.1 CREMA Configuration Flow . . . 101

6.1.1 Text description . . . 103

6.1.2 FIRETOOL . . . 106

6.1.3 Simulation and Synthesis . . . 109

6.2 Design of Application-Specific Accelerators . . . 110

6.2.1 High-level synthesis . . . 111

6.2.2 NISC . . . 111

6.2.3 Dealing with FPGA . . . 112

7. Conclusions . . . 115

7.1 Main Results . . . 115

7.2 Future Development . . . 116

Bibliography . . . 117

(12)

[P1] Garzia, F.; Brunelli, C.; Kylli¨ainen, J.; Moisio, M. and Nurmi, J. “Design of a C library for the implementation of 3D graphics applications on a SoC”, Proceedings of the 2006 International Conference on IP Based SoC Design (IP-SOC ’06), Design and Reuse S.A., 2006, 371-374

[P2] Garzia, F.; Brunelli, C.; Ferro, A. and Nurmi, J. “Implementation of a 2D low-pass image filtering algorithm on a reconfigurable device”.Proceedings of the 3rd International Workshop on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC ’07), Univ. Montpellier II, 2007, 166-170 [P3] Garzia, F.; Brunelli, C.; Nieminen, L.; Mastria, R. and Nurmi, J. “Implemen-

tation of a tracking channel of a GPS receiver on a reconfigurable machine”, Proceedings of the International Conference of Computer as a Tool (EURO- CON ’07), IEEE, 2007, 875-881

[P4] Garzia, F.; Brunelli, C.; Rossi, D. and Nurmi, J. “Implementation of a floating- point matrix-vector multiplication on a reconfigurable architecture”, Pro- ceedings of the 15th Reconfigurable Architecture Workshop (RAW ’08), IEEE, 2008, 1-6

[P5] Garzia, F.; Brunelli, C. and Nurmi, J. “A Pipelined Infrastructure for the Distribution of the Configuration Bitstream in a Coarse-grain Reconfigurable Array”,Proceedings of the 4th International Workshop on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC ’08), Univ. Montpellier II, 2008, 188-191

(13)

[P6] Garzia, F.; Brunelli, C.; Giliberto, C.; Airoldi, R. and Nurmi, J. “Implemen- tation of W-CDMA Slot Synchronization on a Reconfigurable System-on- Chip”,Proceedings of the 2008 International Symposium on System-on-Chip (SOC ’08), IEEE, 2008

[P7] Garzia, F.; Brunelli, C.; Giliberto, C. and Nurmi, J. “Implementation of W- CDMA Cell Search on a Runtime Reconfigurable Coarse-grain Array”,Pro- ceedings of the International Conference of Computer as a Tool (EUROCON

’09), IEEE, 2009

[P8] Garzia, F.; Ahonen, T. and Nurmi, J. “A Switched Interconnection Infrastruc- ture to Tightly-couple a RISC Processor Core with a Coarse Grain Reconfig- urable Array”, Proceedings of the 5th International Conference on Ph.D.

Research in Microelectronics and Electronics (Prime ’09), IEEE, 2009, 16- 19

[P9] Garzia, F.; Hussain, W. and Nurmi, J. “CREMA: A Coarse-grain REconfig- urable Array with Mapping Adaptiveness”,Proceedings of the 19th Interna- tional Conference on Field Programmable Logic and Applications (FPL2009), IEEE, 2009, 708-712

[P10] Garzia, F.; Airoldi, R. and Nurmi, J. “Mapping of the FFT on a Reconfig- urable Architecture targeted to SDR Applications”,Proceedings of Interna- tional Symposium on System-on-Chip (SOC2009), 2009, 157-160

[P11] Garzia, F.; Airoldi, R.; Ahonen, T.; Milojevic, D. and Nurmi, J. “Implemen- tation of the W-CDMA Cell Search on a MPSoC designed for Software De- fined Radios”,Proceedings of IEEE Workshop on Signal Processing Systems (SIPS 2009), 2009, 30-35

(14)

3 Pipelined configurable gate array (pGA)(Copyright c2003 IEEE [14]). 12 4 The MOLEN architecture (Copyright c2004 IEEE [67]). . . 12 5 Reconfigurable platform architecture customized with 3 eFPGAs and

an 8-port NoC (Copyright c2005 IEEE [37]). . . 13 6 KressArray-III structure with hierarchical global interconnections (Copy-

right c1998 SPIE [28]). . . 15 7 Hardware structure of the Dynamically REconfigurable Architecture

for Mobile systems (Copyright c2000 Springer-Verlag [8]). . . 17 8 MorphoSys 8 x 8 RC array with 2D mesh and complete quadrant

connectivity (Copyright c2000 IEEE [56]). . . 18 9 XPP architecture (Copyright c2003 KLUWER Academic Publish-

ers [7]). . . 20 10 ADRES core (Copyright c2003 Springer-Verlag Berlin Heidelberg

[41]). . . 21 11 BUTTER PE (Copyright c2007 University of Montpellier [P2]). . 21 12 BUTTER interconnections (Copyright c2007 IEEE [P3]). . . 23 13 Montium TP (Copyright c2007 Hindawi [57]) . . . 24 14 REMARC architecture (Copyright c1998 ACM [43]). . . 25

(15)

15 RAW architecture (Copyright c2002 IEEE [62]). . . 26

16 PicoArray concept (Copyright c2009 picoChip [47]) . . . 27

17 MORPHEUS architecture (Copyright c2007 IEEE [63]). . . 28

18 The CHAMELEON template (Copyright c2008 IEEE [53]). . . 29

19 Design layout with two reconfigurable modules (Copyright c2002 Xilinx [38]). . . 30

20 Caronte architecture overview (Copyright c2005 IEEE [21]). . . . 31

21 A platform for application development. . . 36

22 State diagram of the 3D graphics pipeline. . . 40

23 Context used for the mapping of the 4x4 matrix-vector multiplication (Copyright c2008 IEEE [P4]). . . 42

24 Organization of input and output vertices in the local memories. . . 43

25 2D filtering algorithm: each pixel is processed taking into account its neighbors (Copyright c2007 RECOSOC [P2]). . . 45

26 Context used for the mapping of the low-pass image filter. . . 46

27 Partition of the image in slices (Copyright c2007 RECOSOC [P2]). 47 28 State diagram of the H.264 decoder . . . 48

29 Second context used in the H.264 decoder implementation. . . 49

30 Context used for the mapping of the correlations (Copyright c2008 IEEE [P6]). . . 53

31 Organization of the results in the output memory (Copyright c2008 IEEE [P6]). . . 54

32 Context used for the mapping of the FFT (Copyright c2009 IEEE [P10]). 55 33 Block diagram of the tracking channel (Copyright c2007 IEEE [P3]). 58 34 Context used for the mapping of the GPS kernel. . . 58

35 Transfer patterns (Copyright c2009 IEEE [P8]). . . 63

(16)

38 Double buffering in the data flow. The DMA controller is loading block 3 while storing block 1. The core processor is processing block

2 (Copyright c2001 IEEE [32]). . . 68

39 Ping-pong memories, DMA, and control interface of BUTTER. . . . 68

40 Local logic for the pipelined configuration (Copyright c2008 RE- COSOC [P5]). . . 73

41 Flow of the configuration words through the array (Copyright c2008 RECOSOC [P5]). . . 73

42 Register chain for quick BUTTER activation. . . 75

43 Design blocks defined at design-time (Copyright c2009 IEEE [P9]). 79 44 Interconnections supported in the template: (a) local (NN), (b) loop, (c) interleaved, (d) global. . . 80

45 Different setup for the input multiplexers. . . 81

46 PE core template. . . 82

47 Example of different PE architectures. . . 83

48 Column/row data swapping of four values using variable-length delay chains. . . 89

49 FFT data reordering using delay chains. . . 90

50 Context for the calculation of correlations. . . 92

51 Two columns of the context for the square modulus. . . 93

52 Implementation of a selection mechanism. . . 94

53 Two columns of the context for the search of the maximum. . . 96

54 CREMA Configuration Flow. . . 102

(17)

55 Graphical interface for the definition of a context. . . 104 56 RISC versus NISC architecture. . . 113

(18)

3 Performance comparison in clock cycles between different image filter implementations (Copyright cSpringer-Verlag 2008 [13]). . . . 47 4 Performance comparison between SW and reconfigurable implemen-

tation (Copyright c2009 IEEE [P7]). . . 54 5 Speed-up in the application under analysis. . . 59 6 Synthesis results of CAPPUCCINO, BUTTER and peripherals on

Altera EP2S180 FPGA device . . . 66 7 Comparison of performance between BUTTER, DMA and memories

(Device: Altera EP2S180F) . . . 66 8 Parameters related with the two case studies, case 1 referring to inte-

ger implementation, case 2 integer plus floating-point. . . 85 9 Comparison of synthesis results between the two versions of CREMA

(Device: Altera EP2S180F) . . . 86 10 Comparison of MIPS for the test case between BUTTER and the two

versions of CREMA . . . 87 11 Comparison of FPGA synthesis results between Butter and CREMA

tailored to SDR (Device: Altera EP2S180F) . . . 98

(19)

(20)

CGRA Coarse-Grain Reconfigurable Array DMA Direct Memory Access

DPR Dynamic Partial Reconfiguration DSP Digital Signal Processor

ESL Electronic System Level FFT Fast Fourier Transform

FPGA Field-Programmable Gate Array FSM Finite State Machine

GNSS Global Navigation Satellite Systems GPP General Purpose Processor

GPS Global Positioning System GUI Graphical User Interface

HDL Hardware Description Language HW Hardware

IC Integrated Circuit

(21)

IP Intellectual Property JTAG Joint Test Action Group

MIMD Multiple-Instruction Multiple-Data MIPS Million Instruction Per Second NEWS North-East-West-South NISC No Instruction Set Computer NN Nearest Neighbour

NOC Network-on-Chip

OFDM Orthogonal Frequency Division Multiplexing OS Operating System

PE Processing Element

PLD Programmable Logic Device RC Reconfigurable Computing

RISC Reduced Instruction Set Computer RLC Reconfigurable Logic Cell

SDR Software Defined Radio

SIMD Single-Instruction Multiple-Data SOC System-on-Chip

VHDL VHSIC Hardware Description Language VHSIC Very-High-Speed Integrated Circuits VLIW Very Long Instruction Word

WCDMA Wideband Code Division Multiple Access

(22)

ations in parallel and often in pipeline. The development time is reduced through the coupling of such accelerators with general-purpose processing units that are programmable using software languages.

The introduction of run-time reconfigurability has enhanced the processing capabili- ties of reconfigurable devices. The accelerator can benefit from the resource sharing since the same PEs can be used to perform different operations at different times.

Several research groups have focused in recent years on the development of different kinds of fine or coarse-grain reconfigurable devices, that are loosely or tightly coupled with a General Purpose Processor (GPP), integrated into its pipeline or acting as coprocessor. However only few of them have been commercially exploited.

There are several reasons. First of all the achievable performance is usually degraded by different factors connected with the management of dynamic reconfiguration and the non-exploitable bandwidth of reconfigurable arrays. Another weak point is that these devices are characterized by large resource utilization and complexity, mainly due to the flexibility they provide. However when the flexibility is not fully exploited, the added cost becomes difficult to justify.

1.1 Objective and Scope of Research

This research aims to increase the performance obtainable by a run-time reconfigurable device. The first part of the work focuses on the implementation of different

(23)

applications on a specific coarse-grain reconfigurable array. From this analysis the author identifies the main causes of the performance degradation. These can be divided into two groups:

• external causes: timing overheads due to data transfers, reconfiguration and control methods;

• intrinsic causes: the reduced speed of the device due to its high complexity.

At this point the research focuses on two different aspects, one of which is how to reduce the impact of the external causes, reducing the overheads due to communication, dynamic reconfiguration and control mechanisms. The author then analyzes the reasons for the increased complexity and searches for a solution.

1.2 Main Contributions

To summarize, the main contributions are the following:

• realization of a VHDL model of a platform based on a reconfigurable device called BUTTER and a GPP called CAPPUCCINO;

• mapping of different applications on the platform and analysis of the results;

• definition of the main external causes for the performance degradation in the reconfigurable platform: communication, reconfiguration and control overheads;

• introduction of a pipelined communication infrastructure to increase the bandwidth of data transfers;

• adoption of ping-pong memories to limit the number of required data transfers;

• use of a mechanism to enable fast dynamic reconfiguration;

• implementation of a control mechanism to reduce the control overhead;

• design of a reconfigurable architecture adaptable to mapped applications;

• realization of a methodology that starts from a algorithm written in C and generates a VHDL model of a system based on it.

(24)

libraries to use with the platform. The author tested on the FPGA prototype an application based on the library.

The author contributed to the work in [P2] and [P3] integrating BUTTER device in the above mentioned platform. The author also supervised the mapping described in the two papers and was responsible for the application prototyping on the platform.

The author introduced the ping-pong memory interface in the platform [P4]. The author was also taking care of the implementation of the matrix-vector multiplication kernel on the platform.

The author carried out all the work in [P5] realizing a new configuration interface for BUTTER.

The author contributed to the mapping of SDR applications on the platform as described in [P6] and later in [P7].

The author carried out all of the work in [P8], inserting a switched pipelined interface in the platform.

The author designed the CREMA template described in [P9]. The author was responsible for the parameterized reconfiguration infrastructure and the management of the variable length configuration words. He also took care of the porting of the matrix- vector multiplication on CREMA. The graphical tool used to generate CREMA configuration was designed by the author.

The author contributed to the implementation of the FFT on BUTTER [P10]. The author was also responsible for the mapping on the platform based on the switched pipelined interconnections.

The author took care of the FPGA implementation of the multi-core system described in [P11]. He also contributed to the application mapping. The author intends to integrate CREMA into one or more clusters of the multi-processor system.

(25)

1.3 Thesis Outline

The thesis is organized as follows. In Chapter 2 the author presents a survey of devices featuring run-time reconfigurability. In Chapter 3 the author analyzes the mapping of several applications on a reconfigurable array called BUTTER. In Chapter 4 the author considers the main issues related to the previous mappings and intro- duces some solutions. In Chapter 5 the author proposes a new design based on an application-specific approach. Finally in Chapter 6 the new design is evaluated as a method to generate application-specific accelerators.

(26)

hardware solutions, even though the increased flexibility of such devices always has a cost in terms of speed, resource utilization, and power. The cost can be quantified in roughly one order of magnitude if compared with Application-Specific Integrated Circuit (ASIC) implementations, and therefore two orders of magnitude if compared with full custom approaches [16].

The history of digital design saw the proposal of numerous solutions for reconfigurability, with the most popular being FPGAs (Field-Programmable Gate Arrays). In small production quantities, an FPGA is nowadays considered as a viable alternative to manufacturing a dedicated hardware circuit [66]. ASICs has reached so large production costs, that only a few companies can afford to implement their designs on such a chip. In addition FPGAs are fast to program and guarantee enough performance for many embedded applications. Nonetheless they are still a step behind ASIC in that they are too power hungry to be widely used for mobile systems [5], or not fast and reliable enough for hard real-time embedded applications. Nevertheless in most cases they are a good solution for prototyping and realization of test boards.

Classical (static) reconfigurability is convenient because manufacturing a chip that can be used for different purposes means increased aggregate demand for the product and therefore reduced cost per piece. On the other hand, the FPGA is programmed once to be used for a fixed function. Occasionally the system can be updated for bug fixing or feature addition.

During the last decade reconfigurability has reached a new dimension: from static it has evolved to dynamic. Dynamic reconfiguration has a unique advantage in that

(27)

some hardware circuit can be reused for different purposes inside the same application. Think about a system where there is a single hardware accelerator that can serve as a video decoder, Global Positioning System (GPS) receiver, or graphics engine.

Assuming that the user will run only one of the supported applications at once, the resource utilization is largely reduced. However in this case the flexibility introduced will have a drawback (usually limited performance or increased power consumption).

In the next sections we will focus on general characteristics of such run-time reconfigurable devices, laying out a general scheme for classification, and then presenting several examples from academy and industry. Finally, some concluding remarks are given.

2.1 Run-time Reconfigurable Devices

A reconfigurable device is a piece of hardware whose functionality is not fixed at manufacturing time, but in the field. Therefore the functionality is specified only before use and it can be changed occasionally. The idea is that the device implements a fixed role in the system where it is placed, but it is possible to reconfigure it to fix bugs or to upgrade its features.

In contrast, it is possible to change the functionality of a run-time reconfigurable device duringuse. The idea is that the device can change its role according to the current task of the system where it is placed. For example, a run-time reconfigurable device in a mobile phone system could act as audio decoder during a phone call, and as video decoder during a video call.

In the embedded systems domain, a typical example of run-time reconfiguration is a system with a general-purpose processor (GPP) and a reconfigurable accelerator (RA) acting as coprocessor. In such a system the GPP can be used either to control only the accelerator operations (loosely-coupled model), or can be an integral part of the processing (tightly-coupled model). In general we can assume that a reconfigurable device is always associated with some “static” hardware that controls the run-time operations (reconfiguration and processing).

An important characteristic of run-time reconfigurable hardware is the time needed for reconfiguration. If it has to be done at run-time, the reconfiguration time should be sufficiently small in order to not degrade the overall performance. It is difficult to

(28)

there are exceptions.

Xilinx introduced in the late 90s a FPGA with the possibility to reconfigure only part of the chip. This feature called Dynamic Partial Reconfiguration (DPR) motivated research on different reconfigurable architectures uniquely based on FPGA chips.

We will talk about DPR in the last section of this chapter.

Another possibility is to insert a small FPGA in a heterogeneous system, where it can serve as reconfigurable accelerator. Thus the first class of run-time reconfigurable architectures we analyze is characterized by FPGA-like reconfigurable accelerators. These accelerators are made of several processing elements organized in a one-dimensional or two-dimensional array with granularity of 2, 4, or 6 bits. They can be classified as fine-grain architectures.

Finally some FPGAs embed dedicated logic that can serve as fixed hardware. For example, Xilinx provides FPGA chips with embedded PowerPC processors. This feature has been used in some fine-grain architectures.

An alternative to such solutions are the coarse-grain architectures. They present a similar global structure, but a granularity of 8, 16 or 32 bits. In this case the PEs are based on ALU-like functional units, where the functionality can be defined using a bitcode. In addition, the interconnections between PEs can also be reconfigured, typically via multiplexing logic. The main idea behind such devices is that any kind of functionality can be mapped using a combination of different PEs, and during the processing each PE implements only a single and fixed operation.

In other situations, the PE can be more complex and execute a sequence of operations. These microcoded operations are stored in a local memory, with reconfiguration achieved by selecting a different sequence of instructions. The difference between a complex PE and a simple microprocessor becomes increasingly small.

Considering the previous requirements, a single GPP could be considered as a run- time reconfigurable device. In this case the hardware is completely fixed, and the

(29)

software is defining the current functionality. By changing the software, we modify the system behavior. Typically the cost in terms of time is negligible. Nevertheless, a GPP cannot be identified as a reconfigurable device.

First of all, an application mapped in software onto a generic GPP is 100X-1000X slower than a dedicated hardware solution. This is far too slow for a reconfigurable device. Therefore we should add a new characteristic for a reconfigurable device:the computational density[19]. A reconfigurable device can achieve better performance per unit of silicon area than a GPP.

On the other hand, by increasing the number of processors and reducing their size, the computational density of a multi-processor system approaches that of a reconfigurable machine. For such reasons, some multiprocessor systems have been classified in literature as reconfigurable machines, and we will see some examples later.

2.2 Fine-grain Architectures

Fine-grain architectures are characterized by PEs with granularity of 2, 4 or 6 bits.

Usually the number of PEs is much larger than in the coarse-grain devices, and the programming model is different. Since most applications work with 8, 16 or 32-bit operations, the fine-grain devices require a combination of PEs to perform a single operation. This leads to mapping inefficiency and waste of resources. On the other hand, they are more flexible and insertion of glue logic or smaller granularity operations is possible.

In the following subsections there are several examples of fine-grain architectures, and the way how they addressed different design issues.

2.2.1 GARP

GARP Architecture [29] was proposed in the late 90s. GARP proposes a tight coupling between a GPP and a fine-grain reconfigurable array. To reduce the control overhead, the array is embedded in the processor datapath and dedicated instructions allow control of it. There are remarkable differences between a GARP Array and a standard FPGA. GARP is composed of PEs called blocks. In each row there is always one control block and 23 logic blocks (see Fig. 1). The number of rows is not

(30)

fixed and can be decided at design-time. However the authors proposed an array of 32 rows as a possible implementation.

Each block processes two 2-bit operands, and all the interconnections are based on 2-bit wires. In order to provide enough memory bandwidth four memory buses run along the columns. The local transfers between blocks are performed using dedicated wiring. These wires runs in horizontal and vertical directions connecting all the blocks along their path. However there is no direct connection between crossing wires as in FPGAs.

Mapping an operation on GARP bears some restrictions that reduce the flexibility but increase the efficiency. Due to the architectural choices, only one operation should be mapped onto a single row. Since only 16 2-bit blocks are required to implement one 32-bit operations, this always leaves 7 unused blocks. In many cases they can be used for additional logic like rounding, control functions, or to support wider data types. These guarantee a large flexibility and in the worst-case only 30% of the array is idle.

(31)

2.2.2 Chess Array

Chess Array [40] has a fairly different structure. Its PEs are 4-bit ALUs. The size has been chosen to ease the mapping of applications based on a 8-, 12-, or 16-bit word length. The ALUs are interleaved with switchboxes providing nearest-neighbour (NN) local interconnections. The obtained array resembles a chessboard (Fig. 2).

In addition to the existing hardware it is possible to insert embedded memory blocks of the size 256 word×8-bit. These memories can be used as data buffers or translation tables in the mapping of multimedia processing algorithms.

Configuration is quite flexible. Each ALU can be configured either by the configuration memory or by a neighbor ALU. This enables the mapping of control schemes in the 2D array. CHESS demonstrates that an architecture can be very efficient if the design choices address a specific application datawidth. The CHESS array is power- ful when accelerating medium-grain multimedia applications, but we cannot expect the same performance when we consider 32-bit acceleration.

(32)

16 Reconfigurable Logic Cells (RLC), based on two 4-input 2-output LUTs (see Fig.

3). A dedicated carry signal and other horizontal wires help the mapping of special functions in one clock cycle.

XiRISC resembles GARP architecture with some interesting differences. First the 4-bit granularity and 16 RLCs per row support two 32-bit operations per row without necessarily leaving holes in the mapping, and have the possibility of lower grain operations. Another feature of XiRISC is the adoption of a dedicated mechanism to guarantee fast reconfiguration.

The VLIW processor with its increased throughput sustains a high overall performance. XiRISC proved to be a flexible architecture, providing speedups between 4×

and 14×for a wide range of applications.

2.2.4 MOLEN

The MOLEN Polymorphic Processor [67] presents an architecture similar to XiRISC (Fig. 4) in the form of a RISC processor tightly-coupled with a reconfigurable processor (RP) composed of a configuration manager and fine-grain reconfigurable hardware (FPGA-based). In this case the reconfigurable processor is physically separated from the GPP. Memory accesses are controlled through an arbitration mechanism, while some registers are shared between the two processors.

Eight dedicated instructions have been added to the ISA of the GPP to control some operations of the reconfigurable logic. This mechanism reduces the cost related to the RP management, without increasing the complexity of the GPP.

The interesting novelty of MOLEN is that the entire system can be mapped onto a Xilinx FPGA chip, using the PowerPC embedded in the FPGA as GPP, while the FPGA fabric is used for the memory and the accelerator. However this architecture is less prone to run-time reconfiguration unless using the feature to reconfigure only part of the FPGA (see Section 2.6).

(33)

(34)

2.2.5 Embedded FPGAs

We saw that a standard FPGA device is not suitable for run-time reconfiguration.

However, it is possible to include a small FPGA block in a fixed architecture. This way we have a static part and a reconfigurable part that can be modified at run-time.

When the FPGA block is small, reconfiguration is not too time consuming.

This type of small FPGAs are called embedded FPGAs (eFPGA) in literature. One of the problems of eFPGAs is their power consumption. The FPGA flexibility is often paid by its power inefficiency. To solve this problem, alternative low-power FPGA structures intended for embedded use are proposed [34].

A NoC-based system integrating three eFPGAs is presented in [37]. The system is depicted in Fig. 5. It is meant to execute multimedia applications using hardware accelerators mapped onto the eFPGA.

A quantitative analysis of eFPGAs targeted to arithmetic implementations is carried out in [68]. The paper also presents a methodology to tailor the FPGA to a specific

(35)

application domain and proposes an architecture for eFPGAs targeted to arithmetic applications.

2.3 Coarse-grain Architectures

Coarse-grain architectures present the possibility to map a 8-, 16-, 32-bit arithmetic or logic operation onto a single PE. This feature optimizes the logic utilization, but it makes it very difficult to map any operation of finer grain. In general we cannot say whether a coarse grain is better than a fine grain. However, the comparison is possible if we define a specific application domain. Knowing in advance the application domain allows tailoring the architecture to the application requirements.

2.3.1 Kress Array

The KRESS Array [28] was one of the first coarse-grain architectures proposed in literature. It is composed of PEs called rDPU (reconfigurable DataPath Unit) connected together using NEWS (North-East-West-South) local routes. Operands can be fed using either the local interconnect to the boundary PEs or the buses that feed PEs of the same row. Fig. 6 depicts the KressArray-III prototype chip containing 9 rDPUs.

The KressArray presents a control unit based on a data sequencer and a 2-D addressable memory. Both the blocks cooperate to guarantee a high speed parallel access to the data. The entire device is called Map-oriented Machine with Parallel Data Access (MoM-PDA). A typical platform would be characterized by the MoM-PDA acting as an accelerator and a GPP acting as its host.

A framework called Code-X is provided to program this architecture. The framework accepts a program coded in a superset of C language and is able to partition the code between the accelerator and the host based on a performance analysis. The accelerator-targeted tasks are then processed by a dedicated compiler X-C that accepts a subset of C language. This compiler produces the code for the data sequencer and the rDPU.

(36)

(37)

2.3.2 DREAM

The Dynamically REconfigurable Architecture for Mobile systems (DREAM) [8] is an array of Reconfigurable Processing Units (RPU). Each RPU can perform data processing operations, as well as control operations.

The array structure is depicted in Fig. 7. The RPU are organized in clusters of four elements that share a configuration memory unit (CMU). The array is characterized by local NN interconnections between RPUs and global interconnections managed through switching boxes (SWB). A Communication Switching Box (CSB) controls two CMUs for the fast reconfiguration, plus the SWB.

DREAM is another example of an architecture tailored to a specific application domain. Its communication infrastructure has been designed to support efficient loop structures, synchronization, and communication mechanisms typical of mobile communication systems.

2.3.3 Morphosys

Morphosys [56] is a coarse-grain, integrated reconfigurable system-on-chip composed of a reconfigurable array of PEs, a modified RISC core (TinyRISC), and an efficient memory interface unit.

Each PE of the array, called Reconfigurable Cell (RC, see Fig. 8), has 8-bit or 16-bit granularity and is characterized by an ALU-multiplier and a register file. The RC is configured using a 32-bitcontext wordthat is stored in a dedicated register of the RC.

Reconfiguration is achieved by loading a new context word from the context memory.

The context memory can host more than one context and is located close to the array, providing the possibility to switch context in a few clock cycles.

The basic interconnection scheme of the RC is a two-dimensional mesh providing NN connectivity. In addition Morphosys is characterized by quadrant connectivity:

an array divided into four clusters of RCs (16 each) called quadrants. Each RC can get the output of any other RC in the same quadrant. At a higher level of networking there are interconnections between neighbor quadrants.

Morphosys is designed to meet requirements of different kind of applications. For this reason it is characterized by a large size and complex mechanisms to increase the

(38)

(39)

Fig. 8.MorphoSys 8 x 8 RC array with 2D mesh and complete quadrant connectivity (Copy- right2000 IEEE [56]).c

(40)

XPP [7] is based on a hierarchical structure similar to Morphosys, whose building blocks are called processing array elements (PAEs). A PAE contains input and output registers, plus a dedicated ALU that can perform special 3-input operations in addition to the standard 2-input arithmetic and logic instructions.

The PAEs are organized in processing array clusters (PACs, see Fig. 9) that resemble the quadrants introduced in Morphosys. Each PAC is connected to its own control machine (CM), while a supervising control machine (SCM) is used to synchronize the processing between different PACs.

Communication between PAEs is possible through a packet-oriented network. Data packets guarantee the exchange of operands and results between PAE objects, while event packets can be used for particular data management operations (skip the processing of data bits, merge some bitstreams). Packet synchronization techniques automatically handle pipeline stalls and similar situations.

The novelty of XPP is the event-driven computation supported through the packet- oriented network. Once again the success of such approach could be strictly related to the application domain.

2.3.5 ADRES

ADRES [41] is similar to XIRISC, but is characterized by coarse-grain arithmetic. In practice ADRES presents the typical stages of a processor (fetch, decode, execute), in which the functional units and the register file are accessible also by a reconfigurable coarse-grain array. The processor and the array work in separate cycles following a processor/coprocessor model. This makes data synchronization easy and data sharing possible using the units shared between the two parts (see Fig. 10). ADRES allows a simplified programming model and reduces communication costs.

(41)

(42)

(43)

2.3.6 BUTTER

BUTTER [11] has been developed by our research group with a major contribution by the author. It is meant to be a coprocessor for multimedia and 3D graphics acceleration. In its basic version it is composed of a 4x8 array of PEs, but its model is parameterized and it is possible to choose the size, i.e. the number of rows and columns, at design time.

The interesting novelty of BUTTER design is the possibility to implement it on a FPGA device. In fact a general-purpose platform based on BUTTER and a RISC core has been successfully synthesized on a FPGA. The reasons behind this design choice are related to the fact that a coarse-grain array is easier to be programmed than an FPGA device, and it could boost the exploitation of FPGA into a wider range of applications. For this reason BUTTER is designed to be as general-purpose as possible, but still not too resource demanding to be put on a FPGA chip.

BUTTER PEs (Fig. 11) provide functional units for arithmetic and logic operations.

The granularity is fixed at 32-bit, and the arithmetic operations support both integer and single precision floating-point values. Lately, the support for subword computation has also been introduced to increase the parallelism in case of 8-bit and 16-bit integer numbers [13].

BUTTER is characterized by a set of unidirectional interconnections between PEs.

The selection of the connection at run-time contributes to the run-time reconfigurability. The possible interconnections are NN from north, north-west, north-east and west, interleaved from two cells up, plus global horizontal and vertical (see Fig. 12).

Besides that, BUTTER presents all the typical issues of a run-time reconfigurable array, and for this reason is used as a case study in this work.

2.3.7 Montium

Montium [58] has a quite different structure than previous architectures. Instead of a matrix of PEs, its architecture resembles a VLIW processor (see Fig. 13). The parallelism in this case is achieved using a one-dimensional array of 16-bit ALUs, controlled by a Communication and Configuration Unit (CCU). The ALUs are fed by 10 dedicated memories, coupled with Address Generation Units.

(44)

(45)

The flexible interconnection infrastructure together with the dedicated sequencer allows direct mapping of any application based on intensive data processing.

2.4 Multi-core Systems

In some cases the PEs of a CGRA became so complex that it resembles a complete microprocessor. In some other cases multi-core architectures present computational density and performance close to reconfigurable machines. The following are some examples.

2.4.1 REMARC

REMARC [43] is the first proposed reconfigurable architecture resembling a multi- core system. REMARC is composed of an 8x8 array of 16-bit PEs. The PEs, called

(46)

“nanoprocessors”, are very simple processors that execute microcode stored in a local memory. A 16-bit ALU supports all the common arithmetic and logic operations.

The interconnections between PEs consist of NN links and global channels that run along rows and columns (see Fig. 14). A global control unit manages the operations of the array and the data transfers between the array and the outside world.

2.4.2 RAW

Reconfigurable Architecture Workstation (RAW) [62] is a multi-core system composed of 32-bit modified MIPS2000 processors, organized in a 4x4 mesh (Fig. 15).

In RAW the distinction between reconfigurable architectures and multi-core systems is rather blurred. RAW supports both static and dynamic scheduling for the network transactions. We could say that when using static scheduling, the behavior and performance are similar to reconfigurable arrays, while dynamic scheduling is a mechanism typical of multi-core systems.

(47)

(48)

2.4.3 Picoarray

PicoArray [20] advances the multi-core approach, introducing more complex PEs and increasing their number. The PC102 device is characterized by 332 nodes called Array Elements (AEs), that can be either a processor (32-bit, 3-way VLIW) or a coprocessor. The AEs are organized in a two-dimensional structure and communicate through a network of 32-bit buses and programmable switches, called picoBus (see Fig. 16). The inter-processor communication is based a Time-Division Multiplexing scheme to guarantee conflict-free communication between AEs. All the transfers are scheduled at compile-time, avoiding the cost of run-time arbitration.

2.4.4 RAA

Reprogrammable Algorithm Accelerator (RAA [54]) is composed of a two-dimensional array of tiny DSPs coupled with local memories. Local communication is supported

(49)

through NN interconnections, while a global bus is used to transfer data and configuration bitstreams from the outside world. Context switching is used to provide fast reconfiguration and also to virtualize the array dimension, folding the array into multiple configurations.

2.5 Hybrid Approaches

Technology trends towards multi-core and many-core architectures have stimulated the development of systems characterized by heterogeneous grain-size accelerators.

The capability to put more and more logic on the same chip has made this approach feasible. The architectures proposed in literature are based on the combination of existing solutions coming from the fine and coarse-grain classes.

Morpheus [44] is one of them. MORPHEUS stands for Multi-purpOse dynamically Reconfigurable Platform for intensive HEterogeneoUS processing and is the result of an EU-funded project of 18 partners from the European academy and industry.

MORPHEUS architecture is depicted in Fig. 17. It includes an ARM9 controller, 3 various-grain reconfigurable units, software-based dynamic reconfiguration, NoC based interconnect and the possibility for Dynamic Frequency Scaling. The reconfigurable units are: an eFPGA (M2000), the pGA (part of the XIRISC processor described in Section 2.2.3), and XPP (see Section 2.3.4).

Another heterogeneous architecture is CHAMELEON [57]. CHAMELEON is a template (see Fig. 18) that combines different tiled accelerators. The tiles can be either

(50)

ASIC blocks, embedded FPGAs, or Montium TPs (see Section 2.3.7). Therefore accelerators can be dedicated hardware implemented on ASIC technology (fixed logic, highest efficiency), FPGAs (very versatile reconfigurable architecture), or they can be realized programming a Montium tile. In [53] the CHAMELEON template has been successfully applied to create a system targeted to Software Defined Radio applications.

2.6 Dynamic Partial Reconfiguration

Dynamic Partial Reconfiguration relies on the possibility to modify the configuration of a portion (slice) of the FPGA device, while the rest of the logic is untouched and may be active. For this reason DPR has introduced run-time reconfigurability inside a classical FPGA chip.

DPR must be supported at hardware level as well as programming level. It is thus first a choice of the FPGA vendor to provide such a feature. Xilinx introduced DPR in the Virtex and Spartan FPGA families in the late 90s, and so far is the only FPGA vendor that supports it.

The biggest design issues when dealing with DPR are the restrictions [38]. The feasibility and the affordability of DPR is guaranteed only if specific slices are reconfigured. Thus for example the height of a reconfigurable module is always the entire height of the FPGA device, while the width is a multiple of 4 slices (meaning that we must reconfigure always blocks of 4 slices). These blocks have also a fixed position on the FPGA chip. These limits lead to the fact that an efficient design should be characterized by reconfigurable logic with the same size. For instance, if we have a platform with a reconfigurable module that can be either a FFT block or a corre-

(51)

lator, the size of the two logic blocks should be the same for the highest efficiency.

Obviously this is not the common situation.

Another issue is the interface. When replacing the logic inside a reconfigurable module, we need to adopt the same type of ports, and require logic that supports interfac- ing to the reconfigurable modules. This issue is solved using bus macros provided by the FPGA vendor that can be used to connect static and reconfigurable blocks (Fig.

19).

In recent years, Xilinx has provided an increasing number of features to ease DPR.

In addition, several research groups have been involved in related research activities.

In [39] the DPR is used to create a NoC-based system with fine-grain reconfigurable modules exchangeable at run-time. The results are evaluated on a Virtex FPGA and a Virtex-II FPGA.

A bus-based system composed of static and reconfigurable modules is presented in [64]. The system is targeted to a Virtex-II 1000 FPGA. The author also develops design methodology to exploit DPR.

A more mature work concerning DPR is the Caronte architecture [21] developed in Politecnico di Milano. The architecture is composed of a static part (processor and

(52)

memory) plus some reconfigurable modules and a reconfiguration manager (see Fig.

20). One interesting aspect of the work is that the proposed DPR design methodology uses unique tools provided by Xilinx and is therefore adoptable by any designer willing to experiment with DPR.

The Caronte architecture is the target of a specification-to-bitstream autonomous design flow called Earendil [51]. Earendil uses only standard design tools and is composed of three phases: High Level Reconfiguration (HLR), Validation (VAL) and Low level Reconfiguration(LLR). HLR performs a first software/hardware partition- ing analyzing the input specifications. VAL is a refinement phase, where the initial decisions are evaluated and the design is optimized. Finally LLR automatically generates the low-level design, defining what goes to the static part and what to the reconfigurable blocks.

The DPR approach has been further exploited in [52] to realize a multi-FPGA system. The system is composed of different FPGA devices, providing the possibility to combine static blocks and dynamically reconfigurable blocks on different devices.

This adds a degree of freedom to the approaches described earlier. One of the main

(53)

challenges of such a system is to reduce the impact of overhead due to configuration management and communication between different chips. One of the proposed solutions is to manage the reconfiguration using a Linux OS running on a statically- placed embedded processor. This solution has been adopted in the RAPTOR2000 platform [65].

2.7 Concluding Remarks

In this brief survey we described several examples of reconfigurable architectures.

Except for the multi-core architectures, all of them presents a very similar structure:

a GPP (RISC or VLIW) coupled with a reconfigurable accelerator. The accelerator can be integrated in the processor pipeline or interfaced as a coprocessor. In all the approaches the accelerator has a very large bandwidth and throughput, larger than the processor word width. Run-time reconfigurability allows the reuse of resources of the accelerator for different purposes. In some cases the accelerator is tailored to specific application domains, while in other cases they are intended for general-purpose use.

Now we move from theory to practice. We will try to quantify the speed-up that such accelerators can achieve, analyzing the application mapping on one of them.

The choice, dictated by practical reasons, falls on BUTTER. From such analysis, we expect to understand how much we can gain from run-time reconfigurability and what are its limitations.

(54)

solutions, are treated in detail in the next chapter.

3.1 A Platform for Application Development

Application development first requires the hardware platform where the application should run. Since BUTTER is meant to work as a coprocessor, we need a GPP for the platform. Its main purpose will be to control the accelerator operations. We can choose any microcontroller or microprocessor as long as it can be programmable in C, since the programming tools for BUTTER support only C language. This condi- tion simplifies the application mapping, since it is relatively easy to map or find an application coded in C.

For the GPP, we selected a general-purpose RISC core called CAPPUCCINO, developed in our group. CAPPUCCINO was the result of combining two separate devices, COFFEE and MILK.

3.1.1 COFFEE

COFFEE (a COre For FrEE [35], [61]) is a general-purpose RISC core, characterized by a 6-stage pipeline. It supports 32-bit and 16-bit instructions and its pipelined integer multiplier makes it a good candidate for embedded signal-processing applications. It is programmable using C, thanks to a compiler based on GNU GCC.

(55)

In terms of interfaces, COFFEE has a Harvard memory interface (separate ports for data and instruction memory) plus a dedicated coprocessor interface. The coprocessor interface allows the connection between the core and up to 4 different coproces- sors. The data transfers are faster and controlled using dedicated instructions, but the interface supports only transfers between core and coprocessor registers and no more than 8 registers per coprocessor are addressable. The interface is not suitable to connect for example COFFEE with BUTTER, since the reconfigurable array needs to be fed with a much larger amount of data.

3.1.2 MILK

MILK [12] is a floating-point coprocessor providing support for IEEE 754 floating- point arithmetic (addition and subtraction, multiplication and division). Originally MILK was interfaced to COFFEE through the coprocessor interface. Later the two devices were merged in order to remove the overhead related to data transfers between their register sets. CAPPUCCINO is the result of this fusion [11]. What happened is that the functional units of MILK have been transferred to the pipeline of COFFEE, and the operands are taken directly from COFFEE registers. Thus CAPPUCCINO supports integer and floating-point calculations, and in practice the floating-point divider is also used to implement integer division.

3.1.3 Building the platform

CAPPUCCINO and BUTTER are the main components of our platform. We fixed some other requirements for the platform: we need some peripherals for testing and debugging of all the applications we target and we aim to put the platform on FPGA.

An overview of the platform is depicted in Fig. 21. The two processing cores are interfaced to several peripherals through a 32-bit bus. CAPPUCCINO is directly connected to the instruction memory, where the application text segment should be loaded. BUTTER is connected to two local data memories, a configuration memory, and a control interface. All of the control registers that are used to activate BUTTER operations reside in the control interface and are accessed by the RISC core through the bus. The two local data memories are meant to provide source operands and store the results of BUTTER processing. These memories are organized into 16 banks,

(56)

Besides the data memories, a configuration memory is used to host the configuration bitstream that specifies the current functionality implemented by the array. This memory is organized in 4 banks, one per row, and provides 15 configuration bits to each PE of BUTTER.

The local data and configuration memories are characterized by a 32-bit bus interface that is used to exchange the data with the other system peripherals, in particular the memory.

Among the system peripherals, we have a data memory, a graphics memory, a text buffer, and some dedicated interfaces. The system data memory holds the data segment of the application (including heap and stack). The graphics memory is used for 3D graphics and video applications. It is composed of a double frame buffer, and a z-buffer. The double frame buffer is connected through the VGA interface to a VGA port. It holds color information for each pixel of the screen connected to the VGA interface. Since it is a double buffer, it is possible to update a frame while the old one is displayed, and switch to the new one only when the updating is completed. The z-buffer is used only for 3D graphics and holds the depth information associated with the pixels of a graphics application.

Besides the graphic memory, we also have a text buffer that can be connected to the VGA port. The text buffer is basically just a frame buffer, where each pixel is associated with only one bit. The text buffer can be used with dedicated C libraries to display textual output on a screen.

Finally a codec interface is used to control the audio ports and the UART interface for the serial port. Notice that instruction memory and system data memory can be monitored from outside due to a dedicated JTAG interface provided with the FPGA development board.

Table 1 collects all the information about the implementation of the platform on an Altera StratixII EP2S180F FPGA device. To get some information about the performance, we collected values of the maximum operating frequencies of BUTTER and

(57)

Fig. 21. A platform for application development.

(58)

UART 455 398 0.39% -

VGA Interface 135 95 0.11% -

Graphic Memory 270 20 0.2% 3,932,160

Text Buffer 49 4 0.03% 81,920

Platform 96,150 15,367 64% 7,424,000

Table 2.Comparison of performance between BUTTER and CAPPUCCINO (Device: Altera EP2S180F)

Device Maximum Frequency [MHz]

Fast Model Slow Model

CAPPUCCINO 96.89 57.73

BUTTER 44.99 25.08

CAPPUCCINO in Table 2. Such numbers will be useful to evaluate the acceleration obtained using BUTTER.

3.2 Programming BUTTER

The programming flow of the platform described above has to take into account the separation of roles between the RISC core and the accelerator. The platform is characterized by a loosely-coupled model of computation, meaning that the RISC core should mainly take care of the control, while BUTTER performs the actual computation. However we cannot map onto BUTTER the entire application, only pieces of code that are suitable to exploit its data level parallelism (characterized by SIMD or MIMD operations) and with not many control instructions. Thus the model is loosely-coupled in the sense that the RISC core and the accelerator are never per-

(59)

forming some processing operations at the same time. As we stated already, the RISC core can be programmed in C and can not only handle integer additions and multi- plications, but also floating-point operations. If such operations are not performed on a large set of data, mapping on BUTTER usually does not produce a significant speed-up.

When mapping an application on the platform, the starting point is to code the application in C language. This first software version of the application can run on the CAPPUCCINO core and we can evaluate how many cycles the software implementation requires. In addition, we can do preliminary testing and debugging to remove problems related to the peripherals and/or the platform itself. Usually such software implementation is slow, but provides a good comparison point for the final acceler- ated version.

For the second step, the application developer must decide which part of the code should be mapped on BUTTER and which one kept on the RISC core. This choice depends on whether the mapping on BUTTER produces a real speed-up or not. Typ- ically only some critical sections (or kernels) are mapped on the accelerator. This requires some effort from the developer. The fact is that the operation is to a large ex- tent manual, and can be characterized by several substeps. BUTTER is composed by a limited number of PEs, and thus it is not always possible to map an entire kernel on it. This means that a kernel must be split in stages, where only one stage is mapped at once on the array. In the literature, the set of parameters that define the mapping of some operations on the array is called context of configuration. One kernel requires one or more contexts, and usually an application is characterized by more than one kernel.

The definition of the contexts is almost completely manual. The only automatic step is the coding of the contexts into a configuration bitstream. This is done using a graphical user interface in which the user defines the contexts needed, and the tool generates a set of configuration files. These are C header files and can be used together with C libraries to control BUTTER operations using the RISC core.

In next sections we describe in detail some of the applications we mapped on the platform using the procedure described above.

(60)

sors (GPU). Since our hardware is primarily targeted to embedded applications, and is much simpler than any GPU, we thought that the usage of such libraries was not needed. Accordingly, we created some simpler 3D libraries [P1] to use with CAP- PUCCINO core and BUTTER. With such libraries we developed some test applications to use on the platform.

Any graphic application is based on a loop in which each iteration generates one frame to be displayed on the screen (Fig. 22). The programming flow of a 3D application requires the programmer to create a three-dimensional world, in which he can instantiate some three-dimensional objects. Each object is characterized by a set of vertices that will be used to eventually draw the object on the screen. These vertices have three coordinates that define their position in the 3D world.

In the first stage of the main loop iteration, all the objects are subjected to some geometrical transformations. Since the transformation parameters can change frame by frame, the impression of the viewer is that the objects are moving in the 3D world.

From the implementation point of view these transformations are applied to the vertices of the objects.

The second part of the main loop iteration corresponds to the lighting and rendering, that is, the drawing of the object on the screen. The rendering first requires the actual drawing of the surfaces, starting from the vertices. This can be done using different algorithms as complex as the surface to be drawn. Then the surfaces are projected onto the screen, which involves some additional operations such as hidden surface removal, whereby surfaces are removed if they are hidden behind other objects.

The means by which the object is drawn starting from the vertices is not unique. One method is to consider the surface of the object as composed by several polygons, and the vertices as vertices of these polygons. It is the case that simple geometrical objects as well as meshes can have a very large number of polygons. Another way is to use the vertices as control points to generate the surfaces using complex mathematical formulas. In our libraries, the rendering supports only the first method.

(61)

Fig. 22.State diagram of the 3D graphics pipeline.

(62)

3.3.1 Mapping of the 4x4 matrix-vector multiplication

The transformation is performed in every iteration of the main loop and is applied on every vertex of the 3D world. Typically each vertex is characterized by 3 coordinates, the location of the point along the x-, y-, and z-axis of the world. A transformation can be modeled as the multiplication between the vertices and a transformation matrix. A rotation or scaling needs only a 3x3 matrix, while a translation requires some geometrical tricks. The 3-dimension space should be projected in a 4-dimension space, where a 4x4 matrix can model also translations. This means that any transformation can be represented by a 4x4 matrix. To apply the transformation, we just need to multiply the vertex by the matrix. This operation is mapped on the array [P4].

We made the assumption that a matrix is fixed (as it should be for each iteration step), while the vertices are the input operands. BUTTER processing produces the transformed vertices. The mapping requires two contexts, one of them is depicted in Fig. 23. This context performs the actual computation: it executes the scalar product between a 4-element vector and a 4x4 matrix. The 4x4 matrix must be loaded beforehand into the immediate registers of the first two rows, and this is the role of the second context.

Considering the implementation issues, the way how we organized the data in the local memories is interesting (Fig. 24). BUTTER processes one vertex at a time, producing one transformed vertex per cycle. One vertex is composed of 4 elements, stored in 4 banks of the local memory. Since 16 banks are available, we can use the interconnections between memory and the array (buffered switches) to exploit all the available memory. This means that during the processing we need to change the setting for these input and output buffers. First we can take the operand from the first set of four banks, then from the second set, and so on. Changing the buffer settings during the processing has a cost in terms of clock cycles that is part of the so-called control overhead(see Section 4.3).

(63)

(64)

Fig. 24.Organization of input and output vertices in the local memories.

(65)

3.3.2 Result evaluation

The 4x4 matrix-vector multiplication processes one vector per clock cycle. Thus the processing ofnvertices takesnclock cycles. CAPPUCCINO on the other hand requires 196·ncycles. Considering for example the figures for the slow timing model (CAPPUCCINO @ 57MHz, BUTTER @ 25MHz), the processing on CAPPUC- CINO takes n·3.4µs, while BUTTER n·40ns. This means an average speed-up of 85×. However this speed-up does not take into account thecontrol overheadde- scribed before, which depends on the number of vectors we need to process.

3.4 Multimedia Applications

The multimedia field offers more examples of applications suitable for the acceleration with BUTTER. In the desktop computer world, general-purpose processors are equipped with the so-called SIMD extensions in order to support data parallelism on 8/16-bit data. In the embedded world, we can use accelerators like BUTTER. In the next sections we analyze examples of applications related to image processing, audio and video compression.

3.4.1 Image processing

The processing of a still image is by definition a SIMD operation. Image filters are implemented using masks that span throughout the image pixels generating a new pixel from a linear combination of neighbor pixels. The programmer has to map a sum-of-product into the coarse-grain array. The tricky part is again related with the management of inputs and outputs.

In the case of a generic image filter, the inputs are pixels of the original image, while the outputs are pixels of the filtered one. For each output we need to multiply a set of inputs by coefficients (related to the filter type) and then add all the results together.

The point is that the inputs are coming from a 2D location (see Fig. 25), i.e. from different rows of the image, while in the accelerator the inputs are processed row by row.

In [P2] we described the mapping of a low-pass filter onto BUTTER. Mapping and routing are shown in Fig. 26. We need two contexts to perform the complete pro-