Emulators - FUNCTIONAL VERIFICATION - AXI-Stream VIP Optimization for Simulation Acceleration :

2. FUNCTIONAL VERIFICATION

2.6 Simulation

2.6.1 Emulators

Emulators are special-purpose computers manufactured to speed up simulations. They are usually some tens to thousands of times faster than software simulators depending on a use case. [11] When the design size reaches approximately tens to hundreds of million gates the software simulator’s performance starts to drop and benefits of using emulator becomes visible [30]. An emulator can be implemented in two different ways of which the other is indirect implementation and the other is direct implementation [13].

In indirect implementation the logic simulator is implemented with hardware. An example of indirect implementation is the cycle-based EVE machine. It is built with primitive logic processors (LP) that have four inputs channels and a Boolean function (e.g. AND or OR) that LPs execute continuously. The more there are LPs in the system, the more Boolean functions can be evaluated in parallel to increase performance although the communication between LPs take then more time. In addition, the emulator includes hardware elements for registers, latches and memory arrays which increase the functionalities that emulator supports. [13]

In direct implementation the design is directly mapped to the programmable hardware of the emulator [13]. The programmable hardware can consist of FPGAs, processor arrays

or programmable ASICs [11-13]. Because the design cannot typically fit on a single HW element it is partitioned to multiple sub-circuits that are mapped to the programmable HW elements. The programmable components are connected to each other via pins of their packaging on the printed circuit boards (PCB). There are two different connection architectures that are direct and indirect architectures. [12]

In the direct architecture the components are connected directly to each other with physical wiring while in the indirect approach they are connected using routing chips.

The direct routing is straight forward connection method but restricted because of the components’ limited number of I/O pins. The restricted number of the connection pins limit the signals that can be feed to a programmable component and out from it. Thus, all possible logic recourses of the component might not be used because the designed module to be mapped there cannot receive the signals that it would require to function.

This causes the module to be mapped to another component to which the required signals can be connected through available pins. One way to avoid this limitation is to use some of the components’ logic in time-multiplexed routing or choose indirect routing.

[12]

In time-multiplexed approach a certain connection wire between two processing elements is used in turns. This may require division of the emulation clock to sub-cycles to allow signalling between components and their corresponding logic circuits to occur correctly. [12]

An example of the indirect routing with programmable interconnect approach is the crossbar component. It is a component that can connect any of its pins together and thus improve inter-chip routing. The disadvantage is that the size of the crossbar component increases significantly if the number of components and their connections increase. In this case a single crossbar can be divided to multiple smaller components to decrease the size of single component. Time-multiplexing and crossbar interconnection methods can be combined to further improve inter-chip routing in the emulator as Figure 6 demonstrates. [12] For example, the usage pq-line of the Chip 1 is divided in time for signals p and q in Figure 6. In the crossbar of Multiplexer (MUX) Chip 1 these signals are then routed forward.

Figure 6. Example of indirect inter-chip routing between emulator’s processing units using time-multiplexed and crossbar schemes [12].

Concurrent execution of the RTL design requires correct timing control from an emulator.

It must take care of propagating clock signals without skew to different processing units to ensure intended design behaviour. The skew of the clocks can cause itself wrong signal value propagations or cause hold-time violations which moreover cause metastability. Most timing issues can be fixed by decreasing emulation clock frequency which on the other hand decreases emulation performance and its benefit. In addition, this procedure does not suit SoC designs with multiple different clocks and clock phases.

The clock skew can be decreased for example in case of internally generated secondary clocks by duplicating the clock generation block to processing units that require it. Thus, slow inter-chip path delays between these units can be avoided. [12]

The use of emulator requires host workstation that partitions and compiles the design code for emulation and programs the processing units, for example FPGAs, to which the design is mapped. The workstation is connected to the emulator via high-speed channels and it programs also the routing between the processing units. [12] Because of these phases the design compilation time for emulator might be 5 to 50 times longer than for software simulator. Although the compile times are longer for emulators there are differences between them. For example, emulators utilizing FPGAs have longer compile times compared to emulators utilizing processor arrays. [11]

In addition, the run-time controlling and debugging interface to the emulator is used with the host workstation. However, the communication between workstation and emulator should be kept in minimum to maximize the performance boost that the emulator offers.

[11,13] Multiple users can use modern emulators with remote access simultaneously through host workstations.

The emulators cannot be used to verify RTL design’s timing properties because the design is mapped to its hardware. The reason is that hardware or signal routing inside the emulator is not the same as in the future’s design on another chip. Thus, the results from the emulator would not correlate the actual implementation. The same reason of mapping the design to the emulator limits the emulators to use 2-state logic instead of 4-state logic used in software simulators. The unknow 4-state (X or U) cannot exist in real hardware and high-impedance state (Z) cannot be observed with digital circuit which leaves logical values ‘1’ and ‘0’ left. [11]

In document AXI-Stream VIP Optimization for Simulation Acceleration : a Case Study (sivua 23-26)