Implementation Results - Flexible Low-Area Hardware Architectures for Packet Processing in Soft

With the modifications discussed above, a single parser can serve eight 10 Gbps ports. Therefore, the throughput of one packet parser instance is 80 Gbps. For support of 640 Gbps aggregate throughput, eight packet parser instances are required. Table 7 outlines the number of components required in each instance and the total area and power dissipation. The ASIC technology used is 28nm Fully Depleted Silicon on Insulator (FD-SOI). The operating conditions are (SS, 0.9V, 125˚C). The synthesis tool under use is Synopsys Design Compiler J-2014.09-SP4.

Table 7. Area and power dissipation values for components of an 80 Gbps packet parser

Component Number of

instances Total area (μm²) Total power (mW)

PHV 8 149181.28 80.24

APCU 8 3587.76 47.20

Header counter 8 2766.08 11.36

Payload counter 8 2766.08 11.36

Instruction Memory (256×72) 1 44564.47 9.88

Read port for Instruction Memory 1 10450.56 14.83

Parameter Memory (32×448) 1 34426 3.46

Read port for Parameter Memory 1 8684 42.74

NHRU 1 920.33 0.73

BC 1 452.49 0.48

BCE 1 293.65 0.42

Total - 258092.7 222.7

The total area of an 8-threaded packet parser that sustains 80 Gbps throughput is 258092.7 μm². For sustaining aggregate throughput of 640 Gbps, 8 processor instances are required. Since the instruction memory and parameter memory are relatively small, they are hosted on memories made of registers. Therefore, independent read ports can be easily added. In other words, there is one instruction memory and one parameter memory for all 8 processor instances, each of which runs 8 independent threads. Another benefit of sharing the instruction and parameter memories is that initialization process takes less time. The area of this parser must be compared with RMT’s parser. In [26] parser components have been categorized into 4 classes. Table 8 outlines these classes and their corresponding components in the proposed architecture.

Table 8. Correspondence of packet parser components in RMT and the proposed architecture

Component

class Components in RMT parser Equivalent components in this architecture

1 TCAM APCU, Parameter memory with 8 read ports

2 SRAM Instruction memory with 8 read ports

3 Header identification and field

extraction NHRU, header counter, payload counter, BC,

BCE

4 PHV PHV

The total area of components equivalent to the TCAM-SRAM pair is 0.79 M gates in this architecture. This is in contrast to the 1.6 M gate figure of RMT parsers. For class 3 component, the area in RMT and this architecture are 0.35 M and 0.17 M gates respectively. The total area for all PHV instances in this architecture is 3.65 M gates and matches the value provided in [26]. Total gate count is 4.6 M and 5.6 M gates in this architecture and RMT parser instances respectively. As will be discussed in the section to follow, the area difference increases as more parser instances are instantiated for supporting higher aggregate throughput values.

3.5.1 Discussion of results

A 50% saving in area has been achieved for Match-Action memories of the parser by implementing an alternative mechanism for protocol-independent packet parsing.

In TCAM-based approach, the search key is compared with all entries of the TCAM whereas in this architecture the search key is compared with the relevant values only.

In addition, since a non-lookup mechanism has been used for maintaining the boundary between headers, the number of next header entries does not need to be as many as the TCAM entries. Hence, these values are hosted on memories made of registers. For such memories it is easy to add an independent port. This is not possible with TCAMs. In TCAM-based solutions, as more parser instances are added, each instance must have its own TCAM instance.

The area difference becomes more noticeable as the number of parser instances is increased for sustaining higher throughputs. In RMT parser, the TCAM-SRAM pair must be replicated for each parser instance. This approach is not scalable for instantiating tens of parser instances for achieving high throughputs. In this architecture, on the other hand, the memories are shared simply by adding extra read

ports. The elimination of TCAM is beneficial not only from chip area perspective, but from power dissipation point of view as well. According to [20], the power requirement for an 80 Gbps non-programmable packet parser that does not contain TCAM is around 400 mW. Comparing this figure with an instance of the designed 80 Gbps packet parsers, programmability is achieved at roughly 50% of that power requirement. No power dissipation has been provided in [20] for TCAM-based programmable parser. For a programmable parser that employs TCAM the power dissipation is far above this figure. Information regarding TCAMs are scarce as they do not come with standard cell libraries by default. As a result, it is not possible to compare the power dissipation of the proposed solution with a TCAM-based parser.

However, it can be confidently said that this architecture is far more power efficient.

The achieved throughput can be enhanced by increasing the operating frequency.

The fact that register-based memories have been used makes increasing the frequency a lot easier because registers are not the limiting factor in frequency scaling. The limiting factors are memories and the critical path of combinatorial components such as field extractors. By internally pipelining the field extractors, the potential timing constraint violations can be eliminated. Actual SRAMs and TCAMs, on the other hand, cannot be clocked beyond a certain point. The synthesis experimentations revealed that even at 2.0 GHz the timing constraints are still met.

All the results in this chapter, however, correspond to operating frequency of 1.19 GHz.

4 AN ON-THE-FLY PACKET PRE-PROCESSOR

The implementation results of the programmable packet parser in the chapter 3 are promising. Due to the small area footprint of this parser, there is a lot of silicon real estate that can be utilized for enhancing functionality and throughput. In this chapter, packet processing capabilities will be provided for the packet parser of chapter 3. The content of this chapter is based on PIV.

In document Flexible Low-Area Hardware Architectures for Packet Processing in Software-Defined Networks (sivua 57-60)