Implemented Optimizations - Design and Implementation of a Secure Real-time Transport Protocol

5. UVGRTP

5.3 Implemented Optimizations

uvgRTP features a few optimizations which aid it to be the fastest library at sending HEVC. Only one of the optimizations is H26X-specific but other optimizations can be used for any media type.

uvgRTP imposes one restriction on the input to provide optimal processing speed: the memory given to uvgRTP by the calling application must not be read-only. This restriction allows uvgRTP to do, for example, in-place packet encryption and it speeds up the HEVC start code lookup sig-nificantly. This requirement is, however, rooted in reality given that most often the memory given to uvgRTP is from an encoder which has most likely allocated the memory from operating system and written to that memory chunk also. If, however, the memory is read-only, uvgRTP will make a copy of the memory and operate on the copy instead.

5.3.1 HEVC Start Code Lookup

When uvgRTP is given a frame of HEVC data, as per RFC 7789, it needs to find the NAL units from the frame. Each frame can have multiple NAL units, and they are separated either by a 3-byte start code (0x000001) or a 4-3-byte start code (0x00000001). What uvgRTP does when it gets an HEVC frame from the application is it finds the NAL units by scanning the input data for these start codes which identify NAL unit boundaries. Especially for larger frames, this can be a very time-consuming process. This process is called Start Code Lookup (SCL) and Algorithm 4 shows a naïve way of finding the start of a NAL unit.

Algorithm 4: HEVC start code lookup, simple way

It is very clear why this is not an optimal way of finding the start codes: the code must go through every byte individually which can be time consuming for large frames and contains a lot of unnec-essary work.

To understand the optimization, it needs to be considered what the algorithm is trying to find: a sequence of bytes mostly consisting of zeros. There are two optimizations done in uvgRTP to speed up the HEVC start code lookup. Firstly, uvgRTP goes through the HEVC frame 8 bytes at a time and checks using a special bitwise mask [65] if the 8-byte chunk has a zero byte in it. If it does not, the code checks if we have reached the end of the HEVC frame and terminates. If the end has not been reached and the chunk does not contain a zero, uvgRTP proceeds to the next 8-byte chunk. If the chunk does contain a zero, uvgRTP does the simple HEVC start code lookup on that 8-byte chunk defined above. If a start code is not found, the code proceeds to use the bitmask operation to process the frame. If a start code is found, its location is returned to user and code proceeds to process the frame in hopes of finding of more start codes. This approach is faster than the simple version shown in Algorithm 4 as in this version the algorithm can process the input 8 bytes at a time instead of 1 byte at a time. Algorithm 5 shows a simplified version of the improved HEVC start code lookup.

Algorithm 5: HEVC start code lookup using a bitmask

There is one additional optimization that can be done that will improve the performance of the algorithm. Because uvgRTP is processing a memory chunk of unknown length, it needs to keep track of where it is currently at, so it does not read beyond the end of the memory chunk. uvgRTP checks the current index after each 8-byte read but it is possible to get rid of these bounds-checks for the most part.

This is where the constraint that the memory cannot be read-only comes in. If uvgRTP only has to check the current index when haszero check has failed, the number of checks is significantly reduced. This can be done by setting a byte in the last 8-byte chunk of the input frame to zero.

This way, if the 8-byte chunk does not contain a zero, uvgRTP can just proceed to the next chunk but it still makes it possible to detect when the end has been reached. When the end has been reached, the byte that was overwritten with a zero is replaced. This simple optimization makes the algorithm run significantly faster and is faster than FFmpeg’s implementation for HEVC start code lookup [66]. Algorithm 6 shows a simplified version of the HEVC start code with bounds-check optimization enabled.

Algorithm 6: HEVC start code lookup with bounds-check skipping

To compare the algorithm with FFmpeg’s implementation of HEVC start code lookup, a 754 MB video file with 602 HEVC start codes was used to test the algorithms. An average of 20 runs was taken as the final number. The number for speed in Table 7 tells how many gigabits of HEVC data can the algorithm sort through per second.

HEVC start code look up type Speed (Gbit/s)

uvgRTP, simple 4.94

uvgRTP, bitmask only 16.7

uvgRTP, bitmask + bounds-check skipping 18.1

FFmpeg 13.5

Table 7: HEVC start code lookup benchmark results

As can be seen from Table 7, the implementation of uvgRTP is significantly faster than FFmpeg’s implementation.

5.3.2 Scatter/gather I/O

Scatter/gather I/O, or Vectored I/O, is a technique where the input or output of a program is con-structed from multiple separate buffers. For example, application gives uvgRTP a payload of HEVC data, uvgRTP must append RTP and media-specific headers before the actual payload.

There are two ways to proceed. One is to allocate a new buffer, copy the headers in the right order and then append the user-specified payload but this causes several unnecessary copy op-erations. Using scatter/gather I/O, uvgRTP can prepend the headers before the payload by calling sendmsg(2)/WSASend() functions with scatter/gather I/O buffers and thus removes the need to do any extra copies in user space. Figure 42 shows how the UDP payload is constructed from multiple different buffers residing in different memory locations.

Figure 42: Scatter/gather I/O

Scatter/gather I/O is also very useful feature when fragmenting the input frame into smaller pack-ets. Size limit for one UDP datagram is 1500 bytes and if user-provided payload is larger than that, it must be fragmented into datagrams of 1500 bytes each. Scatter/gather I/O makes this very easy because the system call only takes a pointer to the memory so each consecutive call to send message just updates the payload pointer by 1500 bytes. Using this technique, uvgRTP’s send stack is completely zero-copy on both Windows and Linux.

5.3.3 In-place Encryption

In-place encryption is a technique where the plaintext being encrypted is overwritten as it is getting encrypted. This technique is utilized by uvgRTP if the calling application has so signaled. This requires that the input frame is writable, and that the caller does not need the memory after calling push_frame().

In-place encryption saves a memory copy but when scatter/gather I/O is used, for example, when fragmenting a large HEVC NAL unit, it is disabled as the encryption routines used by uvgRTP do not allow scatter/gather type inputs.

5.3.4 System Call Clustering

Each time uvgRTP wants to send or receive data, it must execute a system call. System calls are those function calls that the application makes to interactive with the outside world. If an applica-tion wants to read input from keyboard, open a file, or send a UDP packet over the network, it must execute a system call.

During the system call, the execution switches from the context of the application to the context of the kernel, i.e., it does a context switch to kernel mode. During a context switch to kernel, CPU’s instruction and data caches are populated with kernel code and data and application code is evicted from the cache. Some of the entries in Translation Lookaside Buffer (TLB) are also evicted. When the application has finished executing the system call, it returns back to the user mode which may result in a cache miss due to evictions done in kernel mode. This has serious consequences to application’s running performance. [67]

A system call is executed for each packet which means that for an HEVC frame of size 100 KB, around 70 system calls would be executed. To optimize for the system calls, uvgRTP features what is called System Call Clustering (SCC) [68]. What this optimization does is that it creates large buffer that contains all the datagrams that should be sent and when it is executing the sys-tem call, it passes the buffer as a parameter to the kernel. The kernel copies this buffer and sends all the datagrams that are in the buffer, using only one system call.

On Windows, SCC implemented using WSASendTo() [69] and on Linux it’s implemented using sendmmsg(2) [70]. SCC is the single most profound, media-generic optimization uvgRTP has to offer.

In document Design and Implementation of a Secure Real-time Transport Protocol Library for High-Speed Video Streaming (sivua 55-60)