Design and Implementation of a Secure Real-time Transport Protocol Library for High-Speed Video Streaming

(1)

Aaro Altonen

DESIGN AND IMPLEMENTATION OF A SECURE REAL-TIME TRANSPORT PROTOCOL LIBRARY FOR HIGH-SPEED VIDEO STREAMING

Faculty of Information Technology and Communication Sciences

Master of Science Thesis

April 2021

(2)

ABSTRACT

Aaro Altonen: Design and Implementation of a Secure Real-time Transport Protocol Library for High-Speed Video Streaming

Master of Science Thesis Tampere University

Master’s Programme in Information Technology April 2021

The amount of transmitted video in current telecommunication networks keeps increasing every year. Efficient transportation technologies are required to provide a backbone for high bandwidth, low-latency communications. This thesis presents a novel Real-time Transport Protocol (RTP) library called uvgRTP for secure high-speed video streaming. It has a built-in support for AVC, HEVC, and VVC video codecs and provides a secure communication channel between the endpoints by implementing SRTP and ZRTP specifications. It attains a higher peak goodput and lower average latency than the state-of-the-art technology when streaming HEVC video. This RTP library, having a permissive 2-Clause BSD license and an intuitive API, and supporting both Linux and Windows, is a good option for secure real-time video streaming.

Keywords: Real-time Transport Protocol (RTP), High-Efficiency Video Coding (HEVC), Secure RTP (SRTP), Zimmermann RTP (ZRTP), cryptography, video streaming

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Aaro Altonen: Suojatun reaaliaikaisen siirtoprotokollakirjaston suunnittelu ja toteutus nopeaa videon suoratoistoa varten

Diplomityö

Tampereen Yliopisto Tietotekniikan DI-ohjelma Huhtikuu 2021

Videon määrä tämänhetkisissä tietoliikenneverkoissa kasvaa joka vuosi. Suurikaistaisen ja matalaviiveisen kommunikaation tueksi tarvitaan tehokkaita tiedonsiirtotekniikoita. Tämä työ esit- telee uuden reaaliaikaisen siirtoprotokollakirjaston nimeltään uvgRTP turvallista ja nopeaa videon suoratoistoa varten. Se sisältää sisäänrakennetun tuen AVC-, HEVC-, ja VVC-videokoodekeille ja tarjoaa turvallisen kommunikaatioväylän päätepisteiden välille implementoimalla SRTP- ja ZRTP-spesifikaatiot. uvgRTP saavuttaa korkeamman huippunopeuden ja matalamman keskiar- volatenssin kuin muut viimeisintä tekniikkaa edustavat ratkaisut suoratoistetulle HEVC-videolle.

Tämä Linuxilla ja Windowsilla toimiva, BSD 2-Clause –lisenssin alainen, ja intuitiivisen ohjelmoin- tirajapinnan sisältävä RTP-kirjasto on varteenotettava vaihtoehto turvalliseen reaaliaikavideon- siirtoon.

Avainsanat: Real-time Transport Protocol (RTP), High-Efficiency Video Coding (HEVC), Secure RTP (SRTP), Zimmermann RTP (ZRTP), kryptografia, videon suoratoisto

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck –ohjelmalla.

(4)

PREFACE

First and foremost, I want to thank Associate Professor Jarno Vanne for providing me an opportunity to work on such an interesting project and for acting as the supervisor of this thesis. Many thanks to Professor Jari Nurmi for acting as an examiner.

I also want to thank Marko Viitanen whose work were the foundations of this library and Joni Räsänen with whom I have had many interesting discussions regarding this project and many others.

Finally, I want to express my gratitude towards all those people have acted as cheer- leaders during my studies and given me motivation to continue but especially to my mother who has encouraged me to apply myself and who finds the silver lining in the worst of situations.

Tampere, 24 April 2021 Aaro Altonen

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

RTP Real-time Transport Protocol

RTCP RTP Control Protocol

FPS Frames per Second

RFC Request for Comments

HEVC High Efficiency Video Coding

UDP User Datagram Protocol

SSRC Synchronization Source

CSRC Contributing Source

DCT Discrete Cosine Transform

CABAC Context-adaptive Binary Arithmetic Coding

TCP Transmission Control Protocol

P2P Peer-to-peer

HTTP Hypertext Transfer Protocol

MPD Media Presentation Description

OOB Out-of-Band

CNAME Canonical name

MSW Most Significant Word

LSW Least Significant Word

SRTP Secure RTP

SRTCP Secure RTCP

ZRTP Zimmermann RTP

MIKEY Multimedia Internet Keying

VVC Versatile Video Coding

SCC System Call Clustering

SCL Start Code Lookup

VoIP Voice over IP

MTU Maximum Transfer Unit

FU Fragmentation Unit

NAL Network Abstraction Layer

RSA Rivest-Shamir-Adleman

ECC Elliptic-curve Cryptography

AES Advanced Encryption Standard

DSA Digital Signature Algorithm

HMAC Hash-based Message Authentication

CM Counter mode

CFB Counter feedback

IV Initialization Vector

ROC Roll-over counter

UMTS Universal Mobile Telecommunications System

DOS Denial-of-Service

SIP Session Initialization Protocol

PBX Private Branch Exchange

NIST National Institute of Standards and Technology

SCL Start Code Lookup

TLB Translation Lookaside Buffer

E2EE End-to-End Encryption

ICE Interactive Connectivity Establishment

QoS Quality-of-Service

QoE Quality-of-Experience

(8)

1. INTRODUCTION

Video traffic in IP networks is estimated to grow 90% annually and will account for 82% of the total traffic by 2022 [1]. At the same time, Virtual Reality (VR) applications are gaining popularity which will demand both high bandwidth and low latency from supporting software. As progress is made in screen and video coding technologies, 4K, 8K and higher resolutions are becoming an expectation from video services.

This increase in use is a multidimensional problem and has to be solved on multiple fronts. Video coding standards, such as Advanced Video Coding (AVC/H.264) [2] and High-Efficiency Video Coding (HEVC/H.265) [3], provide tools to compress digital video into more manageable size. For 4K UHD video, HEVC can attain a 40% [4] higher coding efficiency for the same objective quality and 64% [5] higher coding efficiency for the same subjective quality as AVC. Versatile Video Coding (VVC/H.266) [6] can attain higher coding efficiencies for the same objective and subjective quality as HEVC while also providing better tools to compress 8K and 16K videos.

Real-time Transport Protocol (RTP) is a network protocol used in real-time audio and video trans- missions. It was designed especially for Voice over IP (VoIP) applications that require a way to stream multimedia in real-time, but it can also be used to stream offline content. The latest specification, RFC 3550 [7], defines the structure of an RTP packet and introduces RTP Control Pro- tocol (RTCP) which is used to monitor the session among participants and to provide Quality-of- Service (QoS) metrics.

RTP is a widely used and developed protocol and several open-source RTP implementations are available [8]-[16] but they all have problems. For example, they either do not provide built-in support for video codecs, do not support the latest RTP specification or are full-fledged multimedia frameworks and thus not ideal for lightweight applications. Furthermore, most of the libraries do not support Secure RTP (RTP) or any key management protocol, such as Zimmermann RTP (ZRTP), making them unideal for secure video transportation. Some of the libraries also require a lot of code to get a simple video call working.

To address all these issues, uvgRTP [17] was created. Originally it was meant to be a thread- safe, easy-to-use alternative to LiveMedia’s LIVE555 multimedia framework. The need for this was realized when Kvazzup [18], an open-source video conferencing software, was in need of an RTP implementation with a permissive license and a built-in support for HEVC.

uvgRTP is a novel RFC 3550-compliant RTP library with built-in support for AVC, HEVC, VVC and Opus, transparent media encryption using SRTP and ZRTP, and a user-friendly API. It is licensed under the permisive 2-Clause BSD and can be run on both Windows and Linux. Addi- tionally, uvgRTP is able to attain a goodput of 4.8 Gbit/s and an average latency of 6.9 ms for HEVC in ideal conditions.

In this thesis, even though uvgRTP supports both AVC and VCC, the focus is going to be on HEVC as it represents the state-of-the-art and is supported by both FFmpeg and LIVE555 which are both very well-known multimedia frameworks.

This thesis is organized as follows. Chapter 2 gives a general overview of video compression and a short introduction to HEVC. Chapter 3 describes what media streaming is and what are RTP and RTCP. Chapter 4 describes the security aspects of multimedia streaming, why it is needed and ways of implementing it. Chapter 5 gives an in-depth look at the implemented RTP library, its architecture, optimizations, how it operates, and what is supports. Chapter 6 shows the compari- son of different RTP libraries and presents the HEVC throughput benchmarks of uvgRTP, FFmpeg, and LIVE555. Chapter 7 concludes this thesis.

(9)

2. VIDEO COMPRESSION

Video compression is the act of compacting the information in a video file to a smaller size. Un- compressed video, so called raw video, has all the captured information. Often there are a lot of redundancies in this information, or it can be compacted or presented in a different, more concise form. This is called video compression and it is necessary, for example, if the video is transported over a network or it is saved to disk as the spatial requirements for compressed video are often multiples times smaller than that of an uncompressed video.

2.1 Raw Video

Raw or uncompressed video is a format where the information of the video has not been compacted. It is often used by video cameras or monitors. Often times in image processing, the work is done on raw image data. Though useful for some purposes, it is not beneficial to store or transport raw video so the information can be compacted into smaller form, requiring less bandwidth or storage space. Data rate for raw video is directly proportional to color depth, resolution, and refresh rate. For example, one minute of 30 Frames per Second (FPS) Full HD video would take 1920 * 1080 * 30 * 60 = 3 732 480 000 bytes or 3.7 gigabytes. The amount of data 60 seconds worth of 8-bit color video takes without compression is unacceptable for most applications.

The two most common raw video formats are RGB and YUV. In RGB, each pixel has red, green and blue values and sometimes an alpha value which tells the transparency of the pixel.

Thus, each pixel takes 4 bytes of space if each color takes 8 bits. Figure 1 shows the format of an RGB pixel. [19]

Figure 1: RGB pixel format [19]

YUV is also a popular raw video format, and it consists of three signals: Y which defines the luminance, and U and V which define chrominance or blue and red, respectively. In YUV design, the human perception of brightness and color has been taken into account. It was originally used to provide color images for televisions while remaining backwards-compatible with monochrome televisions as they would ignore the U and V parts of the signal. Figure 2 shows the format for YUV 4:2:0. [20]

Figure 2: YUV 4:2:0 pixel format [20]

(10)

As can be seen from Figure 2, there is twice the amount of Y component compared to U and V components as the human eye is much more sensitive to brightness than color changes.

2.2 Video Compression Techniques

Video compression is a specialized form of compression where image data is condensed into a smaller size. This can be achieved by using entropy coding such as Context-adaptive binary arithmetic coding (CABAC), or by stripping high-frequency signals away using Discrete Cosine Transform (DCT) or by encoding only the changing parts of an image instead of the entire image.

The simplest form of video coding is intra-frame coding. This technique does not use any reference frames, but each frame is encoded individually and is almost identical to typical image compression such as JPEG. Figure 3 shows the intra-frame coding flow.

Figure 3: Intra-frame coding flow [21]

The process starts by splitting the raw video frame into macroblocks. Macroblocks are the processing unit in video compression and are typically 16x16 pixels in size. The block goes through a DCT filter which converts the block into different frequencies. The unimportant frequencies may be stripped out during quantization to save space and the process becomes lossy as data is permanently lost. After that, depending on the codec, the block may be converted to some other format and is finally entropy-coded using Huffman coding, CABAC etc. [21]

The other, more efficient way of compressing video is to realize that there is a connection between frames and thus use other frames as reference frames. This is called inter-frame coding and Figure 4 shows the difference between intra- and inter-frame coding.

(11)

Figure 4: Inter-frame vs intra-frame encoding [22]

Using inter-frame coding, a lot of space can be saved as only the motion is encoded instead of the whole frame. When inter-frame coding takes place, the encoder searches for a block from an already encoded reference frame. When a match is found, the difference of the blocks is calculated which is the prediction error. The match also yields a motion vector which tells how much the block has moved inside the reference frame. This information is then encoded using DCT, quantization, and entropy coding, just like with intra-frame coding. The spatial requirements for inter-frames are, however, much smaller than for intra-frames. [23] Figure 5 shows the inter-frame coding flow.

Figure 5: Inter-frame coding flow [23]

There are two types of inter-frames: predictive frames (P-frames) and bi-predictive frames (B- frames). Using inter-frames also introduces a new concept called group of pictures (GOP) shown in Figure 6.

(12)

Figure 6: Group of pictures [24]

The video is divided into GOPs and a GOP defines how inter- and intra-frames are ordered.

GOP can contain intra frames (I-frames), P-frames which reference older frames and B-frames which can reference both older and future frames as shown in Figure 5. I-frames mark the beginning of a GOP. [25]

Video compression is a very complex subject and unlike text compression, data loss is acceptable to certain degree which makes it possible to compress video at acceptable quality to very small sizes compared to raw video. Several competing video codecs are available such as the previously mentioned AVC, HEVC, and VVC but also AV1 [26] and VP9 [27].

2.3 High-Efficiency Video Coding

High-Efficiency Video Coding (HEVC/H.265), published in 2013, is a state-of-the-art video codec.

Compared to its predecessor AVC, HEVC is able to attain, for example, for 4K UHD video, a 40%

[4] higher coding efficiency for the same objective quality and 64% [5] higher coding efficiency for the same subjective quality.

As the amount of video consumption increases every year, the need for efficient compression technologies is severe. AVC is not able to achieve acceptable bitrates and complexity when, for example, the resolution of the video is increased. Therefore, HEVC was created with two clear design goals in mind: better support for parallelism during video coding and better support for higher resolutions, such as 4K UHD. [28]

One of the innovations of HEVC compared to AVC is the extension from macro blocks to Coding Tree Units (CTU) which can be up to 64x64 pixels in size. The large size of a CTU allows the data to be compressed more efficiently. In HEVC it is also possible to perform both intra- and inter- picture prediction on these CTUs. [28].

HEVC also features two loop filters, Deblocking Dilter (DBF) and Sample-adaptive Offset (SAO) which are used to improve the quality of the picture by removing and smoothing compression artifacts.

(13)

3. MEDIA STREAMING

Media streaming in the 21^st century is a part of everyday life. Youtube and Netflix are gaining popularity and remote work increases the usage of video conferencing software. This video content can be divided roughly into two categories: real-time and non-real-time. Non-real-time content is not being produced in real-time but streamed, for example, from an offline storage as is the case with Youtube and Netflix. Real-time streaming happens, for example, during a video conference where video is captured from a webcam and is streamed over the network to other participants. Real-time streaming has strict deadlines because if video coding, streaming, decoding or displaying takes too long time, the disturbance is visible which leads to user dissatisfaction. Real- time streaming can be implemented in many ways, depending on what is more important: latency or correctness.

Real-time streaming can also be roughly divided into two categories: producer-consumer streaming and participatory streaming. An example of producer-consumer streaming could be watching a livestream on Twitch where 100% packet delivery and correctness are more important than if a viewer receives the video 500 milliseconds later than some other viewer. For participatory streaming, having low latency is of utmost importance as 0.5 - 2 second latency, for example, for audio can be detrimental in a video conference.

For low-latency applications, real-time streaming over User Datagram Protocol (UDP) is preferred as UDP provides lower latencies compared to Transmission Control Protocol (TCP) while removing QoS features such as packet retransmission or packet ordering. These QoS features can also be implemented on top of UDP if needed. There are several protocols available for real-time streaming, some running over UDP, some over TCP. Often peer-to-peer (P2P) applications prefer UDP as UDP hole punching can be employed to allow traffic flow through a firewall.

3.1 Real-time Messaging Protocol

Real-time Messaging Protocol (RTMP) [29] is a proprietary protocol developed by Adobe for multimedia streaming, especially designed for Flash player communications. It can be used over TCP, UDP and Hypertext Transfer Protocol (HTTP).

When an RTMP session is started, a special handshake is performed between the parties.

The handshaking procedure is shown in Figure 7.

Figure 7: RTMP handshaking procedure

(14)

The handshake starts with a C0 message sent by the client that contains the RTMP version used by the client. After the C0 message, the client sends a C1 message. The format for C1/S1 is the same and it contains a 4-byte timestamp field, a 4-byte field filled with zeros and rest of the message, 1528 bytes, is filled with random data. The value of the timestamp field specified the epoch for the stream. If the versions sent in C0/S0 messages do not match, the session is termi- nated.

The C2/S2 messages are very similar in structure to C1/S1 messages. The first 4-byte long field contains the remote peer’s timestamp sent in C1/S1 message so, for example, C2 would contain the timestamp sent in message S1 and vice versa. The second field, also 4 bytes long, is also a timestamp field and it tells in RTMP time when the previous message was received. The rest of the message for both C2 and S2 contains the random that was sent in the previous message so, for example, S2 would contain the random data that was sent in C1 message. When C2 and S2 messages have been exchanged, the session can start.

Figure 8 shows the structure of an RTMP data packet.

Figure 8: RTMP payload format

The packet starts with a format field. This 2-bit field determines the format of the chunk basic header. The second field indicates the version of the header, which in this case is 1. The third field determines the Chunk Stream ID of the packet. What follows is a 24-bit timestamp field which counts upwards from the epoch negotiated in C1/S1 messages. The length field determines the length of the chunk. The Message Type ID indicates what the packet is carrying, and examples of message type IDs are Virtual Control or Video packets. The Message Stream ID allows the receiver to differentiate different data sources but often the Message Stream ID is the same for all chunks. The rest of the packet is payload. Figure 8 shows only one of the RTMP packet formats, rest can be found from the specification [29].

The typical operating mode for RTMP is to create a logical channel called message stream from some source, such as live video, and split that stream into chunks. These chunks then create a chunk stream which is then transported over the network in chunks which follow, for example, the format defined in Figure 9. RTMP allows the client and server to multiplex/demultiplex audio and video into the same chunk stream using Message Stream IDs. The stream behavior can be controlled using in-band control messages.

RTMP is an example of a widely used streaming protocol. It supports a multitude of transportation protocols, enables stream encryption and HTTP tunneling. However, as HTML5 is gaining popularity, RTMP is becoming less relevant. It is also not a good option for real-time, peer-to-peer streaming as it is built upon a notion of a client/server model.

(15)

3.2 Dynamic Adaptive Streaming over HTTP

Dynamic Adaptive Streaming over HTTP (DASH) [30] is another very widely used media streaming protocol. Its main design goals is to be bandwidth-conservative, bitrate-adaptive solution for streaming both live and offline content [31]. Streaming content over HTTP circumvents the problem of firewall hole punching as the HTTP port 80 is almost always open and thus the client only needs to listen to the data coming to that port.

The operating principle of DASH is simple. Split the content into segments of data, provide a user with Media Presentation Description (MPD). User can then request data from the DASH server using HTTP GET requests.

Figure 9 shows how DASH works on a high level.

Figure 9: Client/server model of DASH [30]

On the left in Figure 9 is the DASH server which has the data segments of, for example, video files. On the right is the DASH client which requests these data segments from the server using the segment-identifying data given in the MPD manifest provided at the start of the session. Client can adaptively adjust, for example, to changing network conditions and request lower-bandwidth video from the server.

3.3 Real-time Transport Protocol

The previously introduced streaming protocols, while providing valuable service in their domain, are not suitable for real-time peer-to-peer video streaming as they are designed with the notion of a client/server model in mind. This model is not suitable, for example, for video conferencing software where all participants are equal and are sending and receiving data.

Real-time Transport Protocol (RTP) is a protocol defined for streaming real-time video, audio and even text. Defined in RFC 3550 [7], it describes the overall packet format for RTP packets. De- fined 2003, it is very widely adopted real-time streaming protocol used often in Voice over IP (VoIP) applications to carry out video and audio streams. RFC 3550 is divided into two parts. The first part describes the RTP part of the protocol and the second part RTP Control Protocol which is more complex of the two.

RTP specifies useful concepts such as Synchronization Sources (SSRC) which can be used to differentiate senders, extension headers which allow implementations to experiment with payload-format-independent functionality, and mixers and translators which have use-cases especially in low-bandwidth networks.

(16)

3.3.1 RTP Packet Format

Figure 10 shows the packet format for RTP.

Figure 10: RTP Packet format

RTP packet consists of a few mandatory fields and the rest of the 1500-byte datagram is for payload. Media formats may add additional headers which are added to the payload area of the packet.

The first field is V which defines the version the RTP protocol is using. For RFC 3350, this field must be set to 2. If a received packet does not have 2 in the version field, it must be discarded.

Second and third fields are Padding and Extension fields, respectively. If padding bit is set, the payload of the packet is padded, and the last byte of the payload defines the padding length.

These padding bytes are not part of the payload and must be removed.

The extension bit defines whether the packet has any extension headers present. If the bit is set, extension headers are located right after Contributing Source (CSRC) entries.

After the X bit comes the CC entry. This tells how many CSRC entries the packet has. Often set to zero unless mixers are used.

The M bit or Marker bit is a free-to-use bit. It does not carry any information from the viewpoint of RTP and can be used, for example, to signal that current packet is the last fragment of a larger frame.

The PT field defines the payload type of the media carried in payload. Each media type has a specific payload type and for HEVC it is 96.

Next comes the sequence that is a 16-bit unsigned int. This means that after 65536 packets, the counter wraps around and sequence numbers start repeating.

The timestamp field is a 32-bit counter that defines the media-specific progression of the stream.

Different media types have different advance rates and often the counter increment is more com- plicated than simply increasing it by one.

Next comes the SSRC of the sender. SSRC shall be unique for all participants of the session.

After SSRC there are optional CSRC entries. If mixers are not used, these fields are not part of the packet. There can be at most 16 CSRC entries.

(17)

Following CSRC is another optional field, extension header, which can be used to provide additional information about the payload data.

Finally, rest of the datagram is dedicated for payload. If indicated by the P bit in the header, the payload contains padding. The last byte of the payload tells how many padding bytes the payload has.

If a datagram of size 1500 bytes is used and there are no extension headers, contributing sources, or padding, the payload carried in each packet is at most 1446 bytes.

3.3.2 Synchronization Sources

One of the most important aspects of RTP is the notion of an SSRC. RTP sessions consist of participants either only sending, only receiving or sending and receiving media. Each participant is assigned a unique identifier called SSRC. This 32-bit identifier gives the receiver the ability to differentiate senders. This is especially important in multicast sessions where participants receive packets from and send packets to the same IP address. Figure 11 presents a multicast RTP session where the yellow boxes represent RTP packets with unique SSRCs.

Figure 11: RTP multicast session

As Figure 11 depicts, it would impossible, for example, for participant 3333 to differentiate the incoming packets and give them to correct video decoding instances had the packets not been equipped with unique identifiers. In typical unicast sessions the SSRCs do not provide much value as often the IP address/port pair is enough to differentiate participants. This is, however, always not the case if, for example, translators or mixers are used.

Though very unlikely, SSRC conflicts are possible. Two participants having the same SSRC can have devastating consequences on video call quality and as soon as the conflict has been noticed, the conflicted participants resolve the conflict amongst themselves, without disrupting the rest of the session. Algorithm 1 shows in pseudocode how an SSRC conflict is resolved.

(18)

Algorithm 1: SSRC conflict resolution

When an RTP packet is received, the first thing that is checked is whether the sender is a new participant. If so, an entry to the participant table is created. Each participant is put on a probation period which ends after a few successfully received packets. When the participant has been ac- cepted to the session and its first real packet has been received, its transport address is saved to the participant table which already contained the participant’s SSRC.

For each received packet, the packet transport address is matched against the address in the participant table. If these do not match, it means there are two nodes transporting from different addresses while using the same SSRC identifiers. If SSRC of the received packet does not collide with our own SSRC, the packet is simply dropped, and the reception process is continued. All subsequent packets with an invalid transport address/SSRC combination are also discarded.

If the SSRC matches our own SSRC and this is the first time we notice this, we add entry to a table that records SSRC conflicts. Then we remove ourselves from the session by sending RTCP BYE message, generate ourselves a new SSRC and rejoin the session. If the transport address has already been logged to the SSRC conflict table, we add a new entry to the table, discard the packet and continue packet reception.

RTP also defines CSRC identifiers which are often used in conjunction with a mixer. In that case, the SSRC of a packet would be that of the mixer and CSRC entries would contain SSRCs of those participants who had contributed to that packet, i.e., whose data is transmitted in that packet.

(19)

3.3.3 Mixers and Translators

Varying internet link speeds provide challenges to a media streaming session. If one or more participants are behind a low-speed link, they may dictate audio and video qualities for the whole session which are often much worse than what most participants would be able to serve. Another problem is reachability. Some participants may be behind an application-level firewall which blocks all IP traffic. To address these issues, RFC 3550 introduced two concepts: mixers and translators.

Translators can be used to bypass application-level firewalls or to upscale quality or number of participant if translators are used as relay nodes for those participants whose bandwidth is low.

The node can then send their RTP packet to a translator which then forwards those packets to all other participants, if the translator has enough bandwidth. Figure 12 shows how translators work in a multicast RTP session.

Figure 12: RTP multicast session with two translators

In Figure 12, there is an RTP multicast session happening where the participants are divided into two parties because of an application-level firewall (orange block). The multicast sender is not able to penetrate the firewall so instead it sends the RTP packets to the reachable participants and a translator which is connected through the firewall to another translator. This other translator then sends the RTP packets to the rest of the participants. All RTP participants receive the original packet without any modifications added by the translators.

Translators are invisible components of an RTP session as they do not leave any marks in the CSRC field of the RTP packet. In Figure 11 even though the participants of the local multicast group would receive the packet from the translators, the SSRC field of the RTP packet would contain the value of multicast sender’s SSRC.

Mixers are similar to translators but instead of only sending the packet forward to other participants, mixers are used to re-encode and/or merge incoming streams. This can aid participants with low-speed links to join a session. Mixers have their own SSRC, and packets sent by the mixer have CSRC list filled with the SSRC values of those participants that contributed to the packet. Figure 13 shows an example of mixer functionality.

(20)

Figure 13: Mixer and two senders

In Figure 13, there are two senders that send RTP packets directly to a mixer. Because the receiver does not have a high enough bandwidth to receive all media streams, mixer re-encodes the incoming streams with a lower sampling rate and merges the two streams into one. The packet produced by the mixer has the SSRC of the mixer and the CSRC list contains the SSRC values of the two senders.

Mixers and translators might introduce loops into the session if they mistakenly forward packets to the same multicast address the packets originated from. This can also happen indirectly by forwarding packets to another translator which then forwards the packets to the network where the packets came from. Loops and SSRC collision are detected the same way as they both cause RTP packets to have the same SSRC and a different transport address. Algorithm 1 shows how a looped packet is dealt with.

3.4 RTP Control Protocol

RTCP is an addition to RTP that sends periodic reports to other participants of the session to provide information about the channel conditions or participants themselves and to perform session management.

The most important responsibility of RTCP is to monitor and report channel conditions to other participants of the session. This is done by estimating the number of dropped packets, measuring overall jitter of received packets, and estimating propagation delay of packets to gain information about the distance between participants. RTCP reports this information periodically to those participants from whom it receives packets so that they can perform necessary adjustments to their sending parameters if the report indicates packet loss or high jitter. As RTP is transported over UDP and QoS is not provided by the transport protocol, RTCP provides an Out-of-Band (OOB) channel for QoS monitoring and adjustment.

Another important responsibility of RTCP is to provide additional information about media streams and their senders. RTCP can be used to distribute the Canonical name (CNAME) of each participant which can be used to group media streams each having different SSRCs under one discrete sender as CNAME is the required to be unique whereas SSRC may change due to SSRC conflicts. Knowing CNAME of a participant and thus all SSRCs that belong to it, a receiver can perform inter-media synchronization using the NTP-RTP timestamp pairs provided in Sender and Receiver Reports.

(21)

Final responsibility of RTCP is to provide session management and bandwidth usage control.

RTCP is used to accept or reject participants joining a session and in doing so, RTCP must also control the used bandwidth carefully as no to saturate the upload bandwidth of the sender as the number of participants in a session grows. RTCP also monitors SSRC conflicts and loops introduced by invalidly configured RTP mixers/loops.

3.4.1 RTCP Sender Report

RTP session participants are divided into two categories: senders and receivers. Senders either only send or send and receive RTP packets but receivers only receive packets. Senders only send Sender Reports to other senders whereas receivers send Receiver Reports to all senders.

These Sender and Receiver Reports are used to provide QoS monitoring and synchronization information for the session. Based on reported values, senders can downgrade the quality of the video they are sending if, for example, the number packets lost reported from all participants is high. Figure 14 shows the format of an RTCP Sender Report.

Figure 14: RTCP Sender Report

The first two fields are the same as in RTP packets, version and padding. The RC fields tells the report count of this Sender Report which denotes how many sender entries the report contains. The PT fields tells the RTCP packet type, which is 200 for Sender Reports. The length field contains the length of the Sender Report in bytes.

(22)

What follows is the sender’s own statistics that it has gathered, and they start with the sender’s SSRC.

As RTP clock is monotonic and its initial value is random, it is very hard to synchronize different media streams solely using RTP timestamp data. To make it easier, Sender Report contains both NTP and RTP timestamps that both point to the same point in time.

The two following fields are the Most Significant Word (MSW) and Least Significant Word (LSW) of an NTP timestamp, both 32 bits long. The next field is the RTP equivalent of that NTP timestamp value. Using this RTP value and an associated NTP value, receiver can synchronize media streams by converting the RTP timestamp, received in RTP packets, to NTP timestamps and then either forwarding or delaying playback based on calculations.

The next fields denote how many packets and octets has been sent and these values can be used to adjust the packet and byte loss estimates of the receiver. This fields also concludes the sender statistics and what follows are records for each sender from which this sender has received packets. There are as many records as the RC field in the beginning of the packet tells.

The report blocks start with the SSRC value of the sender from whom the packets have been received.

The next 32-bit field contains two values: fraction of packets lost, and cumulative number of packets lost. Receivers constantly estimate how many packets are lost and the number of expected packets between a certain time period can be calculated using RTP sequence numbers.

The fraction lost field gives the number of lost packets divided by the number of expected packets since the last sent Sender or Receiver Report.

Cumulative packet loss reports the total packet loss since the start of the session. It is calculated by subtracting the number of packets received from number of packets expected.

The next field tells the highest sequence number received, including wrap arounds. The lowest 16 bits give the actual highest sequence number received and the upper 16 bits contain the number of times the sequence number counter has wrapped around.

The next field tells the statistical variance of RTP packet interarrival time. It is a smoothed absolute value of the difference in packet spacing at the receiver compared to the sender. Each RTP receiver has a separate clock running in RTP timestamp units for each incoming RTP stream. This enables the receiver to estimate the interarrival jitter by comparing its internal RTP clock values to the timestamp values in incoming RTP packets.

Let Si be the RTP timestamp of packet i and Ri the time of arrival at receiver in RTP timestamps for that same packet. The difference in packet spacing D for packets i and i+1 can be calculated as follows:

𝐷(𝑖, 𝑖 + 1) = (𝑅_𝑖+1− 𝑅_𝑖) − (𝑆𝑖+1− 𝑆_𝑖) = (𝑅𝑖+1− 𝑆_𝑖+1) − (𝑅𝑖− 𝑆_𝑖). (1) The interarrival jitter for packet i can be calculated using Formula 1, previous packet i-1 and the following formula:

𝐽(𝑖) = 𝐽(𝑖 − 1) + |𝐷(𝑖 − 1, 𝑖)| −^{𝐽(𝑖−1)}

16 . (2)

The next field in the Sender Report is last sender report which gives the middle 32 bits of the last Sender Report that was received. The field is set to zero if the no Sender Report has been received.

The last field of the Sender Report tells difference between receiving last report and sending this report, expressed in 1/65536 seconds. This value is used to estimate roundtrip propagation delay by subtracting LSR and DLSR from the timestamp of reception of this report. This allows senders to approximate distance to receivers.

(23)

3.4.2 RTCP Receiver Report

RTCP Receiver Report is the second periodically sent report. This report is utilized by those participants that do not themselves send any data but only receive data from other participants.

RTCP Receiver Report is sent to every participant from whom the receiver receives data. Each time an RTP packet is received, sender statistics are updated for that participant and when the RTCP timeout period has ended, an RTCP Receiver Report is generated which contains an entry for each sender. Figure 15 shows the format of an RTCP Receiver Report.

Figure 15: RTCP Receiver Report

The format for RTCP Receiver Report is very similar to RTCP Sender Report but because these reports are sent by those participants that do not send RTP data, the report is missing sender statistics. Explanations for the fields of the packet have been given in previous section.

3.4.3 RTCP Source description

Some applications may want to distribute additional information about themselves for other participants of the session. Examples of this information could be the real name of the participant, email or geographical location. This information is sent in an RTCP SDES packet and Figure 16 shows the format of such a packet.

(24)

Figure 16: RTCP Source Description

The first three fields contain the version, whether the packet contains padding, and the source descriptor item count of the packet, respectively. The payload type for SDES packet is 202. The length field tells the octet count of the whole packet.

What follows is the SSRC of the participant that the following SDES items describe. Thus, one SDES packet can describe multiple different participants. Each SDES item is aligned to a 32-bit boundary, starts with an 8-bit identifier field, followed by an 8-bit length field and finally containing the SDES item data, encoded using UTF-8 [32].

Every SDES packet must contain the CNAME field for each participant, which further distin- guishes the participants from each other. Other possible fields are NAME, EMAIL, and PHONE which give the real name, email address, and phone number of the participant, respectively.

3.4.4 RTCP Goodbye

Before a participant terminates the session, it is expected to send an RTCP BYE packet to all members of the session to inform that it is exiting. Upon receiving a BYE packet, the receiver removes the sender from the participant table. RTCP has participant timeout monitoring capabil- ities so it is not strictly necessary for a participant to send a BYE packet for it to be excluded from the participant table of receivers. Figure 17 shows the packet format of an RTCP BYE report.

Figure 17: RTCP BYE

The first three fields contain the version, whether the packet contains padding, and the entry count of the packet, respectively. The fourth field tells the type of this packet and for RTCP BYE packets the value is 203. The length field tells the length of the packet in bytes.

What follows is one of more fields that contains SSRC values. The SC field tells how many SSRC entries will follow. Typically, there is only one SSRC, the sender’s, but if a mixer leaves a session, its RTCP BYE packet contains its own SSRC and the SSRC values of all those participants that have been contributing to the mixer’s packets.

(25)

The last two fields of the packet are optional, and they contain the length of the message and a message telling why a participant is leaving. This could be, for example, an ASCII string “micro- phone malfunction”.

3.4.5 RTCP Application-specific message

As new features and applications are implemented, a need for experimental RTCP packet types may arise for functionality that is not provided by Sender Reports, Receiver Reports, or Source Descriptions. RTCP APP packets allow applications to exchange any kind of data and if the new packet type is deemed valuable, it may be registered to be an official IANA-acknowledge RTCP packet. Figure 18 shows the packet format of an RTCP APP packet.

Figure 18: RCP Application-specific message

The RTCP APP packet starts with a version number and a padding bit fields. The field that follows tells the subtype of the RTCP APP packet which can be used by the application to identify different kinds of RTCP APP packets. The payload type field contains the value for RTCP APP packets which is 204 and the length field tells the packet length in bytes. The SSRC/CSRC field identifies the sender of the packet.

The name field contains a 4-letter name for the person defining the set of RTCP APP packets.

The name field could contain, for example, the name of the application which uses these experimental RTCP APP packets and handles the allocation of subtypes.

The last field of the packet is the actual data. The byte count of the data can be calculated using the byte count of whole packet.

3.4.6 RTCP Transmission Interval

Considering an audio-only, unicast RTP session with several hundreds of participants. There is only one RTP sender at a time, but several hundreds of RTP receivers. At the same time though, there are several hundreds of RTCP senders and receivers, all transmitting Sender and Receiver Reports constantly. This QoS signaling can, in the worst case, saturate the current speaker’s upload bandwidth or some other participant’s download bandwidth and cause noticea- ble audio degradation as packets are getting dropped. To fix this issue, RTCP specifies carefully considered transmission intervals such that RTCP will never interfere with RTP sending and RTCP senders are favored over RTCP receivers to enable all receivers to gather the most essential information about the senders and to perform inter-media synchronization if needed. Al- gorithm 2 shows pseudocode for RTCP transmission interval calculation.

(26)

Algorithm 2: RTCP interval calculation

As Algorithm 2 shows, the RTCP Report Transmission interval depends on several parameters, some of them changing as the session progresses. The allocated bandwidth for RTCP (rtcp_bw) is an application-defined value that gives the total bandwidth RTCP can use during the session in units of kilobytes per second. The average RTCP packet size (rtcp_pkt_size) is updated constantly during the session and it also includes IP and UDP headers on top of RTCP headers and data. The usage of RTCP packet size in calculations allows the algorithm to adapt to changes in RTCP payload data.

The calculation starts by checking how large a portion of members are RTP senders. If the number of senders is at most 25%, the algorithm favors senders in its bandwidth usage calculation by giving them precedence over RTP receivers. If there are more than 25% of senders, all members of the session are treated equally.

The algorithm then proceeds to set a guaranteed lower bound for the transmission interval, denoted by t_d, in case the constants C and n yield a too small value. For example, if an application had set the RTCP bandwidth to 200 kilobytes per second and the number of members in the session would be 10 participants and assuming a realistic average RTCP packet size, 200 bytes, the transmission interval would be 0.01 seconds. This would result in far too much signaling than what is needed. That is why the lower bound is set to 2.5 seconds for new participants that have not sent any RTCP data and to 5 seconds for other participants.

(27)

Once the deterministic transmission interval t_d has been selected, the algorithm proceeds to account for accidental but possible synchronization of session members. The problem with synchronization is as follows: the core assumption is that the usage of RTCP bandwidth is distributed smoothly over the entire RTCP transmission interval. That is, even if all participants would send a packet every two seconds, the packet transmission would spread over the entire interval smoothly, i.e., there would be no spikes. These transmission spikes, multiple participants sending packets at the exact same time, could cause congestion and in the worst case, packet loss. As participants may become accidentally synchronized and could observe the session parameters at the same time and would thus calculate the exact same deterministic transmission interval, it would lead to accidental synchronization and to the issued associated with it.

Sally Floyd and Van Jacobson demonstrated [36] that introducing randomization to transmission interval negates the problems associated with accidental synchronization. RTCP implements this by multiplying the deterministic transmission interval with a random number uniformly distributed in the range [0.5, 1.5].

What happens when a group of nodes, counted in the order of thousands, join a session sim- ultaneously is that they inspect the session to contain from zero to a few participants when in reality the session contains thousands as all the new nodes are also considered to be members of the session. This joining causes a massive influx of RTCP data packets as the new members all send their first packet within 3.75 seconds since joining the session. In low-bandwidth networks, this will saturate the entire bandwidth and cause several packets to be dropped and thus further distorts the interval calculation as RTCP packets from all participants are not received.

To counter this, RTCP has in use what is called timer reconsideration which is a back-off mechanism used for the RTCP interval timer. As RTCP packets are sent periodically, a timer is used to schedule packet transmissions. Using the timer reconsideration scheme, when the timer has run out and it is time to send an RTCP packet, the scheduler first recalculates the RTCP transmission interval using the newest data and checks again if it actually is time to send an RTCP packet. If so, the packet is sent. Otherwise, the timer is restarted, and the packet is sent later.

This mechanism causes RTCP to overuse the allocated bandwidth and it is compensated by dividing the calculated interval by 𝑒 − ³

2. [37]

3.5 HEVC Streaming Over RTP

While RFC 3550 only specifies the basic format of an RTP packet, it leaves the format-specific packetization rules for other specifications to define. HEVC streaming over RTP has been defined in RFC 7798 [33] which specifies the payload format and packetization rules.

RFC 7798 defines three different packet types for transporting Network Abstraction Layer (NAL) units efficiently: single NAL unit packets, Aggregation Packets and Fragmentation Units (FU). Maximum Transfer Unit (MTU) specifies the maximum allowed size of an Ethernet frame.

Usually, MTU is set to 1500 bytes but Ethernet also supports jumbo frames which have a size of 9000 bytes [34]. To get maximum payload size for an RTP packet, Ethernet, IP, UDP, and RTP headers are subtracted from 1500 bytes giving 1446 bytes of payload data for an IPv4 packet.

3.5.1 Single NAL Unit Packets

Single NAL unit packet is the simplest of the three formats. Each packet contains exactly one NAL unit and any NAL unit that is less than 1446 bytes long can be sent in a single NAL unit packet. Figure 19 shows the packet format of such a packet.

(28)

Figure 19: Single NAL unit packet

The format is very simple. The F bit must always be zero and the Type field tells the packet type for an RFC 7798-compatible RTP packet. For single NAL unit packets the type is the NAL unit’s type that is being transported. Layer ID is required to be zero, but in the future, it may have use cases with scalable or 360-encoded video. TID tells the temporal ID of the NAL unit plus 1.

For this packet format, the NAL header can be copied directly from the NAL unit.

The optional DONL field can be used to signal the 16 least significant bits of the decoding order for this NAL unit. It can be useful, for example, in the case of H.264’s MTAP aggregation packet implementation where the sender sends interleaved slices of multiple pictures [35]. In that case, the receiver must pass the NAL units to the decoder in correct order. If the stream contains only temporally adjacent NAL units, DONL is not necessary.

3.5.2 Aggregation Packets

HEVC contains a few NAL units that are smaller than the 1446-byte limit given by the MTU.

While these NAL units can be sent using individual single NAL unit packets, it is more efficient to pack the NAL units together and send one larger RTP packet that contains all these small NAL units. To support this functionality, RFC 778 specifies aggregation packets. Figure 20 shows the format of an aggregation packet with two NAL units packed together.

Figure 20: Aggregation packet

(29)

The format of the packet is very similar to a single NAL unit packet. The biggest difference is that there is a top-level NAL header which has a value 48 as its NAL type, denoting an aggregation packet. This allows the receiver to distinguish between single NAL unit packets and aggregation packets.

Then, for each NAL unit there is a 16-bit field for the size of the NAL unit and NAL unit header, directly copied from the input NAL unit. Following the header field is the NAL unit data. Each NAL unit must have a size field because the individual sizes cannot be inferred from the RTP packet size as their size may vary.

3.5.3 Fragmentation Units

If a NAL unit is larger than 1446 bytes, it must be split into smaller pieces to enhance the likelihood of reliable transmission. RFC 7798 provides a way to do this by specifying a format for FUs. Figure 21 shows the format of a FU.

Figure 21: Fragmentation unit

A FU packet starts with a NAL header and the NAL type is set to 49 so the receiver can distinguish FUs from single NAL unit and aggregation packets. After the NAL header there is a FU header which contains a Start bit, an End bit and a Fragmentation Unit type. The S bit tells if the fragment is the first fragment of a NAL, the E bit tells if the fragment is the last fragment of a NAL and the FU type tells the NAL type of the fragmented NAL unit. FU payload data is right after the FU header. FU header must never have both S and E bit set at the same time.

RFC 7798 further requires that all FUs of a fragmented NAL unit must have the same timestamp and RFC 3550 requires that sequence numbers are increased by an increment of one.

Using the timestamp and sequence numbers of the packets that contained S and E bits set, the receiver can calculate the number of FUs the fragmented NAL unit took. When the first, last and every fragment between have been received, the receiver can concatenate the FU payloads into one large frame, essentially reconstructing the fragmented NAL unit. If even one fragment is lost during transmission, the whole NAL unit must be dropped as it cannot be reconstructed.

(30)

4. SECURITY

Security is an essential part of any information system, real-time video streaming not being an exception. Both real-time and non-real-time streaming solutions have various security features available, from encryption to packet authentication.

For RTP streaming, SRTP is used to encrypt the traffic between participants. SRTP provides only the higher-level encryption and authentication mechanisms but leaves the actual key management and negotiation unspecified. There are two common ways of providing key management for an SRTP-compatible application: Multimedia Internet Key (MIKEY) and ZRTP. Out of the two, ZRTP has gained popularity in unicast sessions and will be the focus in this thesis.

Roughly dividing, cryptographic encryption routines can be divided into two categories: symmetric and asymmetric encryption. In asymmetric encryption, the encryption and decryption are done using different keys and in symmetric encryption, the key is the same for both operations. Both of them have their use cases and benefits over one another. RTP implementations wanting to provide security features through SRTP and ZRTP use both encryption modes, by utilizing Elliptic- curve Cryptography (ECC) and Diffie-Hellman key exchange for key negotiation and by providing ciphering through Advanced Encryption Standard (AES) and Twofish.

4.1 Asymmetric Encryption

As stated, in asymmetric encryption, or in so called public-key cryptography, the keys that are used for encryption and decryption are different. Asymmetric encryption allows, for example, the implementation of digital signatures and thus provide a backbone for HTTP over Transport Layer Security (TLS), data encryption, or key exchange over an insecure channel. The need for asymmetric encryption is quite obvious: it is always not possible to establish a secure channel over which the key can be transferred. Asymmetric encryption fixes this problem by providing endpoints a way to exchange keys, encrypt, or sign data using information that is publicly available.

Figure 22 shows the high-level flow of a public-key based cryptography routine.

Figure 22: Public key cryptography cipher flow [38]

As Figure 22 shows, the encryption and decryption happen using different keys. The basic idea behind public-key cryptography is that instead having one privately kept key, a person has two keys, one which can be shared publicly and one that must be kept strictly private. Bob generates a key pair for himself and publishes the public key of the key pair, for example, on his website. If

(31)

Alice wishes to send Bob a message only he can decrypt, Alice encrypts the message using Bob’s public key. When Bob receives the message, he can decrypt it using his private key. This relationship works backwards also: Bob can sign a document with his private key and Alice can verify that it truly is Bob’s signature using the public key provided by Bob.

These keys share a mathematical relationship that allow present-day cryptographic features such as key exchange over an unsecure channel, digital signatures, and public-key based data encryption. In the following sections, three key concepts relating to asymmetric encryption are dis- cussed in more detail.

4.1.1 Rivest–Shamir–Adleman

Rivest-Shamir-Adleman (RSA) [39] is one of the most popular public-key based cryptosystems built upon a mathematical concept called modular exponentiation. There are four requirements given to any public-key based cryptosystem:

• decrypting an encrypted message gives the original message o formally, D(E(M)) = M

• encrypting a decrypted message gives the original message o formally, E(D(M)) = M

• E and D are easy to compute

• Sharing E publicly does not compromise the privacy required for D

RSA fulfills all these requirements by using relatively cheaply implementable computational operations and by deriving the encryption and decryption keys in a way that does not compromise the privacy requirements of the decryption key when the encryption key is shared.

The encryption is done using RSA by first choosing a pair of positive integers (e, n). These two numbers form the public key for RSA system and are used only for ciphering the message M. Message M is expressed as an integer or if too long, as a series of integers. Likewise, the decryption is done using a pair of positive integers (d, n) where n is the same for both operations.

Encryption is done by raising the message M to the power e modulo n and decryption is done by raising the ciphertext C to the power d modulo n. Mathematically expressed:

𝐶 = 𝐸(𝑀) = 𝑀^𝑒 𝑚𝑜𝑑 𝑛 (3)

𝑀 = 𝐷(𝐶) = 𝐶^𝑑 𝑚𝑜𝑑 𝑛 (4)

As encryption and decryption routines of RSA are simple, the crux of the system lies in selecting secure values for e, d, and n. Understanding how they are derived requires understanding of modular exponentiation and prime numbers.

Euler’s totient function, denoted as 𝜑(𝑞), gives the number of positive integers up to q that are relatively prime to q. What this means is that for each number from 1 to q, count only those numbers that have one common denominator with q, number 1. For example, 𝜑(8) = 4, as 8 is relatively prime to 1, 3, 5, and 7. For prime numbers:

𝜑(𝑞) = 𝑞 − 1 (5)

as prime numbers are only divisible by one and themselves. Multiplication of 𝜑(𝑞) is commuta- tive:

𝜑(𝑞𝑝) = 𝜑(𝑞) ∗ 𝜑(𝑝) = (𝑞 − 1) ∗ (𝑝 − 1) (6) Function gcd (𝑛, 𝑚) can be used to check the greatest common divisor for a pair of numbers, for example:

gcd(17, 13) = 1 (7)

(32)

as two prime numbers have only one common denominator, number 1. Using information presented above, rationale for the public and private key derivation of RSA can be presented.

To calculate n, a user first selects two large primes by generating random numbers. The authors of [39] recommend using 100-digit primes. According to prime number theorem, the process of selecting two random, 100-digit primes takes around 461 tries [40]. In 2021, the recommendation is using at least 3072-bit numbers [41] which forms a 384-digit number.

When a user has selected values for q and p, n is calculated as:

𝑛 = 𝑞 ∗ 𝑝 (8)

This makes n publicly shareable without compromising the security as factoring q and p is unfea- sible with current-day computers. A user must never reveal the values of q and p, only n as otherwise security of the system is lost.

Calculating d is done by selecting a positive integer, larger than max (𝑞, 𝑝) that is relatively prime to q and p, meaning:

gcd(𝑑, 𝜑(𝑛)) = 1 (9)

gcd(𝑑, 𝜑(𝑞) ∗ 𝜑(𝑝)) = 1 (10)

gcd(𝑑, (𝑞 − 1) ∗ (𝑝 − 1) = 1 (11)

When a suitable value for d has been selected, e is calculated to be the multiplicative inverse of d modulo 𝜑(𝑛), that is:

𝑒 ∗ 𝑑 = 1 𝑚𝑜𝑑 𝜑(𝑛) (12)

Now that values for e, d, and n have been calculated, a user can encrypt messages using Formula 3 and decrypt them using Formula 4. For example, a user might take the ASCII values of each character of a string, encrypt the characters individually and then concatenate the encrypted values into a string to get a ciphertext. Likewise, the receiver would then decrypt the message one character at a time.

4.1.2 Diffie-Hellman Key Exchange

Diffie-Hellman key exchange protocol [42] is used to establish a shared secret between two participants over an unsecure channel by leveraging very large prime numbers and modular exponentiation. On a high level, the key exchange works by selecting a privately kept secret value, combining that with a publicly available value, exchanging the combinations and adding the private secret value to the received combination. Figure 23 shows Diffie-Hellman key exchange using a paint color analogy.

(33)

Figure 23: Diffie-Hellman key exchange using paint colors [43]

In Figure 23, Alice and Bob combine their private paint colors with a public color to create a mixture that is then shared with the other person. Here the underlying assumption is that sepa- rating the original colors from the mixture is very hard. Once they have received each other’s paint mixtures, they add their own private paint to the mixture and thus create a common secret.

Mathematically what happens is that Alice and Bob agree on a public key pair (g, p) where g is called the generator and p is called the prime modulus. When this public key pair has been chosen, Alice and Bob select private keys AS and BS, respectively, which are recommended to be of same length as RSA private keys [41]. When the keys have been selected, Alice calculates:

𝐴_𝑃= 𝑔^𝐴^𝑆 𝑚𝑜𝑑 𝑝 (13)

where g is the generator, AS is Alice’s secret key and p is the prime modulus. Bob uses the same formula to calculate his publicly shareable secret:

𝐵_𝑃= 𝑔^𝐵^𝑆 𝑚𝑜𝑑 𝑝 (14)

When these have been calculated, Alice and Bob exchange these public key values with each other and then, for example, Alice can calculate the shared secret KAB using:

𝐾_𝐴𝐵= 𝐵_𝑃^𝐴^𝑆 𝑚𝑜𝑑 𝑝 (15)

𝐾_𝐴𝐵= (𝑔^𝐵^𝑆)^𝐴^𝑆 𝑚𝑜𝑑 𝑝 (16)

𝐾_𝐴𝐵= 𝑔^𝐵^𝑆^𝐴^𝑆 𝑚𝑜𝑑 𝑝 (17)

(34)

Formula 17 shows that the calculation of KAB is the same for both Alice and Bob and thus they have established a shared secret KAB over an unsecure channel using modular exponentiation. A malicious third-party could find KAB by calculating:

𝐾_𝐴𝐵= 𝐵_𝑃^𝑙𝑜𝑔^𝑔^(𝐴^𝑃⁾𝑚𝑜𝑑 𝑝 (18)

This is, however, infeasible for large values of p with current-day computers which is where the security of Diffie-Hellman key exchanges stems from.

There is one issue though that both RSA and regular Diffie-Hellman suffer from. That is the size of keys. As computers are getting more efficient, the factorization of large primes or solving discrete logarithms becomes more and more feasible. From RSA/Diffie-Hellman standpoint this can be temporarily mitigated by using larger key sizes and currently the recommended key size is at least 3072 bits [41]. Unfortunately, key sizes this large make the process computationally more expensive and as [44] shows, decryption time grows exponentially with key size and increasing the size is not going to be a long-term solution. That is why ECC has gained popularity over non-ECC solutions such as traditional Diffie-Hellman.

4.1.3 Elliptic-curve Cryptography

An alternative to cryptography based on modular exponentiation is Elliptic-curve cryptography that uses algebraic curves to derive cryptographic keys. Figure 24 shows an example of an elliptic curve over real numbers.

Figure 24: Elliptic curve [45]

The curve in the Figure 24 takes the form:

𝐸_𝐸_𝑎,𝑏∶ 𝑦²= 𝑥³+ 𝑎𝑥 + 𝑏 (19)

where a and b are the parameters defining the curve. There are several different curves available and, for example, SSH uses [46] twisted Edwards Curves [47] as an alternative to RSA keys.

When parties wish to use ECC to establish a shared secret, they must agree on a set of values called domain parameters. The parameter set, (p, a, b, G, n, h), specifies everything needed to calculate the shared secret. Parameter p specifies the finite field Fp, a and b define the curve

Design and Implementation of a Secure Real-time Transport Protocol Library for High-Speed Video Streaming

Aaro Altonen