Real-time Transport Protocol

3. MEDIA STREAMING

3.3 Real-time Transport Protocol

The previously introduced streaming protocols, while providing valuable service in their domain, are not suitable for real-time peer-to-peer video streaming as they are designed with the notion of a client/server model in mind. This model is not suitable, for example, for video conferencing software where all participants are equal and are sending and receiving data.

Real-time Transport Protocol (RTP) is a protocol defined for streaming real-time video, audio and even text. Defined in RFC 3550 [7], it describes the overall packet format for RTP packets. De-fined 2003, it is very widely adopted real-time streaming protocol used often in Voice over IP (VoIP) applications to carry out video and audio streams. RFC 3550 is divided into two parts. The first part describes the RTP part of the protocol and the second part RTP Control Protocol which is more complex of the two.

RTP specifies useful concepts such as Synchronization Sources (SSRC) which can be used to differentiate senders, extension headers which allow implementations to experiment with pay-load-format-independent functionality, and mixers and translators which have use-cases espe-cially in low-bandwidth networks.

3.3.1 RTP Packet Format

Figure 10 shows the packet format for RTP.

Figure 10: RTP Packet format

RTP packet consists of a few mandatory fields and the rest of the 1500-byte datagram is for payload. Media formats may add additional headers which are added to the payload area of the packet.

The first field is V which defines the version the RTP protocol is using. For RFC 3350, this field must be set to 2. If a received packet does not have 2 in the version field, it must be discarded.

Second and third fields are Padding and Extension fields, respectively. If padding bit is set, the payload of the packet is padded, and the last byte of the payload defines the padding length.

These padding bytes are not part of the payload and must be removed.

The extension bit defines whether the packet has any extension headers present. If the bit is set, extension headers are located right after Contributing Source (CSRC) entries.

After the X bit comes the CC entry. This tells how many CSRC entries the packet has. Often set to zero unless mixers are used.

The M bit or Marker bit is a free-to-use bit. It does not carry any information from the viewpoint of RTP and can be used, for example, to signal that current packet is the last fragment of a larger frame.

The PT field defines the payload type of the media carried in payload. Each media type has a specific payload type and for HEVC it is 96.

Next comes the sequence that is a 16-bit unsigned int. This means that after 65536 packets, the counter wraps around and sequence numbers start repeating.

The timestamp field is a 32-bit counter that defines the media-specific progression of the stream.

Different media types have different advance rates and often the counter increment is more com-plicated than simply increasing it by one.

Next comes the SSRC of the sender. SSRC shall be unique for all participants of the session.

After SSRC there are optional CSRC entries. If mixers are not used, these fields are not part of the packet. There can be at most 16 CSRC entries.

Following CSRC is another optional field, extension header, which can be used to provide addi-tional information about the payload data.

Finally, rest of the datagram is dedicated for payload. If indicated by the P bit in the header, the payload contains padding. The last byte of the payload tells how many padding bytes the payload has.

If a datagram of size 1500 bytes is used and there are no extension headers, contributing sources, or padding, the payload carried in each packet is at most 1446 bytes.

3.3.2 Synchronization Sources

One of the most important aspects of RTP is the notion of an SSRC. RTP sessions consist of participants either only sending, only receiving or sending and receiving media. Each participant is assigned a unique identifier called SSRC. This 32-bit identifier gives the receiver the ability to differentiate senders. This is especially important in multicast sessions where participants receive packets from and send packets to the same IP address. Figure 11 presents a multicast RTP session where the yellow boxes represent RTP packets with unique SSRCs.

Figure 11: RTP multicast session

As Figure 11 depicts, it would impossible, for example, for participant 3333 to differentiate the incoming packets and give them to correct video decoding instances had the packets not been equipped with unique identifiers. In typical unicast sessions the SSRCs do not provide much value as often the IP address/port pair is enough to differentiate participants. This is, however, always not the case if, for example, translators or mixers are used.

Though very unlikely, SSRC conflicts are possible. Two participants having the same SSRC can have devastating consequences on video call quality and as soon as the conflict has been noticed, the conflicted participants resolve the conflict amongst themselves, without disrupting the rest of the session. Algorithm 1 shows in pseudocode how an SSRC conflict is resolved.

Algorithm 1: SSRC conflict resolution

When an RTP packet is received, the first thing that is checked is whether the sender is a new participant. If so, an entry to the participant table is created. Each participant is put on a probation period which ends after a few successfully received packets. When the participant has been ac-cepted to the session and its first real packet has been received, its transport address is saved to the participant table which already contained the participant’s SSRC.

For each received packet, the packet transport address is matched against the address in the participant table. If these do not match, it means there are two nodes transporting from different addresses while using the same SSRC identifiers. If SSRC of the received packet does not collide with our own SSRC, the packet is simply dropped, and the reception process is continued. All subsequent packets with an invalid transport address/SSRC combination are also discarded.

If the SSRC matches our own SSRC and this is the first time we notice this, we add entry to a table that records SSRC conflicts. Then we remove ourselves from the session by sending RTCP BYE message, generate ourselves a new SSRC and rejoin the session. If the transport address has already been logged to the SSRC conflict table, we add a new entry to the table, discard the packet and continue packet reception.

RTP also defines CSRC identifiers which are often used in conjunction with a mixer. In that case, the SSRC of a packet would be that of the mixer and CSRC entries would contain SSRCs of those participants who had contributed to that packet, i.e., whose data is transmitted in that packet.

3.3.3 Mixers and Translators

Varying internet link speeds provide challenges to a media streaming session. If one or more participants are behind a low-speed link, they may dictate audio and video qualities for the whole session which are often much worse than what most participants would be able to serve. Another problem is reachability. Some participants may be behind an application-level firewall which blocks all IP traffic. To address these issues, RFC 3550 introduced two concepts: mixers and translators.

Translators can be used to bypass application-level firewalls or to upscale quality or number of participant if translators are used as relay nodes for those participants whose bandwidth is low.

The node can then send their RTP packet to a translator which then forwards those packets to all other participants, if the translator has enough bandwidth. Figure 12 shows how translators work in a multicast RTP session.

Figure 12: RTP multicast session with two translators

In Figure 12, there is an RTP multicast session happening where the participants are divided into two parties because of an application-level firewall (orange block). The multicast sender is not able to penetrate the firewall so instead it sends the RTP packets to the reachable participants and a translator which is connected through the firewall to another translator. This other translator then sends the RTP packets to the rest of the participants. All RTP participants receive the original packet without any modifications added by the translators.

Translators are invisible components of an RTP session as they do not leave any marks in the CSRC field of the RTP packet. In Figure 11 even though the participants of the local multicast group would receive the packet from the translators, the SSRC field of the RTP packet would contain the value of multicast sender’s SSRC.

Mixers are similar to translators but instead of only sending the packet forward to other partici-pants, mixers are used to re-encode and/or merge incoming streams. This can aid participants with low-speed links to join a session. Mixers have their own SSRC, and packets sent by the mixer have CSRC list filled with the SSRC values of those participants that contributed to the packet. Figure 13 shows an example of mixer functionality.

Figure 13: Mixer and two senders

In Figure 13, there are two senders that send RTP packets directly to a mixer. Because the re-ceiver does not have a high enough bandwidth to receive all media streams, mixer re-encodes the incoming streams with a lower sampling rate and merges the two streams into one. The packet produced by the mixer has the SSRC of the mixer and the CSRC list contains the SSRC values of the two senders.

Mixers and translators might introduce loops into the session if they mistakenly forward packets to the same multicast address the packets originated from. This can also happen indirectly by forwarding packets to another translator which then forwards the packets to the network where the packets came from. Loops and SSRC collision are detected the same way as they both cause RTP packets to have the same SSRC and a different transport address. Algorithm 1 shows how a looped packet is dealt with.

In document Design and Implementation of a Secure Real-time Transport Protocol Library for High-Speed Video Streaming (sivua 15-20)