Study Plan
2110 Topo
This lesson explains Real Time Protocol and how it fits into the SMPTE ST 2110 stack.
• Why isn't UDP enough?
• RTP Packets
• Secure RTP
• Maximum Transmission Unit
• Mapping MPEG Transport Streams to RTP
RTP (Real-time Transport Protocol) is one of the most fundamental elements in modern IP-based media systems, including SMPTE ST 2110, VoIP, streaming, and conferencing. It’s defined in RFC 3550 (originally published by the IETF in 2003) and standardized as part of the Internet protocol suite.
SMPTE Standard
pebble whitepaper
RTP provides end-to-end transport for real-time data, such as:
It is not responsible for guaranteed delivery. That’s intentional. Instead, it focuses on:
It’s usually paired with UDP as its underlying transport because UDP allows continuous packet flow without retransmission delays.
Short answer: UDP moves packets; RTP makes media usable.
UDP deliberately lacks almost everything real-time media needs.
What UDP provides (and why it’s not enough)
UDP gives you only:
UDP does NOT provide:
For real-time media, that’s a problem.
What UDP is missing — and how RTP fixes it
| Capability | UDP | RTP |
|---|---|---|
| Transport | ✅ | ✅ (over UDP) |
| Ordering | ❌ | ✅ |
| Timing | ❌ | ✅ |
| Loss detection | ❌ | ✅ |
| Payload type | ❌ | ✅ |
| Stream ID | ❌ | ✅ |
| A/V sync | ❌ | ✅ (via RTCP) |
| Media awareness | ❌ | ✅ |
Why not just “add this in the application”?
Because:
• Everyone would do it differently
• Interoperability would collapse
• Monitoring and tooling would be impossible
RTP is a standardized media contract on top of UDP.
UDP delivers packets; RTP delivers time-based media.
While RTP (Real-time Transport Protocol) is found at the OSI Layer 5 – Session layer, it is most accurately described as operating between the Transport layer (Layer 4) and the Application layer (Layer 7) in practice.
RTP resides in Layer 5 (Session)?
RTP provides session management features for real-time media streams:
These functions align with the OSI Session layer's responsibilities: establishing, managing, and terminating sessions/dialogues between applications.
The actual ST2110 payload resides in layer 6 (Presentation), so, RTP is the “shim” ST 2110 uses to make UDP packets behave like real-time media.
Quick Summary
• OSI model (academic/theoretical) → Layer 5 (Session)
• TCP/IP model (real-world/practical) → Application layer (above UDP)
• Most precise modern answer → Layer 5, because RTP manages real-time media sessions rather than being a pure end-user application like HTTP or SMTP.
So if you're studying for networking certifications (CCNA, CompTIA Network+, etc.) or discussing OSI layers strictly, say: RTP operates at Layer 5 (Session layer).

Packetization: Encapsulates media (like video frames or audio samples) into packets that fit within network MTUs.
Version (V) 2 bits How Timestamps and Sequence Numbers Work Together Synchronization source (SSRC): A unique ID per stream, ensuring receivers can distinguish between multiple media sources. SSRC (Synchronization Source) 32 bits
Indicates the RTP version. Currently version 2 is used everywhere.
Padding (P) 1 bit
If set, extra padding bytes are added at the end of the packet (useful for alignment).
Extension (X) 1 bit
Indicates the presence of an optional extension header that follows the CSRC list (for extra metadata).
CSRC Count (CC) 4 bits
Specifies how many Contributing Source (CSRC) identifiers are included. Usually 0 unless mixing multiple sources (like in conferencing).
Marker (M) 1 bit
A flag that can mark special events such as the start of a video frame or the end of an audio talk-spurt.
Payload Type (PT) 7 bits
Identifies the format of the media payload (e.g., PCM audio, H.264, JPEG-XS). Both sender and receiver must agree on this mapping.
Sequence Number 16 bits
Increments by 1 for each RTP packet sent. Used by the receiver to detect packet loss or out-of-order arrival.
Timestamp 32 bits
Represents the sampling instant of the first byte in the payload. Used to synchronize playback and align media streams (e.g., audio/video lip-sync).
Sequence Number: increments per packet (packet order integrity).
Timestamp: increases by the number of samples or frame duration (playback timing).
Example for audio:
Sequence: 1001, 1002, 1003 → identifies packet order.
Timestamp: 48000, 48160, 48320 → aligns with the 48 kHz sample clock.
PTP (IEEE 1588) provides the timebase so timestamps are accurate and synchronized.
Contributing source (CSRC): Used when a stream is mixed (e.g., multiple talkers in a conference).
Minimal control overhead: Only 12 bytes of base header — very efficient for high-rate video/audio.
Unique ID chosen by the sender to identify the stream. Receivers use it to distinguish multiple concurrent RTP sources.
CSRC List (Optional) 0–15 entries, 32 bits each
Lists contributing sources if the payload is a mix (e.g., an audio mixer combining multiple microphones).
RTP Header Extension
Where it lives: Immediately after the fixed 12-byte RTP header — but only if the X bit (Extension) in the header is set to 1.
Purpose: Used to carry optional metadata that doesn’t fit into the standard header, Such as:
Profile-specific ID: Identifies the meaning/format of the extension (e.g., SMPTE 2110-21 uses 0xABAC).
Length: Number of 32-bit words in the extension data.
Extension Data: The custom information itself.
In 2110 systems, the extension is often used for flow timing metadata or ancillary synchronization.
Structure:
Header end
Payload Variable length (up to MTU limit)
The actual media data — such as video frame segments, audio samples, or ancillary data. The type and interpretation are defined by the Payload Type (PT) field.
RTP Padding (P bit)
Where it lives: At the end of the packet (payload section).
Purpose: Used when the payload must be padded to a specific byte boundary. For example:
How it works:
Example:
[Payload Data][xx][xx][xx][03]
The final byte (03) means there are 3 padding bytes in total.
RTP Control Protocol (RTCP)
RFC 3550 also defines a companion protocol: RTCP (RTP Control Protocol).
How RTP Works in Real Time
RTP Is Transport, Not Reliability
RTP does not:
This makes it perfect for live, real-time content, where a missing frame is better than a delayed one.
SMPTE ST 2110 uses RTP as the encapsulation layer for all essence types (video, audio, metadata).
ST 2022-7 can be used underneath for redundancy.
RTP (RFC 3550) is a lightweight, real-time transport protocol designed for synchronization, sequencing, and timing of audio/video data across IP networks, which are the backbone of modern IP broadcast and streaming systems.
SRTP Master Key Identifier (MKI)
Where it lives: In Secure RTP (SRTP) packets, optionally appended after the encrypted payload but before the authentication tag.
Purpose: Identifies which cryptographic key was used to encrypt the RTP payload. This is important when multiple SRTP keys are in use, for example, during rekeying events or multi-party conferences.
Structure (optional, variable length):
[Encrypted Payload][MKI][Auth Tag]
MKI length is negotiated during session setup (via SDP). It tells the receiver which master key to use to decrypt the payload. Used mainly in environments where keys are rotated frequently for security compliance.
Authentication Fields (SRTP Auth Tag)
Where it lives: At the end of the SRTP packet — after the payload and optional MKI.
Purpose: Provides integrity and authenticity for the RTP packet. It ensures that the packet wasn’t tampered with and came from a trusted sender.
How it works: Calculated using an HMAC (Hash-based Message Authentication Code) algorithm like HMAC-SHA1.
Common tag lengths: 80 bits (10 bytes) or 32 bits (4 bytes).
The receiver recomputes the tag and compares it — mismatch means discard.
Example structure (end of packet):
[Encrypted Payload][MKI (optional)][Auth Tag]
Putting It All Together In SMPTE ST 2110 and other uncompressed media systems, each video frame can be tens of megabytes. It must be split into many RTP packets, and each RTP packet must fit within the MTU (usually 1500 bytes). If you use Jumbo Frames (9000 bytes), you can send more payload per packet → fewer packets per frame → lower CPU load and switch congestion. If an RTP packet is larger than the MTU, IP will fragment it into multiple smaller packets. That’s bad for real-time streams because: Hence, in professional media networks, fragmentation is avoided by design. Packet sizes are kept under the MTU limit. Every device in the path (switches, routers, NICs) must agree on the MTU. If not, packets might get dropped or fragmented midstream. The RTP packet includes: Ethernet + IP + UDP + RTP headers + payload
Here’s how these optional fields stack up in the full RTP/SRTP packet:
Maximum Transmission Unit
Network Type
Common MTU (bytes)
Notes
Ethernet (standard)
1500
Default for most IP networks.
Ethernet Jumbo Frames
9000
Used in data centers and 2110 networks for higher efficiency.
Wi-Fi
2304 (max theoretical)
Often lower in practice due to overhead.
VPN / GRE tunnels
1400–1476
Smaller MTU because of encapsulation overhead.
This is why IP video engineers often set: MTU = 9000 (Jumbo) across all NICs and switches in a 2110 plant.
→ total size must not exceed MTU.
Mapping MPEG Transport Streams to RTP