2.
Multimedia
1. Introduction
The internet was originally designed primarily for data communication.
Initially, most of the traffic was data such as files, emails, and web content.
As more people started using the internet, several varied applications emerged.
The widespread use of the internet for communication and entertainment fueled the need for multimedia
communication over the internet.
Today, voice, images, videos, animation, graphics, etc. are commonly exchanged on the internet.
Multimedia networking software like internet telephony, video on demand, internet TV, video conferencing, etc. are
being widely used.
However, multimedia traffic has strict timing requirements and hence it imposes
several restrictions on the communication.
It imposes additional and complex requirements on the architecture of the internet
and the communication protocols.
Hence, additional, specialized protocols for handling multimedia have been
developed. These include Real time Transfer Protocol (RTP, its Real time Transport
Control Protocol (RTCP) and Real time Streaming Protocol (RTSP).
We can divide audio and video services into three broad categories: Streaming stored
audio/video, Streaming live audio/video and Interactive audio/video as shown below.
The concept of "streaming means that a user can play the file
immediately after the downloading has started "
Streaming media is multimedia that is delivered and consumed in a
continuous manner from a source.
➢ Streaming stored audio/video: Streaming stored audio/video refers
to on-demand request for compressed audio video files. In the
streaming stored audio video service, the multimedia files are
compressed and stored on a server, these files are downloaded by the
client through the internet. This is also called as on-demand audio
video
Examples of stored audio files are podcast, audio books etc.
Examples, of stored video files are movies, video clips, video recording
etc. Popular sources include youtube, Spotify, Netflix etc
▪ Streaming live audio/video: Streaming live audio/video refers to the
broadcasting of audio, radio, video and TV content through the internet.
It is real-time delivery of content as it is captured. In this category the
audio, video content is broadcasted online through the internet .A user can
listen to the audio and view the video in real time, as soon as the broadcast
starts.
An example of this is Internet radio and TV. Today most of the TV and radio
stations broadcast their programs on the internet. Live telecast of sports
and other events are also done through the internet.
➢ Interactive audio/video: Interactive audio/video refers to
the use of the internet for interactive audio/video
applications.
People use these kind of applications to interactively
communicate with one another.
A good example of this application is video conferencing and
internet telephony
2. Digitizing Audio and
Video
Audio and video signals are essentially analog in nature.
A computer network sends and receive data in the form of
binary bits.
Hence, in order to transmit audio and video content through
computer network the signals need to be digitized.
2.1 Digitizing
Audio
Digitization is a process of
converting the analog signals to a
digital signal.
There are three steps of
digitization of sound.
1. Sampling -Sound is captured as an analog signal from a
microphone or other sound input device.
The analog signal is sampled at regular intervals by an
analog-to-digital converter (ADC).
Sampling involves measuring the amplitude (loudness) of the
sound wave at discrete points in time.
A sampling rate is the number of times the analog sound is
taken per second.
A higher sampling rate implies that more samples are taken
during the given time interval and ultimately, the quality of
reconstruction is better.
The sampling rate is measured in terms of Hertz, Hz in short.
2. Quantization - The analog-to-digital converters perform
this type of function to create a series of digital values out of
the given analog signal
Quantization is representing the sampled values of the
amplitude by a finite set of levels, which means converting a
continuous-amplitude sample into a discrete-time signal.
The following figure shows how an analog signal gets
quantized
How many numbers are used to represent the value of each
sample known as sample size or bit depth or resolution.
The amplitude value is rounded to the nearest digital value
based on the bit depth (e.g., 16-bit, 24-bit).
Commonly used sample sizes are either 8 bits or 16 bits
The larger the sample size, the more accurately the data will
describe the recorded sound.
3. Encoding -Encoding is the process of converting the data
or a given sequence of characters, symbols, alphabets etc.,
into a specified format, for the secured transmission of
data.There are various encoding formats that determine
how the digital audio data is structured and stored.
Common formats include WAV (Waveform Audio File
Format), AIFF (Audio Interchange File Format), MP3 (MPEG
Audio Layer III) etc.
To digitize the signal, it must be sampled at particular
intervals.
According to the Nyquist theorem, a signal must be sampled
at twice the frequency for proper representation.
Ex. Voice signals which have a frequency of 4kHz are
sampled 8000 times per second.
Each sample quantized with 8 bits (per sample). Hence, 1
second of voice signal generates 8000x8=64kbps.
Typically music is sampled at 44100 samples per second
with 16 bits per sample resulting is digital signal of 705.6
kbps
2.2 Digitizing Video
A video is made up of a sequence of images called frames.
When an image appears on the retina of the human eye, the
image is retained for several milliseconds.
Consequently, if a sequence of images are displayed at the
appropriate rate, the eye does not notice that it is looking at
discrete images This is how you get smooth motion in videos.
The display rate is usually 25 (or 30 frames per second. Each
frame is refreshed twice to avoid flickering. Hence, either the
sender sends 50 frames or the receiver stores and refreshes each
frame twice.
Each frame is composed of a grid of picture elements or pixels.
A pixel is the smallest unit of a digital image or graphic
The total number of pixels in an image = number of rows x number of
columns.
Each pixel stores color information.
The pixel depth or color depth is the number of data bits each pixel
represents.
Black and white image: Each pixel stores 1 bit ie. 0 (black) or 1 (white)
Grayscale image: Each pixel stores 8 bits representing 256 shades.
Color image : Each pixel stores 24 bits (8 bits each for RGB)
For example, if a color image has 1024 x 768 pixels, then the total memory
required per image 1024 x 768 x 24=18874368 bits.
For a one second video, the memory required would be 25 x 2x 1024 x 768
x 24944 Mbps. Clearly, it is not possible to store and transmit such large
amount of data over a network. Hence, compression is needed for
multimedia.
3. Audio and Video Compression
"Data compression or bit-rate reduction is the process of encoding
information using fewer bits than the original representation."
Any compression method is either lossy or lossless
I. Lossless compression reduces bits by identifying and eliminating
statistical redundancy. No information is lost in lossless compression.
2.Lossy data compression Lossy compression reduces bits by removing
unnecessary or less important information.
These schemes are designed by researchers on how people perceive
the data in question. It takes into account human characteristics such as
psychoacoustics( combines the physiology of sound — how our bodies
receive sound — with the psychology of sound, or how our brains
interpret sound) for sound, and psychovisuals(study of the psychology
of vision) for images and video.
Audio compression can be used for speech or
music. Two categories of techniques are used for
3.1 Audio audio compression.
Compression I. Predictive encoding
ii. Perceptual encoding
I. Predictive encoding:
➢ This is a lossless compression technique.
➢ The basic idea behind predictive encoding is that it requires fewer bits to transmit the
difference between two samples rather than transmitting the samples themselves.
➢ A digitized audio signal consists of 8000 samples per second (for speech) and 44100 samples
per second (for music).
➢ Since audio is a continuously varying signal the difference between successive samples will
not be very large. Hence, the difference can be encoded in fewer number of bits. The first
sample will be sent as a whole.
➢ For example Consider 6 sample values (encoded in 8 bits) 196, 190, 194, 193, 200, 202 the
differences will be 6,4,1, +7. +2 which can be encoded in less bits compared to bits required to
encode each original value. This type of compression is normally used for speech.
II. Perceptual encoding:
➢This is a lossy compression technique.
➢ This is the most common compression technique used to create CD quality audio. This
type of audio needs at least 1411 Mbps. It cannot be sent over the Internet without
compression.
➢MP3 (MPEG audio layer 3), a part of the MPEG standard uses this technique.
➢Perceptual encoding uses the science of psychoacoustics, which is the study of how
people perceive sound.
➢The idea is based on some flaws in our auditory system. Compression is done by
considering that some sounds can mask other sounds. Masking can happen in
frequency and time
A. Frequency masking: In frequency masking, a sound in a particular frequency range can
partially or totally mask a sound in another frequency range.
B. Temporal masking: In temporal masking, a loud sound can numb our ears for a short time
even after the sound has stopped.
➢MP3 uses frequency and temporal masking to compress audio signals This technique analyzes
and divides the audio frequencies of the signal into several groups,
➢I. Totally masked: No bits are allocated to the frequency ranges that are totally masked. The
user is not going to hear these frequencies
➢Ii. Partially masked: A small number of bits are allocated to the frequency ranges that are
partially masked
➢Iii. Not masked : A larger number of bits are allocated to the frequency ranges that are not
masked .
➢MP3 produces three data rates, according to the range of frequencies in the original analog
signal 96 kbps, 128 kbps, and 160 kbps
Video Compression
Video compression technology is a set of techniques for reducing and removing redundancy in
video data.
The compressed video must have a much smaller size compared to the uncompressed video.
This allows the video to be saved in a smaller file or sent over a network more quickly.
Video compression may be lossy, in which case the image quality is reduced compared to the
original image.
For lossy compression, the goal is to develop compression techniques that are efficient and
result in perceptually lossless quality.
In effect, even though the compressed video is different from the original uncompressed video,
the differences are not easily visible to the human eye.
Video data consumes a large amount of data and sending it in the raw format over the internet
is not feasible. Hence, video data is compressed.
Two options are possible:
i. Since a video is composed of images, compress each image (JPEG).
ii. Compress the video (MPEG)
Image Compression: JPEG
The Joint Photographic Experts Group (JPEG) is the most widely used standard for compressing
images.
In a grayscale image, each pixel is represented using 8 bits (256 colors).
A color image uses 24 bits per pixel, 8 bits for each basic color i.e. R, G and B.
To simplify the compression, the image is divided into blocks of size 8X8 each and the
compression algorithm is applied on each block.
Three phases of JPEG
i. Discrete Cosine Transform: The purpose of DCT is to transform an image from the spatial
domain to a frequency domain. The DCT transforms the 64 values such that the first value is
called the DC value which is the fundamental color of the block multiplied by a DCT coefficient.
The rest of values are called AC values that represent the changes between adjacent pixels. This
gives an idea of the redundancies in the pixels which help in compression.
To understand the concept, let us consider three cases:
Case 1: Suppose all 64 pixels in the block have the same value. The DCT gives a non- zero value
for the first pixel i.e. T(0.0). The value of T(0,0) is the average of the other values and is called
the dc value. The rest of the values, called ac values in T(m,n) represent changes in the pixels
values. But because there are no changes, the rest of the values are 0.
Case 2:We have a block with two different grayscale values. Also, there is a sharp change in the
values . When we apply DCT to this block, we get a DC value as well as non-zero ac values
Case 3:We have a block with continuously changing pixel values. There is no sharp
change between values of neighboring pixels . DCT gives DC value with many non-zero ac values
i. Quantization: The next step is quantization which is applied to the table T. Quantization reduces
the number of bits needed for encoding. Each value is divided by a constant and the fractional part
is dropped (lossy). A standard 8 x8 quantizing table is used as divisors.
Iii.Data Compression: In order to group the 0's together, the table is read diagonally in a zigzag
manner as shown. Compression is done by removing the redundant 0's and applying a text based
compression method like Huffman coding to compress frequently occurring values.
https://www.youtube.com/watch?v=X5LXecsGYIc&pp=ygUcSlBFRyBjb21wcmV
zc2lvbnNob3J0IHZpZGVvcw%3D%3D
Video Compression MPEG
Moving Picture Experts Group (MPEG) uses lossy compression within each frame similar to JPEG,
which means pixels from the original images are permanently discarded.
It also uses interframe coding, which further compresses the data by encoding only the differences
between periodic frames (see interframe coding). MPEG performs the actual compression using the
discrete cosine transform (DCT) method.
The two types of compression used in MPEG are:
Spatial Compression :Within a single frame, areas of similar color and texture can be coded with
fewer bits than the original frame, thus reducing the data rate with a minimal loss in noticeable visual
quality. JPEG compression works in a similar way to compress still images. Spatial, or intraframe,
compression is used to create standalone video frames called I-frames (short for intraframe).
Temporal Compression: Instead of storing complete frames, temporal (interframe) compression
stores only what has changed from one frame to the next, which dramatically reduces the amount of
data that needs to be stored while still achieving high-quality images
Video is stored in three types of frames:
1. An I-frame (Intra-coded picture) is a complete, independent frame , like a JPG or BMP image
file. It is not related to any other frames. They are present at regular intervals.
2. A P-frame (Predicted picture) holds only the changes in the image from a previous frame. For
example, in a scene where a car moves across a stationary background, only the car's
movements need to be encoded. The encoder does not need to store the unchanging
background pixels in the P-frame, thus saving space. P-frames are also known as delta-frames.
3. A B-frame (Bidirectional predicted picture) saves even more space by using differences
between the current frame and both the preceding and following frames to specify its content
4. Streaming Stored Audio/Video
In recent years, audio/video streaming has become a popular class of applications and a major
consumer of network bandwidth. Clients request compressed audio/video files, which are
resident on servers. Four approaches with varying complexities are as follows:
i. First approach: Using a Web Server
ii. Second approach: Using a Web Server with a Metafile
Iii. Third approach: Using a Media Server
iv. Fourth approach: Using a Media Server and RTSP
4.1 First Approach: Using a Web Server
In this approach, the audio/video file resides on a Web server and is an ordinary object in server's file
system, just like any other file. A user downloads the file by establishing a TCP connection with the
Web server and sending an HTTP request for the object. Upon receiving such a request, the Web
server bundles the file in an HTTP response message and sends the response message back into the
TCP connection.
Advantages
1. This approach is very simple.
2. It does not involve streaming.
3. No special protocol or additional mechanism is needed.
4. The multimedia file is fetched just as any other file from the server.
Disadvantages
1. The size of the audio/video file is very large even after compression.
2. The file needs to be downloaded completely before it can be played.
3. A user may need some seconds before the file can be played.
4.2 Second Approach-Using webserver and metafile
In the second approach web server stores two files:
I. The actual audio/video file.
ii. The metafile that holds information about the audio/video file.
The media player is directly connected to the web server. When the client sends a request for
the audio/video file, the server sends the metafile as response. The metafile is passed to
the media player which downloads the file from the web server.
Advantages
1. It does not involve streaming.
2. No special protocol or additional mechanism is needed.
3. The metafile gives additional information about the multimedia file.
4. The media player is directly connected to the web server. Thus, the file is played as soon as it
is downloaded.
Disadvantages
1. The file needs to be downloaded completely before it can be played.
2. The browser and media player both use HTTP (which uses TCP). The connection oriented
approach is ok for fetching the metafile but not the actual audio/video file.
4.3 Third Approach: Using a Media Server
In the third approach, there is another media server to store the audio/video files. The client and
the web server use HTTP for communication. For faster downloading, the media player and media
server communicate using a protocol based on the connectionless protocol UDP as shown.
Process
i. The HTTP client sends a GET request to web server.
ii. The server sends the metafile.
iii. The metafile is passed to the media player.
iv. The media player uses the URL in the metafile to request for download (using any UDP based protocol).
V. The media server sends the media file.
Advantages
1. Faster than the above approach because UDP is faster than TCP.
2. No special protocol or additional mechanism is needed.
3. The metafile gives additional information about the multimedia file.
4. The media player is directly connected to the media server. Thus, the file is played as soon as it is
downloaded.
Disadvantages
1. The file needs to be downloaded completely before it can be played.
2. Requires an additional media server.
4.4 Fourth Approach: Using a Media Server and RTSP
Real Time Streaming Protocol (RTSP) is a control protocol designed to add more functionalities to the streaming
process. It is an application level protocol that enables control over the delivery of data with real time property.
It controls the playing of the audio/video and includes operations like pausing the playback, fast forwarding,
rewinding etc. It is a client-server multimedia presentation protocol.
Steps
i. The HTTP client sends a GET message to the web server.
ii. The web server sends the metafile to the client.
iii. The metafile is passed to the media player.
iv. The media player sets up a connection with the media server.
v. The media server responds.
vi. The media player sends a play message to start playing the audio/video.
vii. The audio/video file is downloaded by using another protocol that runs over UDP
viii. The client ends the playing by sending a TEARDOWN message.
ix. The media server responds.
Advantages
1. Fastest downloading and streaming.
2. RTSP is a special protocol for controlling the streaming process.
3. The file is downloaded/played using underlying UDP. Hence it is faster.
Disadvantages
Requires a special protocol to be used.
The web server stores the The web server stores the metafile as well The web server stores the metafiles and the The web server stores the
media files. as media file media server stores the media files metafiles and the media
server stores the media files.
The client fetches the media The client fetches the metafile from the web The client fetches the metafile from web The client fetches the metafile
file directly from the web server and the media player fetches the server and the media player fetches the from the web server and the
server. media file from the web server. media file from the media server. media player fetches the
media file from the media
server.
Does not use a metafile. Uses a metafile.
/Uses a metafile. Uses a metafile.
The client and webserver uses The client, webserver and media player The client and webserver use HTTP, media The client webserver use
HTTP use HTTP. player and media server use UDP HTTP while the media player
and media server use RTSP
Media file has to be download Media file has to be downloaded Media file has to be downloaded before Media file is played as it is
ed before playing. before playing playing received.(streaming)
5. Streaming Live Audio/Video
Streaming live audio/video is similar to broadcasting audio and video by radio and TV stations.
Instead of broadcasting through the air, the stations broadcast through the Internet.
There are several similarities between streaming stored audio/video and streaming live
audio/video. They are both sensitive to delay, neither can accept retransmission.
However, there is a difference. In the first application, the communication is unicast and on-
demand. In the second, the communication is multicast and live.
Live streaming uses multicast services of IP and use of protocol such as UDP and RTP for faster
data delivery. Presently, live streaming still uses TCP and multiple unicasting instead of
multicasting
6. Real-Time Interactive Audio/Video
In real-time streaming applications, the file is played at the client side as it is being received.
The requirement is that once the playback begins, it should proceed according to the original
timing of the recording.
In real-time interactive audio/video, people communicate with one another in real time. Video
conferencing, internet telephony, voice over IP are examples of such applications.
This imposes strict limits on transmission time, delay and jitter.
The current network protocols cannot satisfy these requirements. Moreover, multimedia
applications need some more services that are different from TCP and more than what UDP
offers.
Before we see these protocols, we will understand some important characteristics of real-time
audio/video communication
i. Time Relationship: Data over a TCP/IP network is sent in the form of packets. For real time
transmission, the network must preserve the time relationship between packets. Let us consider
an example where a real-time video server creates live video and sends them online. It creates
four packets, each with 10 seconds of video. The server sends the packets at time Os, 10s and
20s,30s respectively. Let us assume that it takes Is for each packet to travel and reach the
destination. The packets arrive at the client at time 1s. 11s and 21s ,31s respectively. Hence, the
receiver must start the play with a 1s delay after the server has started the transmission. This is
shown in the following diagram.
Ii. Jitter: In the above example, all packets arrive after the same amount of delay i.e. 1s Hence
the receiver can manage this delay by starting the playback 1s late. However, if the Packets do
not arrive after the same amount of delay, it causes a problem called jitter. Jitter is a variation in
the delay caused in real-time by the delay between packets.
For example, the first packet may arrive after a 1s delay, the second may arrive after a 5s delay
(at 15s) and third may arrive after a 7s delay (at 27s). After the client has finished playing the
first packet, there is a gap since the second has not yet arrived. The same happens for the third
packet. This gap is called jitter.
Iii. Timestamp: Timestamp is one solution to Jitter. If each packet has a timestamp that shows
when the packet was produced with respect to the previous packet. then the receiver can add
this to the time at which it should start the playback. The receiver calculates when each packet
should be played. For example, if the first packet has timestamp 0 second, second has
timestamp 10s and third has timestamp 20 ,fourth has 30s. If the receiver starts play back of the
first packet at 8s, second will be played 18s and third will be played at 28s. This will ensure that
there are no gaps in the playback even after jitter.
v. Playback buffer: Since the packet arrival time and the playback time is different, each packet
must be buffered before it can be played. This buffer is called as playback buffer When the first
packet starts arriving at the receiver, the client delays the play until a threshold is reached.
Threshold is measured in time units of data. The replay does not start until the time units of data
are equal to the threshold value. In the previous example, the first packet starts arriving at Is and
the threshold is 7s. Hence, the playback time is 8s. The following figure shows the buffer at
different times for the previous example.
v. Ordering : Ordering assigns a sequence number to each packet. This is used in case packets
are lost or reordered. A timestamp does not tell the receiver the order of the packets if a packet
is lost. Hence ordering is a way to ensure that the packets are played back in the correct order .
Vi. Multicasting: Conferencing requires two way communication between receivers and senders
Multimedia plays an important role in audio and video conferencing. The data is distributed
between sender and multiple receivers using multicasting
Vii.Translation: The sender and receiver may not be capable of handling data at the same
speed. For example, a source may create a high quality video signal at 5 Mbps while the receiver
may have a bandwidth of less than 1Mbps. Hence, real time traffic requires a translator that
changes the quality of the high bandwidth video signal to a lower quality narrow bandwidth
signal. Translation means changing the encoding of a payload to a lower quality to match the
bandwidth of the receiving network .
viii. Mixing: A mixer combines several streams of traffic into one stream. If there is more than
one source that is sending data at the same time, the traffic contains multiple streams. These
streams must be converted into one stream by a mixer. A mixer adds signals coming from
different sources to create one single signal.
Ix. Support from Transport layer protocols: The application layer requires the services of the
transport layer for real-time multimedia delivery. TCP is not suitable for interactive traffic since it
does not support time stamping, multicasting. It provides ordering and sequencing. However,
the overheads of TCP are not suitable for real time data delivery.
UDP is more suitable for interactive multimedia traffic because it supports multicasting and has
no retransmission strategy, but UDP does not provide time-stamping, sequencing or mixing.
Hence we need the services of another transport layer protocol like RTP to handle this
deficiency.
Requirements of Real-Time Interactive Multimedia
1. In order to ensure playback timing and jitter removal timestamps are required.
2. In order to ensure the order of data packets, a sequence number is required.
3. Most of the real-time multimedia applications are video conferencing, live streaming where several
clients receive data. Therefore, the multicast mode is preferred.
4. In order to combine audio and video streams from multiple sources within a single audio/video session,
mixers are required.
5. In order to use high bit rate streams over a low-bandwidth network the translators are required.
Transport layer protocol limitations:
a. TCP forces the receiver application to wait for retransmission(s) in case of packet loss, which
causes large delays.
b. TCP cannot support multicast.
c. TCP congestion control mechanisms decreases the congestion window when packet losses are
detected ("slow start"). Audio and video, on the other hand, have strict timing rates that cannot be
suddenly decreased.
d. TCP headers are larger than a UDP header (40 bytes for TCP compared to 8 bytes for UDP).
e. TCP doesn't contain the necessary timestamp and encoding information needed by the receiving
application.
f. TCP doesn't allow packet loss. In Audio/Video, packet loss is tolerated and can be ignored.
Streaming stored audio/video Streaming live audio/video Interactive live audio video
Compressed Media files are Audio/video content is broadcast Audio/video content is
stored in a server and played by over the internet transmitted as soon as it is
downloading/streaming captured and played as it is being
received
Client server communicate using Uses multicast services of IP and Special purpose protocols such as
standard protocols such as TCP use of protocol such as UDP and RTP. RTCP SIP and RTSP are used
and UDP. Special protocol such as RTP for faster data delivery
RTSP is used for streaming
Unicast and on-demand Multicast and live. Multicast live and interactive
Not real-time Real-time Real-time
Example: youtube etc. Internet radio, TV Video conferencing Voice over IP
etc.
Protocols
To overcome the TCP limitations and satisfy the requirements of realtime interactive
transmission, several protocols are developed which support the real-time traffic over
Internet. These are
1. RTP (Real-Time Protocol): Used for real-time data transport (extends UDP, sits
between application layer and UDP)
2. RTCP (Real-time Control Protocol): Used to exchange control information between sender
and receiver, works in conjunction with RTP
3. SIP (Session Initiation Protocol): Provides mechanisms for establishing calls over IP
4. RTSP (Real-Time Streaming Protocol): Allows user to control display (rewind ,FF, pause,
resume, etc.).
5. RSVP(Resource Reservation Protocol): Intended to reserve resources in a computer network
to get different quality of services QoS.
7. Real-time Transport Protocol (RTP)
Real-time Transport Protocol (RTP) is the protocol designed to handle real-time traffic on Internet .
RTP does not have a delivery mechanism: it must be used with UDP, RTP lies between UDP and the
application.
RTP ensures real-time delivery and also provides means for:
i. Jitter elimination/reduction
ii. Synchronization of several audio and/or video streams that belong to the same multimedia
session.
iii. Multiplexing of audio/video streams that can belong to different sources.
iv. Translation of streams from one encoding type to another.
v. Time-stamping, sequencing, and mixing facilities.
RTP Packet Format
RTP packets are encapsulated in UDP datagram. RTP uses a temporary even numbered UDP port
for communication. The RTP packet header format is shown below
The fields are
Version: This 2-bit field defines version number. The current version is 2.
P. The length of this field is 1-bit. If value is 1, then it denotes presence of padding at end of packet and if value
is 0, then there is no padding.
X: The length of this field is also 1-bit. If value of this field is set to 1, then it indicates an extra extension header
between data and basic header and if value is 0 then, there is no extra extension.
Contributor count: This 4-bit field indicates number of contributors. Here maximum possible number of
contributor is 15 as a 4-bit field can allows numbers from 0 to 15.
M: The length of this field is 1-bit and it is used as end marker by application to indicate end of its data.
Payload types: This field is of length 7-bits to indicate type of payload. The payload type indicates the type of
encoding or format used for the payload data carried in the RTP packet.
Sequence Number: The length of this field is 16 bits. It is used to give serial numbers to RTP packets. It helps in
sequencing. The sequence number for first packet is given a random number and then every next packet's
sequence number is incremented by 1. This field helps to detect lost or out-of-order packets
Time Stamp: The length of this field is 32-bit. It is used to find time relationship between different
RTP packets. The timestamp for first packet is given randomly and then time stamp for next packets
given by sum of previous timestamp and time taken to produce first byte of current packet.
Synchronization Source Identifier: If there is only one source, this 32-bit field defines the source.
However, if there are several sources, the mixer is the synchronization source and the other
sources are contributors. The value of the source identifier is a random number chosen by the
source
Contributor Identifier: This is also a 32-bit field used for source identification where there is more
than one source present in session. The mixer source use synchronization source identifier and
other remaining sources (maximum 15) use contributor identifier.
8. Real Time Control Protocol
RTP does not provide any mechanism for error and flow control, congestion control, and
quality feedback .
For that purpose the RTCP is added as a companion to RTP to provide end-to-end monitoring
and data delivery, and QoS.
It conveys a number of statistics and other information about an RTP flow between recipients
and senders.
RTCP is responsible for the following functions:
I. Feedback on performance of the application and the network.
II. To control the flow and quality of data.
III. Correlation and synchronization of different media streams generated by the same sender
(eg combined audio and video).
RTCP Message Types
Sender report: The sender report is sent periodically by the active sender in a conference to
report transmission as well as statistics of reception for all RTP packets transmitted during the
time period. The report contains the details of absolute time-stamp, that is the number of
seconds elapsed since midnight on January 1, 1970. These details help the receiver for
synchronization process.
Receiver report: Passive participants are those participants that do not send RTP packets, and
for them the Receiver report is used. This report is used to inform the sender and other
receivers about the quality of service.
Source description message: The source sends a source description message within a fixed
interval to give some extra information about itself. It contains the details about the name of the
source, its mail ID, contact number or source controller.
Bye message: A source sends a BYE message to shut down a stream. It is used by the source
to announcing for leaving the conference. This message is a direct announcement for other
sources about the absence of a source. It is also useful to a media mixer.
Application-Specific message: is a type of RTCP packet designed to carry application-specific
data that is not covered by the standard RTCP packet
9. Voice over IP (VoIP)
Voice over IP, or Internet telephony is a real-time interactive audio/video application. Voice over
Internet Protocol (VoIP), is a technology that allows you to make voice calls using broadband
Internet connection instead of a regular phone line.
VoIP services convert your voice into a digital signal that travels over the Internet.
VoIP can allow you to make a call directly from a computer, a special VoIP phone, or a
traditional phone connected to a special adapter.
The idea is to use the Internet as a telephone network with some additional capabilities
Two protocols have been designed to handle this type of communication:
I. Session Initiation Protocol (SIP)
II. H.323
Advantages
1. Cost: Free VoIP to VoIP, Low cost VoIP to Public Switch Telephone Network (PSTN)
2 Less bandwidth requirements
3. Low cost/ no cost software and hardware
4. Mobility: Any internet connection
5. offers a wide range of features: from call forwarding, blocking, caller ID and voicemail, to remote
management, automatic call distribution and interactive voice recognition .
Disadvantages
1. Quality: Low quality as compared to telephone network
2. VolP is dependent on internet connection
3. Lost or delayed packets cause gaps in the signal .
4.Hard to find geographic location
5. Security: Most VolP services do not support encryption.
9.1 Session Initiation Protocol (SIP)
SIP, designed by IETF(Internet Engineering Task Force), is an application layer protocol that
establishes, manages, and terminates a multimedia session.
It can be used to create two party, multi-party, or multicast sessions. It is independent of the
underlying transport layer protocol ie it can run on UDP, TCP or SCTP.
SIP is a text based protocol. It defines 6 messages .
Each message has the header and body.
The header consists of several lines that describe the structure of the message, the callers
capability, media type and so on.
The SIP message body describes the session to be initiated. For example, in a SIP phone call the
body usually includes audio codec types, sampling rates, server IP addresses and so on
1. INVITE
INVITE is used to initiate a session with a user agent.
A session is considered established if an INVITE has received a success response or an ACK has been sent.
A successful INVITE request establishes a dialog between the two user agents which continues until a BYE is
sent to terminate the session.
An INVITE sent within an established dialog is known as a re-INVITE.
Re-INVITE is used to change the session characteristics or refresh the state of a dialog.
2. ACK
ACK is used to acknowledge the final responses to an INVITE method. An ACK always goes in the direction
of INVITE.
3. REGISTER
REGISTER request performs the registration of a user agent. This request is sent by a user agent to a
registrar server. It allow a SIP client to register its contact information with a SIP registrar server.
4. CANCEL
CANCEL is used to terminate a session which is not established. User agents use this request to
cancel a pending call attempt initiated earlier.
5. BYE
BYE is the method used to terminate an established session. This is a SIP request that can be sent by
either the caller or the callee to end a session.
6. OPTIONS
OPTIONS method is used to query a SIP server or client about its capabilities, configuration, or the
options it supports.
SIP addresses
In a regular telephone network, each telephone user has a unique telephone number that identifies
that user.
SIP is very flexible. In SIP an email address, an IP address, telephone number, or any other type of
address can be used to identify the sender and receiver.
However the address needs to be in SIP format or scheme
A SIP user is assigned a unique address known as a URI (Uniform Resource Identifier) that is similar in
format to an email address such as sip:user@host. The user part of the address could be an
alphanumeric user id or a phone number. The host part is an IP address, or a domain name.
Sip :
[email protected] Username and IPv4 address
Sip :
[email protected] Phone number and domain name
Sip :
[email protected] Username and domain name
Sip :
[email protected] Phone number and IPv4 address
SIP Session
A simple session using SIP consists of three modules:
i. Establishing a session: This module requires a three-way handshake. The caller sends an
INVITE message using any transport layer protocol to begin the communication. If the callee
is willing to start the session, the callee sends a reply message. The caller responds with an
ACK message.
ii. Communicating: After the session is established the caller and the callee communicate using
two temporary ports
iii. Terminating a session: Either party can terminate the session with the bye message
SIP Session
Tracking the Collee
In many cases, the callee is not at his/her location or terminal. Also, the callee may not have a fixed IP
address.
SIP uses a callee tracking mechanism to find the IP address of the callee.
SIP uses some network elements, each identified by an SIP address
▪ SIP uses some servers as Registrars
▪ A user in the SIP system is registered with at least one registrar server which knows the IP address of
the callee.
▪ When a caller wants to call a callee whose IP address is not known, it sends an INVITE message to the
email address
▪ The message goes to a proxy server which sends a lookup message to the registrar servers
▪ The registrar server responds to the proxy server with the callee's IP address
▪ The proxy server inserts the IP address in the caller's INVITE
▪ The message is sent to the callee
H.323
H.323 is a standard by ITU
(International Telecommunication
Union ) to allow telephones on the
public telephone network to talk
to computers connected to the
internet.
H.323 is widely used in IP based
videoconferencing, Voice over
Internet Protocol (VoIP) and
Internet telephony.
The Architecture of H.323
Terminals A terminal, or a client, is an endpoint where H.323 data streams and signaling
originate and terminate. A terminal must support audio communication, video and data
communication support is optional
A gateway connects the internet to the telephone network. A gateway provides data format
translation, control signaling translation, audio and video codec translation and call setup and
termination functionality on both sides of the network. The gateway transforms a telephone
network message to an Internet message
The gatekeeper server on the LAN plays the role of the registrar server as discussed in the SIP
protocol. Gatekeepers are needed to ensure reliable communication. It provides central
management and control services like address translation, bandwidth management, routing etc.
H.323 Protocols
H.323 uses a number of protocols to establish and maintain voice communication.
Audio/video codecs: H.323 uses several audio codecs like G71 or G7231 and video
codec standards like H.261, H.263 for compression.
H.245 protocol: which allows parties to negotiate the compression method .
Q.931 protocol: used for establishing and terminating connections.
H.225 protocol: used for registration with the gatekeeper.
H.323 Example
It shows the steps used by H.323 terminal to communicate with the telephone. Let us consider that the
number of the terminal is 111 and the the telephone is 112.
1. The terminal sends a broadcast message to the gatekeeper with an Admission
Request message to the gatekeeper with number 112. The gatekeeper checks it's database
of registered endpoints whether it contains the number 112. If so, the gatekeeper will check
if 111 is allowed to call 112 and if there is enough bandwidth. If the call can be connected.
the gatekeeper replies with a message Admission Confirm (ACF) with an IP address of 112 .
2. The terminal and gatekeeper communicate using H.245 to negotiate the bandwidth.
3. The terminal 111 opens a call signaling channel to the address provided by the gatekeeper
The gatekeeper, gateway and the telephone communicate using Q.931 to set up a
connection
4. The terminal, the gatekeeper, gateway and the telephone communicate using H245 to
negotiate the compression method.
5. The terminal, gateway and the telephone exchange audio using RTP under the management
of RTCP
6. The terminal, the gatekeeper, gateway and the telephone communicate using Q931 to
terminate the communication. Each of the two endpoints informs the gatekeeper about the
completed call.
Thank You.....