MPEG-FAQ 4.1: What is MPEG-Audio then ?
What is MPEG-Audio then ?
From: "Harald Popp" <POPP@iis.fhg.de>
Date: Fri, 25 Mar 1994 19:09:06 +0100
Q. What is MPEG?
A. MPEG is an ISO committee that proposes standards for
compression of Audio and Video. MPEG deals with 3 issues:
Video, Audio, and System (the combination of the two into one
stream). You can find more info on the MPEG committee in other
parts of this document.
Q. I've heard about MPEG Video. So this is the same compression
applied to audio?
A. Definitely no. The eye and the ear... even if they are only a
few centimeters apart, works very differently... The ear has
a much higher dynamic range and resolution. It can pick out
more details but it is "slower" than the eye.
The MPEG committee chose to recommend 3 compression methods
and named them Audio Layer-1, Layer-2, and Layer-3.
Q. What does it mean exactly?
A. MPEG-1, IS 11172-3, describes the compression of audio
signals using high performance perceptual coding schemes.
It specifies a family of three audio coding schemes,
simply called Layer-1,-2,-3, with increasing encoder
complexity and performance (sound quality per bitrate).
The three codecs are compatible in a hierarchical
way, i.e. a Layer-N decoder is able to decode bitstream data
encoded in Layer-N and all Layers below N (e.g., a Layer-3
decoder may accept Layer-1,-2 and -3, whereas a Layer-2
decoder may accept only Layer-1 and -2.)
Q. So we have a family of three audio coding schemes. What does
the MPEG standard define, exactly?
A. For each Layer, the standard specifies the bitstream format
and the decoder. It does *not* specify the encoder to
allow for future improvements, but an informative chapter
gives an example for an encoder for each Layer.
Q. What have the three audio Layers in common?
A. All Layers use the same basic structure. The coding scheme can
be described as "perceptual noise shaping" or "perceptual
subband / transform coding".
The encoder analyzes the spectral components of the audio
signal by calculating a filterbank or transform and applies
a psychoacoustic model to estimate the just noticeable
noise-level. In its quantization and coding stage, the
encoder tries to allocate the available number of data
bits in a way to meet both the bitrate and masking
The decoder is much less complex. Its only task is to
synthesize an audio signal out of the coded spectral
All Layers use the same analysis filterbank (polyphase with
32 subbands). Layer-3 adds a MDCT transform to increase
the frequency resolution.
All Layers use the same "header information" in their
bitstream, to support the hierarchical structure of the
All Layers use a bitstream structure that contains parts that
are more sensitive to biterrors ("header", "bit
allocation", "scalefactors", "side information") and parts
that are less sensitive ("data of spectral components").
All Layers may use 32, 44.1 or 48 kHz sampling frequency.
All Layers are allowed to work with similar bitrates:
Layer-1: from 32 kbps to 448 kbps
Layer-2: from 32 kbps to 384 kbps
Layer-3: from 32 kbps to 320 kbps
Q. What are the main differences between the three Layers, from a
A. From Layer-1 to Layer-3,
complexity increases (mainly true for the encoder),
overall codec delay increases, and
performance increases (sound quality per bitrate).
Q. Which Layer should I use for my application?
A. Good Question. Of course, it depends on all your requirements.
But as a first approach, you should consider the available
bitrate of your application as the Layers have been
designed to support certain areas of bitrates most
efficiently, i.e. with a minimum drop of sound quality.
Let us look a little closer at the strong domains of each
Layer-1: Its ISO target bitrate is 192 kbps per audio
Layer-1 is a simplified version of Layer-2. It is most useful
for bitrates around the "high" bitrates around or above
192 kbps. A version of Layer-1 is used as "PASC" with the
Layer-2: Its ISO target bitrate is 128 kbps per audio
Layer-2 is identical with MUSICAM. It has been designed as
trade-off between sound quality per bitrate and encoder
complexity. It is most useful for bitrates around the
"medium" bitrates of 128 or even 96 kbps per audio
channel. The DAB (EU 147) proponents have decided to use
Layer-2 in the future Digital Audio Broadcasting network.
Layer-3: Its ISO target bitrate is 64 kbps per audio channel.
Layer-3 merges the best ideas of MUSICAM and ASPEC. It has
been designed for best performance at "low" bitrates
around 64 kbps or even below. The Layer-3 format specifies
a set of advanced features that all address one goal: to
preserve as much sound quality as possible even at rather
low bitrates. Today, Layer-3 is already in use in various
telecommunication networks (ISDN, satellite links, and so
on) and speech announcement systems.
Q. So how does MPEG audio work?
A. Well, first you need to know how sound is stored in a
computer. Sound is pressure differences in air. When picked up
by a microphone and fed through an amplifier this becomes
voltage levels. The voltage is sampled by the computer a
number of times per second. For CD audio quality you need to
sample 44100 times per second and each sample has a resolution
of 16 bits. In stereo this gives you 1,4Mbit per second
and you can probably see the need for compression.
To compress audio MPEG tries to remove the irrelevant parts
of the signal and the redundant parts of the signal. Parts of
the sound that we do not hear can be thrown away. To do this
MPEG Audio uses psychoacoustic principles.
Q. Tell me more about sound quality. How good is MPEG audio
compression? And how do you assess that?
A. Today, there is no alternative to expensive listening tests.
During the ISO-MPEG-1 process, 3 international listening tests
have been performed, with a lot of trained listeners,
supervised by Swedish Radio. They took place in 7.90, 3.91
and 11.91. Another international listening test was
performed by CCIR, now ITU-R, in 92.
All these tests used the "triple stimulus, hidden reference"
method and the so-called CCIR impairment scale to assess the
The listening sequence is "ABC", with A = original, BC = pair
of original / coded signal with random sequence, and the
listener has to evaluate both B and C with a number
between 1.0 and 5.0. The meaning of these values is:
5.0 = transparent (this should be the original signal)
4.0 = perceptible, but not annoying (first differences
3.0 = slightly annoying
2.0 = annoying
1.0 = very annoying
With perceptual codecs (like MPEG audio), all traditional
parameters (like SNR, THD+N, bandwidth) are especially
Fraunhofer-IIS (among others) works on objective quality
assessment tools, like the NMR meter (Noise-to-Mask-Ratio),
too. If you need more informations about NMR, please
Q. Now that I know how to assess quality, come on, tell me the
results of these tests.
A. Well, for details you should study one of those AES papers
listed below. One main result is that for low bitrates (60
or 64 kbps per channel, i.e. a compression ratio of around
12:1), Layer-2 scored between 2.1 and 2.6, whereas Layer-3
scored between 3.6 and 3.8.
This is a significant increase in sound quality, indeed!
Furthermore, the selection process for critical sound material
showed that it was rather difficult to find worst-case
material for Layer-3 whereas it was not so hard to find
such items for Layer-2.
For medium and high bitrates (120 kbps or more per channel),
Layer-2 and Layer-3 scored rather similar, i.e. even
trained listeners found it difficult to detect differences
between original and reconstructed signal.
Q. So how does MPEG achieve this compression ratio?
A. Well, with audio you basically have two alternatives. Either
you sample less often or you sample with less resolution (less
than 16 bit per sample). If you want quality you can't do much
with the sample frequency. Humans can hear sounds with
frequencies from about 20Hz to 20kHz. According to the Nyquist
theorem you must sample at least two times the highest
frequency you want to reproduce. Allowing for imperfect
filters, a 44,1kHz sampling rate is a fair minimum. So
you either set out to prove the Nyquist theorem is wrong or
go to work on reducing the resolution. The MPEG committee
chose the latter.
Now, the real reason for using 16 bits is to get a good
signal-to-noise (s/n) ratio. The noise we're talking
about here is quantization noise from the digitizing
process. For each bit you add, you get 6dB
better s/n. (To the ear, 6dBu corresponds to a doubling of
the sound level.) CD-audio achieves about 90dB s/n. This
matches the dynamic range of the ear fairly well. That is, you
will not hear any noise coming from the system itself (well,
there is still some people arguing about that, but lets not
worry about them for the moment).
So what happens when you sample to 8 bit resolution? You get
a very noticeable noise floor in your recording. You can
easily hear this in silent moments in the music or between
words or sentences if your recording is a human voice.
Waitaminnit. You don't notice any noise in loud passages,
right? This is the masking effect and is the key to MPEG Audio
coding. Stuff like the masking effect belongs to a science
called psycho-acoustics that deals with the way the human
brain perceives sound.
And MPEG uses psychoacoustic principles when it does its
Q. Explain this masking effect.
A. OK, say you have a strong tone with a frequency of 1000Hz.
You also have a tone nearby of say 1100Hz. This second tone is
18 dB lower. You are not going to hear this second tone. It is
completely masked by the first 1000Hz tone. As a matter of
fact, any relatively weak sounds near a strong sound is
masked. If you introduce another tone at 2000Hz also 18 dB
below the first 1000Hz tone, you will hear this.
You will have to turn down the 2000Hz tone to something like
45 dB below the 1000Hz tone before it will be masked by the
first tone. So the further you get from a sound the less
masking effect it has.
The masking effect means that you can raise the noise floor
around a strong sound because the noise will be masked anyway.
And raising the noise floor is the same as using less bits
and using less bits is the same as compression. Do you get it?
Q. I don't get it.
A. Well, let me try to explain how the MPEG Audio Layer-2 encoder
goes about its thing. It divides the frequency spectrum (20Hz
to 20kHz) into 32 subbands. Each subband holds a little slice
of the audio spectrum. Say, in the upper region of subband 8,
a 6500Hz tone with a level of 60dB is present. OK, the
coder calculates the masking effect of this sound and finds
that there is a masking threshold for the entire 8th
subband (all sounds w. a frequency...) 35dB below this tone.
The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4
bit resolution. In addition there are masking effects on band
9-13 and on band 5-7, the effect decreasing with the distance
from band 8.
In a real-life situation you have sounds in most bands and the
masking effects are additive. In addition the coder considers
the sensitivity of the ear for various frequencies. The ear
is a lot less sensitive in the high and low frequencies. Peak
sensivity is around 2 - 4kHz, the same region that the human
The subbands should match the ear, that is each subband should
consist of frequencies that have the same psychoacoustic
properties. In MPEG Layer 2, each subband is 750Hz wide
(with 48 kHz sampling frequency). It would have been better if
the subbands were narrower in the low frequency range and
wider in the high frequency range. That is the trade-off
Layer-2 took in favour of a simpler approach.
Layer-3 has a much higher frequency resolution (18 times
more) - and that is one of the reasons why Layer-3 has a much
better low bitrate performance than Layer-2.
But there is more to it. I have explained concurrent masking,
but the masking effect also occurs before and after a strong
sound (pre- and postmasking).
A. Yes, if there is a significant (30 - 40dB ) shift in level.
The reason is believed to be that the brain needs some
processing time. Premasking is only about 2 to 5 ms. The
postmasking can be up till 100ms.
Other bit-reduction techniques involve considering tonal and
non-tonal components of the sound. For a stereo signal you
may have a lot of redundancy between channels. All MPEG
Layers may exploit these stereo effects by using a "joint-
stereo" mode, with a most flexible approach for Layer-3.
Furthermore, only Layer-3 further reduces the redundancy
by applying huffmann coding.
Q. What are the downside?
A. The coder calculates masking effects by an iterative process
until it runs out of time. It is up to the implementor to
spend bits in the least obtrusive fashion.
For Layer 2 and Layer 3, the encoder works on 24 ms of sound
(with 1152 sample, and fs = 48 kHz) at a time. For some
material, the time-window can be a problem. This is
normally in a situation with transients where there are large
differences in sound level over the 24 ms. The masking is
calculated on the strongest sound and the weak parts will
drown in quantization noise. This is perceived as a "noise-
echo" by the ear. Layer 3 addresses this problem
specifically by using a smaller analysis window (4 ms), if
the encoder encounters an "attack" situation.
Q. Tell me about the complexity. What are the hardware demands?
A. Alright. First, we have to separate between decoder and
Remember: the MPEG coding is done asymmetrical, with a much
larger workload on the encoder than on the decoder.
For a stereo decoder, variuos real-time implementations exist
for Layer-2 and Layer-3. They are either based on single-DSP
solutions or on dedicated MPEG audio decoder chips. So
you need not worry about decoder complexity.
For a stereo Layer-2-encoder, various DSP based solutions with
one or more DSPs exist (with different quality, also).
For a stereo Layer-3-encoder achieving ISO reference quality,
the current real-time implementations use two DSP32C and
Q. How many audio channels?
A. MPEG-1 allows for two audio channels. These can be either
single (mono), dual (two mono channels), stereo or
joint stereo (intensity stereo (Layer-2 and Layer-3) or m/s-
stereo (Layer-3 only)).
In normal (l/r) stereo one channel carries the left audio
signal and one channel carries the right audio signal. In
m/s stereo one channel carries the sum signal (l+r) and the
other the difference (l-r) signal. In intensity stereo the
high frequency part of the signal (above 2kHz) is combined.
The stereo image is preserved but only the temporal envelope
In addition MPEG allows for pre-emphasis, copyright marks and
original/copy marks. MPEG-2 allows for several channels in
the same stream.
Q. What about the audio codec delay?
A. Well, the standard gives some figures of the theoretical
Layer-1: 19 ms (<50 ms)
Layer-2: 35 ms (100 ms)
Layer-3: 59 ms (150 ms)
The practical values are significantly above that. As they
depend on the implementation, exact figures are hard to
give. So the figures in brackets are just rough thumb
Yes, for some applications, a very short delay is of critical
importance. E.g. in a feedback link, a reporter can only talk
intelligibly if the overall delay is below around 10 ms.
If broadcasters want to apply MPEG audio coding, they have to
use "N-1" switches in the studio to overcome this problem
(or appropriate echo-cancellers) - or they have to forget
about MPEG at all.
But with most applications, these figures are small enough to
present no extra problem. At least, if one can accept a Layer-
2 delay, one can most likely also accept the higher Layer-3
Q. OK, I am hooked on! Where can I find more technical
informations about MPEG audio coding, especially about Layer-
A. Well, there is a variety of AES papers, e.g.
K. Brandenburg, G. Stoll, ...: "The ISO/MPEG-Audio Codec: A
Generic Standard for Coding of High Quality Digital Audio",
92nd AES, Vienna 1992, pp.3336
E. Eberlein, H. Popp, ...: "Layer-3, a Flexible Coding
Standard", 94th AES, Berlin 93, pp.3493
K. Brandenburg, G. Zimmer, ...: "Variable Data-Rate Recording
on a PC Using MPEG-Audio Layer-3", 95th AES, New York 93
B. Grill, J. Herre,... : "Improved MPEG-2 Audio Multi-Channel
Encoding", 96th AES, Amsterdam 94
And for further informations, please contact email@example.com
Q. Where can I get more details about MPEG audio?
A. Still more details? No shit. You can get the full ISO spec
from Omnicom. The specs do a fairly good job of obscuring
exactly how these things are supposed to work... Jokes aside,
there are no description of the coder in the specs. The specs
describes in great detail the bitstream and suggests
Originally written by Morten Hjerde <100034,firstname.lastname@example.org>,
modified and updated by Harald Popp (email@example.com).
Audio & Multimedia ("Music is the *BEST*" - F. Zappa)
Fraunhofer-IIS-A, Weichselgarten 3, D-91058 Erlangen, Germany