Dialogue Enhancement With MPEG-H - An Update
Dialogue Enhancement With MPEG-H - An Update
This Express Paper was selected on the basis of a submitted synopsis that has been peer reviewed by at least two qualified
anonymous reviewers. The complete manuscript was not peer reviewed. This express paper has been reproduced from the
author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no
responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights reserved.
Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio
Engineering Society.
ABSTRACT
Difficulties in following speech on TV due to loud background sounds are a common issue in broadcasting. Object-
based audio (OBA) systems like MPEG-H Audio can solve this problem by providing a personalized speech level.
Recently, international broadcasters have employed dialogue enhancement (DE) together with OBA, providing
customization and improved accessibility to their audiences, e.g., during the football World Cup 2022. To also add
customizable dialogues to material produced without OBA, deep neural networks (DNNs) can be applied to
separate dialogues from the music and effects of the final audio mix. One of the technologies used for this is
MPEG-H Dialog+, which has recently been adopted for the new “Clear Speech” service of the on-demand platform
of the German public broadcaster ARD. This paper reviews the current state of DE, detailing real-world adoptions,
with particular focus on the MPEG-H Audio system. The intention is to provide an up-to-date overview of
successful implementations of DE solutions into production workflows as an example for further adoptions and
developments.
1
WDR is a constituent member of ARD, the joint organisation of
Germany’s regional public-service broadcasters.
Rieger et al. Dialogue Enhancement with MPEG-H Audio
Figure 1. Survey by WDR and Fraunhofer IIS on understanding speech on TV carried out on
a national scale in Germany [2].
accessibility service “Clear Speech” [3] (discussed in due to loud background noises and music [1]. The
Sec. 4.2). Other European content providers are goal of the study was to understand how much the
building similar DE services, for example the background level would have to be lowered to
Swedish public broadcaster SVT [4]. significantly improve the intelligibility of speech on
TV. However, the results were inconclusive; it only
This paper reviews the current state of object-based became clear later that the optimum volume
personalization and related technologies (Sec. 2.1). It difference between dialogue and music & effects
details real-world adoptions of DE technologies with (M&E) is highly personal. The study also pointed out
emphasis on the MPEG-H Audio system (Sec. 2.2). that the broadcasting system at that time was not
For that purpose, it examines cases where an object- capable of transmitting an additional track with
based production is available as well as cases in increased speech level [1].
which the objects are estimated by a deep neural
network (DNN) (Sec. 3). It is meant to provide an In 2011, the BBC and Fraunhofer IIS jointly
overview of successful implementations of DE conducted a public field test during the Wimbledon
technology into existing production workflows in Tennis Championships, in which viewers had the
Brazil and South Korea (Sec. 4.1), as examples for opportunity to personalize the dialogue level. The
further adoptions and developments. results showed that while one part of the subjects
preferred to clearly enhance the dialogue level, the
other part of the viewers chose to increase the
2. Object-based Dialogue Enhancement ambience sounds [5]. The preferred loudness
Sound mixing is an important asset for TV differences between dialogue and background were
productions, as the composition of music, effects, and investigated in more detail in further studies that also
ambience not only serves as a background for the showed a significant difference between expert and
dialogue but is part of the storytelling. It is a complex non-expert listeners [6].
task to produce content that, on the one hand,
immerses viewers into the scene by providing a Furthermore, in 2019, the BBC conducted a public
compelling mix, and, on the other hand, prevents loud trial together with the University of Salford, using a
background sounds from masking speech, allowing narrative importance approach [7]. That way, non-
the audience to fully understand the dialogue without speech sounds are grouped together based on their
high listening effort. The topic of DE therefore has a importance to the narrative, preventing the loss of
long history, dating back more than 30 years. significant sounds due to global background
attenuation [8]. Participants in the trial could not only
enhance the dialogue, but also narratively important
2.1 Related Works sounds. The trial was followed by a survey, showing
As early as 1991, BBC Research & Development that out of the 299 participants 73% rated the content
published a study documenting regular complaints more enjoyable or easier to understand when using
about speech that was difficult to understand – mostly the personalization feature.
comprises various content types, such as nature however, come with a clear LE cost, requiring care at
documentaries, sports programs, and feature films. the production stage (cf. e.g., the recommendations in
The audio stems are manually edited to exclude any [32]), and the personalization functionalities provided
parts where non-speech-sounds are present in the by DE for users.
dialogue stem or speech is present in the M&E stem.
This helps prevent faulty training where, for example,
sounds are incorrectly identified as speech and 4. Dialogue Enhancement Production
separated. DS cannot always separate dialogue Workflows
perfectly and some degree of coloration or distortions
In the following, we outline three current real-world
might be introduced. There are ongoing efforts to
use cases, where DE was implemented in broadcast
improve the quality of DS as well as to automatically
and streaming workflows. The examples from Globo2
control the final quality [29].
(Brazil) and SBS 3 (South Korea) are live, object-
based productions where speech and background
MPEG-H Dialog+ combines DS with automatic
sounds were provided as separate components. The
remixing, allowing global and time-varying
described service by German public-service
background attenuation, or a combination of both.
broadcaster ARD works with DS in a channel-based
Users who prefer minimal background signal might
offline production environment. While DE for live
benefit from global background attenuation, where
transmission is normally faced with challenging
the relative background level is attenuated by the
latency requirements down to a maximum of a few
same specified amount over the entire signal. Since
frames, post-production workflows can typically be
globally attenuating the background might decrease
covered with offline processing, providing the option
ambience and sounds of narrative importance [8],
of more lookahead and processing time for DS and
time-varying background attenuation only lowers the
remixing.
background level when dialogue is present. The
parameters for DS and remixing can be adjusted, and
it is possible to combine global and time-varying 4.1 Object-based Live Broadcast
attenuation [2]. Finally, the automatic authoring The football World Cup 2022 in Qatar was a recent
produces audio and metadata ready to be used in an major event for which object-based DE was provided
object-based workflow. MPEG-H Dialog+ was in several broadcasting and streaming services [33].
shown to successfully enable personalization [2], and Both Globo and SBS implemented interactive,
to reduce listening effort (LE), as discussed below. personalized MPEG-H Audio in their services. This
enabled viewers to switch between different presets,
3.3 Reducing Listening Effort (LE) including an option with attenuated background
sounds for improved ease of listen. Additionally, due
DE and the DS network of MPEG-H Dialog+ were
to the object-based production, advanced
evaluated in [30], where a multimodal evaluation of
personalization options were possible: the balance
LE was carried out on subjects without hearing
between dialogue and background could be
impairments and using TV excerpts. Pupil size was
changed by users with a slider in the user interface,
considered as a physiological indicator of LE together
see Figure 2.
with self-reported LE and word recall rate. A
common trend was observed across all measures of
At the Globo facilities, two live broadcast services
LE: It was shown that background music and effects
were set up during the world cup: One for the current
cause significant LE in broadcast audio, even under
SBTVD 2.5 broadcast standard with HE-AAC or
ideal listening conditions. Decreasing the background
AAC-LC and MPEG-H Audio [12], and one for
level via DE consistently reduced the LE. This was
SBTVD TV 3.0 (via DASH streaming), which is
shown to be the case both when the original audio
currently in the standardization phase and specifies
objects were used for DE and when DE was enabled
MPEG-H Audio as the sole mandatory audio codec
by the DS network of MPEG-H Dialog+. It was
[34] [35]. Globo received the broadcast signals by
concluded that background music and effects can
the world cup’s host broadcaster in their facilities
carry vital information and play an important role in
in Rio de Janeiro. The local commentary by Globo
engaging and entertaining the audience [31]. They do,
sports channels was added to the live feed from Qatar.
2 3
Globo is a Brazilian free-to-air television network, owned by Seoul Broadcasting System (SBS) is one of the leading South
media conglomerate Grupo Globo, and the largest commercial TV Korean television and radio broadcasters and the largest private
network in Latin America. broadcaster in South Korea.
To allow different presets and user personalization, processing in this use case is part of a containerized
the signal was enriched with MPEG-H metadata by automatic workflow. It receives the audio mix from
an AMAU (Audio Monitoring and Authoring Unit) an archive or production. After DS processing and
and passed along together with the video signals to automatic remixing, stereo audio with enhanced
the corresponding video encoders for simultaneous dialogue is rendered to a file that is loudness-
delivery over the TV 2.5 and TV 3.0 systems [36]. normalized according to EBU R128 [38]. The new
audio mix is then muxed with video versions of
different bitrates and encoded in parallel.
[16] E. Vincent, T. Virtanen and S. Gannot, Audio [26] “IDC,” Audionamix, [Online]. Available:
Source Separation and Speech Enhancement, https://audionamix.com/instant-dialogue-
John Wiley & Sons, 2018. cleaner/. [Accessed March 2023].
[17] A. S. Master, L. Lu, H. M. Lehtonen, H. Mundt, [27] “RX Dialogue Isolate,” iZotope, [Online].
H. Purnhagen and D. Darcy, “Dialog Available:
Enhancement via Spatio-Level Filtering and https://www.izotope.com/en/products/rx/featu
Classification,” in AES Convention 149, res/dialogue-isolate.html. [Accessed March
Virtual, 2020. 2023].
[18] J. T. Geiger, P. Grosche and Y. L. Parodi, [28] “Clarity Vx Pro,” Waves, [Online]. Available:
“Dialogue Enhancement of Stereo Sound,” in https://www.waves.com/plugins/clarity-vx-
23rd European Signal Processing Conference pro. [Accessed March 2023].
(EUSIPCO), Nice, France, 2015. [29] M. Torcoli, J. Paulus, T. Kastner and C. Uhle,
[19] A. Craciun, C. Uhle and T. Bäckström, “An “Controlling the Remixing of Separated
Evaluation of Stereo Speech Enhancement Dialogue with a Non-Intrusive Quality
Methods for Different Audio-Visual Estimate,” in IEEE Workshop on Applications
Scenarios,” in 23rd European Signal of Signal Processing to Audio and Acoustics
Processing Conference (EUSIPCO), Nice, (WASPAA), Virtual, 2021.
France, 2015. [30] M. Torcoli, T. Robotham and E. A. P. Habets,
[20] J. Paulus, M. Torcoli, C. Uhle, J. Herre, S. “Dialogue Enhancement and Listening Effort
Disch and H. Fuchs, “Source Separation for in Broadcast Audio: A Multimodal
Enabling Dialogue Enhancement in Object- Evaluation,” in 14th International Conference
based Broadcast with MPEG-H,” Journal on Quality of Multimedia Experience
Audio Engineering Society (JAES), vol. 67, no. (QoMEX), Lippstadt, Germany, 2022.
7/8, pp. 510-521, 2019. [31] L. Ward, B. G. Shirley, Y. Tang and W. J.
[21] C. Uhle, O. Hellmuth and J. Weigel, “Speech Davies, “The Effect of Situation-Specific Non-
Enhancement of Movie Sound,” in AES Speech Acoustic Cues on the Intelligibility of
Convention 125, San Francisco, USA, 2008. Speech in Noise,” in INTERSPEECH, 2017,
[22] N. L. Westhausen, R. Huber, H. Baumgartner, Stockholm, Sweden.
R. Sinha, J. Rennies and B. T. Meyer, [32] D. Geary, M. Torcoli, J. Paulus, C. Simon, D.
“Reduction of Subjective Listening Effort for Straninger, A. Travaglini and B. Shirley,
TV Broadcast Signals With Recurrent Neural “Loudness Differences for Voice-Over-Voice
Networks,” IEEE/ACM Transactions on Audio, Audio in TV and Streaming,” Journal Audio
Speech, and Language Processing, vol. 29, pp. Engineering Society (JAES), vol. 68, no. 11, pp.
3541-3550, 2021. 810-818, 2020.
[23] J. Paulus and M. Torcoli, “Sampling Frequency [33] Fraunhofer IIS, “Football Fans Around the
Independent Dialogue Separation,” in World Experience the Worldcup in Immersive
European Signal Processing Conference and Personalized MPEG-H Audio,” 2022.
(EUSIPCO), Belgrade, Serbia, 2022. [Online]. Available:
[24] D. Petermann, G. Wichern, Z.-Q. Wang and J. https://www.iis.fraunhofer.de/en/pr/2022/2022
Le Roux, “The Cocktail Fork Problem: Three- 1208_mpeg-h_audio_worldcup.html.
Stem Audio Separation for Real-World [Accessed April 2023].
Soundtracks,” in IEEE International [34] Brazilian Digital Terrestrial Television System
Conference Acoustic Speech Signal Processing Forum; Brazilian Ministry of Communications,
(ICASSP), Singapore, 2022. “Testing and Evaluation Report: TV 3.0
[25] M. Torcoli and E. A. P. Habets, “Better Project - Audio Coding,” SBTVD Forum,
Together: Dialogue Separation and Voice Brazil, 2021.
Activity Detection for Audio Personalization in [35] A. Murtaza, S. Meltzer, Y. Grewe, N. Faecks,
TV,” in IEEE International Conference M. Raulet and L. Gregory, “MPEG-H Audio
Acoustic Speech Signal Processing (ICASSP), System for SBTVD TV 3.0 Call for Proposals,”
Rhodes, Greece, 2023. SET International Journal of Broadcast
Engineering, 2021, doi:
10.18580/setijbe.2021.3.