0% found this document useful (0 votes)
56 views8 pages

Dialogue Enhancement With MPEG-H - An Update

This paper reviews dialogue enhancement technologies, detailing real-world implementations. It discusses how object-based audio systems like MPEG-H Audio can offer customizable speech levels to improve accessibility. Deep learning methods can also extract dialogues from existing content to enhance speech intelligibility without object-based production. Recent adoptions of these technologies aim to improve the listening experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views8 pages

Dialogue Enhancement With MPEG-H - An Update

This paper reviews dialogue enhancement technologies, detailing real-world implementations. It discusses how object-based audio systems like MPEG-H Audio can offer customizable speech levels to improve accessibility. Deep learning methods can also extract dialogues from existing content to enhance speech intelligibility without object-based production. Recent adoptions of these technologies aim to improve the listening experience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Audio Engineering Society

Convention Express Paper 84


Presented at the 154th Convention
2023 May 13-15, Espoo, Helsinki, Finland

This Express Paper was selected on the basis of a submitted synopsis that has been peer reviewed by at least two qualified
anonymous reviewers. The complete manuscript was not peer reviewed. This express paper has been reproduced from the
author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no
responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights reserved.
Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio
Engineering Society.

Dialogue Enhancement with MPEG-H Audio:


An Update on Technology and Adoption

Daniela Rieger, Christian Simon, Matteo Torcoli, and Harald Fuchs


Fraunhofer Institute for Integrated Circuits, Am Wolfsmantel 33, 91058 Erlangen, Germany
Correspondence should be addressed to Daniela Rieger ([email protected])

ABSTRACT
Difficulties in following speech on TV due to loud background sounds are a common issue in broadcasting. Object-
based audio (OBA) systems like MPEG-H Audio can solve this problem by providing a personalized speech level.
Recently, international broadcasters have employed dialogue enhancement (DE) together with OBA, providing
customization and improved accessibility to their audiences, e.g., during the football World Cup 2022. To also add
customizable dialogues to material produced without OBA, deep neural networks (DNNs) can be applied to
separate dialogues from the music and effects of the final audio mix. One of the technologies used for this is
MPEG-H Dialog+, which has recently been adopted for the new “Clear Speech” service of the on-demand platform
of the German public broadcaster ARD. This paper reviews the current state of DE, detailing real-world adoptions,
with particular focus on the MPEG-H Audio system. The intention is to provide an up-to-date overview of
successful implementations of DE solutions into production workflows as an example for further adoptions and
developments.

not normally struggle with speech intelligibility. This


1. Introduction shows that the option to personalize sound is much
Difficulties in understanding speech on TV and the appreciated. Using object-based audio (OBA), the
resulting complaints are a common problem that MPEG-H Audio system can offer such options by
broadcasters have been dealing with for more than 30 providing an improved speech level that can also be
years [1]. The urgency of these complaints has been adapted by the user. Additionally, the study highlights
confirmed by a study with over 2,000 participants the need to provide dialogue enhancement (DE) for
carried out by Westdeutscher Rundfunk (WDR1) and the large amount of existing content which contains
Fraunhofer IIS in 2020 [2]: It showed that 68% of all only the final audio mix. In this case, deep-learning
participants, and 90% of the subjects aged 60 and based solutions such as MPEG-H Dialog+ offer the
above, often or very often had problems to understand possibility to automatically separate speech from the
speech on TV (see Figure 1). 83% of all study background and remix the content to a new speech-
participants liked the possibility to switch to a enhanced version that is easier to understand. In 2022,
dialogue-enhanced version, including those who do the German broadcaster ARD adopted MPEG-H
Dialog+ for the on-demand segment of their new

1
WDR is a constituent member of ARD, the joint organisation of
Germany’s regional public-service broadcasters.
Rieger et al. Dialogue Enhancement with MPEG-H Audio

Figure 1. Survey by WDR and Fraunhofer IIS on understanding speech on TV carried out on
a national scale in Germany [2].

accessibility service “Clear Speech” [3] (discussed in due to loud background noises and music [1]. The
Sec. 4.2). Other European content providers are goal of the study was to understand how much the
building similar DE services, for example the background level would have to be lowered to
Swedish public broadcaster SVT [4]. significantly improve the intelligibility of speech on
TV. However, the results were inconclusive; it only
This paper reviews the current state of object-based became clear later that the optimum volume
personalization and related technologies (Sec. 2.1). It difference between dialogue and music & effects
details real-world adoptions of DE technologies with (M&E) is highly personal. The study also pointed out
emphasis on the MPEG-H Audio system (Sec. 2.2). that the broadcasting system at that time was not
For that purpose, it examines cases where an object- capable of transmitting an additional track with
based production is available as well as cases in increased speech level [1].
which the objects are estimated by a deep neural
network (DNN) (Sec. 3). It is meant to provide an In 2011, the BBC and Fraunhofer IIS jointly
overview of successful implementations of DE conducted a public field test during the Wimbledon
technology into existing production workflows in Tennis Championships, in which viewers had the
Brazil and South Korea (Sec. 4.1), as examples for opportunity to personalize the dialogue level. The
further adoptions and developments. results showed that while one part of the subjects
preferred to clearly enhance the dialogue level, the
other part of the viewers chose to increase the
2. Object-based Dialogue Enhancement ambience sounds [5]. The preferred loudness
Sound mixing is an important asset for TV differences between dialogue and background were
productions, as the composition of music, effects, and investigated in more detail in further studies that also
ambience not only serves as a background for the showed a significant difference between expert and
dialogue but is part of the storytelling. It is a complex non-expert listeners [6].
task to produce content that, on the one hand,
immerses viewers into the scene by providing a Furthermore, in 2019, the BBC conducted a public
compelling mix, and, on the other hand, prevents loud trial together with the University of Salford, using a
background sounds from masking speech, allowing narrative importance approach [7]. That way, non-
the audience to fully understand the dialogue without speech sounds are grouped together based on their
high listening effort. The topic of DE therefore has a importance to the narrative, preventing the loss of
long history, dating back more than 30 years. significant sounds due to global background
attenuation [8]. Participants in the trial could not only
enhance the dialogue, but also narratively important
2.1 Related Works sounds. The trial was followed by a survey, showing
As early as 1991, BBC Research & Development that out of the 299 participants 73% rated the content
published a study documenting regular complaints more enjoyable or easier to understand when using
about speech that was difficult to understand – mostly the personalization feature.

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 2 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

2.2 MPEG-H Audio can interact with. DS can be used as a pre-processing


MPEG-H Audio is a next generation audio system step to enable OBA so that personalization is offered
based on the open international standard ISO/IEC to the audience for both object-based and non-object-
23008-3, MPEG-H 3D Audio [9]. It is part of major based productions.
broadcasting standards such as DVB [10], ATSC 3.0
[11], and SBTVD [12] [13]. MPEG-H Audio supports 3.1 Related Works
audio scenes consisting of object-based content as DS shares relevant characteristics with both speech
well as channel-based and scene-based content, or enhancement and source separation. Both have been
any combination of them, and thus enables an major research topics for decades, cf. e.g., [16]. Early
immersive audio experience via broadcast, streaming, works specifically addressing DS for TV developed
and music services. Besides immersive sound, signal processing strategies for extracting the
MPEG-H Audio offers personalization and dialogue from a final audio mix. These strategies
“Universal Delivery”, the latter allowing the optimal
exploited characteristics specific to dialogue in TV
rendering of a production on all kinds of devices [14].
To enable these personalized sound experiences, productions, e.g., the fact that dialogue is usually
MPEG-H Audio relies on metadata, which is amplitude-panned in a stereo mix [17], typically
produced (or “authored”) and transmitted together located in the phantom center [18], or in any case a
with the audio content. A renderer provides an audio direct component correlated across channels [19], or
playback that is optimized for the individual user, end a combination of these characteristics [20]. A more
device, and surroundings [15]. general approach was proposed in [21], where feature
extraction is followed by a shallow neural network.
Due to the object-based nature of an MPEG-H Audio This idea anticipated more recent works, in which
scene, the individual components of the mix, such as deep neural networks (DNNs) are applied in the
dialogue and M&E, are not mixed together before Short-Time-Fourier-Transform (STFT) domain [22]
transmission, but represented separately within the [23] [24] [25].
MPEG-H Audio stream. Thus, an MPEG-H Audio
receiving device allows the individual adjustment of
Recent advances in deep learning brought significant
these components during rendering and playback.
The personalization options are defined during improvements in the quality of DS. While research in
production, e.g., the configuration of “presets”, which the field continues and further improvements are to
are different representations of the audio scene based be expected, some products delivering remarkable
on metadata and user interaction [14]. Users could, quality are already on the market. Besides the
for instance, select a preset with dialogue MPEG-H Dialog+ technology discussed further
enhancement, where the relative level of M&E is below, it is worth noting a selection of other post-
attenuated when dialogue is active. Additionally, production tools: IDC by Audionamix [26], RX
broadcasters can allow users to individually adjust the Dialogue Isolate by iZotope [27], and Clarity Vx Pro
levels of dialogue and background objects in defined by Waves [28].
ranges, enabling them to set their preferred balance.
To ensure consistent playback loudness, also in cases
3.2 MPEG-H Dialog+
with user interaction, the content loudness is
normalized during the rendering process based on MPEG-H Dialog+ is a DNN-based tool performing 1)
metadata [14]. DS, 2) automatic remixing, and 3) automatic
metadata authoring. Dialog+ has been specifically
developed to fill the gap between traditionally
3. Deep-learning-based Dialogue produced material available only as a final audio mix
and object-based audio. The core DS solution
Enhancement
contained in MPEG-H Dialog+ is under active
Today, broadcasters and service providers still rely on development, but the version that is currently
a great amount of material created with non-object- deployed in broadcasting and to service providers was
based workflows, for which only the final audio mix described in [2], and updated with improvements
is available, often in stereo format. For such cases, similar to the ones proposed in [25]. It consists of a
Dialogue Separation (DS) can be applied to separate deep fully convolutional neural network, trained with
speech and background elements (M&E) from the original TV material for which the individual audio
final audio mix and to create audio objects that users stems are available from production. The material

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 3 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

comprises various content types, such as nature however, come with a clear LE cost, requiring care at
documentaries, sports programs, and feature films. the production stage (cf. e.g., the recommendations in
The audio stems are manually edited to exclude any [32]), and the personalization functionalities provided
parts where non-speech-sounds are present in the by DE for users.
dialogue stem or speech is present in the M&E stem.
This helps prevent faulty training where, for example,
sounds are incorrectly identified as speech and 4. Dialogue Enhancement Production
separated. DS cannot always separate dialogue Workflows
perfectly and some degree of coloration or distortions
In the following, we outline three current real-world
might be introduced. There are ongoing efforts to
use cases, where DE was implemented in broadcast
improve the quality of DS as well as to automatically
and streaming workflows. The examples from Globo2
control the final quality [29].
(Brazil) and SBS 3 (South Korea) are live, object-
based productions where speech and background
MPEG-H Dialog+ combines DS with automatic
sounds were provided as separate components. The
remixing, allowing global and time-varying
described service by German public-service
background attenuation, or a combination of both.
broadcaster ARD works with DS in a channel-based
Users who prefer minimal background signal might
offline production environment. While DE for live
benefit from global background attenuation, where
transmission is normally faced with challenging
the relative background level is attenuated by the
latency requirements down to a maximum of a few
same specified amount over the entire signal. Since
frames, post-production workflows can typically be
globally attenuating the background might decrease
covered with offline processing, providing the option
ambience and sounds of narrative importance [8],
of more lookahead and processing time for DS and
time-varying background attenuation only lowers the
remixing.
background level when dialogue is present. The
parameters for DS and remixing can be adjusted, and
it is possible to combine global and time-varying 4.1 Object-based Live Broadcast
attenuation [2]. Finally, the automatic authoring The football World Cup 2022 in Qatar was a recent
produces audio and metadata ready to be used in an major event for which object-based DE was provided
object-based workflow. MPEG-H Dialog+ was in several broadcasting and streaming services [33].
shown to successfully enable personalization [2], and Both Globo and SBS implemented interactive,
to reduce listening effort (LE), as discussed below. personalized MPEG-H Audio in their services. This
enabled viewers to switch between different presets,
3.3 Reducing Listening Effort (LE) including an option with attenuated background
sounds for improved ease of listen. Additionally, due
DE and the DS network of MPEG-H Dialog+ were
to the object-based production, advanced
evaluated in [30], where a multimodal evaluation of
personalization options were possible: the balance
LE was carried out on subjects without hearing
between dialogue and background could be
impairments and using TV excerpts. Pupil size was
changed by users with a slider in the user interface,
considered as a physiological indicator of LE together
see Figure 2.
with self-reported LE and word recall rate. A
common trend was observed across all measures of
At the Globo facilities, two live broadcast services
LE: It was shown that background music and effects
were set up during the world cup: One for the current
cause significant LE in broadcast audio, even under
SBTVD 2.5 broadcast standard with HE-AAC or
ideal listening conditions. Decreasing the background
AAC-LC and MPEG-H Audio [12], and one for
level via DE consistently reduced the LE. This was
SBTVD TV 3.0 (via DASH streaming), which is
shown to be the case both when the original audio
currently in the standardization phase and specifies
objects were used for DE and when DE was enabled
MPEG-H Audio as the sole mandatory audio codec
by the DS network of MPEG-H Dialog+. It was
[34] [35]. Globo received the broadcast signals by
concluded that background music and effects can
the world cup’s host broadcaster in their facilities
carry vital information and play an important role in
in Rio de Janeiro. The local commentary by Globo
engaging and entertaining the audience [31]. They do,
sports channels was added to the live feed from Qatar.

2 3
Globo is a Brazilian free-to-air television network, owned by Seoul Broadcasting System (SBS) is one of the leading South
media conglomerate Grupo Globo, and the largest commercial TV Korean television and radio broadcasters and the largest private
network in Latin America. broadcaster in South Korea.

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 4 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

To allow different presets and user personalization, processing in this use case is part of a containerized
the signal was enriched with MPEG-H metadata by automatic workflow. It receives the audio mix from
an AMAU (Audio Monitoring and Authoring Unit) an archive or production. After DS processing and
and passed along together with the video signals to automatic remixing, stereo audio with enhanced
the corresponding video encoders for simultaneous dialogue is rendered to a file that is loudness-
delivery over the TV 2.5 and TV 3.0 systems [36]. normalized according to EBU R128 [38]. The new
audio mix is then muxed with video versions of
different bitrates and encoded in parallel.

The created “Clear Speech” version is offered as


alternative accessibility service in the VoD service
“ARD Mediathek”, see Figure 3. In this workflow,
the produced MPEG-H Dialog+ output is a stereo
rendering from the MPEG-H Audio scene authoring
computed in the background. An optional ADM
(Audio Definition Model) [39] output could be
directly encoded and transmitted in case an OBA
workflow is established.

Figure 2. TV Globo football world cup broadcast


with interactive DE user interface.

At SBS, the target application was the broadcasters’


Android app, receiving a HTTP Live Streaming
(HLS) signal. The four available presets for the users
to choose from were “Basic”, “Enhanced Dialogue”,
“Site”, and “Dialogue Only”. The live workflow also
consisted of an AMAU-to-encoder-setup, fed by a
mixing desk providing international feed and Korean
commentary. In this case, the objects for the
streaming service were produced in parallel to the
regular master audio for the broadcast service.

4.2 DNN-based VoD Service


In contrast to the two OBA scenarios described
above, the German public broadcaster ARD
implemented a DE service called “Klare Sprache”
("Clear Speech”) into their Video on Demand (VoD)
service [37] using pure stereo audio. It is based on
existing non-object-based content for which the
separate audio stems are not available, but only the
final audio mix.
Figure 3. Movies with “Clear Speech” audio in the
The DS technology used for their VoD service is German VoD service “ARD Mediathek”.
MPEG-H Dialog+, implemented in a centralized
transcoding service. The MPEG-H Dialog+

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 5 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

5. Conclusion [4] A. Bidner, J. Lindberg, O. Lindman and K.


This paper reviewed the current state of Dialogue Skorupska, “Deploying Enhanced Speech
Enhancement (DE) in TV services, with emphasis on Feature Decreased Audio Complaints at SVT
the MPEG-H Audio system. It showed real-world Play VOD Service,” in 9th Machine
adoptions and implementations of DE technologies Intelligence and Digital Interaction (MIDI)
into current production and transmission workflows. Conference, Poland, 2021.
DE can be a part of native object-based productions [5] H. Fuchs, S. Tuff and C. Bustad, “Dialogue
or enabled by Dialogue Separation (DS). While Enhancement - Technology and Experiments,”
recent studies have shown that decreasing the relative EBU Technical Review - Q2, 2012.
level of the background via DE consistently reduces [6] M. Torcoli, A. Freke-Morin, J. Paulus, C.
the listening effort, other studies have found that the Simon and B. Shirley, “Background Ducking
optimal balance between dialogue and background is to Produce Esthetically Pleasing Audio for TV
highly personal and situation-dependent. Only the with Clear Speech,” in AES Convention 146,
personalization achieved through DE meets the needs Dublin, Ireland, 2019.
and preferences of the audience in almost every
[7] L. Ward, M. Paradis, B. Shirley, L. Russon, R.
situation.
Moore and R. Davies, “Casualty Accessible
and Enhanced (A&E) Audio: Trialling object-
Next to the relative balance of dialogue and
based accessible TV audio,” in AES
background, other factors can make the TV content
Convention 147, New York, USA, 2019.
difficult to access [40]. Object-based audio (OBA)
provides the tools to improve accessibility, e.g., by [8] L. Ward, “Improving Broadcast Accessibility
efficiently delivering multiple audio versions for Hard of Hearing Individuals: using object-
including audio descriptions, different languages and based audio personalisation and narrative
versions with simplified vocabulary or slower speech importance,” PhD thesis, University of Salford,
pace [40]. Employing DE and OBA means taking one 2020.
concrete step towards improved accessibility and user [9] ISO/IEC 23008-3:2022, “High Efficiency
satisfaction, as shown by the studies reviewed in this Coding and Media Delivery in Heterogeneous
paper. Moreover, DNN-based DS can assist in driving Environments–Part 3: 3D Audio"”.
the shift towards OBA, making existing content [10] ETSI Standard TS 101 154 v2.3.1,
personalized and interactive. “Specification for the Use of Video and Audio
Coding in Broadcasting Applications Based on
the MPEG-2 Trans- port Stream,” Feb. 2017.
6. Acknowledgement [11] ATSC Standard 3.0 A/342:2021 Part 3,
Sincere thanks go to the Fraunhofer IIS Accessibility “MPEG-H System,” Mar. 2021.
Solutions team and to Jennifer Karbach, Yannik
[12] ABNT NBR 15602-2:2020, “Digital
Grewe, and Adrian Murtaza for their valuable
Terrestrial Television - Video Coding, Audio
support.
Coding and Multiplexing Part 2: Audio
Coding,” 2020.
References
[13] TV 3.0 Project, “Terrestrial TV Evolution in
[1] C. Mathers, "A Study of Sound Balances for
Brazil,” [Online]. Available:
the Hard of Hearing," BBC White Paper, 1991.
https://forumsbtvd.org.br/tv3_0/#panel-
[2] M. Torcoli, C. Simon, J. Paulus, D. Straninger, phase2. [Accessed April 2023].
A. Riedel, V. Koch, S. Wits, D. Rieger, H.
[14] Y. Grewe, P. Eibl, D. Rieger, M. Torcoli, C.
Fuchs, C. Uhle, S. Meltzer and A. Murtaza,
Simon and U. Scuda, “MPEG-H Audio
“Dialog+ in Broadcasting: First Field Tests
Production Workflows for a Next Generation
Using Deep-Learning-Based Dialogue
Audio Experience in Broadcast, Streaming,
Enhancement,” in Technical Papers IBC 2021
and Music,” in AES Convention 151, Virtual,
(International Broadcasting Convention),
2021.
Virtual, 2021.
[15] P. Eibl, Y. Grewe, D. Rieger and U. Scuda,
[3] Verband Deutscher Tonmeister e.V., “KI in der
“Production Tools for the MPEG-H Audio
Audioproduktion,” vdt Magazin, no. 2, 2023.
System,” in AES Convention 151, Virtual,
2021.

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 6 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

[16] E. Vincent, T. Virtanen and S. Gannot, Audio [26] “IDC,” Audionamix, [Online]. Available:
Source Separation and Speech Enhancement, https://audionamix.com/instant-dialogue-
John Wiley & Sons, 2018. cleaner/. [Accessed March 2023].
[17] A. S. Master, L. Lu, H. M. Lehtonen, H. Mundt, [27] “RX Dialogue Isolate,” iZotope, [Online].
H. Purnhagen and D. Darcy, “Dialog Available:
Enhancement via Spatio-Level Filtering and https://www.izotope.com/en/products/rx/featu
Classification,” in AES Convention 149, res/dialogue-isolate.html. [Accessed March
Virtual, 2020. 2023].
[18] J. T. Geiger, P. Grosche and Y. L. Parodi, [28] “Clarity Vx Pro,” Waves, [Online]. Available:
“Dialogue Enhancement of Stereo Sound,” in https://www.waves.com/plugins/clarity-vx-
23rd European Signal Processing Conference pro. [Accessed March 2023].
(EUSIPCO), Nice, France, 2015. [29] M. Torcoli, J. Paulus, T. Kastner and C. Uhle,
[19] A. Craciun, C. Uhle and T. Bäckström, “An “Controlling the Remixing of Separated
Evaluation of Stereo Speech Enhancement Dialogue with a Non-Intrusive Quality
Methods for Different Audio-Visual Estimate,” in IEEE Workshop on Applications
Scenarios,” in 23rd European Signal of Signal Processing to Audio and Acoustics
Processing Conference (EUSIPCO), Nice, (WASPAA), Virtual, 2021.
France, 2015. [30] M. Torcoli, T. Robotham and E. A. P. Habets,
[20] J. Paulus, M. Torcoli, C. Uhle, J. Herre, S. “Dialogue Enhancement and Listening Effort
Disch and H. Fuchs, “Source Separation for in Broadcast Audio: A Multimodal
Enabling Dialogue Enhancement in Object- Evaluation,” in 14th International Conference
based Broadcast with MPEG-H,” Journal on Quality of Multimedia Experience
Audio Engineering Society (JAES), vol. 67, no. (QoMEX), Lippstadt, Germany, 2022.
7/8, pp. 510-521, 2019. [31] L. Ward, B. G. Shirley, Y. Tang and W. J.
[21] C. Uhle, O. Hellmuth and J. Weigel, “Speech Davies, “The Effect of Situation-Specific Non-
Enhancement of Movie Sound,” in AES Speech Acoustic Cues on the Intelligibility of
Convention 125, San Francisco, USA, 2008. Speech in Noise,” in INTERSPEECH, 2017,
[22] N. L. Westhausen, R. Huber, H. Baumgartner, Stockholm, Sweden.
R. Sinha, J. Rennies and B. T. Meyer, [32] D. Geary, M. Torcoli, J. Paulus, C. Simon, D.
“Reduction of Subjective Listening Effort for Straninger, A. Travaglini and B. Shirley,
TV Broadcast Signals With Recurrent Neural “Loudness Differences for Voice-Over-Voice
Networks,” IEEE/ACM Transactions on Audio, Audio in TV and Streaming,” Journal Audio
Speech, and Language Processing, vol. 29, pp. Engineering Society (JAES), vol. 68, no. 11, pp.
3541-3550, 2021. 810-818, 2020.
[23] J. Paulus and M. Torcoli, “Sampling Frequency [33] Fraunhofer IIS, “Football Fans Around the
Independent Dialogue Separation,” in World Experience the Worldcup in Immersive
European Signal Processing Conference and Personalized MPEG-H Audio,” 2022.
(EUSIPCO), Belgrade, Serbia, 2022. [Online]. Available:
[24] D. Petermann, G. Wichern, Z.-Q. Wang and J. https://www.iis.fraunhofer.de/en/pr/2022/2022
Le Roux, “The Cocktail Fork Problem: Three- 1208_mpeg-h_audio_worldcup.html.
Stem Audio Separation for Real-World [Accessed April 2023].
Soundtracks,” in IEEE International [34] Brazilian Digital Terrestrial Television System
Conference Acoustic Speech Signal Processing Forum; Brazilian Ministry of Communications,
(ICASSP), Singapore, 2022. “Testing and Evaluation Report: TV 3.0
[25] M. Torcoli and E. A. P. Habets, “Better Project - Audio Coding,” SBTVD Forum,
Together: Dialogue Separation and Voice Brazil, 2021.
Activity Detection for Audio Personalization in [35] A. Murtaza, S. Meltzer, Y. Grewe, N. Faecks,
TV,” in IEEE International Conference M. Raulet and L. Gregory, “MPEG-H Audio
Acoustic Speech Signal Processing (ICASSP), System for SBTVD TV 3.0 Call for Proposals,”
Rhodes, Greece, 2023. SET International Journal of Broadcast
Engineering, 2021, doi:
10.18580/setijbe.2021.3.

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 7 of 8
Rieger et al. Dialogue Enhancement with MPEG-H Audio

[36] A. Murtaza, “Audio Advances Showcased


During the FIFA World Cup 2022,” DVB
Scene, no. 61, p. 11, 2023.
[37] ARD Digital, “Tonspur Klare Sprache,” ARD
Digital, 2023. [Online]. Available:
https://www.ard-digital.de/klaresprache.
[Accessed March 2023].
[38] European Broadcasting Union (EBU),
“Loudness Normalisation and Permitted
Maximum Level of Audio Signals,” EBU
Recommendation 128, 2020.
[39] ITU-R BS.2076-2, “Audio Definition Model,”
2019.
[40] C. Simon, M. Torcoli and J. Paulus, “MPEG-H
Audio for Improving Accessibility in
Broadcasting and Streaming,” 2019.

AES 154th Convention, Espoo, Helsinki, Finland


May 13-15, 2023
Page 8 of 8

You might also like