¹¹institutetext: Deezer Research, Paris, France
¹¹email: [email protected]

Perceived Femininity in Singing Voice: Analysis and Prediction

Yuexuan Kong Viet-Anh Tran Romain Hennequin

Abstract

This paper focuses on the often-overlooked aspect of perceived voice femininity in singing voices. While existing research has examined perceived voice femininity in speech, the same concept has not yet been studied in singing voice. The analysis of gender bias in music content could benefit from such study. To address this gap, we design a stimuli-based survey to measure perceived singing voice femininity (PSVF), and collect responses from 128 participants. Our analysis reveals intriguing insights into how PSVF varies across different demographic groups. Furthermore, we propose an automatic PSVF prediction model by fine-tuning an x-vector model, offering a novel tool for exploring gender stereotypes related to voices in music content analysis beyond binary sex classification. This study contributes to a deeper understanding of the complexities surrounding perceived femininity in singing voices by analyzing survey and proposes an automatic tool for future research.

1 Introduction

Recently, in the field of music content analysis, numerous studies have delved into social stereotypes and biases surrounding gender and sex, notably regarding lyrics [2, 19, 20]. From 1960 to 2010, popular song lyrics of male solo artists became more sexist over the years, while this behavior is less noticeable for the other categories of artists[2]. Sexual behaviors and objectification in lyrics of both female and male are more frequently linked to male artists, and lyrics of male artists contain more gender bias than those of female artists[2, 20].

While these studies employ the concept of gender, they primarily frame the problem by associating content with the singer’s biological sex. Yet, gender and sex are fundamentally different; gender being a socially constructed notion that may or may not align with one’s biological sex. More importantly, recent scientific consensus defines the notion of gender more as a continuous variable than a discrete one[18]. Furthermore, in social psychology, on top of gender, perceived masculinity and femininity by humans have often been considered an important component while investigating the influence on social behaviors [5].

In speech, there has been a surge in research exploring the relationship between perceived voice femininity and social impressions. A previous research points out how perceived gender in voices is a socially and culturally constructed concept[8]. Some studies focus on social stereotypes of voices of different femininity [14, 1]; while others develop tools to predict perceived voice femininity in speech, aiding transgender individuals in their gender affirming surgeries [6, 5].

Despite advancements in perceived voice femininity in speech, research on perceived femininity in singing voice is lacking. There are various research on gender estimation from singing voice [10, 24, 25]. However currently, there is no data analysis of perceived singing voice femininity or model capable of predicting the average perceived singing voice femininity (PSVF), despite its potential usefulness. In speech, research shows that the perception of leadership capacity is influenced by pitch in voices[13], which is often related to perceived femininity[16]. In singing voice, similarly to speech, the message carried by the lyrics can be biased by the perceived femininity of the singing voice, rather than the biological sex of the singer when it is unknown to listeners. Thus having access to the information of PSVF can be of great help for sociological studies about gender representation in music and in media. This approach offers a broader perspective on voice characteristics compared to binary classification of sex often used in music content analysis. Moreover, introducing an automatic model for PSVF prediction could facilitate the analysis of bias, since collecting PSVF of songs for a large catalog is time consuming.

In this work, as a first step towards PSVF analysis, we have three main contributions:

•

We design a stimuli-based PSVF survey, consisting of 1200 audio segments lasting 3 seconds each, balanced across five languages, four age groups, and two sexes, with 7258 valid responses gathered from 128 participants. The dataset is publicly available at https://github.com/deezer/perceived_singing_voice_femininity.
•

We use the survey responses to compare and analyze results among different groups of singers and participants, to test certain hypothesis on PSVF and gain a better understanding.
•

Finally, we use the dataset to fine-tune a modified x-vector model to perform regression instead of binary classification of PSVF, paving the way for large-scale analysis of biases in music corpora coupled to PSVF.

2 Related Work

The first research on automatic perceived voice femininity prediction in speech is conducted by using classification methods[26]. They use a Multi-Layer Perceptron (MLP) for transgender self-assessment of voice femininity. This model is trained using binary gender data and flexible thresholds to categorize voices into masculine, feminine, and androgynous classes. Its accuracy, measured against speaker self-assessments, stands at 88%. However, this system cannot describe perceived femininity on a continuous scale. Chen et al. [4, 5] used Linear Discriminant Analysis(LDA) to align the results of their model with human perception of voice on a continuous scale and conducted an analysis of acoustic factors that impact human perception. To assist transgender individuals in voice training following gender-affirming treatments, Doukhan et al. [6] conducted a survey of human perception of transgender voices and trained a binary classification system based on x-vector[21]. They then used isotonic regression calibration procedure to transform the results into a continuous scale to predict the perceived femininity of transgender individuals[3]. To predict the voice femininity percentage, they deployed an x-vector architecture which is a time-delayed neural network that embeds voice characteristics.

However, to the best of our knowledge, due to the lack of data, singing voice has not been studied in the context of perceived femininity analysis prediction.

3 Human Perception of Singing Voice Femininity

In this section, we conduct a detailed evaluation of human perception of singing voice’s femininity, with the aim of enhancing our comprehension of perceived singing voice femininity (PSVF) and differences in perceptions towards various groups of singers (language, sex, age). We use the term voice femininity as previous research on voice femininity in speech[6, 7].

3.1 Survey Design and Data Collection

In our study, we use the test set of STraDa[15]. STraDa contains 200 songs that are distributed equally across two sexes of the lead singer, 4 age groups of the lead singer (20-34, 35-49, 50-64, 65+) and 5 languages (Mandarin, English, French, Spanish, German), whose collection process is detailed in STraDa [15]. We refer these groups as subgroups of singers. All songs are annotated by the authors manually, and only cis-gender singers are used, which means their gender corresponds to their biological sex. To increase the number of segments of survey, we choose 6 segments of 3 seconds from each song that contain vocals.

The 1200 segments are randomly divided into 20 surveys, each containing 60 segments. The surveys are administered through the online platform JotForm¹¹1https://www.jotform.com. We ask each participant to rate the perceived femininity of the corresponding singing voice in the segment. Each segment is rated using a Likert scale, with the following values corresponding from 2 to -2: "definitely feminine", "rather feminine", "I don’t know", "rather masculine" and "definitely masculine". Positive values correspond to feminine and negative values correspond to masculine. We collect participants’ metadata (gender, language, and age group) at the beginning of the survey. We refer these groups as subgroups of participants. Additionally, after each question, participants are asked if they recognize the singer in the corresponding segment, to eliminate cases where participants’ perceptions are biased by prior knowledge. At the end of the survey, we ask participants whether they experienced any difficulties during the listening task. We share this survey through community mailing lists and in several universities in France and China, and obtain 128 participants. We remove responses indicating prior knowledge of the singer, two participants who have reported difficulties during the task were excluded, resulting in 126 participants and 7258 valid responses for 1200 segments. We release these data to facilitate future research on PSVF²²2https://github.com/deezer/perceived_singing_voice_femininity.

To summarize the survey results and analyze the PSVF, as well as to compare perceptions among different subgroups of segments and participants, we introduce the term average correspondence (AC). This is calculated as the percentage of instances within each subgroup where the averaged PSVF across all participants aligns with the singer’s biological sex. For example, within the subgroup of female singers’ segments, the average PSVF was positive in 96.7% of all segments, indicating a perception of femininity that aligns with the singers’ sex. It serves as a metric to indicate whether the singing voices within one subgroup differ more from the stereotypical perception of voices associated with sex compared to another subgroup.

3.2 Comparison of Different Subgroups’ PSVF

Our hypothesis is that subgroups might exhibit higher AC for singing voices they are more accustomed to or that are physiologically similar to their own (e.g., french participants to french songs). However, Table 1 reveals no significant differences in AC among the various participant subgroups. This suggests that factors such as gender, language, and age do not significantly influence the AC of participants for different subgroups of singers in our study. All subgroups with different gender/age/language exhibit higher AC for male singers, singers of age group 35-49 and mandarin tracks. These findings could be helpful in future research, as they suggest that participants’ PSVF may not be strongly influenced by demographic factors, and may be more universally applicable.

	female	male
nb. participants	53	73
AC female singers	82.9	87.9
AC male singers	96.7	96.4

	20-34	35-49	50-65
nb. participants	96	24	4
AC 20-34 singers	96.3	93.3	87.2
AC 35-49 singers	97.0	97.7	94.9
AC 50-64 singers	87.6	82.2	81.0
AC 65+ singers	92.5	87.4	81.8

	fr	en	sp	man	ge
nb. participants	67	108	15	24	8
AC fr tracks	85.8	87.9	89.6	85.1	85.9
AC en tracks	92.0	93.3	93.0	91.4	93.2
AC sp tracks	90.0	89.6	80.9	87.4	85.9
AC man tracks	97.7	97.9	95.9	92.0	94.3
AC ge tracks	92.7	94.6	92.1	89.8	94.7

Table 1: AC (%) of different subgroups. Answers of participants older than 65 years old do not cover all age groups, therefore we do not report the ACs.

Furthermore, to compare uncertainty across different subgroups, we calculate the proportion of segments where the averaged answer falls between -0.5 and +0.5 for each subgroup of singers (denoted as Unsure). The results are shown in Table 2.

	Gender		Age				Language
	female	male	20–34	35–49	50–64	65+	fr	en	sp	man	ge
Unsure (%)	6.8	2.3	2.3	0.3	9.1	4.1	7.1	2.1	5.4	0.8	4.5

Table 2: Unsure (%) of different subgroups of singers.

Participants perceive female voices, voices of singers aged 50-64, and French singing voices as more neutral, represented by higher Unsure in Table 2. This suggests that on average, participants find these voices less stereotypical compared to voices typically associated with both sexes. While we acknowledge that observations of Unsure may partly result from sampling bias in STraDa, we would like to shed more lights in potential reasons for these similarities beyond sampling bias. For singer’s sex, female voices might have become less stereotypical over the time, with the increase of representation of female singers in genres such as blues and gospel singing, where female singers tend to have deeper and more powerful voices that were more typically associated with male voices[9]. For language, music in Mandarin is known to differ from Western music in tonality, rhythm, and melody [17], which can impact how individuals perceive femininity in singing voices. Additionally, traditional Chinese folk songs typically have higher-pitched female vocals, which have characteristics that resemble more to stereotypical voices that people perceive as feminine voice. For age groups, it may be due to changes in the human voice that occur with age. As individuals age, respiratory changes result in less efficient air movement, and larynx changes reduce vocal fold adjustments during voice production [12, 23]. This could explain lower AC for older age groups where the fundamental frequency of female voices become closer to the fundamental frequency of male voices[22] and that people perceive these voices as more neutral.

3.3 Comparison of Human Perception and Machine’s Singer Sex Classification Results

By using the training set of STraDa[15], the authors trained a singer sex classifier and evaluated it on the testing set of STraDa, the same dataset used for the survey. We compare our AC results with the findings from [15] to justify our decision to fine-tune the automatic model from [15] for large-scale catalog labeling.

The subgroups that exhibit higher AC for PSVF and where the automatic evaluation system renders higher recall in [15] are respectively the same. While this could be explained by sampling bias in the testing set of STraDa, we find that only 41 out of 114 cases are the same where PSVF and automatic sex classification system do not match with the singer’s sex. This suggests that sampling bias alone could not explain the reasons behind this similarity.

This similarity also shows that potentially by fine-tuning this x-vector model, we could obtain a model that align its automatic prediction with PSVF.

4 Automatic Perceived Singing Voice Femininity Prediction System

Within the context of sociological analysis, where PSVF serves as a factor that broadens the perspective for analyzing biases present in music content such as lyrics, we require an automatic prediction model to label a large catalog.

In this section, we fine-tune on the x-vector model used in [15] to align it with PSVF. Moreover, in speech, Doukhan et al. also deployed a system based on x-vector to predict the perceived femininity which showed such system’s potential in predicting perceived femininity[6].

4.1 Training Details

We employ the identical set of 1200 segments with human annotations we obtain from the survey. It is worth noting that a binary coding is insufficient to fully capture PSVF, which is more of a continuous variable than a binary one[6]. We calculate the average of all responses for each one of the 1200 segments. Since the scale used in the survey is from -2 to 2, we rescale each answer and obtain a value that ranges from 0 to 1, as the reference for the automatic PSVF prediction system. We use a 5-fold cross-validation technique to assess the performance of the automatic PSVF prediction system, resulting in 860 segments for training and 240 segments for testing for each fold.

The architecture used in [15] is a x-vector embedding model. It contains 5 time-delayed neural network (TDNN) blocks and outputs an embedding of 64 dimensions, then a linear layer that projects embeddings into two classes. It outputs two neurons for sex classification, thus we replace the last layer by one neuron that outputs a value from 0 to 1 for a continuous prediction. We freeze the first two blocks of the TDNN and retrain other layers by using our training data. We use L1 loss to optimize the model.

We adapt all pre-processing steps used in [15]. We use mel-spectrograms using 24 Mel filters, as input for the TDNN model to extract x-vector embeddings. To improve the quality and quantity of our training data, we integrate source separation using Spleeter[11] for half of training samples and speed change[27] as a method of data augmentation.

4.2 Results

We use Mean Average Error (MAE) to measure the performance of our model, which is a direct indicator of how far the prediction falls from the survey result. Our model shows an average MAE of 0.10 with a standard deviation of 0.01 across five folds, indicating that, on average, our model’s predictions deviate by 0.1 from the results obtained from the survey on a scale of 0 to 1. Considering the relatively limited size of our training data, this performance stands as a first step towards an automatic PSVF prediction model. We firmly believe that employing this model has the potential to offer a multifaceted lens through which biases in music content can be examined. It diverges from the conventional approach of studying biases related to sex, allowing for a broader exploration of perspectives within the realm of music content analysis.

5 Conclusion

In this paper, we adapt a concept that is recently introduced into speaking voice into the field of singing voice: perceived voice femininity. It shifts away from binary classification based on biological sex, and moves towards a richer understanding of how singing voices are gradually perceived in terms of femininity on a continuous scale. In doing so, our goal is to enable large-scale analysis of music corpora in terms of gender stereotypes, for instance by linking lyrics to perceived singing voice femininity (PSVF) of lead singers.

To achieve such goal, we create a stimuli-based survey to investigate PSVF of humans and will publish all related data for future use. We then analyze PSVF data to show how PSVF of different singers vary across different subgroups of participants. Trained on this data, we provide the community an automatic prediction tool that could be used on a large music catalog to conduct sociologist research related to gender stereotypes.

This study fills in the gap in research on analysis of PSVF, yet there is room for improvement. Firstly, future research could enhance the survey by adding singers of more diverse gender identities and gathering data from a more diverse range of individuals, thus aiming for a more representative estimation of PSVF. Additionally, influences of hyperparameter choices, model choices and segment lengths for automatic model could be investigated further. Lastly, investigating the underlying reasons why certain subgroups’ singing voices diverge more from stereotypical voices could bring a better understanding of human perception of singing voice femininity.

References

[1] Arnocky, S., Hodges-Simeon, C.R., Ouellette, D., Albert, G.: Do men with more masculine voices have better immunocompetence? Evolution and Human Behavior 39(6), 602–610 (2018)
[2] Betti, L., Abrate, C., Kaltenbrunner, A.: Large scale analysis of gender bias and sexism in song lyrics. EPJ Data Science 12(1), 10 (2023)
[3] Chakravarti, N.: Isotonic median regression: a linear programming approach. Mathematics of operations research 14(2), 303–308 (1989)
[4] Chen, F., Togneri, R., Maybery, M., Tan, D.: An objective voice gender scoring system and identification of the salient acoustic measures. In: INTERSPEECH. pp. 1848–1852 (2020)
[5] Chen, F., Togneri, R., Maybery, M., Tan, D.W.: Acoustic characterization and machine prediction of perceived masculinity and femininity in adults. Speech Communication 147, 22–40 (2023)
[6] Doukhan, D., Devauchelle, S., Girard-Monneron, L., Ruz, M.C., Chaddouk, V., Wagner, I., Rilliard, A.: Voice passing: a non-binary voice gender prediction system for evaluating transgender voice transition. In: INTERSPEECH 2023. vol. 24, pp. 5207–5211. ISCA (2023)
[7] Doukhan, D., Poels, G., Rezgui, Z., Carrive, J.: Describing gender equality in french audiovisual streams with a deep learning approach. VIEW Journal of European Television History and Culture 7(14), 103–122 (2018)
[8] Eidsheim, N.S.: The race of sound: Listening, timbre, and vocality in African American music. Duke University Press (2019)
[9] Griffin, F.J.: When malindy sings: a meditation on black women’s vocality. In: Uptown conversation: The new jazz studies, pp. 102–125. Columbia University Press (2004)
[10] Hamasaki, M., Goto, M., Nakano, T.: Songrium: a music browsing assistance service with interactive visualization and exploration of protect a web of music. In: Proceedings of the 23rd International Conference on World Wide Web. pp. 523–528 (2014)
[11] Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5(50), 2154 (2020). https://doi.org/10.21105/joss.02154, https://doi.org/10.21105/joss.02154
[12] Hirano, M., Kurita, S., Sakaguchi, S.: Ageing of the vibratory tissue of human vocal folds. Acta oto-laryngologica 107(5-6), 428–433 (1989)
[13] Klofstad, C.A., Anderson, R.C., Peters, S.: Sounds like a winner: Voice pitch influences perception of leadership capacity in both men and women. Proceedings of the Royal Society B: Biological Sciences 279(1738), 2698–2704 (2012)
[14] Ko, S.J., Judd, C.M., Stapel, D.A.: Stereotyping based on voice in the presence of individuating information: Vocal femininity affects perceived competence but not warmth. Personality and Social Psychology Bulletin 35(2), 198–211 (2009)
[15] Kong, Y., Tran, V.A., Hennequin, R.: Strada: A singer traits dataset. In: Interspeech (2024)
[16] Munson, B.: The acoustic correlates of perceived masculinity, perceived femininity, and perceived sexual orientation. Language and speech 50(1), 125–142 (2007)
[17] Rahn, J.: " chinese harmony" and contemporary non-tonal music theory. Canadian University Music Review 19(2), 115–124 (1999)
[18] Reilly, D.: Gender can be a continuous variable, not just a categorical one: Comment on hyde, bigler, joel, tate, and van anders (2019). (2019)
[19] Shakespeare, D., Porcaro, L., Gómez, E., Castillo, C.: Exploring artist gender bias in music recommendation. arXiv preprint arXiv:2009.01715 (2020)
[20] Smiler, A.P., Shewmaker, J., Hearon, B.V.: From “i want to hold your hand” to “promiscuous”: Sexual stereotypes in popular music lyrics, 1960–2008. Sexuality & Culture 21, 1083–1105 (2017), https://api.semanticscholar.org/CorpusID:148678686
[21] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: Robust dnn embeddings for speaker recognition. In: ICASSP (2018)
[22] Stathopoulos, E.T., Huber, J.E., Sussman, J.E.: Changes in acoustic characteristics of the voice across the life span: Measures from individuals 4–93 years of age (2011)
[23] Tarafder, K.H., Datta, P.G., Tariq, A.: The aging voice. Bangabandhu Sheikh Mujib Medical University Journal 5(1), 83–86 (2012)
[24] Weninger, F., Durrieu, J.L., Eyben, F., Richard, G., Schuller, B.: Combining monaural source separation with long short-term memory for increased robustness in vocalist gender recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2196–2199. IEEE (2011)
[25] Weninger, F., Wöllmer, M., Schuller, B.: Automatic assessment of singer traits in popular music: Gender, age, height and race (2011)
[26] Williams, J., Paudel, P.: Application of deep feedforward neural network in transgender vocal analysis
[27] Yang, Y.Y., Hira, M., Ni, Z., Astafurov, A., Chen, C., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E.Z., et al.: Torchaudio: Building blocks for audio and speech processing. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6982–6986. IEEE (2022)