¹¹institutetext: School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
¹¹email: {yisan,luojingl,juboyuan}@stu.xjtu.edu.cn, [email protected]

Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation

Yishan Lv Jing Luo Boyuan Ju Xinyu Yang

{}^{(\textrm{{\char 0\relax}})}

Abstract

Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.

Keywords:

Symbolic Music Generation Hierarchical Structures Cross Attention Chinese Folk Songs

1 Introduciton

Music stands as a treasure within human civilization. In recent years, music generation has become a focus of deep learning research. Many sequence models has been employed to generate symbolic music[14, 3, 18, 10]. Following the introduction of Music Transformer[7], which utilizes a transformer-based architecture for music generation, several Transformer-based models have made a progress in generating complete melodies[28, 22, 4].

Structure is of great significance to music, recently, plenty of works start to study the hierarchical structural features within the music. [25] studies the phrase-level hierarchy of music, [28, 21] study the bar-level hierarchy of music, [16] studies the phrase & bar-level hierarchies of music. While beneath the bar-level hierarchy, there exists a micro-level hierarchy in which skeleton notes play an important role. In this study, we delve into the hierarchical structure, exploring the intrinsic relationship among macro-level hierarchy and micro-level hierarchy.

Figure 1 illustrates the difference between the phrase & bar-level hierarchies and our macro & micro-level hierarchies. Specifically, a melody comprises several phrases, with each phrase comprising several bars. In the phrase & bar-level hierarchies, the bar serves as the fundamental structure unit. Within a bar there are several notes, among which some play a crucial role in guiding the melody generation. These significant notes, known as skeleton notes, are extracted to establish the micro-level hierarchy in our macro & micro-level hierarchies. For Chinese Folk Songs, most phrases are relatively short and the distinction between phrase-level and bar-level hierarchies is not that obvious, so we examine phrases instead of bars as the macro-level hierarchy.

Accordingly, we conduct phrase segmentation on the melody at the macro-level hierarchy, and design a skeleton notes extraction strategy within each phrase at the micro-level hierarchy. Especially, we define a new type of skeleton note for Chinese Folk Songs. Building upon this, we propose a novel Phrase-level Cross-Attention mechanism, which enables the model a deep understanding of musical features from both macro-level and micro-level hierarchical structures.

We construct our own dataset: Small Tunes¹¹1Small Tunes, as known as XiaoDiao in Chinese phonetics, is a category of Chinese Folk Songs. For details, see https://en.chinaculture.org/library/2008-01/11/content_71371_3.htm Dataset (STD), and utilize it to train our model: Small Tunes Transformer (STT). Utilizing the extracted skeleton notes as conditions, STT is capable of generating Small Tunes songs with clear structure and captivating melody. We design 3 novel metrics to evaluate the quality of music from pitch and rhythm dimensions. The experiment results indicate that STT outperforms other state-of-the-art models on all 5 subjective evaluation metrics and 5 out of 6 objective evaluation metrics. Besides, we add 6 ablative groups to study the impact of changes in macro-level and micro-level hierarchies on music generation, thereby exploring the hierarchical structural features within music.

Our main contributes can be summarized as follows:

•

We propose STT, a Transformer-based model, incorporating the novel Phrase-level Cross-Attention mechanism, to explores the hierarchical structures within music from both macro-level and micro-level hierarchies.
•

We design three objective evaluation metrics: Theme Pitch Corresponding (TPC) and Theme Rhythm Corresponding (TRC) evaluate the coherence corresponding to the theme from pitch and rhythm dimensions, and Pentatonic Scale Consistency (PSC) evaluates consistency in a Chinese-style scale dimension.
•

We construct our own dataset: STD, a large-scale dataset containing 10088 MIDI files, covering almost all recorded Small Tunes songs in China.

Refer to caption — Figure 1: Two multi-level hierarchies: phrase & bar-level hierarchies (left) and our macro & micro-level hierarchies (right). The dashed boxes indicate levels that are not considered in the respective hierarchies.

2 Related Work

Music Transformer [7] is the first work to utilize the Transformer-based architecture to generate music with coherent structure. Drawlody[12], a music generation system, composes music by converting a user-input melody curve into melody. MusicVAE [18] utilizes a hierarchical decoder to generate music with long-term structure. WuYun [24] leverages music theory to prioritize the generation of structurally important notes as the skeleton, gradually filling in ornamental notes to complete the melody. While WuYun effectively generates coherent melodies, it lacks consideration for structural features within music. In this paper, we build upon the principles of WuYun to explore the intrinsic relationship between macro-level and micro-level hierarchies in music.

In recent years, an increasing number of works have focused on the structural features of music. These studies can be categorized based on their exploration of intrinsic structural hierarchies into four types: phrase-level, bar-level, phrase & bar-level, and others. 1) phrase-level: MusicFrameworks [1], a Transformer-LSTM architecture, processes music sequences by incorporating chord, melody, and rhythm features. [2] generates music by imitating the structure, melody, and style of a given seed song. [25] explores the form, harmony, and texture features to enhance the structure within music. Theme Transformer [19] centers on theme-based conditioning, generating music using thematic material as the condition. 2) bar-level: Melons [28], a Transformer-based music generation model, represents music sequences as graphs based on eight custom-defined structural types. Popmnet [21] generates pop music with a well-organized structure by establishing relationships of repetition and sequence between all bars. 3) phrase & bar-level: Hyperbolic Music Transformer [8] enhances the structure of music by leveraging hyperbolic theory. [9] utilizes a data-driven approach to analyze the structure of symbolic music. [16] proposes the Phrase and Bar Countdown events to study the phrase & bar-level hierarchies within music. 4) Others: [6] explores repetitive patterns at the motif-level. [13] progressively expands a music fragment into a complete melody across the motif, phrase, and section levels. [20] explores structural elements at the note, chord, and section levels in music to enhance its quality.

Most of the aforementioned works concentrate on music generation within Western music genres such as Western pop music, while research on Chinese-style music, especially Chinese Folk Songs, remains relatively limited. Although some researchers have employed sequence-to-sequence models for Chinese-style music generation, such as MG-VAE [15] for regional-style Chinese Folk Songs composition, [27] generates melody and arrangement for Chinese pop-style songs.

3 Method

3.1 Phrase Segmentation

The structure of Chinese Small Tunes is unique, often presenting orderly structural patterns. The distinctive hierarchical structure in Chinese Small Tunes reflects traditional style of Chinese Folk Songs. Most of the phrases within Small Tunes are relatively short, and thus we examine the phrases as macro-level hierarchy.

We dedicate to produce the accurate phrase segmentation of Small Tunes, which is significant to explore the intrinsic structural features within music. We apply a deep learning method to get phrase segmentation. The model architecture we select is a convolutional neural network with conditional random field [26], and 1168 labeled Chinese Folk Songs in public data set Essen Folksong Database are used to train the model. Then the phrase segmentation of each song in our dataset can be produced using the trained model. The phrase segmentation of a song is defined as $S=\{s_{1},s_{2},\ldots,s_{n}\}$ , where $n$ is the length of sequence, and for instance, $s_{i}=k$ indicates that the $i^{th}$ note belongs to the $k^{th}$ segment.

3.2 Skeleton Extraction

A melody consists of structural notes and ornamental notes, these structural notes, called skeleton [17], is the underlying framework of a full melody. Based on the melodic skeleton, a full-fledged melody can be composed by filling into ornamental notes. The skeleton notes, which tend to be more prominent in auditory perception, are selected as the micro-level hierarchy for our study.

Skeleton notes can be divided into pitch and rhythm dimensions. One skeleton note extracted from the pitch dimension contributes to the stability and harmony, while one from the rhythm dimension is of importance of the rhythm of melody development.

For the pitch dimension, we define a Small Tunes Trembling Tote, which often occurs in the Chinese Small Tunes, featuring traditional Chinese style. The Small Tunes Trembling Note starts and ends with the note which has the same pitch, among them there exists some other ornamental notes with shorter duration. Figure 2(b) shows one piece of a famous Chinese Folk Song Molihua (or Jasmine Flower) as example.

For the rhythm dimension, we select three types of skeleton notes according to [24], which are metrical accent, syncopation, and long note. After conducting phrase segmentation on a single song, we extract the skeleton notes from each phrase, thereby obtaining the skeleton note sequence. Figure 3 illustrates an example of skeleton extraction result.

3.3 Music Representation

REMI [5] is a a widely used method for symbolic music representation. However, we utilize a triplet format of $\{pitch,duration,segment\}$ instead of REMI to represent symbolic music sequences for the following reasons: 1) The REMI representation results in an excessively long input sequence, complicating melody modeling. 2) Tokens such as bar and position in REMI appear irregularly at the beginning or middle of sequences, disrupting the alignment between skeleton notes and full notes sequences during Phrase-level Cross-Attention (as discussed later). Conversely, the triplet format, which includes only pitch, duration, and segment attributes, represents each note as a single token after concatenation. This ensures a one-to-one correspondence between the skeleton notes sequence and the full notes sequence during Phrase-level Cross-Attention, thereby enhancing modeling efficiency.

The pitch and duration values are obtained directly, while the segment value is derived from the outcome of phrase segmentation. After being converted into the digital format, the symbolic music token sequence can be fed into the model as input. Figure 2(a) illustrates the music representation.

For pitch sequence $P:\{p_{1},p_{2},\ldots,p_{n}\}$ , duration sequence $D:\{d_{1},d_{2},\ldots,d_{n}\}$ , segment sequence $S:\{s_{1},s_{2},\ldots,s_{n}\}$ . $P,D,S\in R^{n\times 1}$ , we embed them as $P_{emb},D_{emb},S_{emb}\in R^{n\times d_{model}}$ where $d_{model}$ represents the embedding dimension. Then we utilize a fusion layer to merge the pitch, duration and segment information, resulting in what we denote as Music Fusion (MF) in Equation 1, where $W_{MF}$ represents a trainable linear, and $\oplus$ is a vector concatenation operation.

MF=W_{MF}\cdot(P_{emb}\oplus D_{emb}\oplus S_{emb})

(1)

The positional encoding is illustrated in Equation 2, $PE_{i}$ is the original positional encoding of transformer where $I=\{0,1,\ldots,n-1\}$ represents the index of the music sequence, besides, we propose an additional positional encoding $PE_{s}$ to embed the phrase segment $S:\{s_{1},s_{2},\ldots,s_{n}\}$ .

PE=PE_{i}+PE_{s}

(2)

Now, the input of encoder and decoder block is as follows:

input=MF+PE

(3)

3.4 Model Architecture

We model a song from macro-level and micro-level hierarchies. At the macro-level hierarchy, a Small Tunes song consists of multiple phrases, which intricately interweave and connect with each other. At the micro-level hierarchy, skeleton notes within each phrase play a pivotal role in guiding the melody generation. In order to better study the intrinsic features among these hierarchical structures, we propose a novel Phrase-level Cross-Attention.

The skeleton notes sequence input of encoder block and the full notes sequence input of decoder block are denoted as $G_{input}$ and $H_{input}$ respectively. After being processed by the encoder block, $G_{input}$ serves as the key and value inputs for the Phrase-level Cross-Attention in decoder block, denoted as $G^{{}^{\prime}}:\{g_{1},g_{2},\ldots,g_{m}\}\in R^{m\times d_{model}}$ , and after being processed by the Masked Relative Self-Attention[7] and Add & Norm layer, $H_{input}$ serves as the query input for the Phrase-level Cross-Attention, denoted as $H^{{}^{\prime}}:\{h_{1},h_{2},\ldots,h_{n}\}\in R^{n\times d_{model}}$ . Where $m$ and $n$ are the length of skeleton notes sequence and the length of full notes sequence respectively, and $d_{model}$ is the embedding dimension. The query ( $Q$ ), key ( $K$ ) and value ( $V$ ) are show as Equation 4, where $W_{Q}$ , $W_{K}$ , $W_{V}$ are three trainable linear layers.

Q,K,V=W_{Q}\cdot H^{{}^{\prime}},W_{K}\cdot G^{{}^{\prime}},W_{V}\cdot G^{{}^{% \prime}}

(4)

We design a Phrase-level Mask Matrix to ensure that the melody generation of one phrase only attends the skeleton notes within the same phrase, thereby the skeleton notes can guide the melody generation of the corresponding phrase. For explanation purposes, we provide an example as follows. Given the $k^{th}(k\in 1,2,\ldots)$ phrase, after performing phrase segmentation operations as mentioned earlier, we obtain the phrase segmentation labels: $S^{g}:\{s_{1}^{g},s_{2}^{g},\ldots,s_{m}^{g}\}$ for the skeleton notes sequence and $S^{h}:\{s_{1}^{h},s_{2}^{h},\ldots,s_{n}^{h}\}$ for the full notes sequence. Based on this result, we can extract the skeleton notes subsequence $G_{k}^{{}^{\prime}}:\{g_{i},g_{i+1},\ldots,g_{j}\}$ and the full notes subsequence $H_{k}^{{}^{\prime}}:\{h_{p},h_{p+1},\ldots,h_{q}\}$ within the $k^{th}$ phrase, according to Equation 5.

s_{i}^{g}=s_{i+1}^{g}=\dots=s_{j}^{g}=k=s_{p}^{h}=s_{p+1}^{h}=\dots=s_{q}^{h}

(5)

Where $i$ , $p$ are the index of the first note in one phrase and $j$ , $q$ are the index of the last note. Furthermore, after obtaining the index $i$ , $j$ , $p$ and $q$ , the $k^{th}$ block matrix can be represented as Equation 6, where $r$ stands for row, $c$ stands for column, $0$ represents no masking required while $-\infty$ indicates masking.

M^{k}=\left\{\begin{aligned} &\hfill\phantom{00}0,&&p\leq r\leq q\text{ and }i% \leq c\leq j\\ &-\infty,&&\hfill\phantom{p\leq r\leq q}\hfill\text{others}\end{aligned}\right.

(6)

Performing the same operation on each phrase yields a total of $n_{p}$ block matrices, where $n_{p}$ is the number of phrases. Combining these matrices yields the Phrase-level Mask Matrix $M$ .

Finally, the output of Phrase-level Cross-Attention can be obtained as Equation 7. Figure 5 illustrates the architecture of our model.

Att(Q,K,V,M)=softmax(\frac{Q\cdot K^{T}}{\sqrt{d_{model}}}+M)\cdot V

(7)

4 EXPERIMENT

4.1 Experiment Settings

4.1.1 Dataset.

There has been abundant research on Western music genres like classical and pop music, while studies on Chinese-style songs remain relatively limited. Chinese Folk Songs, a unique music genre of Chinese-style songs, with strong regional characteristics[11, 23], captivating melody and richest numbers, can be traced back to the Classic of Poetry (or Shijing) over 3,000 years ago. Small Tunes¹¹1https://chinglohsiu.github.io/files/MGD.html, a category of Chinese Folk Songs, is popular among towns or countries and is characterized by fixed melody and lyrics, orderly structure, and subtle, melodious tunes. Small Tunes serve as the focus of our study.

We construct our dataset, named the Small Tunes Dataset¹¹1https://chinglohsiu.github.io/files/MGD.html (STD), a large-scale collection of 10088 Small Tunes songs. STD encompasses almost all recorded Small Tunes songs from 31 provinces in China, each meticulously transcribed into MIDI format by us. For model training, we select songs with a time signature denominator of 4.

4.1.2 Baseline Models.

In order to explore the advantages of the model architecture, we select three models as our baseline models:

•

Music Transformer (MT), which is the first Transformer-based model to generate symbolic music[7].
•

WuYun, which uses the skeleton notes as a condition but lack of any segment information[24].
•

Music Transformer with Phrase and Bar Countdown events (MT+P
h &BC), which introduces Phrase and Bar Countdown events to enhance structural coherence [16].

4.1.3 Experiment Configurations.

We utilize 7280 songs from our STD after data preprocessing, with 90% selected as training set to train the model and the remaining 10% as test set to evaluate the performance of the model. The number of layers for both encoder and decoder is 6. The embedding dimension $d_{model}$ is 256, learning rate is 0.001, batch size is 16, and the optimizer we select is Adam with $\epsilon=10^{-8}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ .

4.2 Subjective Evaluation

To assess the quality of the generated music, we conduct a subjective evaluation. Specifically, we invite 10 music experts with professional music training and instrument-playing experience to rate 10 songs generated by STT, three state-of-the-art models and human composers(Ground Truth) on five aspects:

•

Melody: Whether the melody is clear and captivating.
•

Rhythm: Whether the rhythm features consistency.
•

Structure: Whether the melody features a hierarchical structure in its phrases.
•

Skeleton: Whether there are any notes that audibly stand out, playing a role of the musical skeleton.
•

Overall: The overall auditory perception of the entire song.

Figure 6 shows the results of the subjective evaluation. The results indicate that our model, STT, outperforms other state-of-the-art models across all subjective evaluation metrics. This suggests that STT is capable of generating melodies that are more captivating, structures that are clearer, and themes that are more consistently coherent compared to other models. Specially, experts note that STT exhibits a more prominent hierarchical structure in the generated melody compared to other baseline models. However, compared to human compositions, the music generated by STT still exhibits some flaws, indicating room for improvement.

4.3 Objective Evaluation

To ensure a comprehensive assessment of the generated music, we also perform an objective evaluation using six metrics. Specifically, we propose three objective evaluation mechanisms as follows:

Theme Rhythm Correspondence (TRC)

TRC=\min_{i}D_{\text{hamming }}\left(R_{\text{theme }},R_{i}\right)

(8)

We propose Theme Rhythm Correspondence to evaluate the rhythm of the generated melody in relation to the theme piece. For this study, the first two bars, as prompt during the generation phase, are selected as the theme sequence. $R_{theme}$ is the binary onset vector of the theme piece ( $1$ represents an onset, otherwise $0$ ), similarly, $R_{i}$ is the binary onset vector of the $i^{th}$ melody with the same length as the theme piece, and $D_{hamming}(R_{theme},R_{i})$ is the hamming distance to compute the difference of the two melodies $R_{theme}$ , $R_{i}$ . The smaller the TRC value, the more rhythmically similar the generated melody is to the theme, reflecting better rhythmic coherence.

Theme Pitch Correspondence (TPC)

TPC=\min_{i}D_{\text{hamming }}\left(P_{\text{theme }},P_{i}\right)

(9)

Similarly, we propose Theme Pitch Correspondence to evaluate the generated melody, with $P_{theme}$ and $P_{i}$ denote the pitch sequence of theme and the $i^{th}$ piece, respectively.

Pentatonic Scale Consistency (PSC)

S_{i}=\left\{\begin{aligned} &\hfill\phantom{00}10,&&p_{i}\in\{C,D,E,G,A\}\\ &\hfill\phantom{00}6,&&p_{i}\in\{F,F^{\#},B,B^{b},\}\\ &-10,&&\hfill\phantom{p\leq r\leq}\hfill\text{others}\end{aligned}\right.

(10)

\text{PSC}=\frac{1}{n}\left(\sum_{i=1}^{n}s_{i}\right)

(11)

We propose Pentatonic Scale Consistency to evaluate the consistency of generated melody in the pitch scale dimension. Traditional Chinese songs are mostly composed using the Chinese Pentatonic Scale, a distinctive system in Chinese music. This scale consists of five tones: $C$ , $D$ , $E$ , $G$ , and $A$ , which satisfy the perfect fifth intervals. Additionally, four tones ( $F$ , $F^{\#}$ , $B$ and $B^{b}$ ) can be added to play ornamental roles. We rate each note in the melody: assign 10 points if it belongs to $\{C,D,E,G,A\}$ , 6 points if it belongs to $\{F,F^{\#},B,B^{b}\}$ , and deduct 10 points if it does not adhere to the rules of the pentatonic scale. Finally, compute the average score across all notes to obtain the PSC. PSC evaluates whether a melody follows the pattern of the Chinese pentatonic scale.

Moreover, we utilize Rhythm Consistency (RC), Pitch Entropy (PE) and Pitch Class Entropy (PCE) from MusPy²²2https://salu133445.github.io/muspy/metrics.html to evaluate the pitch consistency of melody.

4.3.1 Comparison Result.

To evaluate the performance, we compare our model, STT, against three baseline models and human compositions (Ground Truth). Table 1. shows the result of comparative experiment. STT outperforms other baseline models in all metrics except RC. The closest TPC and TRC values among all baseline models indicate that our model generates more coherent melodies in both pitch and rhythm dimensions. This suggests that Phrase-level Cross-Attention mechanism effectively learns structural features of Small Tunes songs at both macro level and micro level hierarchies. The PSC, PCE and PE value of our model closely match those of ground truth, indicating its capability to generate Small Tunes songs with more consistent melodies. Although STT slightly lags behind the MT model by 1.3% in the RC metric, its close proximity to the ground truth indicates that both models perform well in generating melodies with consistent rhythm.

Table 1: Objective evaluation results of comparative experiments. For all metrics, models with values closer to the ground truth demonstrate better performance.

Model	TPC	TRC	RC	PSC	PCE	PE
STT(ours)	4.61	2.67	85.2%	9.75	2.29	2.66
WuYun	6.26	3.59	85.1%	9.78	2.28	2.62
MT	5.39	3.2	86.2%	9.82	2.25	2.58
MT+Ph&BC	7.70	4.75	86.5%	9.90	2.23	2.53
Ground Truth	3.78	1.91	87.3%	9.73	2.31	2.65

4.3.2 Ablation Result.

To explore the underlying features of hierarchical structure in the Chinese Small Tunes, we design 6 ablative groups focusing on two key aspects: phrase segmentation and skeleton notes extraction. In addition to the phrase segmentation utilized in our method, we also employ three phrase segmentation methods:

•

No use of phrase segmentation, treating the music sequence as a single segment (abbreviated as No Segment).
•

Selection of 2 bars as the phrase unit, a rule-based approach to phrase segmentation (abbreviated as 2 Bars).
•

Expansion of the phrase boundaries from our phrase segmentation result, combining two phrases into a larger unit (abbreviated as Expansion).

Based on these phrase segmentation methods, we additionally design a skeleton notes extraction method, which reduces the number of extracted skeleton notes by randomly removing 50% skeleton notes within each phrase.

Table 2: Objective evaluation results of ablation experiments. Phrase and Skeleton are abbreviations for the ablation methods of phrase segmentation and skeleton notes extraction, respectively. For all metrics, models with values closer to the ground truth demonstrate better performance.

Group	Phrase	Skeleton	TPC	TRC	RC	PSC	PCE	PE
1(STT)	-	-	4.61	2.67	85.2%	9.75	2.29	2.66
2(WuYun)	No Segment	-	6.26	3.59	85.1%	9.78	2.28	2.62
3	2 Bars	-	6.58	3.76	84.5%	9.76	2.30	2.67
4	Expansion	-	5.11	3.17	85.0%	9.80	2.26	2.60
5	-	Remove 50%	4.94	2.92	85.5%	9.80	2.25	2.57
6	No Segment	Remove 50%	5.25	3.18	85.1%	9.81	2.24	2.55
7	2 Bars	Remove 50%	7.21	4.02	84.9%	9.77	2.29	2.62
8	Expansion	Remove 50%	5.23	3.12	85.8%	9.82	2.24	2.56
Ground Truth			3.78	1.91	87.3%	9.73	2.31	2.65

Table 2 shows the result of ablation experiment. We construct 6 ablation groups (Group 2-8) by adjusting the phrase segmentation and skeleton notes extraction strategies. Group 1-4 and 5-8 each employ the same skeleton notes extraction strategy within their groups but utilize different phrase segmentation strategies. Group 1 and Group 5 respectively achieve the best performance in TPC and TRC, indicating that our phrase segmentation strategy contributes to generating coherent melodies. Furthermore, Group 1 outperforms Group 2-4 in almost all metrics except PCE, suggesting that inappropriate segment boundaries are detrimental to capturing the structural features within Small Tunes songs. Moreover, Group 1 outperforms Group 5, indicating that an appropriate number of skeleton notes contribute to guiding the melody generation and constructing the hierarchical structure.

5 CONCLUSION

In order to study the hierarchical structural features within music, we delve into multi-level hierarchies: at the macro-level hierarchy, we conduct phrase segmentation algorithm to study the impact of phrase on the overall structural organization, and at the micro-level hierarchy, we design a skeleton notes extraction strategy to explore how skeleton notes within phrases influence the melody generation. Building upon this, we propose a novel Phrase-level Cross-Attention to capture the intrinsic relationship among multi-level hierarchies. Moreover, we train our proposed model: Small Tunes Transformer on our own established dataset: Small Tunes Dataset, providing a new perspective for the composition of Chinese-style music. We design three novel metrics to evaluate music from rhythm and melody dimensions. The experiment results indicate that our model outperforms other state-of-the-art models on both subjective and objective evaluations. Additionally, we add several ablative groups to deeply explore the intrinsic features within hierarchical structures. In future work, we aim to extend our study of macro and micro-level hierarchies within music, particularly focusing on polyphonic compositions.

References

[1] Dai, S., Jin, Z., Gomes, C., Dannenberg, R.B.: Controllable deep melody generation via hierarchical music structure representation. In: Proceedings of the 22nd International Society for Music Information Retrieval Conference. pp. 143–150 (2021)
[2] Dai, S., Ma, X., Wang, Y., Dannenberg, R.B.: Personalised popular music generation using imitation and structure. Journal of New Music Research 51(1), 69–85 (2022)
[3] Dong, H.W., Hsiao, W.Y., Yang, L.C., Yang, Y.H.: Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32, pp. 34–41 (2018)
[4] Guo, Z., Kang, J., Herremans, D.: A domain-knowledge-inspired music embedding space and a novel attention mechanism for symbolic music modeling. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 5070–5077 (2023)
[5] Hsiao, W.Y., Liu, J.Y., Yeh, Y.C., Yang, Y.H.: Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 178–186 (2021)
[6] Hu, Z., Ma, X., Liu, Y., Chen, G., Liu, Y., Dannenberg, R.B.: The beauty of repetition: An algorithmic composition model with motif-level repetition generator and outline-to-music generator in symbolic music generation. IEEE Trans. Multim. 26, 4320–4333 (2024)
[7] Huang, C.Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N.M., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure. In: International Conference on Learning Representations (2018)
[8] Huang, W., Yu, Y., Xu, H., Su, Z., Wu, Y.: Hyperbolic music transformer for structured music generation. IEEE Access 11, 26893–26905 (2023)
[9] Jiang, J., Chin, D., Zhang, Y., Xia, G.: Learning hierarchical metrical structure beyond measures. In: Proceedings of the 23rd International Society for Music Information Retrieval Conference (2022)
[10] Johnson, D.D., Keller, R.M., Weintraut, N.: Learning to create jazz melodies using a product of experts. In: ICCC. pp. 151–158 (2017)
[11] Li, J., Luo, J., Ding, J., Zhao, X., Yang, X.: Regional classification of chinese folk songs based on crf model. Multimedia tools and applications 78, 11563–11584 (2019)
[12] Liang, Q., Wang, Y.: Drawlody: Sketch-based melody creation with enhanced usability and interpretability. IEEE Transactions on Multimedia (2024)
[13] Lu, P., Tan, X., Yu, B., Qin, T., Zhao, S., Liu, T.Y.: Meloform: Generating melody with musical form based on expert systems and neural networks. In: Proceedings of the 23rd International Society for Music Information Retrieval Conference. pp. 567–574 (2022)
[14] Luo, J., Yang, X., Herremans, D.: Bandcontrolnet: Parallel transformers-based steerable popular music generation with fine-grained spatiotemporal features. arXiv preprint arXiv:2407.10462 (2024)
[15] Luo, J., Yang, X., Ji, S., Li, J.: MG-VAE: deep chinese folk songs generation with specific regional styles. In: Proceedings of the 7th Conference on Sound and Music Technology (CSMT) Revised Selected Papers. pp. 93–106 (2020)
[16] Naruse, D., Takahata, T., Mukuta, Y., Harada, T.: Pop music generation with controllable phrase lengths. In: Proceedings of the 23rd International Society for Music Information Retrieval Conference. pp. 125–131 (2022)
[17] Povel, D.J., et al.: Melody generator: A device for algorithmic music construction. Journal of Software Engineering and Applications 3(07), 683 (2010)
[18] Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. In: International conference on machine learning. pp. 4364–4373 (2018)
[19] Shih, Y.J., Wu, S.L., Zalkow, F., Muller, M., Yang, Y.H.: Theme transformer: Symbolic music generation with theme-conditioned transformer. IEEE Transactions on Multimedia (2022)
[20] Wu, G., Liu, S., Fan, X.: The power of fragmentation: a hierarchical transformer model for structural segmentation in symbolic music generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 1409–1420 (2023)
[21] Wu, J., Liu, X., Hu, X., Zhu, J.: Popmnet: Generating structured pop music melodies using neural networks. Artificial Intelligence 286, 103303 (2020)
[22] Wu, S.L., Yang, Y.H.: The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures. In: Proceedings of the 21st International Society for Music Information Retrieval Conference. pp. 142–149 (2020)
[23] Yang, X., Luo, J., Wang, Y., Zhao, X., Li, J.: Combining auditory perception and visual features for regional recognition of chinese folk songs. In: Proceedings of the 2018 10th International Conference on Computer and Automation Engineering. pp. 75–81 (2018)
[24] Zhang, K., Wu, X., Zhang, T., Huang, Z., Tan, X., Liang, Q., Wu, S., Sun, L.: Wuyun: exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning. arXiv preprint arXiv:2301.04488 (2023)
[25] Zhang, X., Zhang, J., Qiu, Y., Wang, L., Zhou, J.: Structure-enhanced pop music generation via harmony-aware learning. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 1204–1213 (2022)
[26] Zhang, Y., Xia, G.: Symbolic melody phrase segmentation using neural network with conditional random field. In: Proceedings of the 8th Conference on Sound and Music Technology: Selected Papers from CSMT. pp. 55–65. Springer (2021)
[27] Zhu, H., Liu, Q., Yuan, N.J., Qin, C., Li, J., Zhang, K., Zhou, G., Wei, F., Xu, Y., Chen, E.: Xiaoice band: A melody and arrangement generation framework for pop music. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 2837–2846 (2018)
[28] Zou, Y., Zou, P., Zhao, Y., Zhang, K., Zhang, R., Wang, X.: Melons: generating melody with long-term structure using transformers and structure graph. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 191–195 (2022)