MuseControlLite:


Fang-Duo Tsai1   Shih-Lun Wu2   Weijaw Lee 1   Sheng-Ping Yang 1   Bo-Rui Chen 1   Hao-Chung Cheng1   Yi-Hsuan Yang1  

1National Taiwan University
2Massachusetts Institute of Technology

Paper GitHub Colab Colab

Abstract


We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainble parameters. Source code, model checkpoints, and demo examples are available at: https://musecontrollite.github.io/web/.


Bibtex

                
                @misc{tsai2025musecontrollitemultifunctionalmusicgeneration,
                    title={MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners}, 
                    author={Fang-Duo Tsai and Shih-Lun Wu and Weijaw Lee and Sheng-Ping Yang and Bo-Rui Chen and Hao-Chung Cheng and Yi-Hsuan Yang},
                    year={2025},
                    eprint={2506.18729},
                    archivePrefix={arXiv},
                    primaryClass={cs.SD},
                    url={https://arxiv.org/abs/2506.18729}, 
                }
                  

Melody-conditioned Comparison

The melody-conditioned comparison includes four baselines:

  • Stable-audio ControlNet
  • MusicGen-stereo-Large
  • Ours (v1)
  • Ours (v2)
  • Ours (v3)

We further classify the models

Details on Each Baseline

Ours (v1): Uses a one-hot 12-pitch-class chromagram as the condition, the same as MusicGen, which both the reviewers and we found to be perceptually poor.

Ours (v2): Uses a top-4 128-pitch-class CQT as the condition, objective and subjective evaluation in the paper are this version.

Ours (v3): Uses a top-4 128-pitch-class CQT as the condition, but processes the melody of the two audio channels separately, this results in melody acuracy 7.6 % higher then ControlNet in the no singing Song Describer Dataset.

Pretrained Backbone (Original Stable-audio): We also include samples generated by the original Stable-audio Open only with text condition to showcase the expressiveness of the pretrained backbone.

Melody Control
Reference
Text Prompt
Ours(v3)
Ours(v2)
Stable Audio ControlNet
MusicGen Stereo Large
Ours(v1)
Original Stable Audio

A heartfelt, warm acoustic guitar performance, evoking a sense of tenderness and deep emotion, with a melody that truly resonates and touches the heart.

A vibrant MIDI electronic composition with a hopeful and optimistic vibe.

This track composed of electronic instruments gives a sense of opening and clearness.

This track composed of electronic instruments gives a sense of opening and clearness.

Hopeful instrumental with guitar being the lead and tabla used for percussion in the middle giving a feeling of going somewhere with positive outlook.

A string ensemble opens the track with legato, melancholic melodies. The violins and violas play beautifully, while the cellos and bass provide harmonic support for the moving passages. The overall feel is deeply melancholic, with an emotionally stirring performance that remains harmonious and a sense of clearness.

An exceptionally harmonious string performance with a lively tempo in the first half, transitioning to a gentle and beautiful melody in the second half. It creates a warm and comforting atmosphere, featuring cellos and bass providing a solid foundation, while violins and violas showcase the main theme, all without any noise, resulting in a cohesive and serene sound.

Pop solo piano instrumental song. Simple harmony and emotional theme. Makes you feel nostalgic and wanting a cup of warm tea sitting on the couch while holding the person you love.

A whimsical string arrangement with rich layers, featuring violins as the main melody, accompanied by violas and cellos. The light, playful melody blends harmoniously, creating a sense of clarity.

An instrumental piece primarily featuring acoustic guitar, with a lively and nimble feel. The melody is bright, delivering an overall sense of joy.

A joyful saxophone performance that is smooth and cohesive, accompanied by cello. The first half features a relaxed tempo, while the second half picks up with an upbeat rhythm, creating a lively and energetic atmosphere. The overall sound is harmonious and clear, evoking feelings of happiness and vitality.

A cheerful piano performance with a smooth and flowing rhythm, evoking feelings of joy and vitality.

An instrumental piece primarily featuring piano, with a lively rhythm and cheerful melodies that evoke a sense of joyful childhood playfulness. The melodies are clear and bright.

fast and fun beat-based indie pop to set a protagonist-gets-good-at-x movie montage to.

A lively 70s style British pop song featuring drums, electric guitars, and synth violin. The instruments blend harmoniously, creating a dynamic, clean sound without any noise or clutter.

A soothing acoustic guitar song that evokes nostalgia, featuring intricate fingerpicking. The melody is both sacred and mysterious, with a rich texture.

Audio Outpainting Comparison

The Audio Outpainting comparison includes two baselines:

  • MusicGen-stereo-medium
  • Ours

We provide the first 15 seconds as a reference, and both baselines will generate the next 15 seconds. Neither of them used additional conditions, including text or melody.

Music Continuation
Reference
MusicGen
Ours