The updated melody-conditioned comparison includes four baselines:
Ours (v1): Uses a one-hot 12-pitch-class chromagram as the condition, the same as MusicGen, which both the reviewers and we found to be perceptually poor.
Ours (v2): Uses a top-4 128-pitch-class CQT as the condition, the same as Stable-audio ControlNet, which sounds preferable.
Pretrained Backbone (Original Stable-audio): We also include samples generated by the original Stable-audio Open only with text condition to showcase the expressiveness of the pretrained backbone.
Melody Control |
||||||||||||||||||||||||||||||||||
Reference |
Text Prompt |
Ours(v2) |
Stable Audio ControlNet |
MusicGen Stereo Large |
Ours(v1) |
Original Stable Audio |
||||||||||||||||||||||||||||
A heartfelt, warm acoustic guitar performance, evoking a sense of tenderness and deep emotion, with a melody that truly resonates and touches the heart. |
||||||||||||||||||||||||||||||||||
A vibrant MIDI electronic composition with a hopeful and optimistic vibe. |
||||||||||||||||||||||||||||||||||
This track composed of electronic instruments gives a sense of opening and clearness. |
||||||||||||||||||||||||||||||||||
This track composed of electronic instruments gives a sense of opening and clearness. |
||||||||||||||||||||||||||||||||||
Hopeful instrumental with guitar being the lead and tabla used for percussion in the middle giving a feeling of going somewhere with positive outlook. |
||||||||||||||||||||||||||||||||||
A string ensemble opens the track with legato, melancholic melodies. The violins and violas play beautifully, while the cellos and bass provide harmonic support for the moving passages. The overall feel is deeply melancholic, with an emotionally stirring performance that remains harmonious and a sense of clearness. |
||||||||||||||||||||||||||||||||||
An exceptionally harmonious string performance with a lively tempo in the first half, transitioning to a gentle and beautiful melody in the second half. It creates a warm and comforting atmosphere, featuring cellos and bass providing a solid foundation, while violins and violas showcase the main theme, all without any noise, resulting in a cohesive and serene sound. |
||||||||||||||||||||||||||||||||||
Pop solo piano instrumental song. Simple harmony and emotional theme. Makes you feel nostalgic and wanting a cup of warm tea sitting on the couch while holding the person you love. |
||||||||||||||||||||||||||||||||||
A whimsical string arrangement with rich layers, featuring violins as the main melody, accompanied by violas and cellos. The light, playful melody blends harmoniously, creating a sense of clarity. |
||||||||||||||||||||||||||||||||||
An instrumental piece primarily featuring acoustic guitar, with a lively and nimble feel. The melody is bright, delivering an overall sense of joy. |
||||||||||||||||||||||||||||||||||
A joyful saxophone performance that is smooth and cohesive, accompanied by cello. The first half features a relaxed tempo, while the second half picks up with an upbeat rhythm, creating a lively and energetic atmosphere. The overall sound is harmonious and clear, evoking feelings of happiness and vitality. |
||||||||||||||||||||||||||||||||||
A cheerful piano performance with a smooth and flowing rhythm, evoking feelings of joy and vitality. |
||||||||||||||||||||||||||||||||||
An instrumental piece primarily featuring piano, with a lively rhythm and cheerful melodies that evoke a sense of joyful childhood playfulness. The melodies are clear and bright. |
||||||||||||||||||||||||||||||||||
fast and fun beat-based indie pop to set a protagonist-gets-good-at-x movie montage to. |
||||||||||||||||||||||||||||||||||
A lively 70s style British pop song featuring drums, electric guitars, and synth violin. The instruments blend harmoniously, creating a dynamic, clean sound without any noise or clutter. |
||||||||||||||||||||||||||||||||||
A soothing acoustic guitar song that evokes nostalgia, featuring intricate fingerpicking. The melody is both sacred and mysterious, with a rich texture. |
Music Continuation |
|||||||||||||||||||||||
Reference |
MusicGen |
Ours |
|||||||||||||||||||||
In Highlighted audio section, we display some samples that were questioned by the reviewers. We also include samples generated by the original Stable-audio Open to showcase the expressiveness of the pretrained backbone. In our opinion, Stable-audio lacks knowledge that is not present in its training dataset (Stable-audio was originally trained with tags from FMA and Free-Sound). For example, it doesn't perform well with terms like "Jazz," "melodic," "legato," etc. Since we didn't focus on enhancing text adherence, the upper bound of our expressiveness will be limited by the pretrained backbone.
Dynamics Control |
|||||||||||||||||
Generated Music |
Text |
Feature Plots |
Original Stable Audio |
||||||||||||||
a recording of a melodic piano solo. |
![]() |
||||||||||||||||
jazz band with piano, drum and guitar, high quality. |
![]() |
||||||||||||||||
Melody Control |
|||||||||||||||||
Reference |
Generated Music v1 |
Generated Music v2 |
Text |
Feature Plots |
Original Stable Audio |
||||||||||||
jazz band, high quality. |
![]() |
||||||||||||||||
Rhythm Control |
|||||||||||||||||
Reference |
Generated Music |
Text |
Feature Plots |
Original Stable Audio |
|||||||||||||
A cello quartet playing harmonized legato notes in a perfectly treated chamber music studio. |
![]() |
For audio inpainting, we apply a mask from 5s to 25s within the original 30-second audio:
The same procedure applies to audio outpainting, except the audio is masked from 10s to 30s.
Audio Inpainting |
|||||||||||||||
Reference |
Generated Music |
||||||||||||||
Audio Inpainting with Dynamics Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
Audio Inpainting with Melody Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
Audio Inpainting with Rhythm Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
Audio Outpainting |
|||||||||||||||
Reference |
Generated Music |
||||||||||||||
Audio Outpainting with Dynamics Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
Audio Outpainting with Melody Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
Audio Outpainting with Rhythm Control |
|||||||||||||||
Reference |
Generated Music |
Feature Plots |
|||||||||||||
![]() |
|||||||||||||||
![]() |
|||||||||||||||
![]() |
In Musical Attribute Control, we display different conditions as specified in the topic. Attributes that are not explicitly mentioned indicate that they are not provided.
Melody, Rhythm & Dynamics Control |
|||||||||||
Reference |
Generated Music |
Text |
Feature Plots |
||||||||
A high-quality cello solo with deep, rich vibrato and expressive bowing. |
![]() |
||||||||||
A bluesy piano solo with expressive slides and swing feel. |
![]() |
||||||||||
An energetic jazz drum solo with intricate snare work and cymbal accents. |
![]() |
||||||||||
Dynamics Control |
|||||||||||
Generated Music |
Text |
Feature Plots |
|||||||||
a recording of a melodic piano solo. |
![]() |
||||||||||
jazz band with piano, drum and guitar, high quality. |
![]() |
||||||||||
acoustic guitar solo. |
![]() |
||||||||||
Melody Control |
|||||||||||
Reference |
Generated Music |
Text |
Feature Plots |
||||||||
Electrical guitar solo, smoothly. |
![]() |
||||||||||
A cinematic orchestral soundtrack with swelling strings and dramatic percussion. |
![]() |
||||||||||
jazz band, high quality. |
![]() |
||||||||||
Rhythm Control |
|||||||||||
Reference |
Generated Music |
Text |
Feature Plots |
||||||||
A cello quartet playing harmonized legato notes in a perfectly treated chamber music studio. |
![]() |
||||||||||
A jazz guitar playing silky smooth chords in a dimly lit jazz club with perfectly placed microphones. |
![]() |
||||||||||
A xylophone ringing brightly inside a top-tier music conservatory rehearsal room. |
![]() |