MuseControlLite:
Multiple Time-varying Controls for Music Generation


Updated Melody-conditioned Comparison

The updated melody-conditioned comparison includes four baselines:

  • Stable-audio ControlNet
  • MusicGen-stereo-Large
  • Ours (v1)
  • Ours (v2)

Details on Each Baseline

Ours (v1): Uses a one-hot 12-pitch-class chromagram as the condition, the same as MusicGen, which both the reviewers and we found to be perceptually poor.

Ours (v2): Uses a top-4 128-pitch-class CQT as the condition, the same as Stable-audio ControlNet, which sounds preferable.

Pretrained Backbone (Original Stable-audio): We also include samples generated by the original Stable-audio Open only with text condition to showcase the expressiveness of the pretrained backbone.

Melody Control
Reference
Text Prompt
Ours(v2)
Stable Audio ControlNet
MusicGen Stereo Large
Ours(v1)
Original Stable Audio

A heartfelt, warm acoustic guitar performance, evoking a sense of tenderness and deep emotion, with a melody that truly resonates and touches the heart.

A vibrant MIDI electronic composition with a hopeful and optimistic vibe.

This track composed of electronic instruments gives a sense of opening and clearness.

This track composed of electronic instruments gives a sense of opening and clearness.

Hopeful instrumental with guitar being the lead and tabla used for percussion in the middle giving a feeling of going somewhere with positive outlook.

A string ensemble opens the track with legato, melancholic melodies. The violins and violas play beautifully, while the cellos and bass provide harmonic support for the moving passages. The overall feel is deeply melancholic, with an emotionally stirring performance that remains harmonious and a sense of clearness.

An exceptionally harmonious string performance with a lively tempo in the first half, transitioning to a gentle and beautiful melody in the second half. It creates a warm and comforting atmosphere, featuring cellos and bass providing a solid foundation, while violins and violas showcase the main theme, all without any noise, resulting in a cohesive and serene sound.

Pop solo piano instrumental song. Simple harmony and emotional theme. Makes you feel nostalgic and wanting a cup of warm tea sitting on the couch while holding the person you love.

A whimsical string arrangement with rich layers, featuring violins as the main melody, accompanied by violas and cellos. The light, playful melody blends harmoniously, creating a sense of clarity.

An instrumental piece primarily featuring acoustic guitar, with a lively and nimble feel. The melody is bright, delivering an overall sense of joy.

A joyful saxophone performance that is smooth and cohesive, accompanied by cello. The first half features a relaxed tempo, while the second half picks up with an upbeat rhythm, creating a lively and energetic atmosphere. The overall sound is harmonious and clear, evoking feelings of happiness and vitality.

A cheerful piano performance with a smooth and flowing rhythm, evoking feelings of joy and vitality.

An instrumental piece primarily featuring piano, with a lively rhythm and cheerful melodies that evoke a sense of joyful childhood playfulness. The melodies are clear and bright.

fast and fun beat-based indie pop to set a protagonist-gets-good-at-x movie montage to.

A lively 70s style British pop song featuring drums, electric guitars, and synth violin. The instruments blend harmoniously, creating a dynamic, clean sound without any noise or clutter.

A soothing acoustic guitar song that evokes nostalgia, featuring intricate fingerpicking. The melody is both sacred and mysterious, with a rich texture.

Updated Audio Outpainting Comparison

The updated Audio Outpainting comparison includes two baselines:

  • MusicGen-stereo-medium
  • Ours

We provide the first 15 seconds as a reference, and both baselines will generate the next 15 seconds. Neither of them used additional conditions, including text or melody.

Music Continuation
Reference
MusicGen
Ours

Highlighted audio

In Highlighted audio section, we display some samples that were questioned by the reviewers. We also include samples generated by the original Stable-audio Open to showcase the expressiveness of the pretrained backbone. In our opinion, Stable-audio lacks knowledge that is not present in its training dataset (Stable-audio was originally trained with tags from FMA and Free-Sound). For example, it doesn't perform well with terms like "Jazz," "melodic," "legato," etc. Since we didn't focus on enhancing text adherence, the upper bound of our expressiveness will be limited by the pretrained backbone.

Dynamics Control
Generated Music
Text
Feature Plots
Original Stable Audio

a recording of a melodic piano solo.

Features

jazz band with piano, drum and guitar, high quality.

Features
Melody Control
Reference
Generated Music v1
Generated Music v2
Text
Feature Plots
Original Stable Audio

jazz band, high quality.

Features
Rhythm Control
Reference
Generated Music
Text
Feature Plots
Original Stable Audio

A cello quartet playing harmonized legato notes in a perfectly treated chamber music studio.

Features

Audio Inpainting and Audio Outpainting

For audio inpainting, we apply a mask from 5s to 25s within the original 30-second audio:

  1. Unconditional inpainting: The model uses only the unmasked segment of the audio as a condition, without any text or musical attribute conditions.
  2. Inpainting with musical attribute condition: The condition is masked wherever the audio is unmasked, ensuring that the musical attribute condition serves as a guide for inpainting.

The same procedure applies to audio outpainting, except the audio is masked from 10s to 30s.

Audio Inpainting
Reference
Generated Music
Audio Inpainting with Dynamics Control
Reference
Generated Music
Feature Plots
Features
Features
Features
Audio Inpainting with Melody Control
Reference
Generated Music
Feature Plots
Features
Features
Features
Audio Inpainting with Rhythm Control
Reference
Generated Music
Feature Plots
Features
Features
Features
Audio Outpainting
Reference
Generated Music
Audio Outpainting with Dynamics Control
Reference
Generated Music
Feature Plots
Features
Features
Features
Audio Outpainting with Melody Control
Reference
Generated Music
Feature Plots
Features
Features
Features
Audio Outpainting with Rhythm Control
Reference
Generated Music
Feature Plots
Features
Features
Features

Musical Attribute Control

In Musical Attribute Control, we display different conditions as specified in the topic. Attributes that are not explicitly mentioned indicate that they are not provided.

Melody, Rhythm & Dynamics Control
Reference
Generated Music
Text
Feature Plots

A high-quality cello solo with deep, rich vibrato and expressive bowing.

Features

A bluesy piano solo with expressive slides and swing feel.

Features

An energetic jazz drum solo with intricate snare work and cymbal accents.

Features
Dynamics Control
Generated Music
Text
Feature Plots

a recording of a melodic piano solo.

Features

jazz band with piano, drum and guitar, high quality.

Features

acoustic guitar solo.

Features
Melody Control
Reference
Generated Music
Text
Feature Plots

Electrical guitar solo, smoothly.

Features

A cinematic orchestral soundtrack with swelling strings and dramatic percussion.

Features

jazz band, high quality.

Features
Rhythm Control
Reference
Generated Music
Text
Feature Plots

A cello quartet playing harmonized legato notes in a perfectly treated chamber music studio.

Features

A jazz guitar playing silky smooth chords in a dimly lit jazz club with perfectly placed microphones.

Features

A xylophone ringing brightly inside a top-tier music conservatory rehearsal room.

Features