Get AI summaries of any video or article — Sign up free
Amazing Free AI Composer: ACE-Step Now Available thumbnail

Amazing Free AI Composer: ACE-Step Now Available

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Ace-Tep is an Apache 2.0 open-source AI music generator built on a 3.5B-parameter openweight model, with lyric generation and multiple editing workflows.

Briefing

A new Apache 2.0 open-source AI music generator called Ace-Tep has been released with a large 3.5B-parameter openweight model, bringing lyric generation, multi-style music, and several editing workflows into a single toolset. The headline appeal is not just that it can produce full songs quickly on high-end hardware, but that it’s designed for iteration—community members can experiment with fine-tuning and LoRA adapters, while users can repaint, edit lyrics, and extend tracks to refine results.

Performance claims set expectations clearly. On an NVIDIA A100, the system generates about a minute of audio in a little over two seconds; an RTX 4090 can do under two seconds for a minute at lower “step” settings. A MacBook M2 Max can run it too, but noticeably slower. The model’s quality knob is tied to inference steps: pushing to 60 steps roughly doubles generation time, and higher settings can increase pronunciation quirks. Even so, the creator’s testing suggests that consumer-grade hardware can still produce usable outputs—just not with one-click convenience.

Beyond raw generation, Ace-Tep is positioned as a music-production assistant. It supports mainstream music styles using multiple description formats (short tags, descriptive text, and use-case scenarios) and claims coverage across up to 19 languages, with top-performing languages including English, Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, and Korean. The model can generate realistic instrumental tracks with complex multi-instrument arrangements while maintaining coherence, and it also supports vocal styles.

A standout feature is audio inpainting: users can upload audio, remove a chunk, and have the model regenerate the missing section so the surrounding material stays consistent. There’s also “lyric editing,” described as a flowchart-style method that allows local lyric modifications while preserving melody and vocals—useful when generated lyrics are close but not perfect. For longer projects, the system includes repainting (regenerating a time window), editing tags, and extension workflows to build songs incrementally.

The ecosystem is also part of the pitch. Two LoRA adapters—lyrics-to-vocal and text-to-sample—are presented as demonstrations of subtask fine-tuning. Planned or upcoming adapters include a rap-focused fine-tune aimed at rap generation and “AI rap battles,” plus Stem gem (a control net) for generating individual instrument stems from multi-track data, and singing-to-accompaniment to turn raw vocals into a complementary backing track.

Hands-on tests show the tradeoffs. With mid-level settings, the model often produces a verse-and-hook structure rather than fitting every lyric line into a short target duration. It handles country ballads more smoothly than rap, where lyric delivery and pronunciation can break down. Still, the outputs can be listenable, and the workflow encourages fixing mistakes through higher steps, repainting, and targeted extensions—turning generation into an iterative creative loop rather than a single-shot result. For developers and AI music enthusiasts with substantial GPUs, Ace-Tep is framed as one of the most customizable open-source music generation releases available now, even if it doesn’t yet match the polish of top closed-source competitors.

Cornell Notes

Ace-Tep is a new Apache 2.0 open-source AI music generator built around a 3.5B-parameter openweight model. It can generate instrumental tracks and vocals, supports lyric generation, and includes tools for iteration such as audio inpainting, repainting (regenerating a time window), lyric editing, and extensions to build longer songs. Performance depends heavily on inference steps and hardware: an NVIDIA A100 can generate about a minute of audio in a little over two seconds at lower step settings, while consumer GPUs like an RTX 3090 can still produce a minute in roughly several seconds. Early tests suggest it’s more reliable for styles like country ballads than for rap, where pronunciation and lyric delivery can falter. The main value is the combination of open access and production-style controls that let users refine outputs instead of accepting a single generation.

What makes Ace-Tep more than a basic “text-to-music” generator?

Ace-Tep includes multiple refinement workflows: audio inpainting (upload audio, remove a chunk, regenerate the missing section), repainting (set start/end times and regenerate only that segment), lyric editing (local lyric modifications while preserving melody/vocals), and extension (grow a track after an initial generation). This turns music creation into an iterative process where users can correct near-misses rather than restarting from scratch.

How do inference steps and hardware affect generation speed and quality?

Generation time scales with inference steps. At lower step settings, an RTX 4090 can produce about a minute of audio in under two seconds; an NVIDIA A100 is a little over two seconds; and a 3090 is around five seconds for a minute. Raising quality to around 60 steps roughly doubles time, and higher steps can also introduce pronunciation weirdness. On a MacBook M2 Max it runs, but much slower than NVIDIA GPUs.

Which music capabilities are claimed, and what do tests suggest about style differences?

Ace-Tep is described as supporting mainstream styles via tags or descriptive prompts, producing realistic instrumental tracks with multi-instrument arrangements, and generating vocal styles. In practice, the system handled country ballads more smoothly than rap/hip-hop: the beat arrived, but lyric rapping and pronunciation were inconsistent, aligning with the idea that a dedicated rap fine-tune is needed.

What do the LoRA adapters and upcoming modules aim to improve?

Two LoRA adapters are highlighted: lyrics-to-vocal and text-to-sample, demonstrating subtask fine-tuning for specific parts of the workflow. Upcoming additions include a rap-specialized fine-tune for rap generation and “AI rap battles,” Stem gem (a control net) to generate individual instrument stems for later adjustment, and singing-to-accompaniment to create a backing track from a vocal input.

How does lyric editing and repainting help when lyrics don’t fit the generated timing?

When a short target duration can’t accommodate all lyric lines, Ace-Tep may generate only a verse-and-hook portion. Repainting lets users regenerate a specific time window, while lyric editing allows local lyric changes without fully discarding the melody/vocal structure. Together, they support a workflow of generating a workable base, then fixing mismatches in targeted sections.

Review Questions

  1. What specific editing tools in Ace-Tep let users modify only part of a song (rather than regenerating everything)?
  2. How would you expect increasing inference steps to change both speed and lyric pronunciation behavior?
  3. Why might a rap-focused fine-tune be necessary even if the base model can generate rap-like music?

Key Points

  1. 1

    Ace-Tep is an Apache 2.0 open-source AI music generator built on a 3.5B-parameter openweight model, with lyric generation and multiple editing workflows.

  2. 2

    High-end GPUs dramatically reduce generation time: an NVIDIA A100 and RTX 4090 can generate about a minute of audio in roughly 2 seconds at lower step settings.

  3. 3

    Inference steps are a major quality/speed lever; pushing toward 60 steps increases time and can worsen lyric pronunciation.

  4. 4

    Ace-Tep supports audio inpainting, repainting (time-window regeneration), lyric editing (local lyric changes with melody/vocals preserved), and extensions to build longer tracks.

  5. 5

    The model claims multi-style support and up to 19 languages, with top-performing languages including English, Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, and Korean.

  6. 6

    LoRA adapters and upcoming modules target workflow gaps: lyrics-to-vocal, text-to-sample, rap specialization, Stem gem for instrument stems, and singing-to-accompaniment.

  7. 7

    Early hands-on results suggest country ballads are more reliable than rap, where lyric delivery and pronunciation can be inconsistent.

Highlights

Ace-Tep pairs open-source access with production-style controls—repainting, lyric editing, inpainting, and extension—so users can refine outputs instead of accepting a single generation.
On an NVIDIA A100, the system can generate about a minute of audio in a little over two seconds at lower step settings; an RTX 4090 can do under two seconds.
The model’s style performance varies: country ballads come out more coherent, while rap/hip-hop shows weaker lyric rapping and pronunciation.
Upcoming Stem gem and singing-to-accompaniment point toward more “studio-like” workflows: adjustable stems and vocal-to-backing generation.

Topics

Mentioned

  • LoRA
  • CFG
  • VRAM
  • GPU
  • A100
  • RTX
  • M2