Get AI summaries of any video or article — Sign up free
This Shouldn’t Be Possible… Open Source AI Music (SUNO LEVEL) thumbnail

This Shouldn’t Be Possible… Open Source AI Music (SUNO LEVEL)

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Heart Moola is an open-weight, multimodal song-generation model designed to run locally and offline, not just through APIs.

Briefing

Open-source AI music generation can now run locally on a typical gaming PC—producing multi-minute songs with lyrics and instrumentation without relying on cloud APIs. The standout model highlighted is Heart Moola, an open-weight, multimodal LLM-based system for song generation that the creator claims performs near the top of the field on lyric clarity while remaining feasible to run on consumer hardware.

Heart Moola is presented as a practical alternative to closed, top-performing music generators such as Sunno V4.5 and Sunno V5. In side-by-side listening tests, the closed models are described as “all-encompassing” and strong at both lyrics and overall musical feel, while Heart Moola—despite being smaller (about 3B parameters)—is credited with delivering surprisingly comparable results, especially when the goal is lyric-focused output. The model’s architecture is broken down into components including a text tokenizer, an audio encoder, a codec tokenizer, and a local decoder that produces generated music. The project is also positioned as “open-source weights and all,” with an Apache 2.0 license and a GitHub repository that includes code and a tutorial.

Beyond raw quality, the practical message is how to make it work offline. The walkthrough centers on using Google’s “Anti-Gravity” tool to install and configure the Heart Moola repo quickly. System requirements are emphasized: an NVIDIA GPU using CUDA, with roughly 10–12GB VRAM as a minimum and 16GB+ recommended for smooth performance. CPU-only setups are described as possible but extremely slow. On Windows, VRAM can be checked via Task Manager. The creator also notes that some setups may require “lazy loading,” and that model loading can be slower than generation itself.

In live tests, the model is run to generate songs in seconds once loaded, including attempts with missing lyrics. One experiment suggests the model can still produce vocal content even when lyrics are intended to be blank, but the creator attributes unexpected results to whether the lyrics file was actually saved and used. Another key operational detail is that the system may unload checkpoints automatically, forcing a reload that adds time; the workaround is simply to rerun and continue.

Heart Moola is also described as multilingual, with demonstrated capability in Chinese as well as Japanese, Korean, and Spanish. The project roadmap includes improvements such as inference acceleration scripts and streaming inference for web applications, plus reference-audio conditioning for finer control over generated music. A larger 7B variant is also planned, expected to require more VRAM but to move closer to the quality of full-scale closed models.

Overall, the central takeaway is that open-weight music generation is no longer just a research curiosity. With the right GPU and a guided setup flow, users can generate lyric-driven songs locally, iterate on prompts and tags, and look forward to upcoming upgrades that could make the experience more controllable and production-ready—potentially becoming the “stable diffusion of AI music.”

Cornell Notes

Heart Moola is an open-weight, multimodal AI model for generating songs locally on a PC, including lyrics and musical output. The creator compares its results to closed top models like Sunno V4.5 and Sunno V5, arguing that Heart Moola’s smaller size (about 3B parameters) still delivers strong, lyric-focused quality—especially given that it can run offline. Setup is framed as practical: use an NVIDIA CUDA GPU with at least ~10–12GB VRAM (16GB+ recommended) and rely on Google Anti-Gravity to install and configure the repo quickly. Once loaded, generation can take on the order of tens of seconds, though checkpoint unloading can add reload time. Planned upgrades include faster/streaming inference, reference-audio conditioning, and a larger 7B variant for higher fidelity and broader capability.

What makes Heart Moola different from closed music generators like Sunno V4.5 or Sunno V5?

Heart Moola is presented as open-source with open weights (Apache 2.0 license) and code available on GitHub, enabling local/offline generation. The creator frames it as smaller (about 3B parameters) and therefore potentially different in quality, but still competitive—particularly for lyric clarity—while closed models are described as broader “all-encompassing” systems.

What hardware and software conditions are needed to run it locally?

The walkthrough emphasizes an NVIDIA GPU with CUDA support. VRAM guidance is roughly 10–12GB minimum, with 16GB+ recommended for smooth performance. CPU-only operation is possible but “probably going to take forever.” On Windows, VRAM can be checked in Task Manager under the GPU section. The creator also mentions that some setups may need lazy loading and that model loading can be slower than generation.

How does Anti-Gravity fit into the installation process?

Anti-Gravity is used as a guided installer/configurator. The user downloads the Heart Moola GitHub zip, points Anti-Gravity to the folder/zip, and asks it for help setting up the repo for local testing. The creator recommends fast mode with “Gemini 3 flash” and notes that Anti-Gravity can apply environment-specific fixes (for example, a custom PyTorch nightly variant and a Windows MP3 encoding workaround).

How are lyrics and musical style controlled during generation?

The model uses files in the Heart Moola assets folder: lyrics.ext for lyric text and tags.ext for style/instrument tags. The creator shows tags like “piano” and mood/tempo descriptors (e.g., “happy,” “slow,” “romantic”) in CSV-like rows. The workflow involves pasting new lyrics/tags, saving them, and then generating output.

What operational issue can affect results or timing during repeated runs?

The creator reports that checkpoints can unload automatically, which makes subsequent generations slower because the model must reload into GPU memory. This can also lead to confusion if lyrics weren’t saved correctly before rerunning—one test where lyrics were expected to be absent produced vocal output, later attributed to the lyrics file not being saved/used.

What upgrades are planned for Heart Moola?

The roadmap includes releasing scripts for inference acceleration and streaming inference (useful for websites), adding reference audio conditioning for more fine-grained controllable music generation, and “hot song generation” (as described in the transcript). A larger 7B variant is also planned, expected to require more VRAM but to better compete with full-scale models.

Review Questions

  1. What trade-offs does the transcript suggest between Heart Moola’s smaller parameter count and the quality of closed, larger music models?
  2. Why might a user see longer delays between generations even if the model generates quickly once loaded?
  3. How do lyrics.ext and tags.ext work together to shape the output, and what mistake could cause “unexpected” vocal content?

Key Points

  1. 1

    Heart Moola is an open-weight, multimodal song-generation model designed to run locally and offline, not just through APIs.

  2. 2

    The transcript frames Heart Moola as competitive with closed models on lyric clarity, despite being smaller (about 3B parameters).

  3. 3

    Local setup centers on an NVIDIA CUDA GPU with roughly 10–12GB VRAM minimum and 16GB+ recommended for stability and speed.

  4. 4

    Google Anti-Gravity is used to automate repo setup by reading the GitHub instructions and applying system-specific fixes (including PyTorch nightly in one case).

  5. 5

    Generation can be fast once the model is loaded, but checkpoint unloading can force reloads that add time.

  6. 6

    Lyrics and style control come from assets files: lyrics.ext for text and tags.ext for CSV-like style/instrument prompts.

  7. 7

    Planned improvements include inference acceleration, streaming inference, reference-audio conditioning, and a larger 7B variant for higher fidelity.

Highlights

Heart Moola is positioned as open-weight music generation that can run entirely locally, producing multi-minute songs with lyrics.
The workflow emphasizes practical hardware thresholds: NVIDIA CUDA GPUs with ~10–12GB VRAM minimum and 16GB+ recommended.
Once loaded, the model can generate quickly (tens of seconds), but checkpoint unloading can slow repeated runs.
Planned features—streaming inference and reference-audio conditioning—aim to make controllable, web-friendly AI music more realistic.
A larger 7B variant is expected to raise quality closer to top closed models, at the cost of higher VRAM needs.

Topics

  • Open-Source AI Music
  • Heart Moola
  • Local Inference
  • GPU VRAM Requirements
  • Anti-Gravity Setup

Mentioned