This Shouldn’t Be Possible… Open Source AI Music (SUNO LEVEL)
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Heart Moola is an open-weight, multimodal song-generation model designed to run locally and offline, not just through APIs.
Briefing
Open-source AI music generation can now run locally on a typical gaming PC—producing multi-minute songs with lyrics and instrumentation without relying on cloud APIs. The standout model highlighted is Heart Moola, an open-weight, multimodal LLM-based system for song generation that the creator claims performs near the top of the field on lyric clarity while remaining feasible to run on consumer hardware.
Heart Moola is presented as a practical alternative to closed, top-performing music generators such as Sunno V4.5 and Sunno V5. In side-by-side listening tests, the closed models are described as “all-encompassing” and strong at both lyrics and overall musical feel, while Heart Moola—despite being smaller (about 3B parameters)—is credited with delivering surprisingly comparable results, especially when the goal is lyric-focused output. The model’s architecture is broken down into components including a text tokenizer, an audio encoder, a codec tokenizer, and a local decoder that produces generated music. The project is also positioned as “open-source weights and all,” with an Apache 2.0 license and a GitHub repository that includes code and a tutorial.
Beyond raw quality, the practical message is how to make it work offline. The walkthrough centers on using Google’s “Anti-Gravity” tool to install and configure the Heart Moola repo quickly. System requirements are emphasized: an NVIDIA GPU using CUDA, with roughly 10–12GB VRAM as a minimum and 16GB+ recommended for smooth performance. CPU-only setups are described as possible but extremely slow. On Windows, VRAM can be checked via Task Manager. The creator also notes that some setups may require “lazy loading,” and that model loading can be slower than generation itself.
In live tests, the model is run to generate songs in seconds once loaded, including attempts with missing lyrics. One experiment suggests the model can still produce vocal content even when lyrics are intended to be blank, but the creator attributes unexpected results to whether the lyrics file was actually saved and used. Another key operational detail is that the system may unload checkpoints automatically, forcing a reload that adds time; the workaround is simply to rerun and continue.
Heart Moola is also described as multilingual, with demonstrated capability in Chinese as well as Japanese, Korean, and Spanish. The project roadmap includes improvements such as inference acceleration scripts and streaming inference for web applications, plus reference-audio conditioning for finer control over generated music. A larger 7B variant is also planned, expected to require more VRAM but to move closer to the quality of full-scale closed models.
Overall, the central takeaway is that open-weight music generation is no longer just a research curiosity. With the right GPU and a guided setup flow, users can generate lyric-driven songs locally, iterate on prompts and tags, and look forward to upcoming upgrades that could make the experience more controllable and production-ready—potentially becoming the “stable diffusion of AI music.”
Cornell Notes
Heart Moola is an open-weight, multimodal AI model for generating songs locally on a PC, including lyrics and musical output. The creator compares its results to closed top models like Sunno V4.5 and Sunno V5, arguing that Heart Moola’s smaller size (about 3B parameters) still delivers strong, lyric-focused quality—especially given that it can run offline. Setup is framed as practical: use an NVIDIA CUDA GPU with at least ~10–12GB VRAM (16GB+ recommended) and rely on Google Anti-Gravity to install and configure the repo quickly. Once loaded, generation can take on the order of tens of seconds, though checkpoint unloading can add reload time. Planned upgrades include faster/streaming inference, reference-audio conditioning, and a larger 7B variant for higher fidelity and broader capability.
What makes Heart Moola different from closed music generators like Sunno V4.5 or Sunno V5?
What hardware and software conditions are needed to run it locally?
How does Anti-Gravity fit into the installation process?
How are lyrics and musical style controlled during generation?
What operational issue can affect results or timing during repeated runs?
What upgrades are planned for Heart Moola?
Review Questions
- What trade-offs does the transcript suggest between Heart Moola’s smaller parameter count and the quality of closed, larger music models?
- Why might a user see longer delays between generations even if the model generates quickly once loaded?
- How do lyrics.ext and tags.ext work together to shape the output, and what mistake could cause “unexpected” vocal content?
Key Points
- 1
Heart Moola is an open-weight, multimodal song-generation model designed to run locally and offline, not just through APIs.
- 2
The transcript frames Heart Moola as competitive with closed models on lyric clarity, despite being smaller (about 3B parameters).
- 3
Local setup centers on an NVIDIA CUDA GPU with roughly 10–12GB VRAM minimum and 16GB+ recommended for stability and speed.
- 4
Google Anti-Gravity is used to automate repo setup by reading the GitHub instructions and applying system-specific fixes (including PyTorch nightly in one case).
- 5
Generation can be fast once the model is loaded, but checkpoint unloading can force reloads that add time.
- 6
Lyrics and style control come from assets files: lyrics.ext for text and tags.ext for CSV-like style/instrument prompts.
- 7
Planned improvements include inference acceleration, streaming inference, reference-audio conditioning, and a larger 7B variant for higher fidelity.