Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Music generation can be framed as next-token prediction, but success depends on converting musical events into tokens that represent time, polyphony, and duration.
Briefing
Christine Payne’s demo centers on a practical bottleneck in neural music generation: turning music—where multiple notes can occur at once and notes can last for varying durations—into a token sequence that a language-model-style system can learn and generate reliably. She trained an LSTM-based music model on classical piano, then reused the same neural net for jazz, producing new pieces in each style. Framing music generation as a language modeling problem makes generation straightforward—predict the next “token,” feed it back, and continue—but the hard part is defining tokens that represent time, polyphony (multiple simultaneous notes), and note duration in a way the model can handle.
Payne argues that common “one note at a time” approaches break down for general music because real compositions can involve any number of notes at any moment, with no fixed pitch range or consistent sampling rate. Her solution is to redefine what a “musical time step” means and to encode music into tokens using two alternative schemes. The first, “chord-wise encoding,” treats each time step like a multi-hot vector over piano keys: for each of the 88 keys, the encoding marks whether a note is being played (0/1). This yields a vocabulary of possible note combinations; while the theoretical space is enormous (2^88), classical piano music is more constrained, and she reports an effective vocabulary around 55,000 combinations. The second, “note-wise encoding,” is closer to character-level modeling: it represents notes sequentially with a smaller vocabulary tied to note identities and weights, and it naturally supports notes that last longer—an important capability for instruments like the violin.
To test whether these encodings produce musically meaningful outputs, Payne runs a simple human evaluation: pairs of songs are presented, one human-composed and one AI-composed, and participants guess which is which. Results are discouragingly low—people often score around 2–3 out of 3—suggesting the generated music is difficult to distinguish from human work, even if it still struggles with deeper musical structure.
The most persistent limitation is long-term coherence. Early segments often sound good for the first 30 seconds to a minute, but the model loses the thread afterward, generating music that doesn’t maintain a larger-scale plan. Payne points to this as a shared challenge with language-model generation: producing not just locally plausible continuations, but longer-term structure. She also notes that chord-wise encoding can “memorize” training pieces—prompting with Mozart can yield extended continuations that resemble the source—yet it has trouble moving into genuinely new patterns beyond the training distribution.
In Q&A, she clarifies that her training data comes from classical archives of MIDI submissions spanning a broad set of famous composers (e.g., Beethoven sonatas and the full set of Chopin), and that violin-focused datasets were smaller, affecting experimentation. She also discusses future directions: incorporating music theory constraints (like thirds and scales), training on mixtures of composers for fine-tuning toward quirky blends (e.g., Chopin plus jazz), and improving representations or modeling strategies to capture themes that evolve across repetitions rather than looping into longer, disconnected ideas.
Cornell Notes
Christine Payne treats music generation as a language-model problem: once a model can predict the next token, generation becomes iterative. The key challenge is defining tokens that represent polyphony and time. She proposes two encodings: chord-wise encoding uses a multi-hot 88-key representation per time step (with an effective vocabulary around 55,000 note combinations in classical piano), while note-wise encoding models notes sequentially with a smaller vocabulary and better support for longer note durations. Human evaluation in a “guess human vs AI” task suggests participants struggle to tell AI from human music. The remaining gap is long-term structure—outputs often sound coherent only for the first 30–60 seconds, then drift.
Why does language-model-style generation require special music encodings?
How does chord-wise encoding represent simultaneous notes, and what vocabulary size does it imply?
What problem does note-wise encoding address, and why is it useful for longer notes?
What does the human evaluation suggest about the quality of generated music?
What limitation persists even when early generations sound good?
What trade-off appears between memorization and novelty in chord-wise encoding?
Review Questions
- How do chord-wise and note-wise encodings differ in how they represent time steps and polyphony?
- What evidence from the human evaluation supports the claim that the generated music is hard to distinguish from human work?
- Why does Payne consider long-term structure (beyond 30–60 seconds) the central remaining challenge?
Key Points
- 1
Music generation can be framed as next-token prediction, but success depends on converting musical events into tokens that represent time, polyphony, and duration.
- 2
Chord-wise encoding uses an 88-key multi-hot vector per time step, yielding a manageable effective vocabulary (about 55,000 combinations) for classical piano despite enormous theoretical possibilities.
- 3
Note-wise encoding models notes sequentially with a smaller vocabulary and better support for representing longer-lasting notes, which matters for instruments like the violin.
- 4
Human “human vs AI” guessing results suggest generated outputs can be musically convincing, even though participants struggle to tell them apart reliably.
- 5
The biggest remaining failure mode is long-term structure: generations often sound coherent only for the first 30–60 seconds before drifting.
- 6
Chord-wise encoding can memorize training pieces (e.g., Mozart continuations) but has difficulty producing genuinely novel patterns beyond the training distribution.
- 7
Future improvements include incorporating music-theory constraints and training strategies that blend composers (e.g., Chopin plus jazz) while improving long-range planning.