Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018

TL;DR

Music generation can be framed as next-token prediction, but success depends on converting musical events into tokens that represent time, polyphony, and duration.

Briefing Cornell Notes

Briefing

Christine Payne’s demo centers on a practical bottleneck in neural music generation: turning music—where multiple notes can occur at once and notes can last for varying durations—into a token sequence that a language-model-style system can learn and generate reliably. She trained an LSTM-based music model on classical piano, then reused the same neural net for jazz, producing new pieces in each style. Framing music generation as a language modeling problem makes generation straightforward—predict the next “token,” feed it back, and continue—but the hard part is defining tokens that represent time, polyphony (multiple simultaneous notes), and note duration in a way the model can handle.

Payne argues that common “one note at a time” approaches break down for general music because real compositions can involve any number of notes at any moment, with no fixed pitch range or consistent sampling rate. Her solution is to redefine what a “musical time step” means and to encode music into tokens using two alternative schemes. The first, “chord-wise encoding,” treats each time step like a multi-hot vector over piano keys: for each of the 88 keys, the encoding marks whether a note is being played (0/1). This yields a vocabulary of possible note combinations; while the theoretical space is enormous (2^88), classical piano music is more constrained, and she reports an effective vocabulary around 55,000 combinations. The second, “note-wise encoding,” is closer to character-level modeling: it represents notes sequentially with a smaller vocabulary tied to note identities and weights, and it naturally supports notes that last longer—an important capability for instruments like the violin.

To test whether these encodings produce musically meaningful outputs, Payne runs a simple human evaluation: pairs of songs are presented, one human-composed and one AI-composed, and participants guess which is which. Results are discouragingly low—people often score around 2–3 out of 3—suggesting the generated music is difficult to distinguish from human work, even if it still struggles with deeper musical structure.

The most persistent limitation is long-term coherence. Early segments often sound good for the first 30 seconds to a minute, but the model loses the thread afterward, generating music that doesn’t maintain a larger-scale plan. Payne points to this as a shared challenge with language-model generation: producing not just locally plausible continuations, but longer-term structure. She also notes that chord-wise encoding can “memorize” training pieces—prompting with Mozart can yield extended continuations that resemble the source—yet it has trouble moving into genuinely new patterns beyond the training distribution.

In Q&A, she clarifies that her training data comes from classical archives of MIDI submissions spanning a broad set of famous composers (e.g., Beethoven sonatas and the full set of Chopin), and that violin-focused datasets were smaller, affecting experimentation. She also discusses future directions: incorporating music theory constraints (like thirds and scales), training on mixtures of composers for fine-tuning toward quirky blends (e.g., Chopin plus jazz), and improving representations or modeling strategies to capture themes that evolve across repetitions rather than looping into longer, disconnected ideas.

Cornell Notes

Christine Payne treats music generation as a language-model problem: once a model can predict the next token, generation becomes iterative. The key challenge is defining tokens that represent polyphony and time. She proposes two encodings: chord-wise encoding uses a multi-hot 88-key representation per time step (with an effective vocabulary around 55,000 note combinations in classical piano), while note-wise encoding models notes sequentially with a smaller vocabulary and better support for longer note durations. Human evaluation in a “guess human vs AI” task suggests participants struggle to tell AI from human music. The remaining gap is long-term structure—outputs often sound coherent only for the first 30–60 seconds, then drift.

Why does language-model-style generation require special music encodings?

Language modeling predicts the next item given a prompt, then feeds the prediction back to continue. For music, the system must first translate musical events into tokens. General music complicates this because multiple notes can occur simultaneously (polyphony), notes can last for different durations, and there’s no fixed number of notes or consistent pitch range. Payne’s work focuses on encoding time steps and note sets so an LSTM can learn patterns that correspond to musical structure rather than artifacts of sampling.

How does chord-wise encoding represent simultaneous notes, and what vocabulary size does it imply?

Chord-wise encoding defines each time step as an 88-key multi-hot vector: each piano key is marked 1 if a note is played at that moment and 0 otherwise. In theory, that creates 2^88 possible combinations, but classical piano is more constrained. Payne reports an effective vocabulary of about 55,000 combinations across most classical music, making the learning problem more manageable than the raw combinatorics suggest.

What problem does note-wise encoding address, and why is it useful for longer notes?

Note-wise encoding represents music more sequentially, akin to character-level prediction: it encodes notes one after another with associated weights, then advances time. Because it’s built around sequential note events, it can more easily represent notes that last longer—something Payne highlights as especially important when modeling instruments like the violin, where sustained tones matter.

What does the human evaluation suggest about the quality of generated music?

Payne runs a simple task with pairs of songs: one human-composed piece and one AI-composed piece. Participants are asked to guess which is which. The results are poor at distinguishing the two, with many people landing around 2–3 correct out of the set, and some self-reporting as professional musicians still failing to reliably tell them apart. That outcome suggests the encodings can produce musically plausible outputs, even if deeper structure remains weak.

What limitation persists even when early generations sound good?

Long-term coherence. Payne observes that generated pieces often sound good for the first 30 seconds and sometimes up to a minute, but then the music loses continuity—there’s “no long when going on.” In Q&A, she also wants models to capture theme development across repeated motifs (short idea repeats, then expands into longer ideas), which doesn’t happen consistently yet.

What trade-off appears between memorization and novelty in chord-wise encoding?

Chord-wise encoding can strongly reproduce training-set material. When prompted with Mozart, it can continue for 45 seconds to a minute in a way that closely resembles Mozart. But it struggles to move away from the training distribution into interesting new patterns, which Payne treats as a dead-end risk and a reason to explore note-wise encoding.

Review Questions

How do chord-wise and note-wise encodings differ in how they represent time steps and polyphony?
What evidence from the human evaluation supports the claim that the generated music is hard to distinguish from human work?
Why does Payne consider long-term structure (beyond 30–60 seconds) the central remaining challenge?

Key Points

1
Music generation can be framed as next-token prediction, but success depends on converting musical events into tokens that represent time, polyphony, and duration.
2
Chord-wise encoding uses an 88-key multi-hot vector per time step, yielding a manageable effective vocabulary (about 55,000 combinations) for classical piano despite enormous theoretical possibilities.
3
Note-wise encoding models notes sequentially with a smaller vocabulary and better support for representing longer-lasting notes, which matters for instruments like the violin.
4
Human “human vs AI” guessing results suggest generated outputs can be musically convincing, even though participants struggle to tell them apart reliably.
5
The biggest remaining failure mode is long-term structure: generations often sound coherent only for the first 30–60 seconds before drifting.
6
Chord-wise encoding can memorize training pieces (e.g., Mozart continuations) but has difficulty producing genuinely novel patterns beyond the training distribution.
7
Future improvements include incorporating music-theory constraints and training strategies that blend composers (e.g., Chopin plus jazz) while improving long-range planning.

Highlights

Payne’s core move is redefining musical tokens: chord-wise multi-hot vectors for simultaneous notes versus note-wise sequential events for better duration handling.

A simple human evaluation—guessing which pieces are human-made—produced low discrimination, implying the encodings can generate convincing music.

Chord-wise encoding can extend Mozart-like continuations for 45 seconds to a minute, but it tends to cling to training-set patterns rather than inventing new ones.

Even when early segments sound right, the model often fails to maintain longer-term musical structure beyond roughly a minute.

Topics

Music Generation
Token Encoding
Chord-Wise Encoding
Note-Wise Encoding
Long-Term Structure

Mentioned

Christine Payne
LSTM