History of Large Language Models (LLMs)

TL;DR

McCulloch and Pitts’s 1943 neural-network mathematical model provided an early foundation for later neural approaches to language.

Briefing Cornell Notes

Briefing

Large language models didn’t arrive fully formed; they emerged through a sequence of breakthroughs that shifted computing from hand-written language rules to statistical learning, then to neural networks that learn representations from data. The through-line from 1940 to 2023 is a steady replacement of brittle, explicit linguistic instructions with models that infer patterns—first probabilistically, later through deep learning and attention—unlocking the modern era of general-purpose text generation.

In the earliest phase, researchers laid the mathematical groundwork for neural computation. In 1943, Warren McCulloch and Walter Pitts published a study introducing a mathematical model of artificial neural networks, framing brain-like computation as networks of simple connected elements capable of large-scale computational power. By the 1950s and 1960s, language systems began to take shape as rule-based programs. A landmark example was the 1954 IBM experiment, which translated Russian sentences into English using fixed rules. Around the same time, Joseph Weizenbaum’s ELIZA (1960) simulated conversation with a machine in a therapist-like role, demonstrating how structured scripts could produce convincing natural-language interaction.

The next major shift came in the 1980s and 1990s, when probabilistic and statistical methods gained traction. Karen Sparck Jones helped formalize inverse document frequency (IDF), a concept that became central to modern information retrieval and search ranking. During this period, machine learning approaches also entered mainstream NLP. Hidden Markov models were used for tasks such as speech recognition, reflecting a broader move toward learning patterns from data rather than relying solely on manually designed rules.

By the mid-2000s, deep learning and word embeddings changed what language models could represent. Word embeddings enabled systems to map words into continuous vector spaces, capturing semantic relationships that rules struggled to encode. After 2010, recurrent neural networks (RNNs) became a key step toward modeling context, since RNNs are designed for sequential data and can carry information forward through time.

From 2015 onward, rapid progress accelerated. In 2015, Google developed a neural machine translation system, and by 2017 the Transformer architecture arrived, built on attention mechanisms that improved how models handle long-range dependencies. That architecture set the stage for major releases: OpenAI’s GPT-1 (2018) and Google’s BERT (2018) both used Transformers but targeted different goals—GPT for generative tasks and BERT for understanding. The momentum continued with GPT-2 (2019), NVIDIA’s Megatron-LM (2019), and GPT-3 (2020). By 2023, GPT-4 was launched as a multimodal large language model, and the trajectory suggests future scaling toward smaller, more efficient models that could run on low-power devices like smartphones at lower cost.

Cornell Notes

From 1940 to 2023, large language models evolved from rule-based language processing to statistical learning and then to deep neural networks that learn representations from data. Early work by Warren McCulloch and Walter Pitts provided a mathematical basis for neural computation, while 1950s–1960s systems like IBM’s 1954 Russian-to-English experiment and ELIZA relied on explicit rules. In the 1980s–1990s, probabilistic methods gained ground, including Karen Sparck Jones’s inverse document frequency and hidden Markov models for speech recognition. Mid-2000s word embeddings and post-2010 recurrent neural networks improved semantic representation and context handling. The Transformer architecture and attention mechanisms drove the modern wave, enabling GPT and BERT-style systems and rapid scaling from GPT-1 through GPT-4.

What early idea made neural-network language modeling possible?

In 1943, Warren McCulloch and Walter Pitts published a mathematical model of artificial neural networks. Their framework described brain-like computation in abstract terms and showed how networks of simple connected elements could produce immense computational power—an essential conceptual foundation for later neural approaches to language.

How did rule-based systems attempt to handle language before machine learning dominated?

Rule-based systems used fixed algorithms and hand-designed rules to analyze language structure. A notable example was the 1954 IBM experiment translating Russian to English. Another was ELIZA (1960), created by Joseph Weizenbaum, which simulated a therapist-style conversation using scripted patterns rather than learned language understanding.

What did statistical NLP contribute that rules couldn’t?

Statistical NLP introduced probabilistic modeling and data-driven ranking. Karen Sparck Jones’s inverse document frequency (IDF) became a key component of modern search engines. Hidden Markov models also represented a shift toward learning from data, including speech recognition use cases in the 1990s.

Why did word embeddings and RNNs matter for modern language understanding?

Word embeddings (mid-2000s) mapped words into continuous vector spaces, capturing semantic relationships beyond discrete rule patterns. After 2010, recurrent neural networks (RNNs) improved context modeling because they process sequential data and can retain information from earlier tokens, helping language models handle dependencies across text.

What changed with Transformers, and how did that lead to GPT and BERT?

Transformers (introduced in 2017) rely on attention mechanisms, which better capture long-range relationships than earlier sequential approaches. In 2018, OpenAI released GPT-1 for generative tasks, while Google released BERT for understanding context. Both used Transformer architectures but differed in training objectives and intended behavior.

How did the scaling timeline progress from GPT-1 to GPT-4?

After GPT-1 (2018), GPT-2 arrived in 2019 as a refined generative pre-trained Transformer. NVIDIA produced Megatron-LM in 2019 as a Transformer-based large language model. OpenAI released GPT-3 in 2020, and GPT-4 launched in 2023 as a multimodal large language model, marking a major step beyond text-only capabilities.

Review Questions

Which milestone best represents the shift from explicit language rules to data-driven modeling, and what evidence from the timeline supports that?
How do attention-based Transformers differ in purpose from RNN-based context modeling, based on the progression described?
What distinguishes GPT-style generative training from BERT-style understanding training in the timeline’s framing?

Key Points

1
McCulloch and Pitts’s 1943 neural-network mathematical model provided an early foundation for later neural approaches to language.
2
Rule-based NLP dominated early systems, including IBM’s 1954 Russian-to-English experiment and ELIZA’s 1960 scripted conversation.
3
Inverse document frequency (IDF) helped formalize statistical ranking and became central to search relevance.
4
Hidden Markov models represented a move toward learning patterns from data, including speech recognition.
5
Word embeddings enabled semantic representation in continuous space, while RNNs improved context handling for sequential text.
6
Transformers introduced attention mechanisms that accelerated performance and enabled the GPT and BERT family of systems.
7
The GPT lineage progressed from GPT-1 (2018) to GPT-2 (2019), GPT-3 (2020), and GPT-4 (2023), alongside parallel efforts like NVIDIA’s Megatron-LM.

Highlights

The timeline traces a clear replacement cycle: hand-written rules gave way to probabilistic learning, which then gave way to neural networks that learn representations from data.

IDF, credited to Karen Sparck Jones, is positioned as a key building block for modern search engines—showing how NLP advances spilled into retrieval.

Transformers (attention-based) are the pivot point that enables the modern GPT/BERT era rather than incremental improvements on RNNs.

GPT-4’s 2023 launch is framed as a multimodal step, extending beyond text-only generation.

The progression suggests future smaller, cheaper models that could run on low-power devices like smartphones.

Topics

Neural Networks
Rule-Based NLP
Statistical NLP
Word Embeddings
Transformer Architecture
GPT Evolution

Mentioned

Warren McCulloch
Walter Pitts
Joseph Weizenbaum
Karen Sparck Jones
NLP
RNN
IDF
GPT
BERT
LLM

History of Large Language Models (LLMs) | From 1940 to 2023