History of Large Language Models (LLMs) | From 1940 to 2023
Based on AI Researcher's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
McCulloch and Pitts’s 1943 neural-network mathematical model provided an early foundation for later neural approaches to language.
Briefing
Large language models didn’t arrive fully formed; they emerged through a sequence of breakthroughs that shifted computing from hand-written language rules to statistical learning, then to neural networks that learn representations from data. The through-line from 1940 to 2023 is a steady replacement of brittle, explicit linguistic instructions with models that infer patterns—first probabilistically, later through deep learning and attention—unlocking the modern era of general-purpose text generation.
In the earliest phase, researchers laid the mathematical groundwork for neural computation. In 1943, Warren McCulloch and Walter Pitts published a study introducing a mathematical model of artificial neural networks, framing brain-like computation as networks of simple connected elements capable of large-scale computational power. By the 1950s and 1960s, language systems began to take shape as rule-based programs. A landmark example was the 1954 IBM experiment, which translated Russian sentences into English using fixed rules. Around the same time, Joseph Weizenbaum’s ELIZA (1960) simulated conversation with a machine in a therapist-like role, demonstrating how structured scripts could produce convincing natural-language interaction.
The next major shift came in the 1980s and 1990s, when probabilistic and statistical methods gained traction. Karen Sparck Jones helped formalize inverse document frequency (IDF), a concept that became central to modern information retrieval and search ranking. During this period, machine learning approaches also entered mainstream NLP. Hidden Markov models were used for tasks such as speech recognition, reflecting a broader move toward learning patterns from data rather than relying solely on manually designed rules.
By the mid-2000s, deep learning and word embeddings changed what language models could represent. Word embeddings enabled systems to map words into continuous vector spaces, capturing semantic relationships that rules struggled to encode. After 2010, recurrent neural networks (RNNs) became a key step toward modeling context, since RNNs are designed for sequential data and can carry information forward through time.
From 2015 onward, rapid progress accelerated. In 2015, Google developed a neural machine translation system, and by 2017 the Transformer architecture arrived, built on attention mechanisms that improved how models handle long-range dependencies. That architecture set the stage for major releases: OpenAI’s GPT-1 (2018) and Google’s BERT (2018) both used Transformers but targeted different goals—GPT for generative tasks and BERT for understanding. The momentum continued with GPT-2 (2019), NVIDIA’s Megatron-LM (2019), and GPT-3 (2020). By 2023, GPT-4 was launched as a multimodal large language model, and the trajectory suggests future scaling toward smaller, more efficient models that could run on low-power devices like smartphones at lower cost.
Cornell Notes
From 1940 to 2023, large language models evolved from rule-based language processing to statistical learning and then to deep neural networks that learn representations from data. Early work by Warren McCulloch and Walter Pitts provided a mathematical basis for neural computation, while 1950s–1960s systems like IBM’s 1954 Russian-to-English experiment and ELIZA relied on explicit rules. In the 1980s–1990s, probabilistic methods gained ground, including Karen Sparck Jones’s inverse document frequency and hidden Markov models for speech recognition. Mid-2000s word embeddings and post-2010 recurrent neural networks improved semantic representation and context handling. The Transformer architecture and attention mechanisms drove the modern wave, enabling GPT and BERT-style systems and rapid scaling from GPT-1 through GPT-4.
What early idea made neural-network language modeling possible?
How did rule-based systems attempt to handle language before machine learning dominated?
What did statistical NLP contribute that rules couldn’t?
Why did word embeddings and RNNs matter for modern language understanding?
What changed with Transformers, and how did that lead to GPT and BERT?
How did the scaling timeline progress from GPT-1 to GPT-4?
Review Questions
- Which milestone best represents the shift from explicit language rules to data-driven modeling, and what evidence from the timeline supports that?
- How do attention-based Transformers differ in purpose from RNN-based context modeling, based on the progression described?
- What distinguishes GPT-style generative training from BERT-style understanding training in the timeline’s framing?
Key Points
- 1
McCulloch and Pitts’s 1943 neural-network mathematical model provided an early foundation for later neural approaches to language.
- 2
Rule-based NLP dominated early systems, including IBM’s 1954 Russian-to-English experiment and ELIZA’s 1960 scripted conversation.
- 3
Inverse document frequency (IDF) helped formalize statistical ranking and became central to search relevance.
- 4
Hidden Markov models represented a move toward learning patterns from data, including speech recognition.
- 5
Word embeddings enabled semantic representation in continuous space, while RNNs improved context handling for sequential text.
- 6
Transformers introduced attention mechanisms that accelerated performance and enabled the GPT and BERT family of systems.
- 7
The GPT lineage progressed from GPT-1 (2018) to GPT-2 (2019), GPT-3 (2020), and GPT-4 (2023), alongside parallel efforts like NVIDIA’s Megatron-LM.