
Seventy years of AI swing like a pendulum between hand-coded logic and learning from data, and every winter for one camp was the opening for the other. This episode traces that arc from myth and the Dartmouth workshop, through two funding collapses, to the transformer and today's agents.
This is Phase 0, Episode 2 of the OCDevel AI course: a narrative history of artificial intelligence, built around two throughlines. The first is a pendulum between symbolic, logic-based AI and connectionist, learn-from-data AI, where each era's failure seeded the next era's breakthrough. The second is the hype cycle that produced two named "AI winters."
We start in prehistory: the golem and Talos myths, Hobbes calling reasoning "reckoning," Leibniz's dream of settling arguments by calculation, and Boole's Laws of Thought turning logic into algebra. Foundations follow: Turing's machine (1936) and his imitation game (1950), and the McCulloch-Pitts neuron (1943).
The field is named at the Dartmouth workshop (1956). The symbolic golden years bring the Logic Theorist, the Physical Symbol System Hypothesis, ELIZA, SHRDLU, and Rosenblatt's perceptron. Then the first winter: Minsky and Papert's Perceptrons and the Lighthill report. The expert-systems boom (DENDRAL, MYCIN, XCON, Japan's Fifth Generation) ends in a second winter.
The connectionist revival arrives with backpropagation (1986), then SVMs and Deep Blue (1997). Deep learning ignites with ImageNet, AlexNet (2012), and AlphaGo (2016). Finally the transformer (2017), GPT-3, foundation models, ChatGPT, and today's agents. Closing lesson: Sutton's Bitter Lesson.
News brief: Microsoft's seven MAI models at Build, the Mayo Clinic healthcare model, NVIDIA's Cosmos 3, Anthropic's IPO filing, a Trump AI executive order, and xAI's Grok updates.
Let's start with the news. The big story this week came out of Microsoft's Build conference on June second, where Microsoft AI announced a family of seven homegrown models. The pitch was the "agent-first" era, and the strategy was clear: lean less on OpenAI, and lower costs for developers.
The seven cover reasoning, code, images, transcription, and voice. The flagship is MAI-Thinking-1, a reasoning model with roughly thirty-five billion active parameters and a two hundred fifty-six thousand token context window. Microsoft says it was trained from scratch with zero distillation on clean, commercially licensed enterprise data. Reportedly, blind raters prefer it to Sonnet four-point-six, and it matches Opus four-point-six on coding. Microsoft claims ninety-seven percent on a 2025 math contest and fifty-three percent on a coding benchmark. Treat all of those as vendor-reported for now.
There's also MAI-Code-1-Flash, a tiny five-billion-parameter coding model already in GitHub Copilot and VS Code; MAI-Image-2.5, which reportedly tops some image leaderboards; MAI-Transcribe-1.5, claimed to be state of the art and five times faster across forty-three languages; and MAI-Voice-2, which adapts to a voice from a short sample. These aren't Azure-only. They're on OpenRouter, Fireworks, and Baseten, with tunable weights. Microsoft also pitched Frontier Tuning, reinforcement learning inside a customer's compliance boundary, so agents learn a business's workflows without data leaving the building.
Also at Build, Microsoft and Mayo Clinic announced a healthcare frontier model for clinical reasoning, pairing Mayo's de-identified data with Microsoft's cloud. Mayo will own the model; Microsoft plans to distribute it through Azure.
Over at NVIDIA, on June first, they launched Cosmos 3, an open world foundation model for physical AI. It combines vision reasoning, world generation, and action prediction in one system, aimed at generating synthetic data and training robots and self-driving cars, cutting cycles from months to days.
On the business side, Anthropic confidentially filed for an initial public offering on June first at a valuation near nine hundred sixty-five billion dollars, reportedly eclipsing OpenAI for the first time, with a revenue run-rate around forty-seven billion. Some outlets are now floating bubble comparisons.
In Washington, President Trump signed an AI executive order letting developers voluntarily give the federal government early access to frontier models for up to thirty days before release. It explicitly avoids mandatory licensing.
And from xAI, Grok got Composer two-point-five for long-running tasks, a new image-to-video preview model, and a batch of web connectors. For context, current flagships include Claude Opus four-point-eight, GPT five-point-five, Grok four-point-three, and Gemini three-point-one Pro. Now, on to our history.
Welcome back. This is the second episode in our Phase Zero history of artificial intelligence, and today we're going to walk the whole arc, from ancient myths about artificial minds all the way to the tool-using agents on the frontier right now. It's a long story, about seventy years of serious research and a couple thousand years of dreaming before that, but I promise it hangs together. So before we dive into dates and names, let me give you the two ideas to hold onto, because they're the throughlines that make the whole history make sense.
The first idea is the pendulum. AI has swung, over and over, between two fundamentally different bets about what intelligence is. On one side you have the symbolic camp, sometimes called logic-based or knowledge-based AI. Their bet is that thinking is manipulating symbols according to rules, that if you can write down enough facts and enough logic, reasoning falls out. On the other side you have the connectionist or statistical camp. Their bet is that intelligence emerges from learning patterns out of data, the way a brain learns from experience, rather than being hand-coded by an engineer. And here's the key: each swing of the pendulum is a reaction to the previous era's failure. Today's whole stack, neural networks trained by gradient descent, transformers, massive scale, that's the connectionist pole, and right now it's dominant. But the symbolic ideas keep coming back, and we'll see them returning at the very end of this episode in the form of agents.
The second idea is the hype cycle, and it's brutal. The pattern goes like this: researchers overpromise, funding inflates on the hype, the field fails to deliver on those promises, the money collapses, and we get what's called an "AI winter." Then, quietly, underground, progress continues, somebody has a breakthrough, and the new hype begins. We've had two named winters, one in the mid-1970s and one in the late 1980s into the early 1990s. If you remember nothing else, remember that AI has always been a story about the gap between what was promised and what the era's computers and data could actually deliver.
There's a third thread woven through all of it, and it's worth naming now because it'll keep recurring: the repeated discovery that general methods which leverage raw computation tend to outrun cleverly hand-engineered human knowledge. That insight eventually got crystallized into something called the Bitter Lesson, and we'll close on it. But keep it in your back pocket the whole way through.
So let's go all the way back, before computers existed at all.
The dream of building an artificial mind is ancient. The Greeks told stories of Talos, a giant bronze automaton that guarded the island of Crete, marching its perimeter. There's the Jewish legend of the golem of Prague, a figure of clay animated by sacred words inscribed on it. There's Pygmalion's statue Galatea, brought to life. The point is that the idea of manufacturing a mind predates the computer by millennia. It's one of humanity's oldest fantasies.
Then philosophy started to make the idea precise. In sixteen fifty-one, Thomas Hobbes wrote in Leviathan that reasoning is nothing but reckoning, that is, thinking is just computation, adding and subtracting symbols. For that line, Hobbes sometimes gets called the grandfather of AI. Rene Descartes pushed a mechanistic view of the body, treating an animal as a kind of automaton, though he carved out the human mind as something separate, a dualism. And then there's Gottfried Leibniz, in the late sixteen hundreds, who had a dream that sounds startlingly modern. He imagined a universal symbolic language and a calculus of reasoning, a formal system so complete that when two people disagreed, they could simply say, "Let us calculate," and turn the crank to find the answer. He even built a mechanical calculator, the Stepped Reckoner. Leibniz was reaching for symbolic AI three hundred years early.
The next step was turning logic itself into something mechanical. In eighteen fifty-four, George Boole published The Laws of Thought, and gave us Boolean algebra, which reduces logic to algebra over true and false, over one and zero. That is, quite literally, the foundation of every digital circuit running today. Then Gottlob Frege, in eighteen seventy-nine, built modern predicate logic, a far more expressive system. And around the turn of the twentieth century, the mathematician David Hilbert posed a sweeping challenge: could all of mathematics be mechanized? Could you build a procedure that decides, for any mathematical statement, whether it's provable? That challenge was called the decision problem, and it's exactly the question that provoked the next character in our story. Along the way, Kurt Godel dropped his incompleteness theorems in nineteen thirty-one, showing there are true statements no formal system can prove, which already hinted that the dream of total mechanization had limits.
Which brings us to the foundations, the nineteen thirties through the nineteen fifties, when the theory of computation was actually built.
In nineteen thirty-six, Alan Turing wrote a paper called "On Computable Numbers." In it he defined what we now call the Turing machine, an abstract device that formalized, for the first time, exactly what we mean by "computation" and "algorithm." And he used it to answer Hilbert's decision problem, in the negative. There are well-defined questions, like the famous halting problem, that no algorithm can ever decide. So the very man who gave us the theoretical universal computer also drew its hard boundary.
Fourteen years later, in nineteen fifty, Turing wrote a very different kind of paper, "Computing Machinery and Intelligence," in the journal Mind. He asked, "Can machines think?" and then, deciding that question was too slippery, he replaced it with a game. The imitation game, which we now call the Turing test. The setup is simple: if a human interrogator, typing back and forth through text, can't reliably tell whether they're talking to a machine or a person, then for practical purposes the machine has passed. Turing predicted machines would clear that bar by around the year two thousand. He was a little optimistic on the timeline, but the operational framing was brilliant and we still argue about it today.
Now, at the same time, a second tradition was forming, and this is where the pendulum's other pole is born. In nineteen forty-three, two researchers, Warren McCulloch and Walter Pitts, published a paper on what they called a logical calculus of nervous activity. They built the first mathematical model of an artificial neuron, a simple binary threshold unit: it sums its inputs, and if the sum crosses a threshold, it fires. They showed that networks of these units could compute logical functions. That little threshold neuron is the direct ancestor of every neural network we use today. Then, in nineteen forty-nine, Donald Hebb gave us the first concrete idea of how such things might learn, with a rule often summarized as "neurons that fire together wire together." Connections that get used together get strengthened.
Around the same time, Norbert Wiener published Cybernetics in nineteen forty-eight, studying feedback, control, and communication in both animals and machines, and that same year Claude Shannon founded information theory, giving us the bit and the concept of entropy. Shannon also wrote early on chess-playing programs, so he had a foot in both worlds.
And notice what's happened by the early nineteen fifties. Two camps are already in place. On one side, the symbolic, logical tradition, Turing machines and formal logic. On the other, the connectionist and cybernetic tradition, neurons and feedback loops. That tension is baked in at the very birth of the field, and the pendulum starts swinging from here.
Now to the official birth of AI: Dartmouth, nineteen fifty-six.
In the summer of nineteen fifty-five, four researchers wrote a proposal for a summer research project to be held the following year at Dartmouth College. The four were John McCarthy of Dartmouth, Marvin Minsky of Harvard, Nathaniel Rochester of IBM, and Claude Shannon of Bell Labs. And it was in this proposal that John McCarthy coined the term "artificial intelligence," partly, the story goes, to distinguish the new field from cybernetics and from Norbert Wiener's shadow.
The central conjecture of that proposal is worth saying slowly, because it set the tone for decades. They wrote that every aspect of learning, or any other feature of intelligence, can in principle be so precisely described that a machine can be made to simulate it. That's an enormous claim. And they proposed to make real progress on it with a two-month study by about ten people in a single summer. That breathtaking optimism, the idea that you could crack thinking in a summer, set the template for the chronic overpromising that would define the field.
The attendees read like a founding pantheon. Beyond the four organizers, you had Herbert Simon and Allen Newell from Carnegie and the RAND Corporation. You had Arthur Samuel of IBM, who built a checkers-playing program and who coined the term "machine learning" in nineteen fifty-nine. You had Ray Solomonoff and Oliver Selfridge. These names will echo through the next twenty years.
And they came with results. Newell, Simon, and Cliff Shaw had built a program called the Logic Theorist, often called the first AI program. It proved thirty-eight of the first fifty-two theorems in a famous mathematics text, and for one of them it found a proof more elegant than the one in the original book. Then Newell and Simon followed up with the General Problem Solver, a system meant to be a universal reasoner using a technique called means-ends analysis, repeatedly asking, "What's the difference between where I am and where I want to be, and what reduces that difference?" It worked on toy problems. It did not scale. Hold that phrase, "it did not scale," because it's going to come back again and again.
Out of Dartmouth, the institutions formed. McCarthy went on to found the Stanford AI Lab; Minsky co-founded the MIT AI Lab. And McCarthy invented the programming language LISP in nineteen fifty-eight, which became the lingua franca of symbolic AI for a generation.
So now we're in the golden years, roughly nineteen fifty-six to nineteen seventy-four, and this era belongs to symbolic AI.
There's a nickname for this paradigm: GOFAI, which stands for Good Old-Fashioned AI, a term coined later by the philosopher John Haugeland in nineteen eighty-five. The core idea is clean and confident: intelligence is the manipulation of symbols according to rules, and reasoning is search through a space of possibilities. Newell and Simon eventually formalized their faith into something called the Physical Symbol System Hypothesis, stated in their nineteen seventy-six Turing Award lecture, which claims that a physical symbol system has the necessary and sufficient means for general intelligent action. In other words, symbols are not just one path to intelligence, they're the whole story. That's the high-water mark of the symbolic worldview.
The engine of this era was search. You represent a problem as states, you lay out the moves between them as a tree, and you search through that tree for a solution, using heuristics, rules of thumb, to avoid checking everything. This is the era that gave us the A-star search algorithm, out of the Stanford Research Institute in nineteen sixty-eight. But search had a fatal weakness lurking inside it: the branching factor. Every move multiplies the number of possibilities, so the search tree grows exponentially. On toy problems that's fine. On real problems, the tree explodes faster than any computer can handle. Again, did not scale.
Now, while the symbolic camp was riding high, the connectionist pole had its own brief, brilliant moment. In nineteen fifty-seven and fifty-eight, Frank Rosenblatt, at the Cornell Aeronautical Laboratory, built the Perceptron. This was a trainable single-layer neural network for recognizing patterns, and crucially, Rosenblatt had a mathematical proof that it would converge, that it would learn correctly, as long as the data was what's called linearly separable, meaning you could split the categories with a straight line. The Perceptron wasn't just software; the Mark One Perceptron was physical hardware, demonstrated on June twenty-third, nineteen sixty, wired to a camera with a twenty-by-twenty grid of sensors, its weights literally set by motor-driven dials. And the hype was something else. The New York Times reported that the Navy expected a machine that would eventually be able to walk, talk, see, write, and reproduce itself. That is peak overpromise, and remember it, because the bill is about to come due.
This era also produced some unforgettable demos in what were called microworlds, tiny closed domains where the hard problems could be sidestepped. There was SHRDLU, built by Terry Winograd at MIT between nineteen sixty-eight and nineteen seventy. It manipulated colored blocks in a simulated tabletop world, and it understood natural-language commands like "put the red block on top of the green one," tracking context as the conversation went. It felt like genuine understanding. But the magic depended entirely on that closed little blocks world, and it didn't scale out to the messy real one.
And there was ELIZA, written by Joseph Weizenbaum at MIT between nineteen sixty-four and sixty-six. ELIZA was a pattern-matching chatbot, the very first chatbot, and its most famous script, called DOCTOR, imitated a particular style of therapist by mostly reflecting your statements back at you as questions. ELIZA had no understanding whatsoever, none, it was just clever text substitution. And yet people formed real emotional attachments to it, confiding in it, asking for privacy with it. That phenomenon got a name, the ELIZA effect, our tendency to read a mind into a machine that's just shuffling symbols. The episode disturbed Weizenbaum so deeply that he became one of AI's sharpest critics. There was also Shakey the robot, built at the Stanford Research Institute between nineteen sixty-six and seventy-two, the first mobile robot that could perceive its environment, reason about it, and act, using a planning system called STRIPS.
But the promises had outrun the delivery, and the bill came due. Welcome to the first AI winter, roughly nineteen seventy-four to nineteen eighty.
The most famous blow landed on the connectionists. In nineteen sixty-nine, Marvin Minsky and Seymour Papert published a book simply titled Perceptrons, and in it they proved, rigorously, that a single-layer perceptron, Rosenblatt's machine, cannot compute a function as simple as XOR, exclusive-or. Here's the intuition: a single perceptron draws exactly one straight line to separate its categories. But the four cases of XOR are arranged so that no single straight line can ever separate the "true" cases from the "false" ones. So the machine that the Navy thought would reproduce itself couldn't even solve XOR. That book is widely credited with draining funding out of neural-network research for about fifteen years.
Now I owe you an important nuance here, because the popular story is a little unfair. Minsky and Papert knew perfectly well that a multi-layer network, a network with a hidden layer in the middle, could solve XOR. That wasn't the open problem. The open problem was that nobody knew how to train a multi-layer network. There was no known learning algorithm for the deeper case. So the critique was really, "this works, but we can't teach the deeper version yet." And tragically, Rosenblatt himself didn't get to answer it; he died in a boating accident in nineteen seventy-one.
The symbolic camp was hitting its own wall at the same time, and it was that combinatorial explosion we talked about, the search trees blowing up exponentially on anything resembling a real problem. The toy successes simply would not transfer. There was a specific, very public failure too: machine translation. A report called the ALPAC report, back in nineteen sixty-six, had already shown that automatic translation was vastly harder than promised, and it killed most translation funding.
Then came the report that really named the winter, at least in Britain. In nineteen seventy-three, Sir James Lighthill, commissioned by the UK Science Research Council, delivered a withering assessment of the entire field. His verdict was that in no part of the field had the discoveries made so far produced the major impact that had been promised. His central technical critique was, once again, the combinatorial explosion. The Lighthill report led the UK to slash AI funding almost everywhere, sparing only a couple of universities like Edinburgh and Sussex. It was even followed by a televised BBC debate. And across the Atlantic, the US defense research agency, DARPA, was making its own cuts out of frustration with unmet promises. By the late nineteen seventies, "AI" had become a term that serious researchers actively avoided putting on a grant application.
But winters end, and the next spring came from the symbolic side, with a commercial twist. This is the expert-systems boom, roughly nineteen eighty to nineteen ninety-three.
The idea behind expert systems was to stop trying to build general intelligence and instead capture the specific knowledge of a human expert in a narrow domain, as a big pile of if-then rules. The motto was "knowledge is power." The pioneer was a system called DENDRAL, developed at Stanford from the mid-nineteen sixties into the early eighties, which inferred the molecular structure of compounds from mass-spectrometry data. Then came MYCIN, also at Stanford in the mid-seventies, which diagnosed blood infections using about six hundred if-then rules plus a scheme of certainty factors to handle uncertainty. In trials, MYCIN performed at or above the level of junior doctors. And yet it was never deployed in actual clinical practice, blocked by questions of liability and trust. Who do you sue when the program is wrong?
The system that really lit the gold rush was XCON, also called R1, built around nineteen eighty by Digital Equipment Corporation with Carnegie Mellon, led by John McDermott. XCON configured orders for DEC's VAX computers, making sure every order had the right parts. It was the first big commercial expert-system success. By the late eighties it held more than fifteen thousand rules and reportedly saved the company around twenty-five million dollars a year. Suddenly every corporation wanted one, and an industry of expert-system companies and specialized LISP machines sprang up.
The hype went international. In nineteen eighty-two, Japan launched the Fifth Generation Computer Systems project, a government-led, roughly ten-year effort, funded at over four hundred million dollars, aimed at building massively parallel machines based on logic programming. It scared the West into responding: the US formed a research consortium called MCC, the UK launched the Alvey program, and Europe started ESPRIT. And like so much in this story, the Fifth Generation project largely failed to meet its grand goals.
So we got our second winter. And the cause this time is one of the most important lessons in the whole history, so let me dwell on it. Expert systems ran into what's called the knowledge-engineering bottleneck. Hand-encoding human expertise as rules turned out to be painfully slow, expensive, and brittle. The rules didn't generalize beyond exactly what you wrote. As you added more, they started contradicting each other. And they had no common sense at all, no grasp of the obvious background facts every human shares. On top of that, the technology platform collapsed: around nineteen eighty-seven, the market for specialized LISP machines, from companies like Symbolics, cratered, because ordinary desktop computers had simply gotten fast enough to do the job. DARPA pulled back again, and once more "AI" became a dirty word. Researchers rebranded, calling their work "machine learning" or "knowledge-based systems" to avoid the stigma. But the durable lesson, the one that aimed the whole future of the field, was this: the brittleness of hand-coded knowledge pushed everyone toward a different idea, learning from data.
And that's exactly the pendulum swinging back. So now, the statistical turn and the connectionist revival, roughly nineteen eighty-six to two thousand ten.
The pivotal moment came in nineteen eighty-six, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper in Nature called "Learning representations by back-propagating errors." This popularized backpropagation, the algorithm for training multi-layer neural networks, the very thing that had been missing back when Minsky and Papert wrote Perceptrons. The beautiful result was that the hidden units in the middle of the network learn useful internal representations on their own, automatically. That directly answered the XOR critique that had frozen the field for fifteen years, and it reignited connectionism. The movement even had its bible, a pair of volumes on parallel distributed processing.
Now, a crucial nuance, because backpropagation was actually invented several times before nineteen eighty-six. A researcher named Seppo Linnainmaa published the core technique, reverse-mode automatic differentiation, back in nineteen seventy. Paul Werbos, in nineteen seventy-four, was the first to propose applying it specifically to train neural networks. And others, like Parker and a young Yann LeCun, hit on it around nineteen eighty-five. Under the hood, backpropagation is really just the chain rule from calculus, applied efficiently, backward through the network, to compute how much each individual weight contributed to the error. Then you combine that with gradient descent to nudge every weight in the right direction. The math, in other words, had existed for years. What was missing, as always, was enough data and enough compute. The pioneers were, once again, right too early.
But here's the twist: even as neural nets came back, they didn't immediately take over. Through the nineteen nineties and two thousands, the field's center of gravity moved to statistical and probabilistic machine learning, a broad turn that we can summarize as going "from knowledge to data." The star technique was the Support Vector Machine, introduced by Corinna Cortes and Vladimir Vapnik in nineteen ninety-five. SVMs find the maximum-margin boundary between classes and use a clever idea called the kernel trick to handle complicated, nonlinear data. They were elegant, they had beautiful theory, and they worked so well that, for a while, they actually pushed neural networks back to the margins of the field.
There was a whole toolkit in this wave. Judea Pearl gave us Bayesian networks in nineteen eighty-eight and later the formal study of causal inference, principled ways to reason about uncertainty. There were decision trees, from Ross Quinlan's algorithms, and then random forests from Leo Breiman in two thousand one, and boosting methods like AdaBoost from Freund and Schapire in nineteen ninety-seven. Hidden Markov Models dominated speech recognition. And in language, statistical natural-language processing quietly replaced the hand-written grammar rules of the symbolic era, with the field adopting a new mantra: there's no data like more data. The whole worldview had flipped, from encoding knowledge by hand to fitting models to data.
And right in the middle of this, in May nineteen ninety-seven, came one of the most famous moments in AI history: IBM's Deep Blue defeated the reigning world chess champion, Garry Kasparov, three and a half to two and a half. It was the first time a sitting world champion lost a match to a machine under tournament conditions. But here's the nuance that matters, and it cuts against the headlines. Deep Blue did not learn anything, and it was not a neural network. It was brute-force search, evaluating roughly two hundred million positions per second, paired with a handcrafted evaluation function written by human experts. So Deep Blue was actually a triumph of the symbolic-search tradition plus raw computing power. It's a wonderful reminder that the pendulum doesn't move in a clean line.
And while all this was happening, the enablers for the next era were quietly building in the background: Moore's Law kept making compute cheaper, the World Wide Web was generating digital data at an explosive rate, and storage kept getting cheaper. The fuel was accumulating.
Which brings us to the question that defines the modern era: why did deep learning suddenly start working, around two thousand six to two thousand sixteen?
The answer comes down to three ingredients arriving at once. First, data. In two thousand nine, Fei-Fei Li launched ImageNet, a dataset of roughly fourteen million hand-labeled images across about twenty thousand categories, labeled by crowds of workers through Amazon Mechanical Turk. Its annual competition, starting in twenty ten, became the proving ground for computer vision. Second, compute. Graphics processing units, GPUs, originally built for video games, turned out to be perfect for the massively parallel matrix math that neural networks need, and NVIDIA's CUDA platform, from two thousand seven, made them programmable for it. Third, algorithms: better ways to initialize networks, a simple activation function called ReLU that avoided the vanishing-gradient problem that plagued older functions, and dropout, a regularization trick from Hinton's group in twenty twelve. There was also foundational work in two thousand six on deep belief nets, from Hinton and colleagues, that helped revive the very phrase "deep learning."
I want to flag one architecture that was, yet again, right too early. Yann LeCun, at Bell Labs, built convolutional neural networks, CNNs, in a system called LeNet, refined from nineteen eighty-nine into the late nineties. CNNs are designed for images: they use local filters, weight sharing, and pooling to exploit the spatial structure of a picture, and LeNet was good enough to read handwritten digits on checks and ZIP codes. It was the right architecture, sitting there for a decade, waiting for the data and compute to catch up.
And in twenty twelve, it all came together in a single explosive result. A system called AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, entered the ImageNet competition and won it, crushing the field. Its top-five error rate was about fifteen percent, against the runner-up's roughly twenty-six percent. That's a ten-point jump in a competition where people fought over fractions of a percent. AlexNet had around sixty million parameters and eight layers, and it was trained in about five or six days on two consumer NVIDIA gaming GPUs, using ReLU, dropout, and data augmentation. Within a single year, the entire field pivoted to deep learning. That's the spark that lit the modern AI fire.
The breakthroughs came fast after that. In twenty thirteen, Tomas Mikolov and colleagues at Google released word2vec, which learned dense vector representations of words from their context, and gave us that famous, almost magical bit of arithmetic: take the vector for "king," subtract "man," add "woman," and you land near "queen." Meaning had become geometry. For handling sequences like sentences, the workhorse was the LSTM, or Long Short-Term Memory network, actually invented way back in nineteen ninety-seven by Hochreiter and Schmidhuber, which used gates to solve the vanishing-gradient problem in recurrent networks and capture long-range dependencies. LSTMs powered machine translation and speech from roughly twenty fourteen to twenty seventeen, and they were the reigning sequence architecture right up until the transformer.
And then, in March twenty sixteen, came another watershed. DeepMind's AlphaGo defeated Lee Sedol, an eighteen-time world champion, four games to one, at the game of Go. This mattered enormously because Go's branching factor is so vast that the brute-force approach that beat Kasparov at chess simply doesn't work; people thought a Go champion was decades away. AlphaGo combined deep neural networks, a policy network and a value network, with Monte Carlo Tree Search, reinforcement learning, and self-play. In the second game it played a move, the famous Move thirty-seven, so strange and creative that commentators thought it was a mistake, until it won. Lee Sedol struck back with his own brilliant Move seventy-eight to win game four. A year later, AlphaGo Zero learned the game entirely from scratch, with no human games at all, just self-play, and a generalized version called AlphaZero went on to master chess and shogi too. This was the arrival of deep reinforcement learning as a serious force.
Now we reach the architecture that made everything since possible: the transformer and the foundation-model era, from twenty seventeen onward. I'll keep this at the level of intuition; we'll save the actual attention math for later in the course.
In June twenty seventeen, a group of researchers, mostly at Google, published a paper with a confident title: "Attention Is All You Need." It introduced the Transformer, an architecture that threw out recurrence and convolution entirely and relied on a mechanism called self-attention. Here's the intuition for why it beat the LSTMs. A recurrent network reads a sequence one token at a time, in order, which is slow to train and makes it hard to connect words that are far apart in a sentence. Attention lets every token look directly at every other token, all at once, and weigh which ones actually matter for it. So long-range relationships are captured directly, and, just as importantly, the whole sequence can be processed in parallel. That parallelism is the secret. It meant you could finally train on enormous datasets using GPUs and TPUs, and scale models to sizes nobody had attempted.
The transformer split into two influential branches almost immediately. In twenty eighteen, Google released BERT, a bidirectional encoder pretrained by masking out words and asking the model to fill them in, then fine-tuned for specific tasks. BERT made the "pretrain, then fine-tune" recipe mainstream. The other branch was OpenAI's GPT line, the generative, decoder-style models. GPT-1 arrived in twenty eighteen, GPT-2 in twenty nineteen with one and a half billion parameters, released in stages because the team worried about misuse. And then GPT-3, in twenty twenty, with one hundred seventy-five billion parameters, and a breakthrough capability called in-context learning, or few-shot prompting. You could teach GPT-3 a brand-new task just by showing it a few examples in the prompt, with no retraining, no gradient updates at all. The capability simply emerged from sheer scale.
That word "scale" became a strategy, thanks to scaling laws. In twenty twenty, researchers at OpenAI showed that a model's loss falls in a predictable power-law relationship with model size, data, and compute. Then in twenty twenty-two, the Chinchilla work from DeepMind corrected the recipe, showing that the big models of the day had actually been under-trained, that you need far more data per parameter than people had assumed. Suddenly, scaling up wasn't a gamble; it was a plan with a formula. In twenty twenty-one, a large group at Stanford gave the whole paradigm a name in a paper on the opportunities and risks of foundation models, single models trained on broad data that get adapted to countless tasks, and they flagged the risks too: homogenization, bias, and concentration of power.
But a raw pretrained language model just predicts the next token; it isn't naturally helpful or aligned to follow your instructions. The fix was instruction tuning combined with a technique called reinforcement learning from human feedback, or RLHF. The idea, made concrete in OpenAI's InstructGPT work in twenty twenty-two, is that humans rank different model outputs, those rankings train a reward model, and the language model is then optimized against that reward. That's what turns a text-predictor into an assistant that actually tries to do what you ask.
And then, on November thirtieth, twenty twenty-two, all of this reached the public in one product: ChatGPT. Built on GPT-3.5 with RLHF, it reached around one hundred million users in roughly two months, the fastest-growing consumer app of its time. It triggered the boom we're living through now, with GPT-4 arriving in March twenty twenty-three and a whole wave of competitors right behind it.
Which brings us, finally, to the present, the agentic era. And I'll keep this brief and conceptual, because it's the frontier this whole course is building toward. The shift now is from chatbots to agents. Models don't just talk anymore; they call tools and APIs, they browse the web, they run code, they take multi-step actions toward a goal. They're starting to close the loop between language and the world. The enabling ideas include function and tool calling, a pattern called ReAct that interleaves reasoning and acting, retrieval-augmented generation that lets a model look facts up, explicit planning loops, multi-agent systems, and emerging standards like the Model Context Protocol for connecting models to tools.
And here's the beautiful thing, the thing that ties this whole episode together. Listen to what agents reintroduce: explicit planning, tool use, memory, search over possibilities. Those are symbolic-flavored ideas, the very concepts from the GOFAI era, now layered on top of statistical foundation models. The pendulum is swinging again, but this time it's looking less like a swing and more like a synthesis of the two historical poles.
So let me close by gathering the durable lessons, because they're the real point of this history. The first is the Bitter Lesson, which Rich Sutton articulated in twenty nineteen, looking back over seventy years of AI. His claim is that general methods that leverage computation tend to win, by a large margin, over approaches that build in human domain knowledge, because the general methods ride the relentlessly falling cost of compute. Chess, Go, speech, vision, language, every one of them eventually went this way. But here's the nuance you should carry, because the Bitter Lesson is contested and often flattened into a lazy slogan, "scale is all you need." That's not quite Sutton's point. He's really talking about meta-methods, methods that discover complexity on their own, rather than literally just adding more parameters.
The second lesson is the pendulum itself, now concrete. Symbolic on one side: Dartmouth, GOFAI, expert systems. Connectionist and statistical on the other: the perceptron, backprop, SVMs, deep learning. And the key pattern is that each pole's winter was the other pole's opening. The third lesson is "right too early," the recurring tragedy of ideas that arrived before their time: the McCulloch-Pitts neuron, Rosenblatt's perceptron, Werbos and Linnainmaa's backpropagation, LeCun's convolutional nets, Hochreiter and Schmidhuber's LSTM. Over and over, the idea preceded the data and the compute by fifteen, twenty, sometimes forty years.
The fourth lesson is that the hype-to-winter-to-breakthrough cycle isn't bad luck; it's structural. It simply tracks the gap between what people promise and what the era's compute and data can actually deliver. And the fifth, which is really a summary of the other four, explains why today's stack looks exactly the way it does. Neural networks, descending from McCulloch and Pitts through Rosenblatt to Rumelhart and Hinton. Trained by gradient descent and backpropagation, from Linnainmaa and Werbos through to nineteen eighty-six. Scaled up on GPUs and giant datasets, from ImageNet and AlexNet through the scaling laws. And tied together by the transformer, the architecture, from twenty seventeen, that finally made all that scale parallelizable. Every piece of the modern system is an answer to a specific failure earlier in this story. That's the history, and it's the foundation we'll build the rest of this course on.