"The Accumulation of Failures" Gave Birth to LLMs (Part 1) — From the Limits of Mathematics to the Geometry of Meaning

"The Accumulation of Failures" Gave Birth to LLMs (Part 1) — From the Limits of Mathematics to the Geometry of Meaning

LLMs like ChatGPT and Claude were not born from a goal of "let's build an LLM." The history of how byproducts of solving entirely separate problems accumulated over 80 years — Turing's halting problem, Shannon's information theory, the setbacks and revival of the perceptron, the geometry of meaning through Word2Vec — is traced in this first part, following the path from the limits of mathematics to embeddings.
2026.06.24

This page has been translated by machine translation. View original

Introduction

ChatGPT and Claude are large language models (LLMs) that did not emerge from a goal of "let's build an LLM."

A person trying to prove the limits of mathematics, a person trying to reduce noise on telephone lines, a person trying to fix accuracy bugs in machine translation——the byproducts of each solving completely different problems accumulated over 80 years to become LLMs.

This article traces "how those unintended connections came about" in order. No knowledge of ML or mathematics is assumed.

The starting point was Fireship's YouTube video. Inspired by its compact summary of LLM history, I decided to write this article to dig deeper into "what problem each invention was actually trying to solve."

Part 1 (this article) covers the birth of the computer concept, information theory, neural network training, and the technology of converting words into numbers. Part 2 follows the birth of Attention for handling context, through the Transformer, the GPT-3 scaling revolution, and the current era of alignment and efficiency.

Chapter 1 — What Does It Mean to "Compute"? — Alan Turing (1936)

The Problem He Was Trying to Solve

In the mathematics world of the 1930s, there was a dream:

"Every mathematical proposition should be decidable as 'true or false' by an algorithm."

This was called the Entscheidungsproblem (Decision Problem), proposed by the giant of mathematics, Hilbert. In short, it was the question: "Can mathematics be completely mechanized?"

British mathematician Alan Turing set out to prove whether this was true.

Turing's Translation: From a Mathematical Question to a Programming Question

Turing's brilliant insight was to translate the question into a different question.

If an "algorithm for deciding mathematical propositions" existed, it would work like this:

Keep searching for a proof → Stop when a proof is found → Loop forever if not found

In other words, the question "Can mathematics be mechanized?" becomes equivalent to the question: "Can we determine whether a program will stop or loop?"

Therefore, if we can prove the halting problem is unsolvable, we can also prove that mechanizing mathematics is impossible. That is what Turing aimed for.

The "Halting Problem": The Contradiction Born from Self-Reference

"Can we build a universal program (decider H) that, for any program, can determine before execution whether it will halt or loop?"

The answer is No. The reason has the same structure as the "liar's paradox."

"This sentence is false."
→ If true, then false. If false, then true. Either way, a contradiction.

We can construct the same trap with decider H:

  1. Use decider H to build a "contrarian program D"
    • If H says "it will halt" → D loops forever
    • If H says "it will loop" → D stops immediately
  2. Apply D to itself (have it judge itself)
  3. No matter what answer H gives, a contradiction arises

This is the same structure as the "lying sentence." The moment something references itself, any answer leads to contradiction. Decider H cannot exist in principle.

Thus, the dream of "Can mathematics be completely mechanized?" was proven impossible.

df6b1a37-a1fe-4033-b7b4-13e441b0fbc4

The Unintended Byproduct: The Concept of a Computer That "Can Do Anything"

What matters is not the conclusion of the proof, but what was needed to make the proof.

To "prove the halting problem," first "what does it mean to compute" had to be rigorously defined. What was born for this purpose was the Turing machine——an abstract model of a machine that reads and writes symbols on a tape and changes state according to rules.

Turing further noticed a decisive property of this machine. A program itself is data.

What does this mean?

The tape of a Turing machine can hold "input data." However, "the procedure for how to operate (the program)" can also be written on the same tape. That means procedures are a kind of data too. In that case——"a universal machine that can accept any program description as input and execute it" can be built.

This is called the Universal Turing Machine (UTM).

Why this idea was revolutionary becomes clear when compared to the world before it:

Before UTM:  Computational task A → Build dedicated machine A
             Computational task B → Build dedicated machine B
             Computational task C → Build dedicated machine C

After UTM:   Any task → The same 1 machine + different programs (input)

This is your laptop. Whether it's a browser, a game, or LLM inference——the hardware doesn't change. Only the "data" called the program changes.

John von Neumann read Turing's paper and, in the 1940s, embodied this concept into actual computer design (von Neumann architecture). Storing programs in the same memory as data——every computer today follows this design.

The decisive thing happened when this design became the industry standard. Any algorithm expressible on this common architecture runs on any compatible hardware. Anyone who writes a program can run it on compatible machines worldwide. Even without owning specific dedicated hardware, you can participate as long as you have the idea for an algorithm. This is the foundation of the software industry, and also the reason why LLM training code runs on GPU clusters around the world.

be819bd4-585f-445c-8557-7801060ea021

However, the 1936 proof also settled one more thing. The "limits of computability" that Turing himself drew apply to LLMs as well. ChatGPT and Claude cannot in principle solve the halting problem——meaning they cannot fully guarantee whether any given program is correct. The same paper that gave birth to computers also defined the permanent ceiling of all AI running on those computers.

In trying to prove the limits of mathematics, the concept of "a computer that can do anything" was born.

Side note: Fourteen years later, in 1950, Turing posed the question "Can machines think?" and proposed the Turing Test to see whether humans and machines can be distinguished. It is the conceptual ancestor of conversation tests for evaluating LLM capabilities. Also, the highest honor in computer science is called the ACM Turing Award, known as the "Nobel Prize of computer science." As we'll see later, the pioneers of deep learning also received this award.

Chapter 2 — Can "Information" Be Measured Mathematically? — Claude Shannon (1948)

The Problem He Was Trying to Solve

In the 1940s, Claude Shannon, a researcher at Bell Labs, had a practical concern:

"Over a noisy telephone line, how accurately can information be transmitted?"

This was a telephone engineering problem, completely unrelated to AI or machine learning.

Why It Was Necessary to Define "What Is Information"

Telephone engineers of the time dealt with noise intuitively. Amplify the signal, improve cable quality, repeat the message. However, they were unable to answer a fundamental question:

"Over this noisy line, how accurately can information theoretically be transmitted? Where is the limit?"

The reason they couldn't answer is simple. Because they couldn't mathematically define what noise was destroying.

Think about it. Noise destroys signals. So, what part of a signal is destructive when it "breaks"?

For example, if the message "THE CAT SAT ON THE MAT" is partially corrupted by noise:

  • The "C" in "CAT" disappears → The meaning is broken. Fatal.
  • "THE" becomes "TH_" → The reader can reconstruct it. No problem.
  • The second "THE" disappears → The reader already predicts it. Small information loss.

Here lies an important discovery. Not every part of a message carries equal information. Predictable parts carry little information; unpredictable parts carry a lot. Noise truly causes harm only when it destroys parts with high information content.

In other words, to fight noise, you first need to know "what should be protected." To know what to protect, you need to define "what information is."

Defining "Quantity of Information"

Shannon's answer: "Information is the reduction of uncertainty. The more unpredictable an event, the greater the information it carries."

"Tomorrow the sun will rise in the east"——everyone knows this. Information content is nearly zero.
"It snows in Tokyo in midsummer"——unexpected. High information content.

This intuition was formalized in information entropy——a formula that "calculates overall unpredictability based on the likelihood (probability) of each event," giving larger values when prediction is harder.

a0fcb20c-b87d-412a-960c-457e45276246

Incidentally, the word "bit" also appeared in this era, but Shannon didn't invent it himself. The concept of binary numbers with 0 and 1 goes back to Leibniz (1703), and the word "bit" was coined by colleague John Tukey and then popularized by Shannon in his paper. Shannon's essential invention was not "0 or 1" but the mathematics of quantifying information itself.

Side note: Anthropic's AI assistant "Claude" is widely said to be named after Claude Shannon. An AI bearing the name of the man who created information theory is quite a suggestive choice.

The Engineering Shannon's Answer Unleashed

By defining information, everything became quantifiable:

  1. Messages can be measured — you can calculate how many bits of information they contain
  2. Channels can be measured — you can calculate how many bits per second a line can reliably carry (channel capacity)
  3. Redundancy can be designed — you can calculate how many bits to add for error correction

And the most important result——the noisy-channel coding theorem: as long as the information rate stays below channel capacity, no matter how much noise there is, information can be transmitted with virtually no errors through appropriate encoding.

Before Shannon, engineers thought that on noisy lines, degraded communication quality was unavoidable. Shannon proved this was not so. Noise defines a limit, but below that limit, perfect reliability is achievable——as long as there is correct encoding.

The Connection to Compression: Parts with Less Information Can Be Discarded

Shannon's definition of information quantity produced another revolution——data compression.

According to the source coding theorem, the theoretical minimum size a message can be compressed to equals its entropy (information content). That means:

  • "AAAAAAAAAA"——completely predictable, entropy nearly zero → can be compressed to nearly nothing
  • Random noise——completely unpredictable, maximum entropy → incompressible
  • Natural language——in between, many patterns → can be significantly compressed

JPEG, MP3, ZIP files——all follow this principle. Remove parts that are predictable (low information content), preserve parts that are unpredictable (high information content). When decompressing, reconstruct from context what was predictable.

LLM tokenizers (BPE——Byte Pair Encoding) are also compression in this sense. Frequently occurring sequences ("the," "ing," "tion") become a single token, while rare sequences remain as individual characters. A tokenizer is literally a Shannon-optimal encoder for natural language.

1ac313ad-1ac8-49d2-b7e7-9dc53f708b57

The Connection to LLMs: From Training to Inference

Training: Cross-Entropy Loss

The central question when training an LLM is "how accurately can the model predict the next word?"

The metric used to measure this "accuracy of prediction" is cross-entropy loss——derived directly from Shannon's entropy.

Loss = -(sum of correct probability × log(model's predicted probability))

When an LLM is learning, the model continually tries to minimize this value. In other words, reducing "surprise" about the next token is the entire purpose of training. Shannon's definition of "information = reduction of uncertainty" becomes the learning objective as-is.

Inference: Temperature Is a Knob for Entropy

Shannon's concept of "surprise" also directly connects to controlling LLM output. The Temperature parameter is a knob that directly manipulates the entropy of the output distribution.

P(token) = softmax(logit / temperature)
Temperature Effect on distribution Shannon entropy
→ 0 Concentrates on highest probability token Entropy → 0 (zero surprise)
= 1 Model's original distribution Natural entropy
> 1 Distribution flattens, all tokens become more equal Entropy rises (surprise increases)

Low temperature = always choosing the most predictable (low information) token. High temperature = sampling from a higher-entropy distribution = potentially more creative but less consistent.

When you raise Temperature for "write more creatively," you are literally increasing the Shannon entropy of the output.

128bfd32-33a4-4d0e-a192-c7272c944ab3

The Overall Connection

Shannon (1948):  Defined information = surprise = entropy

Compression:    Remove low-entropy (predictable) parts, preserve high-entropy

Tokenizer:      BPE optimally encodes language on the same principle

Training loss:  Cross-entropy = model's "surprise" at the correct token

Temperature:    Controls entropy of output distribution at inference

A single mathematical framework applies to every layer of building and operating LLMs. Shannon's mathematics, which aimed to reduce noise on telephone lines, became both "the definition of smartness" and "the creativity adjustment knob" of LLMs 80 years later.

Chapter 3 — Can Machines Learn? — Optimism, Setbacks, Revival

The Age of Optimism: The Perceptron (1958)

The first demonstration that "machines can learn" was Frank Rosenblatt's Perceptron.

The mechanism is simple. The simplest neural network——a circuit with just a single "neuron."

Input        Weight         Sum            Output
  x1 ----(w1)----\
  x2 ----(w2)-----→  Σ(xi·wi) + bias  → threshold decision → 0 or 1
  x3 ----(w3)----/
  1. Receive multiple inputs (numbers)
  2. Multiply each by a "weight" and sum them up
  3. Output "Yes (1)" if the total exceeds a threshold, "No (0)" otherwise

Let's look at a concrete example. A perceptron that judges "Is this email spam?":

Inputs:
  x1 = Contains "free prize"? (1 or 0)
  x2 = From a known contact? (1 or 0)
  x3 = Has attachment? (1 or 0)

Learned weights:
  w1 = +0.9 (strong spam signal)
  w2 = -0.8 (strong non-spam signal)
  w3 = +0.3 (weak spam signal)

Score = (1)(0.9) + (0)(-0.8) + (1)(0.3) = 1.2
Threshold = 0.5
1.2 > 0.5 → "Spam"

"Learning" is the adjustment of these weights. Starting with random weights, gradually correcting the weights while checking answers. Repeat millions of times, and the weights converge to appropriate values.

Before the perceptron, all programs ran on rules handwritten by humans. The perceptron was a revolutionary demonstration of the concept that machines find rules themselves from data.

The New York Times reported: "The Navy has revealed the embryo of a machine that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." An optimistic mood prevailed.

45b23a0c-1bcf-436d-a05a-6014b74e9b1b

The Setback: The First AI Winter (1969)

Marvin Minsky and Seymour Papert threw cold water on this.

The perceptron has a mathematical limitation.

A question might arise here. "As Turing proved in 1936, a universal Turing machine can perform any computation. Why would something as trivial as XOR be a problem?"

The answer is that Turing's universality and the perceptron's learning are completely different capabilities.

  • Turing's universality: If a human writes the correct program, any computation is possible. XOR? Writable in 1 line with an if statement.
  • The perceptron's promise: The machine discovers rules itself from data. Humans don't write the rules.

What XOR destroyed was not "computational capability" but the "learning mechanism" itself.

Let's look concretely. XOR (exclusive OR) is the simple rule: "if both inputs are 1 or both are 0, output 0; if only one is 1, output 1." Plotted on a 2D plane:

llm-birth-history-turing-shannon-transformer-gpt3-xor

The perceptron tries to separate Yes and No with a single straight line. But looking at the figure above, it is impossible to separate ● and ○ with one straight line. No matter how you draw the line, they always mix.

If it were just this, one might say "XOR is a special case." But what Minsky and Papert proved was not just about XOR. They showed that an entire class of problems requiring non-linear decision boundaries——parity determination (distinguishing odd from even), symmetry detection, connectivity judgment, and more——are all impossible for perceptrons. XOR is simply the most minimal and most embarrassing counterexample.

And the reason this was devastating was the gap between expectations and reality. The media reported "a machine that walks and talks." Mathematicians proved that machine can't even learn XOR. The conclusion for funders: "Neural networks = dead end." This triggered the first AI winter. Research funding froze, and many neural network researchers lost their jobs.

68119310-9ee0-4c1e-83c9-6cf763b7b229

However, there was one line of annotation in their book:

"A multi-layer network might be able to solve this problem. However, how to train it is unknown."

This "however" defined the research agenda for the next 17 years.

The Revival: Backpropagation (1986)

The one who solved the "however" was Backpropagation by Rumelhart, Hinton, and Williams.

A multi-layer network is multiple perceptrons connected together. Each perceptron has its own weights, grouped into layers.

llm-birth-history-turing-shannon-transformer-gpt3-multilayer-nn

The middle Layer 2 in this figure is called a hidden layer. "Hidden" doesn't mean something is being concealed. The input layer is where users pass data, the output layer is where users receive results——both are "visible" to the user. The middle layers, on the other hand, are internal workspaces that users don't directly touch. Not visible from the outside, hence "hidden."

When layers increase, what increases is these hidden layers. There is always exactly one input layer and one output layer. How many hidden layers are stacked between them is "the depth of the network," and this is the meaning of "deep" in "deep learning." This concept of "hidden" will reappear in Chapter 5, so keep it in mind.

The flow of learning is intuitive:

  1. Initialize all weights with random values (if not random, all neurons would learn the same thing)
  2. Input data, pass through all layers, produce a prediction ("this image is a cat")
  3. Check the answer and calculate how much was wrong
  4. Track backward from output to input "which weights contributed to this mistake"
  5. Slightly adjust each weight according to its contribution (weights are overwritten; no snapshots remain)
  6. Repeat this millions of times

Only the final output knows the correct answer. No one knows "what the intermediate layers should correctly output." There is no correct answer for "the correct edge detection result of Layer 1 for a cat image."

Input → [Layer1] → [Layer2] → [Layer3] → Output vs. Correct Answer

                                       Only scoring point

So the error signal is propagated backward from output. Using the mathematical chain rule——the principle that "if A affects B, and B affects C, then A's effect on C is found by multiplying the two effects"——the responsibility of each layer is calculated in sequence.

Forward pass (left → right):
  Input → Layer1 → Layer2 → Layer3 → Output → "Wrong!"

Backward pass (right → left):
  "Wrong!" → Layer3's responsibility → Layer2's responsibility → Layer1's responsibility

  "Layer3, your responsibility is 40% → adjust weights by 40%"
  "Layer2, your responsibility is 35% → adjust weights by 35%"
  "Layer1, your responsibility is 25% → adjust weights by 25%"

The error propagates backward——hence backpropagation.

It was shown that multi-layer networks can be trained this way, solving the problem that Minsky and others called "unknown."

689dfa20-999d-4608-9152-20cd44af7558

Why does adding layers allow solving the problem? The core is the difference between linear and non-linear. A perceptron (single layer) can only draw one straight line——this is the linear limitation. In a multi-layer network, each layer draws its own line, and combining them creates curved or complex-shaped decision boundaries. This is non-linear.

f581ea22-cb34-4c1d-84b0-51398910bad4

More layers means more complex decision boundaries can be drawn. This is the foundation of deep learning.

However, there was a large wall between theory and practice. Training multi-layer networks requires enormous computation, and with 1980s-90s hardware, training could not be run at practical speeds.

Breaking the Speed Wall: The Rise of GPUs (2012)

Even after backpropagation was established, multi-layer networks remained far from practical for a while. Not enough data. Computation too slow.

In 2012, at ImageNet (a contest to classify over 1 million images into 1,000 categories), the situation changed dramatically. A model called AlexNet achieved overwhelming results. Against the 2nd place error rate of 26.2%, AlexNet achieved 15.3%——a difference of about 11 percentage points. In the machine learning world, a 1-2 point improvement year-over-year is normal, so 11 points is understating it to call it "dominant." Many computer vision researchers shifted en masse to deep learning the following year, rewriting the industry's common sense.

What AlexNet used was a GPU (graphics chip). Why were GPUs so effective? Until then, training was done on CPUs, but CPUs and GPUs have fundamentally different design philosophies.

CPUs have a small number of high-performance cores (4-16) designed to handle complex tasks sequentially. GPUs, on the other hand, have thousands of simple cores (4,000+) designed to simultaneously execute large numbers of simple computations.

CPU:  4 skilled mathematicians solving complex problems one by one
GPU:  4,000 elementary school students solving addition all at once

Neural network training is largely matrix multiplication——the simple repeated computation of multiplying thousands of weights by thousands of inputs and summing them. And each multiplication is independent of the others; w3×x3 does not need to wait for the result of w1×x1.

CPU approach (sequential):
  w1×x1 → done → w2×x2 → done → w3×x3 → done → ... → w1000×x1000 → done
  Total: 1000 steps

GPU approach (parallel):
  w1×x1, w2×x2, w3×x3, ... w1000×x1000 → all done simultaneously
  Total: 1 step

Training AlexNet took weeks to months on a CPU, but completed in days on a GPU. Same mathematics, same results——just massively parallelized. GPUs broke through the speed wall of training, proving that the combination of "GPU + large data + multi-layer networks" works at a practical level.

f84b4ec9-d63b-4894-aef6-b69344729771

However, AlexNet was an 8-layer network. "Deeper (more layers) should mean smarter," but trying to go deeper, another wall stands in the way.

Side note: One of AlexNet's authors, Geoffrey Hinton, also appears on the 1986 backpropagation paper. Surviving the AI winter, the same person led the revival of deep learning 26 years later. Hinton received the Turing Award in 2018 along with Yann LeCun and Yoshua Bengio, and together the three are called the "Godfathers of Deep Learning."

Side note: NVIDIA, which made GPUs, was originally a company making graphics for games. 3D rendering in games also involves "executing the same computation on a large number of pixels in parallel," which is structurally identical to the matrix operations of neural networks. The AI boom made NVIDIA one of the most valuable companies in the world——the fact that gaming chips became the optimal tool for AI training is itself one of the "unintended connections" traced in this article.

Going Even Deeper: Overcoming Vanishing Gradients (2010s)

Even after GPUs solved the speed problem, making networks deeper itself had another wall——the vanishing gradient problem.

As we saw with backpropagation, the only scoring point is at the final output. The error signal has no choice but to travel backward through the layers. And each time it passes through a layer, the signal is shrunk by multiplication. Each multiplicative coefficient is typically less than 1 (e.g., 0.3), so multiplying a small number repeatedly causes it to rapidly approach zero.

3 layers:   0.3 × 0.3 × 0.3 = 0.027              ← small but usable
10 layers:  0.3^10 ≈ 0.000006                     ← nearly zero
50 layers:  0.3^50 ≈ 0.00000000000000000000000... ← effectively zero

The closer to the final layer, the stronger the feedback; the closer to the first layer, the more the feedback vanishes:

Layer1      Layer2      Layer3   ...   Layer50     Output
  ←×0.3←    ←×0.3←     ←×0.3←  ...   ←×0.3←    Error!

Layer50: "30% wrong, adjust!"                ← Can learn
Layer25: "0.000001% wrong"                   ← Barely learns
Layer1:  "0.0000000000% wrong"               ← Frozen, cannot learn

This is like having a broken foundation of a building that can't be fixed. Layer 1 is supposed to learn basic features (edges, simple patterns), but feedback doesn't reach it so it can't learn. If the foundation is nonsense, nothing built on top of it matters.

99342861-9430-4d3a-ac8a-192779ae0f06

This wall was overcome gradually through a combination of techniques.

Earlier I explained that the multiplicative coefficient at each layer was less than 1, like "0.3." Where does this coefficient come from? The answer is the activation function——the function that determines how each neuron transforms its output.

What had been used for a long time was Sigmoid. Sigmoid was designed to mimic biological neurons. Real neurons don't simply switch on and off; they activate smoothly in response to stimulus strength. Sigmoid reproduces this with a smooth S-shaped curve that converts any input to a value in the range 0-1.

However, in backpropagation, the "responsibility coefficient" of each layer is determined by the slope (gradient) of this activation function. The slope of Sigmoid is at most about 0.25, and usually less.

Slope of Sigmoid:
  Extremely large/small input → slope ≈ 0.0 (nearly flat)
  Input near 0 (best case) → slope ≈ 0.25

  Best case for each layer: ×0.25
  After 10 layers: 0.25^10 ≈ 0.000001 → signal dies

Biological plausibility and mathematical elegance——the very reasons Sigmoid was considered "clever" caused it to kill gradients. Its strengths were its weaknesses.

ca81c99e-4901-4d5c-a9d9-d937f681c475

ReLU: Why "Crude" Won

ReLU (Rectified Linear Unit) is surprisingly simple. "If positive, pass through; if negative, zero"——that's it.
Mathematicians initially dismissed this. Too crude. Has a non-differentiable point. Looks nothing like a neuron.

But ReLU's slope for positive values is exactly 1.0. Not 0.25, not 0.1, but 1.

Comparison of coefficients during backpropagation:

Sigmoid:  ×0.25 → ×0.25 → ×0.25 → ... → signal dies
          After 10 layers: 0.25^10 = 0.000001

ReLU:     ×1.0  → ×1.0  → ×1.0  → ... → signal survives
          After 10 layers: 1.0^10 = 1.0

ReLU doesn't "amplify" the signal. It simply stopped killing it. Sigmoid was actively compressing the signal at each layer. ReLU lets it through as-is.

In the 1980s, networks only had 2-3 layers, and Sigmoid's problem wasn't apparent. Only when layers became deep did the hidden cost of "clever design" become fatal. At scale, the crudeness of preserving a signal beats the elegance of destroying it.

7e23dd78-7674-4c4f-96ed-4049a3898043

That said, Sigmoid was not "wrong." It was correct, but couldn't withstand scale. Like training wheels on a bicycle——the right design for a beginner, but gets in the way of a racer. In fact, Sigmoid is still used today for specific purposes——Yes/No classification at the output layer (where 0-1 probabilities are needed), and gate control within memory cells. However, in the hidden layers of deep networks, ReLU is now the standard.

Residual Connections (ResNet, 2015): Another Solution

While ReLU suppressed signal attenuation within each layer, residual connections (ResNet) attacked the vanishing gradient problem from a completely different angle.

In a normal network, each layer is required to "receive input and produce complete output." With residual connections, the input is added directly to the output via a shortcut path.

Normal path:          Input → [Layer] → Output

With residual:        Input → [Layer] → Layer result + Input = Output
                        │                               ↑
                        └───────────────────────────────┘
                               Shortcut (input is added directly)

What does this change? The work of the layer fundamentally changes.

Without residual connection:
  Input = 5, expected output = 5.3
  What the layer must learn: "output 5.3" ← learns the whole thing

With residual connection:
  Input = 5, expected output = 5.3
  Output = layer result + input
  What the layer must learn: "output 0.3" ← learns only the small correction
  Final output: 0.3 + 5 = 5.3

The role of the layer changes from "generating complete output" to "learning a small correction (residual) to the input." That's why it's called a residual connection.

Consider a familiar example: if you want to weigh a dog that's too light for the scale to measure——hold the dog and weigh yourself together, then subtract your own weight to get the dog's weight. Because you have a known baseline of your own weight (input), the small difference (the dog = the residual) can be measured accurately. Residual connections work the same way: by retaining the input as a baseline, the small corrections the layer needs to learn become detectable.

And for gradients, there is also a decisive difference. In backpropagation, gradients flow through both paths:

Backward pass:

                Gradient through layer path (may shrink: ×0.25)

Total gradient =

                Gradient through shortcut path (passes through as-is: ×1.0)

Total = (shrunk signal) + (complete signal)

Even if the gradient shrinks to 0.001 through the layer path, the shortcut path adds 1.0. The total is about 1.001. The shortcut functions as insurance, and the signal doesn't die. And since gradients also flow through the layer path, the layer itself can also learn——because the feedback doesn't disappear.

What if the layer can't find any useful correction? Just output 0. Output = 0 + input = input passes through unchanged. Because the layer has the option to "do nothing," performance doesn't degrade when adding more layers.

2a57f99e-460f-44d8-b890-8a058eea9d0e

ResNet successfully trained a 152-layer network with this mechanism and won ImageNet 2015.

ReLU and ResNet: Different Approaches, Same Goal

ReLU and ResNet are solutions from different angles to the same problem of vanishing gradients. ReLU prevents gradient attenuation within each layer (keeping the coefficient at 1.0), and ResNet secures a detour for gradients across layers. In modern deep learning, combining both has made networks with 100+ layers practical.

GPUs broke the speed wall, and ReLU and ResNet broke the depth wall. With both the speed wall and the depth wall fallen, modern deep learning became possible.

394511f1-5323-4b9f-bce3-61671900c214

Chapter 4 — What Does It Mean to Turn Words into Numbers? — The Geometry of Meaning

Computers Can Only Handle Numbers

Up to this point, the story has been about "how to train accurately." And all the successes up to this point dealt with data that was already numbers to begin with. Images are numerical pixel values (each pixel is color information from 0-255). Sound is numerical waveforms. Stock prices, temperatures, sensor data——all numbers from the start. They can be input directly into neural networks.

But words are not numbers. "Cat," "economy," "beautiful"——these are symbols that cannot be input as-is into neural network matrix calculations. To build an LLM, the fundamental problem of "how to convert words into numbers?" first had to be solved.

The simplest approach: assign numbers like "cat=1, dog=2, sky=3…". But this leaves no meaningful relationship between "cat" and "dog." The information that "cats and dogs are similar" and "cats and spaceships are far apart" is not encoded in the numbers.

Word2Vec: Meaning as Coordinates in Space (2013)

The idea shown by Mikolov et al. at Google: represent words as "coordinates (vectors)" in a high-dimensional space.

What does "high-dimensional" mean? The maps we use in daily life are 2D (east-west and north-south); inside a building, 3D (plus up-down). Word2Vec represents a single word with 300-dimensional coordinates. Humans cannot imagine a 300-dimensional space, but mathematically, distances and directions can be calculated just as in 2D or 3D.

Why are as many as 300 dimensions needed? Because the meaning of language is multi-faceted. The word "cat" is characterized along countless axes: "Is it an animal?", "Is it a pet?", "How big?", "Is it dangerous?", "Is it cute?", and so on. 2-3 dimensions can't express this richness of meaningful differences. With 300 dimensions, each dimension can capture a different aspect of meaning.

The training method is simple: learn so that "words appearing in similar contexts receive nearby coordinates." "Cat" and "dog" both appear in contexts like "pet," "food," "walk," so their coordinates end up close together.

The resulting "map of meaning" had surprising properties:

"King" coordinates − "Man" coordinates + "Woman" coordinates ≈ "Queen" coordinates

Arithmetic of meaning is possible. Language became mathematics.

d84ccde4-0fac-42ab-8e8b-d9aba417b06f

Side note: There is a predecessor to the idea that "meaning lies in relationships." In 1998, Stanford graduate students Larry Page and Sergey Brin proposed PageRank in their paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine". At the time, search engines ranked pages by keyword frequency within pages, but PageRank measured importance by how many other pages linked to it——that is, the structural relationships of the Web. The idea that "surrounding relationships, rather than content itself, express the essence" is strikingly similar to Word2Vec's "the meaning of a word is determined by surrounding context (co-occurring words)." And the Google born from this PageRank later established Google Brain, which produced the 2017 "Attention Is All You Need" (Transformer) paper. A search engine company invented the core architecture of LLMs——this too is one of the "unintended connections" we have been tracing.

How to Measure "Closeness"? — Cosine Similarity

We said "cats and dogs are close" and "cats and spaceships are far apart," but what does "close" mean in 300-dimensional space?

Intuitively, we might want to use the distance between two points (Euclidean distance). However, in high-dimensional space, direction captures semantic similarity better than distance.

Let's look at the reason with a concrete example. Suppose we have a long document and a short document both about "cats."

Short document: "cat" appears 5 times  → vector magnitude: small
Long document: "cat" appears 50 times → vector magnitude: large (extends 10x in the direction of the 5-occurrence document)

Euclidean distance between the two: large (far apart)
Direction both are pointing: the same

Both are "documents written about cats," yet when measured by distance, they are judged as "far apart." Measured by direction, they are "the same" — which is correct.

Cosine similarity measures how much two vectors point in the same direction.

Cosine similarity:
  Exactly the same direction → 1.0 (same meaning)
  Perpendicular (unrelated) → 0.0
  Opposite directions       → -1.0 (opposite meaning)
  Example (conceptual image):

  "dog" ↗    ← nearly the same direction as "cat" → cosine similarity ≈ 0.9
  "cat" →
  "spaceship" ↓  ← completely different direction → cosine similarity ≈ 0.1

a46c7347-801f-4b46-9def-445434268b17

This is the foundation of today's RAG (Retrieval-Augmented Generation) and semantic search. A user's question is vectorized, cosine similarity is calculated against the vectors of documents in the database, and the document with the closest direction is retrieved. The idea of "measuring meaning by direction," which started with Word2Vec, has become the search engine of today's AI applications.

This technology is called Embedding.

Why Embedding Is Indispensable for LLMs

The importance of Embedding goes beyond RAG and search. It is the very entry point through which LLMs process language.

The internals of an LLM are a neural network — a chain of matrix multiplications. Matrix multiplication requires numerical vectors. You cannot feed the string "cat" directly into a neural network.

When inputting the sentence "The cat is sleeping" into an LLM:

Step 1: Tokenization
  "The" "cat" "is" "sleeping" → [4521, 12, 8834, 67, 2201]

Step 2: Embedding (convert each token ID to a 300-dimensional vector)
  4521 → [0.12, -0.34, 0.56, ..., 0.78]  ← meaning vector for "cat"
  12   → [0.01, -0.02, 0.03, ..., 0.01]  ← meaning vector for "is"
  ...

Step 3: These vectors enter the neural network

Without Embedding, an LLM cannot even receive input. The technology of "converting words into meaningful vectors," established by Word2Vec, is built into the literal entry point — the first layer — of an LLM.

ed283080-caec-40aa-ba2f-bf4a7c51c692

Up to this point, we have solved the problem of "how to turn words into numbers." However, the next problem awaits. Simply converting words into vectors causes the word order (context) to be lost. "The dog chased the cat" and "The cat chased the dog" — the word vectors are the same, but the meanings are opposite.

How do we handle context? This question is the theme of the next chapter. And the first researchers to seriously tackle this question were those working on machine translation.


Continued in Part 2: Chapter 5 traces how Attention emerged from the structural limitations of RNNs, and how the Transformer evolved "from a patch to an architecture." Then, through the scale revolution of GPT-3, alignment via RLHF, and the current era of efficiency — the second half of how LLMs were completed "without being designed."


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article