But what is a neural network? | Deep learning chapter 1

TL;DR

A simple digit-recognition neural network maps 784 pixel inputs to 10 digit scores using two hidden layers of 16 neurons each.

Briefing Cornell Notes

Briefing

Handwritten-digit recognition becomes feasible once a neural network is treated as a layered math machine: each “neuron” computes a weighted sum of earlier activations, adds a bias, then passes the result through a squashing function. In the simplest setup, a 28×28 grayscale image (784 pixel values) feeds a network with 16 neurons in each of two hidden layers and 10 output neurons—one per digit—so the network’s final decision is simply the output neuron with the highest activation. The striking part is that the same architecture can generalize across wildly different pixel patterns that humans still read as the same digit.

The structure starts with an input layer of 784 neurons, each holding a number between 0 and 1 representing a pixel’s brightness. Hidden layers then transform those numbers into new activations, and the output layer converts the final hidden representation into 10 activations interpreted as “how much the system thinks” the image matches each digit. The middle layers are where abstraction is expected to emerge: a digit like 9 might be decomposed into subcomponents such as loops and line segments, with earlier neurons responding to simpler features (like edges) and later neurons responding to combinations (like the loop-plus-line configuration that distinguishes 9 from 8 or 4). Even if the exact internal features aren’t guaranteed, the layered design provides a plausible path from raw pixels to meaningful parts.

To make that transformation concrete, each neuron in a hidden layer is connected to every neuron in the previous layer. Every connection has a weight, and each neuron also has a bias. The neuron computes a weighted sum of the previous layer’s activations; weights act like learned “dials” that emphasize certain pixel regions or patterns and suppress others. For instance, if a neuron’s weights are mostly positive in a region and negative around it, the weighted sum becomes large when the image contains a contrast pattern consistent with an edge. Because the raw weighted sum can be any real number, the network applies a nonlinearity to map it into a usable range. In this classic introductory version, that nonlinearity is the sigmoid (logistic) function, which pushes very negative inputs toward 0 and positive inputs toward 1, with a smooth transition around 0. The bias shifts the threshold at which the neuron begins to activate meaningfully.

With two hidden layers of 16 neurons each, the model contains roughly 13,000 trainable parameters (weights and biases). “Learning” means finding a setting for these parameters that makes the network output the correct digit for many examples—an alternative to manually engineering edge detectors and digit-combiners by hand. Mathematically, the layer-to-layer computation is compactly expressed using matrix-vector products: activations form vectors, weights form matrices, biases form vectors, and the sigmoid is applied component-wise. This structure turns a complicated overall mapping—from 784 inputs to 10 outputs—into repeated, efficient linear algebra operations.

A final reality check comes from a discussion with Lisha Li: while sigmoid was historically used to model neuron activation, modern deep networks often favor ReLU (rectified linear unit) because it trains more easily. ReLU outputs zero for negative inputs and passes positive inputs through unchanged, and it has become the go-to choice for enabling very deep architectures to learn effectively.

Cornell Notes

The core idea is that a neural network is a layered function that converts pixel values into digit scores. Each neuron computes a weighted sum of activations from the previous layer, adds a bias, and then applies a nonlinearity (in this intro version, sigmoid) to produce an activation between 0 and 1. For handwritten digits, a 28×28 image becomes 784 inputs, which feed two hidden layers of 16 neurons, ending in 10 output neurons whose largest activation indicates the predicted digit. The layered design is motivated by the hope that early layers capture simple features like edges, while later layers combine them into digit-specific patterns. Learning is the process of adjusting about 13,000 weights and biases so the network outputs the correct digit from data.

Why does the architecture use layers instead of trying to map pixels directly to digits?

Layering supports a “build-up” story: digits can be decomposed into parts (for example, a 9’s loop plus a line, an 8’s two loops, a 4’s line segments). The hope is that neurons in earlier layers respond to simpler features such as edges or short line segments, and neurons in later layers respond to combinations of those features. Even if the network doesn’t literally learn exactly those human-interpretable parts, the layered structure provides a mechanism for progressively abstracting from raw pixels to digit-level decisions.

What exactly does a single neuron compute in this model?

A neuron takes activations from the previous layer, multiplies each by a learned weight, sums the results, then adds a bias. That raw value is then passed through a squashing function. In the intro setup, the squashing function is sigmoid: very negative inputs push the activation near 0, positive inputs push it near 1, and the transition is smooth around 0. The bias effectively shifts the threshold for when the neuron becomes meaningfully active.

How do weights and biases relate to detecting patterns like edges?

Weights determine which input regions matter. If a neuron’s weights are mostly positive in a target region and negative in surrounding pixels, the weighted sum becomes large when the image has the contrast pattern consistent with an edge. The bias then controls how strong that contrast must be before the sigmoid output rises toward 1. Together, they let the neuron specialize to certain pixel configurations.

What does “learning” mean in this context?

Learning means choosing values for all weights and biases so the network outputs the correct digit for many training examples. The transcript frames a thought experiment of manually setting weights to force edge detectors and digit combiners, but learning replaces that with data-driven optimization. The network’s job is to find a parameter setting that makes the final output neuron with the highest activation correspond to the correct digit.

How is the layer-to-layer computation written more compactly?

Activations from one layer are stacked into a vector. Weights are arranged into a matrix where each row corresponds to connections into one neuron of the next layer. The weighted sums become a matrix-vector product, biases are added as a vector, and then the nonlinearity (sigmoid here) is applied to each component. This compact form also enables fast computation because libraries optimize matrix multiplication.

Why does the discussion shift from sigmoid to ReLU?

A guest notes that sigmoid is “old school” and that modern networks often use ReLU because it trains more easily. ReLU (rectified linear unit) outputs max(0, a): it returns 0 for negative inputs and passes positive inputs through unchanged. The claim is that sigmoid didn’t help training and became difficult for deep networks, while ReLU worked well for very deep architectures.

Review Questions

In the 28×28 digit model, how many input neurons and output neurons are used, and what do their activations represent?
Describe the role of weights versus biases in a neuron’s computation, including how sigmoid changes the output range.
Explain how matrix-vector multiplication corresponds to moving activations from one layer to the next.

Key Points

1
A simple digit-recognition neural network maps 784 pixel inputs to 10 digit scores using two hidden layers of 16 neurons each.
2
Each neuron computes a weighted sum of previous activations, adds a bias, then applies a nonlinearity to produce an activation between 0 and 1 (sigmoid in this intro).
3
Layering is motivated by the expectation that early layers can capture simpler features (like edges) while later layers combine them into digit-level patterns.
4
Weights act like learned pattern detectors by emphasizing some input regions and suppressing others; biases shift the activation threshold.
5
The model’s capacity comes from roughly 13,000 trainable parameters (weights and biases) that must be tuned during learning.
6
Layer transitions can be expressed compactly as matrix-vector products plus bias vectors, followed by element-wise nonlinearities.
7
Modern practice often replaces sigmoid with ReLU because ReLU is easier to train and supports very deep networks.

Highlights

A 28×28 image becomes 784 input activations, and the predicted digit is the output neuron with the highest activation among 10 outputs.

In this classic setup, a neuron’s activation is sigmoid(weighted sum + bias), turning arbitrary real-valued sums into a controlled 0-to-1 response.

The layered design is aimed at building abstraction: pixels → edges/pieces → digit-specific combinations.

The network’s computation is ultimately repeated matrix-vector multiplication, making it both mathematically clean and computationally efficient.

Sigmoid is described as historically common but increasingly replaced by ReLU for training deep networks.

Topics

Neural Networks
Digit Recognition
Sigmoid Function
ReLU
Layered Abstraction

Mentioned

Lisha Li
ReLU