The Engine Under the Hood

Every neural network you have met in this book, the Committee, the Eyeball, the Attention Seeker, they all learn the same way: Gradient Descent.

But what actually computes the gradients? How does the computer know which knob to turn and by how much?

The answer is Autograd, automatic differentiation. And today, we are building one from scratch.

The Engine

The Idea

Remember the Blind Hiker from ML 7? He takes a step, checks if he went downhill, and adjusts.

Autograd is the eyes for the Blind Hiker. It tells him exactly which direction is downhill, for every single parameter, all at once.

Here is the trick:

You do a forward pass, feed data through the network, get a loss.
Autograd traces every operation (add, multiply, etc.) into a graph.
Then it walks backward through the graph (chain rule), computing how much each value contributed to the loss.

That is backpropagation. That is the entire magic.

Building a Value Class

We need a Value object that:

Holds a number
Remembers what created it (which operation, which inputs)
Can compute gradients backward

import math
 
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
 
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
 
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
 
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
 
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
 
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
 
    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')
 
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
 
    def __pow__(self, other):
        out = Value(self.data**other, (self,), f'**{other}')
 
        def _backward():
            self.grad += (other * self.data**(other-1)) * out.grad
        out._backward = _backward
        return out
 
    def __neg__(self):
        return self * -1
 
    def __sub__(self, other):
        return self + (-other)
 
    def __truediv__(self, other):
        return self * other**-1
 
    def __radd__(self, other):
        return self + other
 
    def __rmul__(self, other):
        return self * other
 
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
 
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

That is it. About 70 lines. This is the entire engine that powers neural networks.

Let's Test It

# Build a tiny expression
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
 
d = a * b + c  # d = 2*(-3) + 10 = 4
d.backward()
 
print(f"d = {d}")         # Value(data=4.0000, grad=1.0000)
print(f"a.grad = {a.grad}")  # -3.0 (how much d changes if a changes)
print(f"b.grad = {b.grad}")  #  2.0
print(f"c.grad = {c.grad}")  #  1.0

a.grad = -3.0 means: "if you increase a by a tiny bit, d goes down by 3x that amount." That is a gradient. That is what PyTorch computes for millions of parameters.

Building a Neural Network on Top

Now we add a Neuron, Layer, and MLP, just like PyTorch:

import random
 
class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)
 
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()
 
    def parameters(self):
        return self.w + [self.b]
 
class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]
 
    def __call__(self, x):
        out = [n(x) for n in self.neurons]
        return out[0] if len(out) == 1 else out
 
    def parameters(self):
        return [p for n in self.neurons for p in n.parameters()]
 
class MLP:
    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
 
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
 
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

Training It

# A tiny dataset: 4 inputs, 4 targets
xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0]  # targets
 
# Create a network: 3 inputs -> 4 -> 4 -> 1 output
model = MLP(3, [4, 4, 1])
 
# Training loop
for step in range(100):
    # Forward pass
    predictions = [model(x) for x in xs]
    loss = sum((pred - y)**2 for pred, y in zip(predictions, ys))
 
    # Backward pass
    for p in model.parameters():
        p.grad = 0.0  # reset gradients
    loss.backward()
 
    # Update weights (gradient descent!)
    for p in model.parameters():
        p.data -= 0.05 * p.grad
 
    if step % 10 == 0:
        print(f"Step {step}, Loss: {loss.data:.4f}")
 
# Check predictions
for x, y in zip(xs, ys):
    print(f"Target: {y}, Predicted: {model(x).data:.4f}")

Run that. Watch the loss go down. Watch the predictions get closer to the targets.

You just trained a neural network on an engine you built yourself.

What PyTorch Does

PyTorch does exactly this, but faster, on GPUs, with thousands of operations, and for millions of parameters. When you call loss.backward() in PyTorch, it runs the same algorithm we just wrote.

The difference? PyTorch uses tensors (arrays of numbers) instead of individual scalars, and it is written in C++ for speed. But the idea is identical.

What You Built

Our Engine	PyTorch Equivalent
`Value`	`torch.Tensor` with `requires_grad=True`
`_backward()`	Autograd backward hooks
`model.parameters()`	`model.parameters()` (same name!)
Manual gradient descent loop	`optimizer.step()`

Inspired by Andrej Karpathy's micrograd.

Next up: we use the real PyTorch to build a GPT.