App 1: The Engine (Autograd from Scratch)
Build your own backpropagation engine in ~100 lines of Python. The secret sauce behind every neural network.
The Engine Under the Hood
Every neural network you have met in this book — the Committee, the Eyeball, the Attention Seeker — they all learn the same way: Gradient Descent.
But what actually computes the gradients? How does the computer know which knob to turn and by how much?
The answer is Autograd — automatic differentiation. And today, we are building one from scratch.

The Idea
Remember the Blind Hiker from ML 7? He takes a step, checks if he went downhill, and adjusts.
Autograd is the eyes for the Blind Hiker. It tells him exactly which direction is downhill, for every single parameter, all at once.
Here is the trick:
- You do a forward pass — feed data through the network, get a loss.
- Autograd traces every operation (add, multiply, etc.) into a graph.
- Then it walks backward through the graph (chain rule), computing how much each value contributed to the loss.
That is backpropagation. That is the entire magic.
Building a Value Class
We need a Value object that:
- Holds a number
- Remembers what created it (which operation, which inputs)
- Can compute gradients backward
import math
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def tanh(self):
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t**2) * out.grad
out._backward = _backward
return out
def __pow__(self, other):
out = Value(self.data**other, (self,), f'**{other}')
def _backward():
self.grad += (other * self.data**(other-1)) * out.grad
out._backward = _backward
return out
def __neg__(self):
return self * -1
def __sub__(self, other):
return self + (-other)
def __truediv__(self, other):
return self * other**-1
def __radd__(self, other):
return self + other
def __rmul__(self, other):
return self * other
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()That is it. About 70 lines. This is the entire engine that powers neural networks.
Let's Test It
# Build a tiny expression
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c # d = 2*(-3) + 10 = 4
d.backward()
print(f"d = {d}") # Value(data=4.0000, grad=1.0000)
print(f"a.grad = {a.grad}") # -3.0 (how much d changes if a changes)
print(f"b.grad = {b.grad}") # 2.0
print(f"c.grad = {c.grad}") # 1.0a.grad = -3.0 means: "if you increase a by a tiny bit, d goes down by 3x that amount." That is a gradient. That is what PyTorch computes for millions of parameters.
Building a Neural Network on Top
Now we add a Neuron, Layer, and MLP — just like PyTorch:
import random
class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
self.b = Value(0.0)
def __call__(self, x):
act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
return act.tanh()
def parameters(self):
return self.w + [self.b]
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
out = [n(x) for n in self.neurons]
return out[0] if len(out) == 1 else out
def parameters(self):
return [p for n in self.neurons for p in n.parameters()]
class MLP:
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]Training It
# A tiny dataset: 4 inputs, 4 targets
xs = [
[2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # targets
# Create a network: 3 inputs -> 4 -> 4 -> 1 output
model = MLP(3, [4, 4, 1])
# Training loop
for step in range(100):
# Forward pass
predictions = [model(x) for x in xs]
loss = sum((pred - y)**2 for pred, y in zip(predictions, ys))
# Backward pass
for p in model.parameters():
p.grad = 0.0 # reset gradients
loss.backward()
# Update weights (gradient descent!)
for p in model.parameters():
p.data -= 0.05 * p.grad
if step % 10 == 0:
print(f"Step {step}, Loss: {loss.data:.4f}")
# Check predictions
for x, y in zip(xs, ys):
print(f"Target: {y}, Predicted: {model(x).data:.4f}")Run that. Watch the loss go down. Watch the predictions get closer to the targets.
You just trained a neural network on an engine you built yourself.
What PyTorch Does
PyTorch does exactly this — but faster, on GPUs, with thousands of operations, and for millions of parameters. When you call loss.backward() in PyTorch, it runs the same algorithm we just wrote.
The difference? PyTorch uses tensors (arrays of numbers) instead of individual scalars, and it is written in C++ for speed. But the idea is identical.
What You Built
| Our Engine | PyTorch Equivalent |
|---|---|
Value | torch.Tensor with requires_grad=True |
_backward() | Autograd backward hooks |
model.parameters() | model.parameters() (same name!) |
| Manual gradient descent loop | optimizer.step() |
Inspired by Andrej Karpathy's micrograd.
Next up: we use the real PyTorch to build a GPT.