Andrej Karpathy · Neural Networks: Zero to Hero
Building micrograd: Backpropagation from Scratch
Karpathy builds a tiny automatic-differentiation engine (micrograd) one line at a time, so backpropagation stops being magic and becomes something you can read.
The spelled-out intro to neural networks and backpropagation: building micrograd2:28:59What backpropagation actually is
Backpropagation is just the chain rule applied recursively through a computation graph. Every value remembers how it was produced, so we can walk backwards and ask 'how much did this input nudge the output?' for every node at once.
The Value object
Each number is wrapped in a Value that stores its data, the children that produced it, and a gradient initialized to zero. Storing the parents is what later lets us replay the graph in reverse.
class Value: def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 self._prev = set(_children) self._op = _op💡 Pro tip: Keep _prev as a set — the same node can feed multiple operations, and you only want to visit it once during the backward pass.Local derivatives for + and *
Addition passes the gradient straight through to both inputs. Multiplication routes each input's gradient through the OTHER input's value. These two local rules, composed, cover most of a network.
- = first operand
- = second operand
def __mul__(self, other): out = Value(self.data * other.data, (self, other), '*') def _backward(): self.grad += other.data * out.grad other.grad += self.data * out.grad out._backward = _backward return outTopological order, then backward
To backpropagate, sort the graph topologically so every node comes after its inputs, set the output gradient to 1, then call each node's local _backward in reverse. The += is essential: gradients accumulate when a node is used more than once.
Reverse-mode automatic differentiationdef backward(self): topo = [] visited = set() def build(v): if v not in visited: visited.add(v) for child in v._prev: build(child) topo.append(v) build(self) self.grad = 1.0 for node in reversed(topo): node._backward()Training loop: the same four steps forever
Forward pass to get a loss, zero the gradients, backward pass to fill them, then nudge each parameter against its gradient. Every deep-learning framework is an elaboration of these four lines.
for k in range(20): loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) for p in n.parameters(): p.grad = 0.0 loss.backward() for p in n.parameters(): p.data += -0.05 * p.grad💡 Pro tip: Forgetting to zero the gradients is the #1 beginner bug — old gradients silently accumulate and training diverges.
The takeaway
By the end you can hand-build a scalar autograd engine, wire it into a small MLP, and train it with gradient descent — the exact machinery underneath PyTorch, minus the tensors.
Go deeper with Clarion
Save this to your Vault to quiz yourself and ask questions across everything you learn.
Coming soon