Andrej Karpathy · Neural Networks: Zero to Hero

Building micrograd: Backpropagation from Scratch

Name: The spelled-out intro to neural networks and backpropagation: building micrograd
Uploaded: 2026-06-24
Description: Karpathy builds a tiny automatic-differentiation engine (micrograd) one line at a time, so backpropagation stops being magic and becomes something you can read.

Karpathy builds a tiny automatic-differentiation engine (micrograd) one line at a time, so backpropagation stops being magic and becomes something you can read.

The spelled-out intro to neural networks and backpropagation: building micrograd2:28:59

1▶ 0:00
What backpropagation actually is
Backpropagation is just the chain rule applied recursively through a computation graph. Every value remembers how it was produced, so we can walk backwards and ask 'how much did this input nudge the output?' for every node at once.
2▶ 12:45
The Value object
Each number is wrapped in a Value that stores its data, the children that produced it, and a gradient initialized to zero. Storing the parents is what later lets us replay the graph in reverse.
```
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
```
💡 Pro tip: Keep _prev as a set — the same node can feed multiple operations, and you only want to visit it once during the backward pass.
3▶ 44:30
Local derivatives for + and *
Addition passes the gradient straight through to both inputs. Multiplication routes each input's gradient through the OTHER input's value. These two local rules, composed, cover most of a network.
$\frac{\partial ( a \cdot b )}{\partial a} = b, \frac{\partial ( a + b )}{\partial a} = 1$
- $a$ = first operand
- $b$ = second operand
```
def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad
    out._backward = _backward
    return out
```
4▶ 1:26:00
Topological order, then backward
To backpropagate, sort the graph topologically so every node comes after its inputs, set the output gradient to 1, then call each node's local _backward in reverse. The += is essential: gradients accumulate when a node is used more than once.
Reverse-mode automatic differentiation
```
def backward(self):
    topo = []
    visited = set()
    def build(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build(child)
            topo.append(v)
    build(self)
    self.grad = 1.0
    for node in reversed(topo):
        node._backward()
```
5▶ 2:13:00
Training loop: the same four steps forever
Forward pass to get a loss, zero the gradients, backward pass to fill them, then nudge each parameter against its gradient. Every deep-learning framework is an elaboration of these four lines.
```
for k in range(20):
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
    for p in n.parameters():
        p.grad = 0.0
    loss.backward()
    for p in n.parameters():
        p.data += -0.05 * p.grad
```
💡 Pro tip: Forgetting to zero the gradients is the #1 beginner bug — old gradients silently accumulate and training diverges.

The takeaway

By the end you can hand-build a scalar autograd engine, wire it into a small MLP, and train it with gradient descent — the exact machinery underneath PyTorch, minus the tensors.

Go deeper with Clarion

Quiz Me

Unlock in the extension

Ask Vault

Unlock in the extension

Save this to your Vault to quiz yourself and ask questions across everything you learn.

Coming soon

What backpropagation actually is

The Value object

Local derivatives for + and *

Topological order, then backward

Training loop: the same four steps forever

The takeaway

Go deeper with Clarion