If you’ve ever stared at a tiny autograd engine and thought, “This is elegant, but how do I make it run like the vectorized stacks in PyTorch or TensorFlow?”, this project is going to scratch that itch. At AI Tech Inspire, we spotted a compact reimagining of Andrej Karpathy’s micrograd—rebuilt around NumPy tensors and carefully engineered to handle broadcasting in the backward pass. It keeps the clarity developers love while nudging the design toward real-world vectorized workloads.
Key facts at a glance
- Revisits Andrej Karpathy’s
micrograd, but targets vectorized NumPy tensors instead of Python floats. - Implements gradient calculation for NumPy tensors, including correct handling of broadcasting in the backward pass.
- Builds an autodiff and neural network library analogous to
micrograd, but tensor-first and vectorized. - Demonstrates a CNN on MNIST with 97%+ accuracy.
- Code available at: https://github.com/gumran/mgp. Feedback is invited by the author.
Why vectorizing micrograd-style engines matters
Karpathy’s micrograd is a masterclass in clarity: it shows reverse-mode autodiff with just Python floats and a lightweight computation graph. But production-grade deep learning leans heavily on vectorization. Moving from scalar nodes to NumPy tensors unlocks:
- Performance via batch operations and SIMD-optimized routines under the hood.
- Cleaner model definitions that operate on batches, channels, and spatial dimensions.
- Faithful alignment with how modern frameworks model shapes and broadcasting.
That last point—broadcasting—sounds trivial until you have to get the gradients right. This project tackles that head-on.
The core idea: a micrograd-like engine, but with tensors
At a high level, the library wraps NumPy arrays in a Tensor-like object carrying data, grad, and a handle to the backward function that produced it. Operators (+, *, matmul, conv, etc.) create new nodes in the graph. Calling .backward() walks the graph in reverse topological order, accumulating gradients along the way—just like micrograd, but at tensor granularity.
In a scalar world, backprop rules are straightforward. In a tensor world, there’s shape semantics to respect: dimensions might be broadcast, reduced, or expanded. The data path is easy—NumPy handles broadcasting automatically. The gradient path is where things get interesting.
Key takeaway: when a tensor was broadcast in the forward pass, its gradient must be summed back over the broadcasted dimensions to match the original shape.
Broadcasting in the backward pass: where many homegrown engines stumble
Consider y = x + b where x has shape (N, D) and b has shape (D,). NumPy implicitly broadcasts b across the batch dimension. Forward is simple; backward requires care:
dy/dx = 1elementwise, sograd_x = grad_y(same shape asx).dy/db = 1for each row, butbwas broadcast. Thereforegrad_bmust sumgrad_yacross the batch axis to get back to shape(D,).
Generalizing this, a reliable pattern is a sum_to_shape(g, shape) helper that:
- Reduces over axes introduced by broadcasting (e.g., where
shapehad1or where dims were absent), - Applies
keepdimsas needed during reduction, then reshapes to the targetshape.
Many developers rediscover this rule the hard way: gradients explode or silently mis-shape when reduction axes aren’t handled consistently. This project bakes that logic into each op’s backward, so you can write vectorized models with confidence.
Layers and losses: familiar building blocks, minimal baggage
The library includes the usual suspects—activations like relu, linear layers, and convolutional blocks—implemented in a way that mirrors small, readable engines. Because everything composes through the same autodiff graph, extending it tends to follow a consistent recipe:
- Define the forward via NumPy ops,
- Capture whatever intermediates are needed for backward (e.g., saved tensors, masks),
- Implement
backwardwith shape-aware reductions for any broadcasted inputs.
If you’ve ever prototyped a new operation for research or education, this design aims for that sweet spot: simple enough to read in a sitting, but vectorized enough to test on real data.
A compact CNN hits 97%+ on MNIST
To show the engine’s viability beyond toy examples, the author trains a small CNN on MNIST and reports over 0.97 accuracy. On a dataset as well-trodden as MNIST, that’s not a leaderboard bid—but it’s a strong proof that a micrograd-inspired, NumPy-tensor stack can power meaningful models end-to-end.
What likely sits inside that CNN:
- Convolutional layers operating on
(N, C, H, W)tensors, relunon-linearities,- Pooling or strided convs,
- One or more fully connected layers before the classifier head.
For developers, the real value is educational: you can trace gradients, inspect grad buffers, and verify broadcasting logic in a network that actually learns.
Why this angle is worth your time
There are several established ways to build and train models today, from Hugging Face model hubs to custom stacks atop PyTorch. So why explore a minimalist NumPy autodiff engine?
- Understand the craft: Broadcasting-aware gradients are foundational knowledge for debugging nontrivial models.
- Prototype new ops: Test-drive custom layers without wrestling a large framework’s extension APIs.
- Teach and learn: Perfect for workshops, study groups, or personal deep dives into reverse-mode autodiff.
- Bridge to acceleration: A clean NumPy core can be a stepping stone to GPU variants via libraries like CUDA-backed ecosystems.
Practical tips: what to look for in the code
Tensorstructure: Note how it storesdata,grad, and references to parents and_backwardclosures.- Topological ordering: Backprop typically starts from a scalar loss; ensure the graph walk visits each node once and accumulates gradients with
+=semantics. - Broadcast-aware reductions: Find the utility that maps
gradto the original input shape. Look for patterns likewhile g.ndim > len(shape): g = g.sum(axis=0)paired with axis-wise sums whereshape[i] == 1. - Numerical stability: Check losses (e.g., log-softmax + NLL) for safe computations that avoid
log(0)and large intermediate magnitudes. - Convolution details: Inspect how im2col-like transforms or direct loops are handled; vectorization choices can influence both clarity and speed.
How to try it
Getting started is straightforward. Clone the repository and explore the examples and tests:
git clone https://github.com/gumran/mgp
From there, run a simple training script or open a notebook to step through forward and backward passes. If you’re new to autodiff internals, sprinkle print statements on grad buffers or add gradient checks by comparing to finite differences on small tensors.
Comparisons and next steps
How does this differ from heavyweight frameworks?
- Scope: It’s intentionally compact—ideal for learning and prototyping—not a drop-in replacement for PyTorch or TensorFlow.
- Device support: NumPy typically runs on CPU. If you crave GPU, one natural path is exploring CuPy or JIT-powered stacks like JAX (different API philosophy, but similar taste for vectorization).
- Ecosystem: You won’t get a zoo of pretrained models. The trade-off is transparency—you control every op and gradient.
Questions we’re asking at AI Tech Inspire as we explore the codebase:
- How easily can new ops be added while preserving broadcasting correctness?
- Could mixed precision or simple graph optimizations be introduced without bloating the design?
- What’s the cleanest pattern for parameter management, optimizers, and regularization while keeping the learning curve low?
Bottom line
If micrograd made you appreciate the elegance of reverse-mode autodiff, this NumPy-tensor take shows how to push that elegance into vectorized territory. It respects the mental model developers already use in mainstream frameworks while keeping the code compact enough to audit in an afternoon.
Curious? Check the repo at github.com/gumran/mgp, skim the broadcasting-aware backward implementations, and run the MNIST CNN. You might come away with fresh intuition—and a lightweight playground for your next idea.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.