I'm unreasonably excited just typing this, because this week I've had countless conversations with myself about it. But now I get to have a more structured conversation— with you. Welcome.
There's a lot to compress in this post, because it's been a week of learning. And I don't mean just new skills, I also mean learning the very nature of how I learn.
On Learning
I've been thinking a lot about this quote (people who have done practical deep learning for coders by jeremy howard will be familiar with this idea):
"You don't teach someone baseball by handing them a textbook on the physics of parabolic motion."
That idea stuck with me because I'm genuinely torn between a deep, low-level understanding and chasing the thrill of high-level deployment. So I sit down for hours, grinding through linear algebra, working out backpropagation by hand. It's intense and exhausting. But it's also weirdly satisfying, there's a kind of beauty in visualizing the math, seeing how it all fits together.
And yet... it always feels a little distant from the real world.
Then I actually build something. A high-level deployment, maybe an image classifier using Gradio and Hugging Face. And at first, it feels almost too easy. A few lines of code, a slick interface, and it works — eventually. A bug hits every now and again and you descend into temporary madness. But strangely, that's when it starts to feel real. Because once it does work, and you see it in action, and you start thinking of all the clever ways you can use it, it starts to feel exciting.
A mathematician named Paul Lockhart put it even more scathingly in his essay A Mathematician's Lament. He imagines a world where kids aren't allowed to paint until they've memorized pigment formulas, and music is off-limits until they've done years of harmony analysis. And that's basically how we treat math and science. Theory now, joy later (if ever).
That vision came flooding back this week as I zigzagged between two very different approaches to understanding neural networks:
- A low-level mathematical breakdown: pure theory, equations, backpropagation, the inner workings of each layer.
- A high-level, practical deployment: using Gradio and Hugging Face to actually build and host a working image classifier that can do something.
So let's learn how a neural network is able to tell a dog from a cat. In 2 Parts.
PART 1: Low Level
Convolution
Convolution is a mathematical operation that combines two functions to produce a third. Put simply. For a more contextual understanding I recommend the video by 3blue1brown ( This video). In the context of images, one function is the input image and the other is a small kernel (or filter). You slide the kernel across the image and compute the dot product at every location. The result is a new image (called a feature map) that highlights certain features of the original image.
Think of it like this: if the input image is a massive haystack, convolution helps you design tiny, reusable magnets (kernels) to efficiently scan for needles (features like edges or curves).
Given an image I and a kernel K, convolution is defined (in 2D) as:
S(i, j) = ∑∑ K(u, v) · I(i + u - 1, j + v - 1)
Where:
- S(i, j): The output feature map.
- F: The kernel size (e.g., 3 for a 3×3 kernel).
You're essentially computing a weighted sum of pixels in the local neighborhood around (i, j).
Famous Kernels in Computer Vision (Before Deep Learning Took Over)
Before CNNs were learning kernels automatically, people used handcrafted kernels for basic image processing. Some notable examples:
Name | Purpose | Kernel Example |
---|---|---|
Sobel | Edge detection |
[-1 0 1] [-2 0 2] [-1 0 1] |
Gaussian | Blurring/smoothing |
[1 2 1] [2 4 2] [1 2 1] (normalized) |
Laplacian | Second derivative, detects edges and changes in intensity |
[ 0 -1 0] [-1 4 -1] [ 0 -1 0] |
Emboss | Gives a 3D shadow effect |
[-2 -1 0] [-1 1 1] [ 0 1 2] |
CNNs essentially learn their own custom kernels through training, ones that aren't just for edges or textures, but specialized for things like "cat ear with 3-pixel tilt" or "whisker-like curve."
In a typical deep learning model, multiple such kernels are used at each layer. Each kernel detects a different pattern. Instead of being handcrafted, the model learns the values in each kernel during training.
So with, say, 32 kernels scanning a 64×64 image, you end up with 32 filtered versions of the original image each one highlighting a different type of feature. These 32 filtered images are called feature maps.
This feature detection process continues layer by layer, with deeper layers detecting more abstract features (first layer: edges; second layer: corners and textures; later layers: eyes, paws, whiskers, etc.).
RELU Function
The concept of activation functions has been discussed a lot throughout the series especially in the introduction to deep learning blog.
After a convolution operation (i.e. filtering the image for features), you're left with a bunch of numerical values; some positive, some negative. But here's the thing:
Without activation functions, your neural network is just stacking a bunch of linear equations. And guess what? A stack of linear equations is still just... one big linear equation.
That's where activation functions come in, and ReLU is the reigning champion (especially inbetween the networks).
What is ReLU?
ReLU stands for Rectified Linear Unit, and it's defined by a simple rule:
f(x) = max(0, x)
In other words:
- If the value is positive, keep it.
- If it's negative, squash it to zero.
That's it. Just hard pruning.
Why Use It?
- Non-Linearity: ReLU introduces non-linearities into the model, allowing the network to learn complex, non-linear mappings from inputs to outputs (like distinguishing between a cat and a pug in bad lighting).
- Computationally Cheap: Compared to sigmoid or tanh, ReLU is faster just a comparison with zero.
- Sparse Representations: By zeroing out negative values, it creates sparsity in the network. Sparse models are often more efficient and generalize better.
- Helps Avoid Vanishing Gradients (kind of): Unlike sigmoid or tanh, which squish inputs into narrow ranges, ReLU doesn't compress large inputs. This means gradients during backpropagation stay healthier (though it has its own issues "dying ReLU").
Pooling
After filtering with convolutions and applying ReLU to keep only the strong, positive signals, you're left with feature maps which are rich but still quite large. Convolution results in images that are larger than both the input variables, so at this stage you picture matrix is much larger. However only specific features are detected by the actual neural network, and so you can just "pool" these and get rid of all the whitespace.
Pooling achieves that by shrinking things down intelligently.
What Is Pooling?
Pooling is a downsampling technique. It reduces the size of the feature maps while retaining the most important information. The goal is:
- Less data to process
- Less chance of overfitting
- More robustness to small translations or distortions (e.g., the cat moved a little)
How Does It Work?
The most common type is Max Pooling.
Max Pooling:
You slide a small window (e.g. 2×2 or 3×3) over the feature map and pick the maximum value inside that window.
If your feature map looks like this:
[1 3 2 4 5 6 1 2 7 8 9 4 3 2 1 0]
...and you apply a 2×2 max pooling with stride 2 (i.e. no overlap), you'll get:
[6 4 8 9]
This kind of pulling only notices the loudest parts of the image"
Equation:
Let's define it more generally. For a window W ⊂ ℝⁿ×ⁿ, max pooling is:
MaxPool(W) = max(x ∈ W)
There's also Average Pooling, which takes the mean instead of the max, but max pooling is generally preferred for classification tasks — it keeps the strong, defining features.
Why Pool?
- Dimensionality Reduction: Reduces number of computations in subsequent layers.
- Translation Invariance: Slight movements in the input image (e.g., the cat blinked or shifted slightly) don't change the output drastically.
- Feature Emphasis: By taking the maximum values, we emphasize the most prominent features, which are usually more relevant.
Flattening
At this point, we've cleaned the image, filtered it, activated it, and squeezed out the most important spatial features with pooling.
Flattening is the step where we say:
"Alright, enough image stuff. Let's go full neural net now."
What Is Flattening?
Flattening takes the multi-dimensional output of the previous layer (typically 2D or 3D arrays of features) and unrolls it into a 1D vector. It's like taking every little number from each feature map and laying them all in a straight line.
So a 3D array like:
Shape: (32, 32, 16)
...becomes a single array:
Shape: (32 × 32 × 16) = (16,384,)
This vector is then passed into the Fully Connected Layers, i.e. traditional neural network territory. Here's where the model starts making actual decisions based on all the condensed image data — like "cat," "dog,", all from just a (really) long string of numbers!
For the Rest....
At this point, the preprocessing is done. All the heavy image manipulation and filtering is out of the way. Now it's just numbers, weights, biases, and matrix multiplication from here on out.
For the rest of the process including the math behind how those weights are updated, what backpropagation is, and how a neural net actually learns. I covered it in my earlier post:
👉 Neural Networks: An Introduction to Deep Learning
Feel free to check that out for the equations and brain gymnastics.
PART 2: Practical Application
The Model: Training Before Deployment
Before you can deploy a model, you have to train it. If Part 1 was the theory this is the doing. And trust me, clicking "run" after hours of debugging feels incredible.
Using the fastai library (which is built on top of PyTorch), you can get a working image classifier in surprisingly few lines of code.
Step 1: Point to Your Data
from pathlib import Path path = Path(r"C:\Users\Fanny\OneDrive - Fanny Fushayi\Computer Science\Building_AI\Computer_Vision\Cat_v_Dog")
You should organize your images into subfolders (e.g., Cat/ and Dog/), because fastai uses folder names to infer class labels.
Step 2: Load and Prepare the Data
from fastai.vision.all import * dls = ImageDataLoaders.from_folder( path, train='.', # Use subfolders as class labels valid_pct=0.2, # 20% validation split item_tfms=Resize(224), # Resize all images to 224x224 batch_tfms=aug_transforms(), # Data augmentation bs=6 # Batch size )
What is Data Augmentation?
Think of data augmentation as training your model with optical illusions. It helps the model generalize better by transforming your input images: rotating, flipping, zooming, lighting changes, etc. This simulates real-world variety and helps avoid overfitting.
Step 3: Show a Batch
dls.show_batch(max_n=9)
This lets you visually confirm that your images are being loaded, labeled, and transformed correctly. If you see upside-down cats or inverted dogs don't panic. That's the augmentation at work.
Step 4: Train the Model
learn = vision_learner(dls, resnet34, metrics=accuracy) learn.fine_tune(2)
What Is Fine-Tuning?
resnet34
is a pretrained model, it's already learned to recognize patterns from millions of images (thanks, ImageNet). Fine-tuning means we keep its earlier layers (those that detect general features like edges, textures, and shapes) and only train the final layers on our specific task; classifying cats and dogs.
It's like hiring an experienced detective and only teaching them the details of this new case. (** now you know why companies require experience on job posts)
Step 5: Save Your Model
learn.export("My_model.pkl")
This saves your trained model to a .pkl file which is a serialized file that can be loaded later for inference.
You're done with training. Let's go deploy this thing.
Deployment: Taking It Online
Now, this part gets messy. Not because it's hard but because you'll encounter issues that don't feel "ML-related." Like GitHub version control, Python environments, GPU access, OS differences; basically, the adult stuff.
Quick Setup Summary for Hugging Face + GitHub Deployment:
- Create a Hugging Face account
- Create a new Space (select Gradio as the SDK)
- Clone the space repo using Git:
git clone https://huggingface.co/spaces/your-username/your-space-name
- Add your files:
- model.pkl
- app.py
- requirements.txt (this lists packages to install, like fastai, gradio, etc.)
- Git basics:
git add . git commit -m "Initial commit" git push
Done right, your app launches automatically.
Challenges I Faced
- GPU issues: My CPU was crawling (180 minutes to train that model — actually to refine). You are going to need a GPU, thankfully you can get one for free on collab or kaggle (comes with learning these platforms as well)
- Version hell: Some packages (like torch) don't play nice on older Hugging Face runtimes. I had to dig into specific versions and make sure requirements.txt was used to make this work.
- Windows/Linux path madness: FastAI sometimes uses PosixPath which breaks on Windows. Temporarily patching with pathlib hacks fixed it.
- Git rage moments: If you forget to commit changes, nothing pushes. If you forget requirements.txt, nothing works. If you forget you need to rename model.pkl to match your loading code, it crashes in silence.
The Final app.py Code
import gradio as gr from fastai.vision.all import * # Load the model learn = load_learner('model.pkl') # Define categories (fastai usually infers this automatically) categories = ('Dog', 'Cat') def classify_image(img): pred, idx, probs = learn.predict(img) return dict(zip(categories, map(float, probs))) # Create interface demo = gr.Interface( fn=classify_image, inputs=gr.Image(type="pil"), outputs=gr.Label(num_top_classes=2), title="🐱🐶 Cat vs Dog Classifier", description="Upload an image to classify whether it's a cat or dog!" ) if __name__ == "__main__": demo.launch()
Final Thoughts
The practical side of ML felt easy, until it wasn't. Most of my bugs weren't in matrix multiplication, they were in getting files into the right folders or figuring out why Hugging Face wouldn't boot.
But once it worked? I couldn't stop playing with it. I tried random photos, my camera feed, comic cats, even weird color-inverted images. It was fun. It made the whole "low-level theory" suddenly click. Now I knew what all that math was for.
And that's why this wasn't just a guide — it was a learning arc.
Everything but a guide.
Obviously, this is surface level, and if you were to go down the rabbit hole, I'm sure you'd need both at some point; the theory and the tooling. There's a lot of nuanced discussion that can come out of all this. Another one of the projects I did was a family facial recognition system because someone said we all look alike... Not very true, apparently, because it took just one epoch and only 10 images per person to hit 90% accuracy. Go figure. (NB: This is outside the technical section, I am not claiming that it was easy for the NN, it probably could do it because of the augmentations (more samples than 10), and the pretraining of a very good model (that could recognise most of the things anyway))
The point is: image classification opens the door to a ton of fun and weird ideas. You can experiment fast, build cool things, and learn a lot along the way (often by breaking stuff).
Oh Also Also, if you want to go the extra mile, you can use the Hugging Face API to get more control over the UI, and even embed your model directly into your own website. And make a super duper awesome webpage using your model as you wish (or other people's). Something like this: Et Voila
import requests response = requests.post( "https://api-inference.huggingface.co/models/your-username/your-model-name", headers={"Authorization": "Bearer YOUR_HUGGINGFACE_TOKEN"}, files={"file": open("my_cat_image.jpg", "rb")} ) print(response.json())
With a bit of front-end work, you could turn this into a slick embedded app. No need to use Gradio's hosted UI, just wire it into your own HTML/CSS/JS setup and roll your own thing. Super useful if you want to integrate it into a portfolio or blog.
Anywaysssss... however far you decide to go, it's a playground. Just don't forget to git commit
before you break something again.