ML 9: The Eyeball (CNNs)
How computers see cats, dogs, and hotdogs.
Seeing the World
Normal Neural Networks (MLPs) struggle with images. An image is just a grid of pixels. A 1000x1000 image has 1,000,000 inputs. If you feed all 1 Million pixels into a Dense Neural Network, it explodes. It's too much.
Enter the Convolutional Neural Network (CNN).

The Scanner
Instead of looking at the whole image at once, a CNN looks at small chunks. Imagine looking at a picture through a paper towel roll. You scan across.
-
Filters (Kernels): Small 3x3 grids that look for specific things.
- One filter looks for Vertical Lines.
- One filter looks for Horizontal Lines.
- One filter looks for Circles.
-
Pooling: Shrinking the image. "Okay, this area is generally dark."
-
Layers:
- Layer 1 sees Lines.
- Layer 2 combines lines to see Shapes (Eyes, Ears).
- Layer 3 combines shapes to see Objects (Cat Face).
How a CNN Sees
Watch filters transform real images
CNNs use many filters like these to detect features (edges, textures, shapes)
Feature Maps
A CNN doesn't "see" a cat. It sees:
- A map of where the fluffy texture is.
- A map of where the pointy ears are.
- A map of where the whiskers are.
If all those maps light up, it guesses "CAT".
The Code (Keras/TensorFlow)
from tensorflow.keras import layers, models
model = models.Sequential()
# 1. The Scanner (Conv2D)
# 32 filters, 3x3 size.
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
# 2. The Shrinker (MaxPooling)
model.add(layers.MaxPooling2D((2, 2)))
# 3. Another Scanner
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# 4. Flatten and Decide (Standard Neural Net at the end)
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))Summary
CNNs revolutionized AI. Before them, Computer Vision was garbage. Now, your phone can unlock with your face, and your car can see stop signs (mostly).
Next up: What if the data is a sequence, like a sentence?