Seeing the World

Normal Neural Networks (MLPs) struggle with images. An image is just a grid of pixels. A 1000x1000 image has 1,000,000 inputs. If you feed all 1 Million pixels into a Dense Neural Network, it explodes. It's too much.

Enter the Convolutional Neural Network (CNN).

The Eye

The Scanner

Instead of looking at the whole image at once, a CNN looks at small chunks. Imagine looking at a picture through a paper towel roll. You scan across.

Filters (Kernels): Small 3x3 grids that look for specific things.
- One filter looks for Vertical Lines.
- One filter looks for Horizontal Lines.
- One filter looks for Circles.
Pooling: Shrinking the image. "Okay, this area is generally dark."
Layers:
- Layer 1 sees Lines.
- Layer 2 combines lines to see Shapes (Eyes, Ears).
- Layer 3 combines shapes to see Objects (Cat Face).

How a CNN Sees

Watch filters transform real images

Original

Filter

-1

Finds edges

Filtered

CNNs use many filters like these to detect features (edges, textures, shapes)

Feature Maps

A CNN doesn't "see" a cat. It sees:

A map of where the fluffy texture is.
A map of where the pointy ears are.
A map of where the whiskers are.

If all those maps light up, it guesses "CAT".

The Code (Keras/TensorFlow)

from tensorflow.keras import layers, models
 
model = models.Sequential()
 
# 1. The Scanner (Conv2D)
# 32 filters, 3x3 size.
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
 
# 2. The Shrinker (MaxPooling)
model.add(layers.MaxPooling2D((2, 2)))
 
# 3. Another Scanner
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
 
# 4. Flatten and Decide (Standard Neural Net at the end)
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))

Summary

CNNs revolutionized AI. Before them, Computer Vision was garbage. Now, your phone can unlock with your face, and your car can see stop signs (mostly).

Next up: What if the data is a sequence, like a sentence?