Back to Blog
Chief Idiot2 min read

ML 9: The Eyeball (CNNs)

How computers see cats, dogs, and hotdogs.

Seeing the World

Normal Neural Networks (MLPs) struggle with images. An image is just a grid of pixels. A 1000x1000 image has 1,000,000 inputs. If you feed all 1 Million pixels into a Dense Neural Network, it explodes. It's too much.

Enter the Convolutional Neural Network (CNN).

The Eye

The Scanner

Instead of looking at the whole image at once, a CNN looks at small chunks. Imagine looking at a picture through a paper towel roll. You scan across.

  1. Filters (Kernels): Small 3x3 grids that look for specific things.

    • One filter looks for Vertical Lines.
    • One filter looks for Horizontal Lines.
    • One filter looks for Circles.
  2. Pooling: Shrinking the image. "Okay, this area is generally dark."

  3. Layers:

    • Layer 1 sees Lines.
    • Layer 2 combines lines to see Shapes (Eyes, Ears).
    • Layer 3 combines shapes to see Objects (Cat Face).

How a CNN Sees

Watch filters transform real images

Original
Filter
-1
-1
-1
-1
8
-1
-1
-1
-1
Finds edges
Filtered

CNNs use many filters like these to detect features (edges, textures, shapes)

Feature Maps

A CNN doesn't "see" a cat. It sees:

  • A map of where the fluffy texture is.
  • A map of where the pointy ears are.
  • A map of where the whiskers are.

If all those maps light up, it guesses "CAT".

The Code (Keras/TensorFlow)

from tensorflow.keras import layers, models
 
model = models.Sequential()
 
# 1. The Scanner (Conv2D)
# 32 filters, 3x3 size.
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
 
# 2. The Shrinker (MaxPooling)
model.add(layers.MaxPooling2D((2, 2)))
 
# 3. Another Scanner
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
 
# 4. Flatten and Decide (Standard Neural Net at the end)
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))

Summary

CNNs revolutionized AI. Before them, Computer Vision was garbage. Now, your phone can unlock with your face, and your car can see stop signs (mostly).

Next up: What if the data is a sequence, like a sentence?

Share this article