The Smartest Idiot

Imagine a student, Steve. Steve wants to pass the math exam. Instead of learning the formulas, Steve memorizes the exact answers to the practice questions.

practice_Q1: "2 + 2 = ?" -> Steve knows "4". practice_Q2: "3 x 5 = ?" -> Steve knows "15".

Steve gets 100% on the Practice Test (Training Data).

But on the Real Exam (Test Data): Real_Q1: "2 + 3 = ?" Steve panics. He has never seen this. He guesses "4" because it looks like 2+2.

Steve has Overfit the data.

The Memorizer

Underfitting vs Overfitting

Underfitting (The Slacker):
- Didn't study.
- Fails Practice Test. Fails Real Exam.
- Model is too simple (e.g., trying to draw a straight line through a circle).
Overfitting (The Memorizer):
- Memorized the noise.
- Aces Practice Test. Fails Real Exam.
- Model is too complex (e.g., a squiggly line that touches every single dot).
Good Fit (The Student):
- Understands the general concepts.
- Does okay on Practice. Does okay on Real Exam.

Steve's Study Strategy

Practice Test

85%

Real Exam

80%

Passed!

How to fix Overfitting?

More Data: Harder to memorize 1 million questions than 10.
Regularization: Punish the model for being too complex. (Like fining the student for writing really long answers).
- L1 (Lasso): Deletes useless features.
- L2 (Ridge): Shrinks features.
Early Stopping: Stop training before Steve starts memorizing.
Dropout: (Used in Neural Nets) Randomly knock out neurons during training so they don't rely on each other too much.

The Golden Rule

NEVER TEST ON YOUR TRAINING DATA. Always split your data:

Train Set: For the model to learn.
Test Set: For the final exam.

If Train Score >> Test Score, you are Overfitting.

Summary

We want models that Generalize, not models that Memorize. A model that knows the existing data perfectly is usually useless for new data.

Next up: Teaching computers to see.