ML 12: The Dog Trainer (Reinforcement Learning)
Training AI with treats and newspapers.
Pavlov's AI
Supervised Learning: "Here is input, here is answer." Unsupervised Learning: "Here is input, good luck."
Reinforcement Learning (RL): "Here is an environment. Do whatever. If you do good, you get a cookie (+1). If you die, you get a slap (-1)."
The Agent and The Environment
- Agent: The Gamer (AI).
- Environment: The Game (Super Mario).
- Action: Jump, Run, Duck.
- State: Where Mario is, where the Goomba is.
- Reward: Coins (+), Winning (+), Dying (-).

Exploration vs Exploitation
The Agent has a dilemma:
- Exploit: Do what I know gives points (Jump on Goomba).
- Explore: Try something new (Jump down that weird pipe). Maybe it's death. Maybe it's a secret level with 1000 coins.
If you never explore, you never find the optimal path. If you never exploit, you die randomly.
Q-Learning Robot
Green = good path | Red = danger zone
Q-Learning and Deep Q-Networks (DQN)
The AI builds a "Cheat Sheet" (Q-Table) of (State, Action) -> Expected Reward.
- "If I see a pit and I Jump -> +10 survival."
- "If I see a pit and I Run -> -100 death."
DeepMind used this to play Atari games. AlphaGo used this to beat the world champion at Go.
Summary
In RL, we don't teach the AI how to win. We just give it the goal. It figures out crazy strategies we never thought of (like glitching the game or playing weird moves).
Next up: The Artist using AI against itself.