Course Launching Spring 2026 — This website is under active development. Content and materials are being continuously updated.

Learning Modules

ModuleTopicLabReadingsDeliverables
Module 1
From AlphaZero to RLHF
  • Classical RL → modern trajectory optimization
  • Overview of RL
  • What is an agent?
  • Case Studies: AlphaZero, InstructGPT, Gato
Alignment cleaning robot challenge warmup activity
Textbook:
  • Chapter 1
Module 2
Reinforcement Learning Foundations
  • Agent-environment loop
  • Rewards, MDPs, value functions
  • Bellman equations for V* and Q*
  • Tabular Q-learning and convergence to Q*
Tabular value iteration in GridWorld
Textbook:
  • Chapters 2-4
Module 3
Deep Reinforcement Learning: Value-based Agents
  • DQN, Double DQN, Dueling networks
  • Experience replay
  • Value estimation instability
  • Reward hacking
CartPole with DQN variations using Stable-Baselines3
Textbook:
  • Chapters 6-8
Module 4
Deep Reinforcement Learning: Policy Gradients and PPO
  • REINFORCE, A2C
  • PPO vs TRPO vs SAC
  • Why PPO dominates modern pipelines like RLHF
Use PPO on MiniGrid
Textbook:
  • Chapter 13
Module 5
Safety, Generalization, and Exploration
  • Exploration RL
  • Evaluation beyond rewards
  • Generalization, adversarial robustness
  • Intrinsic motivation (curiosity, prediction bonuses)
TBD
Textbook:
  • Chapters 2, 8
Project 1
Module 6
Human in the Loop RL
  • Human preferences
  • Reward modeling theory
  • Preference ambiguity
  • Reward modeling pitfalls and overspecification
TBD
Module 7
RLHF Pipeline
  • SFT → Reward model → PPO loop
  • Label efficiency, reward model overfitting
  • Oversight protocols (e.g., Constitutional AI)
  • Tools: Hugging Face TRL, TRLX, gpt2
Collect binary preferences, train reward model, and finetune toy LMs
Module 8
Offline and Batch RL
  • Behavioral cloning
  • CQL, IQL
  • Decision Transformers
Decision Transformer implementationProject 2
Module 9
Model-based RL and World Models
  • Planning vs learning
  • DreamerV3, MuZero
  • Sim-to-real intro (MuJoCo, PyBullet)
DreamerV3 introduction
Textbook:
  • Chapter 8
Module 10
Hierarchical RL
  • Temporal abstraction and long-horizon credit assignment
  • Options framework and Option-Critic architecture
  • Multi-level policies (HIRO) and subgoal learning
Hierarchical RL experiments
Module 11
Inverse RL and Reward Inference
  • IRL, AIRL
  • MaxEnt
  • Preference-based IRL
Toy Car IRL implementationProject 3
Module 12
Multi-agent RL and Emergence
  • Social dilemmas, coordination
  • Tool use
  • Case study: OpenAI hide-and-seek
PettingZoo or Melting Pot multi-agent experiment
Module 13
RL in Structured & Constrained Domains
  • Constrained MDPs
  • RL with graphs
  • Domain-specific challenges
TBD
Module 14
Frontiers in Aligned RL
  • Scalable oversight
  • Deception, red-teaming
  • RLAIF
TBDProject 4

Learning Path

This course is structured as a 14-module progression from foundational concepts to advanced applications. Each module builds on previous knowledge, with hands-on labs and comprehensive reading materials. Self-paced learners can expect to spend 8-10 hours per module for thorough understanding.