CS/Stat 184(0): Introduction to Reinforcement Learning



Modern AI systems often need the ability to make sequential decisions in an unknown, uncertain, possibly hostile environment, by actively interacting with the environment to collect relevant data. Reinforcement Learning (RL) is a general framework that can capture the interactive learning setting and has been used to design intelligent agents that achieve high-level performance in challenging applications such as Go, computer games, robotic manipulation, health care, and education.

This course provides an introduction to reinforcement learning covering a range of problem formulations, algorithms, and theory. The four main themes of the course are (1) Markov decision processes (Bellman equations/optimality, planning, UCB, unknown environments, linear quadratic control, exploration, imitation learning), (2) bandits (epsilon-greedy, UCB, Thompson sampling, contextual bandits, linear bandits, exploration in MDPs), and (3) deep RL and methods for large-scale systems (policy gradient methods, Monte Carlo tree search, Q-learning, imitation learning).

There will also be an Embedded Ethics lecture on ethical issues arising in reinforcement learning. The assignments will focus on a mix of algorithmic and statistical principles, along with their programming implementations.

After taking this course, students will be able to understand fundamental RL algorithms and their analysis.

The course will go through algorithms and their analysis. All homework will have a programming component to give students more hands-on experience with the concepts.

Staff and Organization


Instructor: Lucas Janson

TFs: Anvit Garg, Nowell Closser

CAs: Jayden Personnat, Sibi Raja, Alex Cai, Ethan Tan, Neil Shah, Jason Wang, Russell Li, Sid Bharthulwar, Andrew Gu, Ian Moore

Lecture time: Monday/Wednesday 10:30am - 11:45am

Lecture location: Room 118, Yenching Auditorium, 2 Divinity Avenue

Calendar: The course calendar is below. Please double check the Google Calendar for the most up to date information on location changes/cancellations before you arrive. This calendar provides an overview of lectures, sections, office hours, homework, and project deadlines.




Discussion: Ed discussion board

Contact Info:

Please only communicate with any of the course staff (including the instructor) by making a post that is "Private", i.e., "Visible to you and staff only" in Ed. Any course related email sent directly to the instructors will not be responded to in a timely manner.

Announcements:

Please make sure you monitor for (and receive) announcements from both the official Canvas class mailing list and from Ed. Ed is a convenient way to send out some announcements, such as homework corrections and clarifications. It is important for you to make sure you get these announcements in a timely manner.

Prerequisites


Lectures will focus on algorithm design and analysis. We require a background in: calculus & linear algebra (e.g., AM22a, Math 21b), probability theory (e.g., Stat 110), and programming in Python. The following topics are recommended but not required: linear regression, supervised learning, algorithms.

Homeworks will have a programming component, and we expect students to be comfortable with programming in Python (or committed to quickly learning it). We will use Python as the programming language in all homeworks.

Grading Policies


Participation 5%; Homework 45% (HW0: 5%, HW1-HW4: 40% total); Midterm 20%; Project 30%;

The course is letter-graded by default, but you may switch to SAT/UNSAT if you prefer.

In order to pass the course, you must attempt and submit all homework, even if they are submitted for zero credit (as per the late policy below). We will also have an "embedded ethics" lecture with 1-2 corresponding questions, either incorporated into a homework or as a standalone short assignment (with the grading scheme adjusted appropriately). All homeworks are mathematical and have a programming component (we use Python and Gymnasium, formerly OpenAI Gym).

Participation: 5% of the grade will be participation. People can participate in the course in many different ways, including regular attendance of lectures (there will be a form after each class where students can record their attendance), participating in section, in the Ed forum, and more. At the end of the term, you will write a paragraph on how you participated in the course. The requirements to get the full 5% contribution will not be too onerous, and regularly attending the lectures will suffice. If for some reason you are not able to regularly attend all the lectures, then increased participation in Ed and section will be sufficient. If you have another responsibility that prevents you from attending all the lectures, please let us know by making a post that is "Private", i.e., "Visible to you and staff only" in Ed, and we will take this into consideration.

Homework Policies: Collaboration is permitted though each student must understand, write, and hand in their own submission. In particular, it is acceptable for students to discuss problems with each other; it is not acceptable for students to look at another student's written answers. It is also not acceptable to publicly post your (even partial) solution on Ed, but it is encouraged for you to ask public questions on Ed. You must also indicate on each homework with whom you collaborated and what online resources you used.

Each student will have 96 cumulative hours of late time (as measured on Gradescope), which will be forgiven. After this cumulative amount of time has passed, any assignment that is turned in late will receive zero credit. Furthermore, only up to 48 hours of late time may be used on any one assignment; any assignment turned in more than 48 hours late will receive zero credit. You are expected to track your own late time. Your grades on gradescope will not reflect late time.

The final homework score for HW1-4 will be determined by summing up the total points earned across all four assignments. This sum will then be divided by the total possible points to calculate the overall percentage score for the HW1-4 component of the course.

We highly encourage you to use LaTeX. We will also accept neatly written handwritten homework.

Homeworks must be submitted through Gradescope. PDF files of the homeworks can be accessed on Gradescope. PDF and LaTeX files for the homeworks will also be uploaded to Canvas.

Regrading Policy: All homework regrading requests must be submitted on Gradescope within seven days after the grades are released. For example, if we release the grades on Monday, then you have until midnight the following Monday to submit any regrade requests. If you feel that we have made an error in grading your homework, please let us know with a written explanation. This policy is to ensure that we can address any concerns in a timely and fair manner. The focus of office hours and in person discussions are solely limited to asking knowledge-related questions. Grade-related questions must be submitted by making a post that is "Private", i.e., "Visible to you and staff only" in Ed.

Project: Please see the course project page.


Diversity and Inclusiveness

While many academic disciplines have historically been dominated by one cross section of society, the study of and participation in STEM disciplines is a joy that the instructors hope that everyone can pursue, regardless of their socio-economic background, race, gender, etc. We encourage students to both be mindful of these issues, and, in good faith, try to take steps to fix them.

You should expect to be treated by your classmates and the course staff with respect. You belong here, and we are here to help you learn and enjoy this course. If any incident occurs that challenges this commitment to a supportive and inclusive environment, please let the instructors know so that the issue can be addressed. We are personally committed to this and subscribe to Harvard's Values for Inclusion.

Honor Code

  • You must always understand and write up your own solutions.
  • Collaborations only where explicitly allowed.
  • Do not use forums like Course Hero, Chegg, etc.
  • Properly cite any outside materials you use for your homeworks. Do not directly search for answers on the internet. If you are unclear about whether some online material can be used, please ask the course staff first.
  • No sharing of your solutions within or outside class at any time.
  • Do not use AI tools to explicitly obtain answers. Think of AI tools as you would a message board or collaborator: you can use it for assistance (and if you do, you should cite it) but you may not directly ask it for the answer.
  • The above is not an exhaustive list, and in general, common sense rules about academic integrity apply. If it is something in doubt, please ask us whether it is OK before you do it. Also see the Harvard College Honor Code.

    Course Materials

    Slides will be posted before each lecture, and annotated slides (with all notes taken on them by the instructor during lecture) will be posted after each lecture. We will make reasonable attempts to record and post each lecture. It is possible some lectures may not be recorded, in which case we will not be able to do any make-ups of that lecture. We encourage the students to attend the lectures in person (see the Participation Policy) and participate in the class discussion.

    Section materials will also be posted by the TFs. These materials serve as the reference material for the course content.

    In addition to the lecture slides, there is also a draft textbook being written for this course, available here. This material should closely follow the lecture content, notation, and structure, with some additional material and examples (you are not responsible for the additional material and examples, though you will hopefully find them helpful). Feedback on any aspect of this draft is welcome and appreciated via the Ed forum.

    Other references to learn more (but that are not necessarily aligned with the course in content or level) are Reinforcement Learning Theory and Algorithms and Reinforcement Learning: An Introduction.

    Assignment Schedule (tentative)

    Assignment Deadline
    Homework 0 11:59pm ET 9/12/2024
    Homework 1 11:59pm ET 9/29/2024
    Homework 2 11:59pm ET 10/13/2024
    Homework 3 11:59pm ET 11/11/2024
    Project Proposal 11:59pm ET 11/18/2024
    Homework 4 11:59pm ET 11/26/2024
    Project Milestone 11:59pm ET 12/04/2024
    Final Project 11:59pm ET 12/13/2024

    Class Meeting Schedule (tentative)

    Lecture Slides Textbook
    9/4/24 MDPs: Introduction to RL and Markov Decision Processes Slides
    Annotated Slides 
    1.1-1.2.4
    9/9/24 MDPs: Dynamic Programming Slides
    Annotated Slides 
    1.2.4-1.4.3
    9/11/24 MDPs: Discounted, Infinite Horizon MDPs Slides
    Annotated Slides 
    1.5.1-1.5.2
    9/16/24 MDPs: Value and Policy Iteration Slides
    Annotated Slides 
    1.5.3
    9/18/24 Control: Optimal Control in Linear Quadratic Regulator (LQRs) Slides
    Annotated Slides 
    2.1-2.4.1
    9/23/24 Control: Control for Nonlinear Systems (Iterative LQR) Slides
    Annotated Slides 
    2.5.1-2.6.4
    9/25/24 Bandits: Introduction to Bandits Slides
    Annotated Slides 
    3.1-3.4.1
    9/30/24 Bandits: Explore-Then-Commit (ETC), ε-greedy Slides
    Annotated Slides 
    3.4.1-3.5
    10/2/24 Bandits: Upper Confidence Bound (UCB) Slides
    Annotated Slides 
    3.6.1-3.6.2
    10/7/24 Bandits: Thompson Sampling Slides
    Annotated Slides 
    3.7
    10/9/24 Learning: Supervised Learning Slides
    Annotated Slides 
    4.1-4.3
    10/14/24 Holiday (no class)
    10/16/24 Lecture Cancelled: Registrar accidentally let another class book our lecture room!
     
    10/21/24 Learning: Fitted Dynamic Programming Slides
    Annotated Slides 
    5.1-5.4
    10/23/24 Midterm
    10/28/24 PG: (Stochastic) Gradient Descent & Policy Gradient Slides
    Annotated Slides 
    6.1-6.4
    10/30/24 PG: Estimation and Baselines Slides
    Annotated Slides 
    6.5
    11/4/24 PG: Trust Region Methods and Natural PG Slides
    Annotated Slides 
    6.6-6.7
    11/6/24 Embedded EthiCS Readings
    Slides 
    11/11/24 PG: NPG and PPO Slides
    Annotated Slides 
    6.8-6.9
    11/13/24 PG: PPO & Importance Sampling Slides
    Annotated Slides 
    6.9-6.10
    11/18/24 Imitation Learning: Behavior Cloning & DAgger Slides
    Annotated Slides 
    7.1-7.4
    11/20/24 MCTS: Monte Carlo Tree Search Slides
    Annotated Slides 
    8.1-8.3
    11/25/24 Exploration: Exploration in MDPs and UCB-VI Slides
    Annotated Slides 
    9.1-9.5
    11/27/24 Holiday (no class)
    12/2/24 Bandits: Linear Bandits Slides
    Annotated Slides 
    3.8
    12/4/24 Bandits: Contextual Bandits Slides
    Annotated Slides 
    3.8-3.9