Systems for Large (Language) Models



In this course, we will go over various systems advances that can enable training billion parameter transformer models, such as GPT-3 and LLaMa. Such “foundation models”, trained on internet-scale data have enabled breakthroughs in generative AI such as ChatGPT. However, working with these models is often considered impossible without access to industry scale GPU clusters, primarily due to the sheer size of these models. We will go over advances in systems that democratize research on such large scale models and make working with such models possible even with modest resources. By the end of this course, you will learn how to train and finetune state-of-the-art models that contain billions of parameters and cannot fit into a single GPU’s memory. While we will primarily work with transformer based language models, the focus will be on learning general principles that can be broadly applied to any large model.

Requirements to Register

This is an advanced course that requires being able to grasp involved concepts and quickly implement and test out ideas in code. We thus require you to have a good level of familiarity with deep learning frameworks such as PyTorch, as well as libraries such as HuggingFace's Transformers, Accelerate, Datasets, and Microsoft's DeepSpeed. We also expect knowledge about Transformers and a good level of familiarity with deep learning fundamentals including batch sizes, backpropagation, optimizers, and learning rate schedules. Knowledge of C/C++/CUDA is beneficial. Additionally, this course also requires some level of research maturity, since we will be critiquing some state of the art papers, with the goal of identifying possible research projects in this nascent space of systems for LLMs.

The course will be taught in English and all instruction will be in-person. Remote participation may be possible in special circumstances, please reach out to the instructors for permission:


Contact Details

We will use Moodle for all course-related discussions. We wish for this class to be an interactive and a fun learning experience for everyone and thus encourage you to participate in discussions both in-class and on Moodle.

If you have any issues with the course, do not hesistate to send an email to and instructors and/or TAs will get back to you as soon as possible.


70% of your grade will be based on performance in your "lead" presentation (see "Course Format" below for further details), 20% will be based on mini-projects (more deatils below), and 10% will be based on class participation.

Course Content

The goal of the course is to give you both conceptual knowledge and hands-on experience about various techniques that allow for training, finetuning, and (if time permits) inference of modern very large models (at least a billion parameters). Thus, we will both read (and critique) papers and also implement parts of them as mini-projects.

We will go over the following broad topics:

  • Training and finetuning of Large Models
    • Reducing memory footprint (2 lectures)
      • Gradient Checkpointing, Gradient Accumulation
      • Mixed Precision Training, Weight Quantization
    • Parallelism (2 lectures)
      • Data Parallel, Model Parallel, Pipeline Parallel
      • ZeRO and ZeRO-Offload + other ZeRO optimizations
    • Parameter Efficient Fientuning (1 lecture)
      • Low Rank Adapters (LoRA) and Quantized Low Rank Adapters (QLoRA)
  • (If time permits) Faster Inference
    • Mixture of Experts
    • Flash Attention
    • Paged Attention
    • Multi-Query Attention

Lecture Format

Each lecture will focus on a subtopic from the list above (eg: "Reducing Memory Footprint" will be the subtopic of lecture 1). This will include a list of readings (papers, blogposts etc.) and possibly videos (eg: talks). Everyone in the class is expected to go through this material before to coming to class. Additionally, each lecture will have a set of "lead" presenters which would be a pre-assigned group of 2 people. Lead presenters will make a 45-60 minute long presentation on the subtopic of the lecture. This should cover (atleast) the following points (however, everyone is welcome to be creative and add more):

  1. A comprehensive explanation of the topic.
  2. Assessment of strengths and weaknesses of the presented approach

The set of lead presenters for a lecture will rotate amongst all students in the class, so that everyone is equally involved in the course. The first lecture will be delivered by the instructors to set an example of what is expected in the presentation.

Mini Projects: To get a hands-on experience, we will require you to do a mini project that will implement a technqiue from the upcoming lecture. Details of what to implement for each lecture will be released on Moodle. You will be required to submit a 1-2 page report detailing the experiment you ran along with the results obtained before the lecture on that topic. There are 5 lectures in total, and you can choose to do the mini project for any 3 of the 5 lectures. For example the mini project details for Lecture 1 are out on Moodle. The report for this will be due before lecture 1 (ie 6th December).

Seminar Schedule

Lectures will be held in building E1 5, room 029, from 16:00 -- 18:00.

Following is a tentative schedule, you will be informed if there are any changes.