This course focuses on efficient AI computations aimed at reducing the cost of AI model training and inference. Students will learn systematic methods for implementing parallel and distributed computations for computer vision and language models such as CNNs and Transformers across multiple computing cores or nodes. They will also learn techniques for co-designing machine learning models, data curation methods, computing algorithms, and system architectures.
Techniques to be studied include systolic arrays, low-bitwidth arithmetic, model pruning, quantization, distillation, low-rank fine-tuning, dynamic selection of submodels (e.g., experts) based on input, speculative decoding, synthetic data generation with stable diffusion, data and model security, scheduling for efficient memory access, reasoning with reinforcement learning, and test-time computing for reasoning.
As a part of programming assignments, students will utilize large language models to generate code that can leverage AI accelerating techniques learned in this course.
Upon successful completion of this course, students will be equipped to address the challenging tasks of designing and utilizing energy-efficient, high-performance AI accelerators.