21. 2. 2023
|
1 hour
|
|
Introduction and Overview
|
Ryan/Mrinmaya/Ce/Florian
|
The lecturers will contextualize large language models in NLP and computer science more broadly. Thereby, we will also motivate why the topic necessitates a separate course. We will also go over the course schedule and logistics.
|
Introductory Slides
|
|
21. 2. 2023
|
1 hour
|
Probabilistic Foundations
|
Basic Measure Theory
|
Ryan
|
Language modeling is about placing probability on infinite sets of strings. Measure theory is the primary tool used for the rigorous study of probability theory. This lecture shows why defining a language model rigorously requires a careful measure-theoretic treatment. We use the classic infinite coin toss model as an illuminating example. Then, we will get into some basic measure-theoretic definitions that will be useful in formally defining language models.
|
|
Du, Li, et al. A Measure-Theoretic Characterization of Tight Language Models. arXiv, 2022.
|
24. 2. 2023
|
1 hour
|
Defining a Language Model
|
Ryan
|
We will continue to introduce definitions and facts from basic measure theory, building up to a formal definition of a language model, which will be our working definition throughout the class.
|
|
|
28. 2. 2023
|
2 hours
|
Tight Language Models
|
Ryan
|
The primary goal of this lecture is to introduce the notion of tightness, which will be a recurring theoretical concept in the first part of the course. Informally, a language model is tight when it only places probability mass on finite strings. We introduce the Borel-Cantelli lemmata and prove a precise characterization of tight language models.
|
|
Du, Li, et al. A Measure-Theoretic Characterization of Tight Language Models. arXiv, 2022.
,
Chen, Yining, et al. Recurrent Neural Networks as Weighted Language Recognizers. arXiv, 2017.
|
3. 3. 2023
|
1 hour
|
Modeling Foundations
|
The Language Modeling Task
|
Ryan
|
In this lecture, we introduce the language modeling task, which we define to be any attempt to learn a language model from finite data. We will discuss various objectives that one might wish to optimize to induce a language model from data. We also discuss various regularization techniques and their use in combatting overfitting.
|
|
|
7. 3. 2023
|
2 hours
|
Finite-State Language Models
|
Ryan
|
Finite-state language models have a storied history in NLP. They are a natural generalization of n-gram models, which were the standard in the field from the 1980s till the late 2010s. In terms of theory, we introduce probabilistic finite-state automata as a generalization of finite-state automata from classic theory of computation. Additionally, we give a simple, closed-form characterization of tightness. We also show how Bengio et al. (2003), the first successful neural language model, is naturally viewed as a probabilistic finite-state automaton.
|
|
Bengio, Yoshua, et al. A neural probabilistic language model. J. Mach. Learn. Res., 2003.
|
10. 3. 2023
|
1 hour
|
Pushdown Language Models
|
Ryan
|
In many ways, human language is more naturally modeled by a context-free grammar than by a finite-state automaton. This lecture discusses how to use weighted context-free grammars, specifically when implemented as weighted pushdown automata, to construct language models. In the case of a 1-stack pushdown language model, we give an iterative algorithm to determine tightness. We also discuss pushdown language models with more than one stack. In this case, determining whether such a language model is tight is undecidable. Learning the nuts and bolts of pushdown language models is more than just a historical artifact: The definitions provided in this lecture will serve as a basis for proofs about the capacity of recurrent neural networks. Indeed, our proof that it is undecidable to determine the tightness of a recurrent neural language model with infinite precision is as simple as demonstrating an encoding of a 2-stack pushdown language model as a recurrent neural network.
|
|
|
14. 3. 2023
|
2 hours
|
Neural Network Modeling
|
Recurrent Neural Language Models
|
Ryan
|
Finite-state language models, by construction, can only look at a finite amount of context. Recurrent neural networks are a formalism that overcomes this limitation. In this lecture, we give a formal definition of a recurrent neural language model (RNNLM). We give examples of tight and non-tight RNNLMs as well as characterize the vanishing gradient problem.
|
|
|
17. 3. 2023
|
1 hour
|
Variants of RNNLMs
|
Ryan
|
We discuss several popular variants of the RNN, most notably the LSTM and GRU. We give a formal argument showing that these variants mitigate the vanishing gradient problem.
|
|
|
21. 3. 2023
|
2 hours
|
Representational Capacity of RNNLMs
|
Ryan
|
In this lecture, we explore the representational capacity of RNNLMs. We show that, if the activation function is a hard thresholding operation, then RNNLMs have the same expressive capacity as a finite-state LM. However, we show that RNNLMs can implicitly represent finite-state LMs that are much larger. Additionally, if the activation function is a saturated sigmoid or a ReLu and we assume infinite precision arithmetic, we show how an RNN can emulate a Turing machine.
|
|
Siegelmann H. T. and Sontag E. D. On the computational power of neural nets. Computational learning theory. 1992.
|
24. 3. 2023
|
1 hour
|
Transformer-based Language Models
|
Ryan
|
Introduced in 2017 by Vaswani et al., Transformers have quickly become the most popular architecture for neural language modeling. They are the basis for recent large language models, e.g., GPT-3 and PaLM. This lecture gives the definition of a Transformer and overviews details, e.g., residual connections, layer normalization, and position embeddings.
|
|
|
28. 3. 2023
|
2 hours
|
Efficient Attention
|
Ryan
|
There is an ever-growing bag of tricks that speed up the computation of the attention mechanism in Transformer-based language models. This lecture overview those tricks and various generalizations of the transformer, which are becoming increasingly necessary to scale up Transformer LMs on academic hardware. We will also discuss multi-headed attention, sparse attention, and Transformer variants tailored for long documents. Where possible, we prove guarantees for the methods.
|
|
|
31. 3. 2023
|
1 hour
|
Representational Capacity of Transformer-based Language Models
|
Ryan
|
Inspired by the Turing completeness of RNNs, we study the representational capacity of Transformers. Although the connection to automata is not as straight-forward as with RNNs, we discuss how to think about Transformers as formal models and show that, assuming an unbounded number of layers and infinite precision, Transformers are Turing complete.
|
|
|
4. 4. 2023
|
2 hours
|
Modeling Potpourri
|
Tokenization
|
Ryan
|
Throughout the class, we have assumed access to the alphabet Σ. This lecture discusses how we should choose Σ. We discuss various facts about natural language that influence Σ, e.g., morphology and syntax. Then, we introduce the byte-pair encoding algorithm, an automatic procedure for inducing Σ, and give a analyze of its correctness and runtime.
|
|
|
|
|
Easter Break
|
|
|
|
|
|
18. 4. 2023
|
2 hours
|
Modeling Potpourri
|
Generating Text from a Language Model
|
Ryan
|
A popular use case for language modeling is the generation of text. This lecture overviews various strategies for deterministically and stochastically generating text. We discuss beam search, ancestral sampling, as well as various sampling adaptors, e.g., top-k, nucleus, and locally typical sampling.
|
|
|
21. 4. 2023
|
1 hour
|
Training, Fine Tuning and Inference
|
Transfer Learning
|
Mrinmaya
|
|
Slides
|
|
25. 4. 2023
|
2 hours
|
Parameter efficient finetuning
|
Mrinmaya
|
|
Slides
|
|
28. 4. 2023
|
1 hour
|
Prompting and zero-shot inference
|
Mrinmaya
|
|
Slides
|
|
2. 5. 2023
|
2 hours
|
Parallelism and Scaling up
|
Scaling up
|
Ce
|
|
Slides
|
|
5. 5. 2023
|
1 hour
|
Parallelism
|
Ce
|
|
Slides
|
|
9. 5. 2023
|
2 hours
|
Applications and the Benefits of Scale
|
Multimodality
|
Mrinmaya
|
|
Slides
|
|
12. 5. 2023
|
1 hour
|
Additional Topics
|
Mrinmaya
|
|
Slides
|
|
16. 5. 2023
|
2 hours
|
Analysis
|
Analysis and Probing
|
Tiago/Ryan
|
Many language models are uninterpretable, i.e., it is hard to know why a language model prefers one prediction to another. This lecture overviews a variety of recent techniques for better understanding language models’ behavior and interpreting their predictions.
|
Slides
|
|
19. 5. 2023
|
1 hour
|
Cognitive Modeling
|
Ethan/Alex/Ryan
|
Language models show remarkable linguistic capabilities. This lecture treats the question: Do language models process language as humans do? The performance of language modeling on a wide variety of cognitive benchmarks is discussed in an attempt to tease apart how language models are similar and dissimilar to human language processing. We will also discuss the implication of language remodeling on language science.
|
Slides 1
,
Slides 2
|
|
23. 5. 2023
|
2 hours
|
Security and Misuse
|
Security and Misuse
|
Florian
|
Machine learning models are remarkably brittle, and prone to all kinds of exploits. Language models are no different: we will see how tampering with model inputs or training data can lead to arbitrarily bad outcomes. We will also discuss how language models could be exploited for nefarious purposes such as large-scale spam campaigns. On the other hand, language models could also prove useful as a defensive tool, e.g., for automated online content moderation or for dispelling misinformation.
|
Slides
|
|
26. 5. 2023
|
1 hour
|
Harms and Ethical Concerns
|
Florian
|
Language models work extremely well, until they don’t! What are some of the harms that large-scale deployment of language models can bring? We will discuss ways in which models can perpetrate or exacerbate issues in training data (biases, toxicity, etc.) and the difficulty in aligning models with particular ethical principles or truths.
|
Slides
|
|
30. 5. 2023
|
2 hours
|
Memorization and Privacy
|
Florian
|
We look into language models’ remarkable ability to memorize training data, and the risks this may pose for privacy or copyright. We will look at different ways to define memorization and privacy for textual models, and understand the different threats they aim to address. We will then review methods for provably guaranteeing the confidentiality and privacy of machine learning systems, and debate their adequacy in the context of textual models.
|
Slides
|
|
2. 6. 2023
|
1 hour
|
The data lifecycle
|
Florian
|
So far, most of the course has been about models. But what would these models be without the right data? We will discuss the lifecycle of modern training sets for language models, to understand how design choices in the data collection and maintenance process influence the model’s “world view”. We will review emerging guidelines and best practices for managing and documenting machine learning datasets across their lifetime.
|
Slides
|
|