On the Training Dynamics of Transformers for Language Tasks

Jing Yang
UVA ECE

Time: 2025-03-19, 12:00 - 13:00 ET
Location: Olsson 105 and Zoom

Abstract Transformers have demonstrated remarkable success in language tasks, yet their theoretical training dynamics remain insufficiently understood. In this talk, I will present a fine-grained, non-asymptotic analysis of how a one-layer transformer learns in two fundamental settings: next-token prediction (NTP) and regular language recognition.

For NTP, we introduce a mathematical framework based on partial orders to characterize essential structural properties of training datasets. We design a two-stage training algorithm where the feed-forward layer undergoes pre-processing before the attention layer is trained. Our results show that both layers converge sub-linearly toward max-margin solutions, while the cross-entropy loss enjoys a linear convergence rate. Additionally, we establish that the trained transformer exhibits non-trivial generalization under dataset shifts, providing insights into its robust language modeling capabilities.

To further explore the fundamental mechanisms behind transformer-based language processing, we analyze regular language recognition tasks, focusing on even pairs and parity check. While even pairs can be learned directly, solving parity check requires integrating Chain-of-Thought (CoT) reasoning. Our theoretical analysis identifies two distinct training phases: an initial rapid growth of the attention layer, followed by a logarithmic progression of the linear layer toward a max-margin hyperplane. These findings highlight how transformers encode structured linguistic dependencies and reasoning processes, with experimental validation reinforcing our theoretical results.

Bio Jing Yang is an Associate Professor in the Department of Electrical and Computer Engineering at the University of Virginia, with a courtesy appointment in the Department of Computer Science. Previously, she was an Assistant and then tenured Associate Professor at the Pennsylvania State University. She received her B.S. from the University of Science and Technology of China (USTC), and her M.S. and PhD from the University of Maryland, College Park, all in Electrical Engineering.

Her recent research interests include transformers and large language models, reinforcement learning and privacy-preserving machine learning. She is a recipient of the National Science Foundation CAREER Award (2015) and the WICE Early Achievement Award (2020) and was recognized as one of the 2020 N2Women: Stars in Computer Networking and Communications.