[Deep Learning Theory Team Seminar] Talk by Prof. Difan Zou (HKU) on Understanding the Working Mechanism of Transformers

イベント説明

Venue: Online and the Open Space at the RIKEN AIP Nihonbashi office
Language: English
Title: Understanding the Working Mechanism of Transformers: Model Depth and Multi-head Attention
Speaker: Prof. Difan Zou, HKU, https://difanzou.github.io/
Abstract:
In this talk, I will discuss our recent works on the working mechanism of the Transformer architecture, including the learning capabilities and limitations of model depth and the multi-head attention mechanism in different tasks. Specifically, in the first part of the talk, we designed a series of learning tasks based on actual sequences and systematically evaluated the performance and limitations of Transformers of different depths in terms of memory, reasoning, generalization, and context generalization capabilities. We have demonstrated that a Transformer with single-layer attention performs excellently in memory tasks but cannot complete more complex tasks. In addition, at least a two-layer Transformer is required to achieve reasoning and generalization capabilities, while context generalization capabilities may require a three-layer Transformer to be achieved.
In the second part of the talk, considering the sparse linear regression problem, we explored the role of the multi-head attention of the Transformer model (after training) and revealed the working mechanism of multi-head attention on different Transformer layers. Firstly, we found in experiments that each attention head in the first layer of the Transformer is very important for the final performance, while in subsequent Transformer layers usually only one attention head plays an important role. We further proposed a preprocess-then-optimize working mechanism and theoretically proved that a multi-layer Transformer (multiple heads in the first layer and only one head in subsequent layers) can implement this mechanism. Moreover, in the sparse linear regression problem, we further proved the superiority of this mechanism compared to the naive gradient descent and ridge regression algorithms, which is consistent with the experimental findings. These research results help to deeply understand the advantages of multi-head attention and the role of model depth, providing a new perspective for revealing more complex mechanisms inside the Transformer.
Bio: Dr.Difan Zou is an assistant professor in computer science department and institute of data science at HKU. He has received his PhD degree in Department of Computer Science, University of California, Los Angeles (UCLA). His research interests are broadly in machine learning, deep learning theory, graph learning, and interdisciplinary research between AI and other subjects. His research is published in top-tier machine learning conferences (ICML, NeurIPS, COLT, ICLR) and journal papers (IEEE Trans., Nature Comm., PNAS, etc.). He serves as an area chair/senior PC member for NeurIPS and AAAI, and PC members for ICML, ICLR, COLT, etc.

開催日

2024年11月20日14:00 ～ 2024年11月20日15:00

主催者・問い合わせ先

RIKEN AIP Public

開催場所

項目	内容
場所	名称未設定
住所	Online and the Open Space at the RIKEN AIP Nihonbashi office

イベント説明

開催日

主催者・問い合わせ先

開催場所

開催場所の地図

SNS・Bookmark

近隣のイベント

近隣の場所 (直線距離)