Attention

Transformer - Attention Is All You Need

Brief introduction to the famous paper about Transformers: Attention Is All You Need (这篇博文用英文攥写并由ChatGPT进行翻译，如果有误请参照英文原文) Transformer Paper: Attention Is All You Need Github link of Transformer 1. 背景编码器-解码器自注意力前馈网络位置编码数据集: WMT 2014英法翻译任务评估指标: BLEU (bilingual evaluation understudy) BLEU使用n-gram计算修改后的精度度量来度量候选文本与参考文本之间的相似性。其思想是，如果参考中的一个词已经匹配，则不能再次匹配该词。 $\text{Count}_{\text{clip}} = \min (\text{count, Max_Ref_Count})$ $$ \text{BLEU} = \text{BP} \times \exp(\sum_{n=1}^N \mathbf{w_n}\log P_n) $$ $\text{BP}$ (Brevity Penalty，简洁惩罚)将在候选长度与参考翻译长度相同时为1。 $N$: n-grams $\mathbf{w_n}$: 每个修改后精度的权重在Transformer问世之前，序列到序列模型，尤其是LSTM，在下游NLP任务中占据主导地位。Transformer引入了自注意力机制，使注意力机制更加可并行化和高效。使用自注意力机制替代LSTM，在翻译任务中实现了更高的准确性。基于Transformer的预训练模型: BERT (使用Transformer编码器), GPT (使用Transformer解码器), ALBERT 2. Seq2Seq和Attention 详情请参见斯坦福224N有关Sequence-to-Sequence Models and Attention的介绍: