Category: Database, Data, Architecture, artificial-intelligence

by The relative recency of the introduction of transformer architectures and the ubiquity with which they have upended language tasks speaks to the rapid rate of progress in machine learning and artificial intelligence. There’s no better time than now to gain a deep understanding of the inner workings of transformer architectures, especially with transformer models making big inroads into diverse new applications like predicting chemical reactions and reinforcement learning.

Vanilla Transformer uses six of these encoder layers (self-attention layer + feed forward layer), followed by six decoder layers.

In the vanilla Transformer model, the residual summing operation is followed by layer normalization, a method for improving training that, unlike batch normalization, is not sensitive to minibatch size.

Related Articles