Speed benchmark einsum vs matmul in XL-Attention

The original Transformer can only attend to a fixed and limited segment of the input to compute its attention. The major drawback of this architecture is that no information can flow across separated segments which prevents the Transformer to model long-term dependencies. Transformer-XL is an enhancement to the vanilla Transformer which enables the latter to store the most recent hidden states in a fixed-length memory … Continue reading Speed benchmark einsum vs matmul in XL-Attention